Logistic regression: model and methods

Methods of logistic regression and discriminant analysis are used when it is necessary to clearly differentiate respondents by target categories. Moreover, the groups themselves are represented by the levels of one univariate parameter. Let us consider in more detail the model of logistic regression, and also find out why it is needed.

logistic regression

General information

An example of a problem in the solution of which logistic regression is used can be the classification of respondents by groups of buyers and non-buyers of mustard. Differentiation is carried out in accordance with socio-demographic characteristics. These, in particular, include age, gender, number of relatives, income, etc. In operations, there are differentiation criteria and a variable. The latter encodes target categories into which, in fact, it is necessary to divide respondents.

Nuances

It should be said that the range of cases in which logistic regression is applied is much narrower than for discriminant analysis. In this regard, the use of the latter as a universal method of differentiation is considered more preferable. Moreover, experts recommend starting classification studies with discriminant analysis. And only in case of uncertainty about the results can logistic regression be used. This need is due to several factors. Logistic regression is used when there is a clear idea of ​​the type of independent and dependent variables. In accordance with this, one of the 3 possible procedures is selected. In discriminant analysis, the researcher always deals with one static operation. It involves one dependent and several independent categorical variables with a scale of any type.

Kinds

The task of a statistical study that uses logistic regression is to determine the likelihood that a particular respondent will be assigned to a particular group. Differentiation is carried out according to certain parameters. In practice, in accordance with the values ​​of one or several independent factors, it is possible to classify respondents into two groups. In this case, there is a binary logistic regression . Also, the specified parameters can be used when distributing to groups of more than two. In such a situation, there is a multinomial logistic regression. The resulting groups are expressed by the levels of a single variable.

logistic regression

Example

Suppose there are respondents' answers to the question of whether they are interested in a proposal to acquire a land plot in a suburb of Moscow. In this case, the options are "no" and "yes." It is necessary to find out exactly which factors have a primary influence on the decision of potential buyers. To do this, the respondents are asked questions about the infrastructure of the territory, the distance to the capital, the area of ​​the plot, the presence / absence of a residential building, etc. Using binary regression, we can distribute the respondents into two groups. The first will include those who are interested in acquiring potential buyers, and the second, respectively, those who are not interested in such an offer. For each respondent, in addition, the probability of being assigned to one or another category will be calculated.

Comparative characteristics

The difference from the two options indicated above is in the different number of groups and the type of dependent and independent variables. In binary regression, for example, the dependence of the dichotomous factor on one or more independent conditions is studied. Moreover, the latter can have any type of scale. Multinominal regression is considered a variation of this classification option. In it, more than 2 groups belong to the dependent variable. Independent factors must have either an ordinal or nominal scale.

Logistic regression in spss

In statistical package 11-12, a new version of the analysis was introduced - ordinal. This method is used when the dependent factor refers to the eponymous (ordinal) scale. In this case, independent variables are selected of one particular type. They must be either ordinal or nominal. Classification into several categories is considered the most universal. This method can be used in all studies that use logistic regression. Improving the quality of the model , however, is only possible with all three tricks.

quality control of adequacy and logistic regression

Ordinal classification

It is worth saying that earlier in the statistical package there was no standard possibility of performing specialized analysis for dependent factors with an ordinal scale. For all variables with the number of groups more than 2, the multinominal option was used. The recently introduced ordinal analysis has a number of features. They take into account the specifics of the scale. Meanwhile, in methodological manuals, ordinal logistic regression is often not considered as a separate technique. This is due to the following: ordinal analysis does not have any significant advantages over multinominal. The researcher may well use the latter in the presence of both an ordinal and a nominal dependent variable. Moreover, the classification processes themselves hardly differ from each other. This means that conducting an ordinal analysis will not cause any difficulties.

Analysis Option

Let's consider a simple case - binary regression. Suppose, in the process of marketing research, the demand for graduates of a certain metropolitan university is assessed. The questionnaire asked respondents questions, including:

  1. Are you working? (ql).
  2. Indicate the year of graduation (q 21).
  3. What is the average graduation point (aver).
  4. Gender (q22).

Logistic regression will allow us to evaluate the effect of independent factors aver, q 21 and q 22 on the variable ql. Simply put, the purpose of the analysis will be to determine the likely employment of graduates based on gender, graduation year, and GPA.

logistic sigmoid regression indicator

Logistic regression

To set parameters using binary regression, use the Analyze►Regression►Binary Logistic menu. In the Logistic Regression window, you need to select the dependent factor in the left list of available variables. It is ql. This variable must be placed in the Dependent field. After that, independent factors must be introduced into the Covariates site - q 21, q 22, aver. Then you need to choose the method of their inclusion in the analysis. If the number of independent factors is more than 2, not the method of simultaneous introduction of all variables, which is set by default, is used, but step-by-step. The most popular way is Backward: LR. Using the Select button, you can include in the study not all respondents, but only a specific target category.

Define Categorical Variables

The Categorical button should be used when one of the independent variables is the nominal one with more than 2 categories. In this situation, just such a parameter is placed in the Define Categorical Variables window on the Categorical Covariates section. In this example, such a variable is absent. After that, in the Contrast drop-down list, select the Deviation item and click the Change button. As a result, several dependent variables will be formed from each nominal factor. Their number corresponds to the number of categories of the initial condition.

Save New Variables

Using the Save button in the main study dialog box, you can create new parameters. They will contain indicators calculated during the regression process. In particular, you can create variables that define:

  1. Belonging to a specific classification category (Groupmembership).
  2. The probability of assigning the respondent to each study group (Probabilities).

When using the Options button, the researcher does not receive any significant opportunities. Accordingly, it can be ignored. After clicking the “OK” button, the analysis results will be displayed in the main window.

logistic regression coefficient

Verification of the quality of adequacy and logistic regression

Consider the Omnibus Testsof Model Coefficients table. It displays the results of the analysis of the quality of approximation of the model. Due to the fact that a step-by-step option was set, you need to look at the results of the last stage (Step2). A positive result will be considered such that an increase in the Chi-square indicator is detected when moving to the next stage with a high degree of significance (Sig. <0.05). The quality of the model is evaluated on the Model line. If a negative value is obtained, but it is not considered significant when the overall high materiality of the model, the latter can be considered practically suitable.

Tables

Model Summary makes it possible to evaluate the indicator of total dispersion, which is described by the constructed model (indicator R Square). The Nagelker value is recommended. A positive indicator can be considered the parameter Nagelkerke R Square, if it is higher than 0.50. After that, the classification results are evaluated, in which the actual indicators of belonging to one or another of the studied categories are compared with those predicted on the basis of the regression model. To do this, use the Classification Table. It also allows us to draw conclusions about the correctness of differentiation for each group under consideration.

logistic regression model
The following table makes it possible to find out the statistical significance of the independent factors introduced into the analysis, as well as each non-standardized coefficient of logistic regression . Based on these indicators, it is possible to predict that each respondent in the sample belongs to a specific group. Using the Save button, you can enter new variables. They will contain information about belonging to a specific classification category (Predicted category) and the probability of inclusion in these groups (Predicted probabilities membership). After clicking "OK" in the main window of the Multinomial Logistic Regression, the calculation results will appear.

The first table in which indicators important for the researcher are present is Model Fitting Information. A high level of statistical significance will indicate the high quality and suitability of using the model in solving practical problems. Another significant table is the Pseudo R-Square. It allows you to estimate the proportion of the total variance in the dependent factor, which is determined by the independent variables selected for analysis. According to the Likelihood Ratio Tests table, conclusions can be drawn about the statistical significance of the latter. Parameter Estimates reflects non-standard ratios. They are used in the construction of the equation. In addition, for each combination of variables, the statistical significance of their impact on the dependent factor is determined. Meanwhile, in marketing research, there is often a need to differentiate by categories of respondents, not separately, but as part of the target group. To do this, use the table Observedand Predicted Frequencies.

Practical use

The considered analysis method is widely used in the work of traders. In 1991, an indicator of logistic sigmoid regression was developed. It is an easy-to-use and effective tool with which you can predict likely prices before they "overheat". The indicator is presented on the chart in the form of a channel formed by two lines running in parallel. They are removed at an equal distance from the trend. The width of the corridor will depend solely on the timeframe. The indicator is used when working with almost all assets - from currency pairs to precious metals.

logistic regression in spss

In practice, 2 key strategies for using the tool have been developed: for breakdown and for reversal. In the latter case, the trader will focus on the dynamics of price changes within the channel. As the cost approaches the support or resistance line, the bet is on the likelihood that the movement will begin in the opposite direction. If the price comes close to the upper limit, then you can get rid of the asset. If it is at the lower limit, then you should think about the acquisition. The strategy for the breakdown involves the use of orders. They are set outside the boundaries at a relatively small distance. Taking into account that the price in some cases violates them for a short time, you should play it safe and set stop loss. At the same time, of course, regardless of the chosen strategy, the trader needs to calmly perceive and evaluate the situation that has arisen on the market.

Conclusion

Thus, the use of logistic regression allows you to quickly and easily classify respondents into categories according to specified parameters. In the analysis, you can use any specific method. In particular, multinominal regression is universal. However, experts recommend applying all of the above methods in combination. This is due to the fact that in this case the quality of the model will be significantly higher. This, in turn, will expand the range of its application.

Source: https://habr.com/ru/post/G25401/


All Articles