In scientific research, the need often arises for finding a relationship between productive and factor variables (crop yield and rainfall, a person’s height and weight in homogeneous groups by gender and age, pulse rate and body temperature, etc.).
The second are signs that contribute to the change of those associated with them (the first).
The concept of correlation analysis
There are many definitions of the term. Based on the foregoing, we can say that correlation analysis is a method used to test the hypothesis of the statistical significance of two or more variables, if the researcher can measure them, but not change them.
There are other definitions of this concept. Correlation analysis is a method of processing statistical data, which consists in studying the correlation coefficients between variables. In this case, the correlation coefficients between one pair or a plurality of pairs of features are compared to establish statistical relationships between them. Correlation analysis is a method for studying the statistical relationship between random variables with the optional presence of a strict functional nature, in which the dynamics of one random variable leads to the dynamics of the mathematical expectation of another.
The concept of false correlation
When conducting a correlation analysis, it must be borne in mind that it can be carried out in relation to any set of features, often absurd in relation to each other. Sometimes they have no causal relationship with each other.
In this case, they speak of a false correlation.
Tasks of correlation analysis
Based on the above definitions, the following tasks of the described method can be formulated: to obtain information about one of the desired variables using the other; determine the tightness of the relationship between the studied variables.
Correlation analysis involves determining the relationship between the studied characteristics, in connection with which the tasks of correlation analysis can be supplemented with the following:
- identification of factors that have the greatest impact on the effective sign;
- identification of previously unexplored causes of communication;
- building a correlation model with its parametric analysis;
- study of the significance of communication parameters and their interval assessment.
Relationship between correlation analysis and regression analysis
The method of correlation analysis is often not limited to finding the closeness of the relationship between the studied values. Sometimes it is supplemented by the compilation of regression equations, which are obtained using the analysis of the same name, and which are a description of the correlation between the resulting and factorial (factorial) sign (s). This method, together with the analysis in question, constitutes the method of
correlation and regression analysis.
Method Terms
Effective factors depend on one to several factors. The method of correlation analysis can be applied if there are a large number of observations about the value of effective and factor indicators (factors), while the studied factors should be quantitative and reflected in specific sources. The first can be determined by a normal law - in this case, the result of the correlation analysis is the Pearson correlation coefficients, or, if the signs do not obey this law, the Spearman rank correlation coefficient is used.
Rules for the selection of correlation analysis factors
When applying this method, it is necessary to determine the factors that influence the effective indicators. They are selected taking into account that causal relationships must be present between the indicators. In the case of creating a multivariate correlation model, those that have a significant impact on the resulting indicator are selected, while it is preferable not to include interdependent factors with a pair correlation coefficient of more than 0.85 in the correlation model, as well as those in which the relationship with the effective parameter is non-linear or functional in nature.
Display Results
The results of correlation analysis can be presented in textual and graphical forms. In the first case, they are presented as a correlation coefficient, in the second - in the form of a scatter diagram.
If there is no correlation between the parameters, the points on the diagram are randomly arranged, the average degree of connection is characterized by a greater degree of ordering and is characterized by a more or less uniform distance of the applied marks from the median. A strong connection tends to a straight line and for r = 1 the dotted graph represents a straight line. The inverse correlation is different in the direction of the graph from the upper left to the lower right, the straight line - from the lower left to the upper right corner.
Three-dimensional representation of the scatter chart
In addition to the traditional 2D representation of the scatter diagram, 3D display of the graphical representation of correlation analysis is currently used.
A scatter chart matrix is also used, which displays all paired graphs in one picture in a matrix format. For n variables, the matrix contains n rows and n columns. The diagram located at the intersection of the i-th row and the j-th column is a graph of the variables Xi compared to Xj. Thus, each row and column are one dimension, a separate cell displays a scatter chart of two dimensions.
Communication tightness rating
The tightness of the correlation is determined by the correlation coefficient (r): strong - r = ± 0.7 to ± 1, medium - r = ± 0.3 to ± 0.699, weak - r = 0 to ± 0.299. This classification is not strict. The figure shows a slightly different scheme.
An example of applying the method of correlation analysis
An interesting study has been undertaken in the UK. It is devoted to the relationship between smoking and lung cancer, and was carried out by correlation analysis. This observation is presented below.
Source data for correlation analysisProfessional group | smoking | mortality |
Farmers, foresters and fishermen | 77 | 84 |
Miners and quarry workers | 137 | 116 |
Manufacturers of gas, coke and chemicals | 117 | 123 |
Manufacturers of glass and ceramics | 94 | 128 |
Workers in furnaces, forges, foundries and rolling mills | 116 | 155 |
Electrical and Electronics Workers | 102 | 101 |
Engineering and related professions | 111 | 118 |
Woodworking industry
| 93 | 113 |
Tanners | 88 | 104 |
Textile workers | 102 | 88 |
Workwear Manufacturers | 91 | 104 |
Food, drink and tobacco workers | 104 | 129 |
Manufacturers of paper and printing | 107 | 86 |
Manufacturers of other products | 112 | 96 |
Builders | 113 | 144 |
Artists and decorators | 110 | 139 |
Drivers of stationary engines, cranes, etc. | 125 | 113 |
Workers not elsewhere classified | 133 | 146 |
Transport and communications workers | 115 | 128 |
Warehouse workers, storekeepers, packers and filling machine workers | 105 | 115 |
Clerical workers | 87 | 79 |
Sellers | 91 | 85 |
Sports and recreation workers | 100 | 120 |
Administrators and Managers | 76 | 60 |
Professionals, technicians and artists | 66 | 51 |
We begin the correlation analysis. It is better to start the solution for clarity with the graphical method, for which we construct a dispersion (scatter) diagram.
She demonstrates a direct connection. However, it is difficult to make an unambiguous conclusion on the basis of only the graphical method. Therefore, we continue to perform correlation analysis. An example of calculating the correlation coefficient is presented below.
Using software tools (using MS Excel as an example below), we determine the correlation coefficient, which is 0.716, which means a strong relationship between the studied parameters. We determine the statistical reliability of the obtained value from the corresponding table, for which we need to subtract 2 from 25 pairs of values, as a result of which we get 23 and from this row in the table we find r critical for p = 0.01 (since this is medical data, we use more stringent dependence, in other cases, p = 0.05) is enough, which is 0.51 for this correlation analysis. An example has demonstrated that r calculated is greater than r critical; the value of the correlation coefficient is considered statistically significant.
Using software for correlation analysis
The described type of statistical data processing can be carried out using software, in particular, MS Excel. Excel correlation analysis involves calculating the following parameters using functions:
1. The correlation coefficient is determined using the CORREL function (array1; array2). Array 1.2 - cell of the interval of values of productive and factor variables.
The linear correlation coefficient is also called the Pearson correlation coefficient, and therefore, starting from Excel 2007, you can use the PEARSON function with the same arrays.
A graphical display of correlation analysis in Excel is performed using the Charts panel with the choice of Scatter Chart.
After specifying the source data, we get a graph.
2. Assessment of the significance of the coefficient of pair correlation using t-student test. The calculated value of the t-test is compared with the tabular (critical) value of this indicator from the corresponding table of values of the parameter in question, taking into account a given level of significance and the number of degrees of freedom. This assessment is carried out using the STUDIO DISPLAY function (probability; degrees of freedom).
3. The matrix of pair correlation coefficients. The analysis is performed using the “Data Analysis” tool, in which “Correlation” is selected. A statistical evaluation of the pair correlation coefficients is carried out by comparing its absolute value with a tabular (critical) value. If the calculated coefficient of pair correlation exceeds the critical one, we can say, taking into account the given degree of probability, that the null hypothesis about the significance of linear communication is not rejected.
Finally
The use of the method of correlation analysis in scientific research allows us to determine the relationship between various factors and effective indicators. It should be borne in mind that a high correlation coefficient can also be obtained from an absurd pair or set of data, and therefore this type of analysis must be carried out on a sufficiently large data array.
After obtaining the calculated value of r, it is desirable to compare it with r critical to confirm the statistical reliability of a certain value. Correlation analysis can be carried out manually using formulas, or using software tools, in particular MS Excel. Here you can also construct a scatter (scatter) diagram for the purpose of visualizing the relationship between the studied factors of the correlation analysis and the resultant attribute.