The principal component analysis (PCA) method simplifies the complexity of high-dimensional data while maintaining trends and patterns. He does this by converting the data into smaller sizes, which act as a summary of functions. Such data are very common in different branches of science and technology, and arise when several features are measured for each sample, for example, such as the expression of many species. This type of data presents problems caused by an increased error rate due to multiple data correction.
The method is similar to clustering - it finds patterns without links and analyzes them, checking whether samples are taken from different research groups and whether they have significant differences. As with all statistical methods, it can be applied incorrectly. Scaling variables can lead to different results of the analysis, and it is very important that it is not adjusted to match the previous value of the data.
Component Analysis Objectives
The main goal of the method is to detect and reduce the dimension of the data set, to determine new significant base variables. To do this, it is proposed to use special tools, for example, to collect multidimensional data in the TableOfReal data matrix, in which the rows are associated with cases and columns of variables. Therefore, TableOfReal is interpreted as data vectors numberOfRows, each vector of which has the number of Columns elements.
Traditionally, the principal component method is performed using a covariance matrix or a correlation matrix, which can be calculated from the data matrix. The covariance matrix contains scaled sums of squares and cross-products. The correlation matrix is similar to the covariance matrix, but first the variables, i.e. the columns, were standardized in it. First, you will have to standardize the data if the variances or units of the variables are very different. To perform the analysis, select the TabelOfReal data matrix in the list of objects and even click go.
This will lead to the appearance of a new object in the list of objects by the method of principal components. Now you can plot the eigenvalue curves to get an idea of the importance of each. And also, the program can offer an action: get the share of variance or check the equality of the number of eigenvalues and get their equality. Since the components are obtained by solving a specific optimization problem, they have some “built-in” properties, for example, maximum variability. In addition, there are a number of other properties that can provide factor analysis:
- the variance of each, while the fraction of the total variance of the source variables is given by eigenvalues;
- evaluation calculations that illustrate the value of each component during observation;
- obtaining loads that describe the correlation between each component and each variable;
- correlation between the initial variables reproduced using the p – component;
- playback of the original data can be reproduced from p-components;
- “Rotation” of components to increase their interpretability.
Choosing the number of storage points
There are two ways to select the required number of components for storage. Both methods are based on the relationship between eigenvalues. To do this, it is recommended to plot the values. If the points on the chart tend to level out and are close enough to zero, then they can be ignored. Limit the number of components to a number that accounts for a certain proportion of the total dispersion. For example, if the user is satisfied with 95% of the total dispersion, they get the number of components (VAF) 0.95.
The main components are obtained by designing a multivariate statistical analysis of the method of the main components of datavectors on the space of eigenvectors. This can be done in two ways - directly from TableOfReal without first forming a PCA object, and then you can display the configuration or its numbers. Select an object and TableOfReal together with “Configuration”, thus, analysis is performed in the component’s own environment.
If the starting point turns out to be a symmetric matrix, for example, a covariance matrix, they first reduce to a shape, and then QL algorithm with implicit shifts. If, on the contrary, the starting point is a data matrix, then it is impossible to form a matrix with the sums of squares. Instead, they switch from a numerically more stable method, and form expansions in singular values. Then the matrix will contain eigenvectors, and square diagonal elements - eigenvalues.
Types of linear combinations
The main component is the normalized linear combination of the source predictors in the dataset using the principal component method for dummies. In the image above, PC1 and PC2 are the main components. Suppose there are a number of predictors like X1, X2 ..., Xp.
The main component can be written as: Z1 = 11X1 + 21X2 + 31X3 + .... + p1Xp
Where:
- Z1 - is the first major component;
- p1 - is a load vector consisting of loads (1, 2.) of the first main component.
Loads are limited by the sum of a square equal to 1. This is due to the fact that a large load can lead to a large dispersion. It also determines the direction of the main component (Z1) along which the data differs the most. This leads to the fact that the line in the p-mer space is closest to n-observations.
Proximity is measured using RMS Euclidean distance. X1..Xp are normalized predictors. Normalized predictors have an average value of zero, and the standard deviation is unity. Therefore, the first main component is a linear combination of the original predictor variables, which captures the maximum variance in the data set. It determines the direction of the greatest variability in the data. The greater the variability recorded in the first component, the more information received by him. No other can have volatility above the first major.
The first main component results in a row that is closest to the data and minimizes the sum of the squared distance between the data point and the line. The second main component (Z2) is also a linear combination of source predictors that captures the remaining variance in the data set and is uncorrelated Z1. In other words, the correlation between the first and second components should be zero. It can be represented as: Z2 = 12X1 + 22X2 + 32X3 + .... + p2Xp.
If they are uncorrelated, their directions should be orthogonal.
Test Data Prediction Process
After the main components are calculated, the process of forecasting test data with their use begins. The process of the main component method for dummies is simple.
For example, you need to make a conversion to a test suite, including the center and scaling function in the R language (ver. 3.4.2) and its rvest library. R is a free programming language for statistical computing and graphics. It was reconstructed in 1992 to solve statistical problems by users. This is the complete simulation process after extracting the PCA.
Python Dataset:
To implement a PCA, python imports data from the sklearn library. The interpretation remains the same as the users of R. Only the dataset used for Python is a cleaned version, in which there are no imputed missing values, and categorical variables are converted to numeric. The modeling process remains the same as described above for users R. The principal component method, calculation example:
Spectral decomposition
The idea of the main component method is to approximate this expression to perform factor analysis. Instead of summing from 1 to p, they now summarize from 1 to m, ignoring the last pm members in the sum and getting the third expression. You can rewrite this as shown in the expression that is used to determine the factor load matrix L, which gives the final expression in matrix notation. If standardized measurements are used, replace S with the correlation sample matrix R.
This forms the matrix L of the factor load in factor analysis and is accompanied by the transposed L. To assess specific variances, the factor model for the variance-covariance matrix.
Σ = L L '+ Ψ
Now it will be equal to the variance-covariance matrix minus LL '.
Ψ = Σ - L L '
The main components are determined by the formula
Where:
- Xi is the observation vector for the i-th subject.
- S stands for our sample dispersion-covariance matrix.
Then p is the eigenvalues for this dispersion covariance matrix, as well as the corresponding eigenvectors for this matrix.
The eigenvalues of S: λ ^ 1, λ ^ 2, ..., λ ^ n.
The eigenvectors S: e ^ 1, e ^ 2, ..., e ^ n.
Excel Analysis in Bioinformatics
PCA analysis is a powerful and popular multivariate analysis method that allows you to explore multidimensional data sets with quantitative variables. According to this technique, the method of principal components in bioinformatics, marketing, sociology and many other fields is widely used. XLSTAT provides a complete and flexible function for exploring data directly in Excel and offers several standard and advanced options that allow you to get a deeper understanding of user data.
You can run the program on raw data or on the difference matrixes, add additional variables or observations, filter the variables according to various criteria to optimize the reading of maps. In addition, you can perform turns. Easily customize the correlation circle, observation graph as standard Excel charts. It is enough to transfer the data from the report on the results to use them in the analysis.
XLSTAT offers several data processing methods that will be used on the input data before computing the main component:
- Pearson, a classic PCA that automatically standardizes data for calculations to avoid bloated variables with large deviations from the result.
- Covariance that works with non-standard deviations.
- Polychoric, for ordinal data.
Dimension Data Analysis Examples
We can consider the method of principal components on the example of performing a symmetric correlation or covariance matrix. This means that the matrix must be numeric and have standardized data. Suppose there is a data set of dimension 300 (n) × 50 (p). Where n is the number of observations, and p is the number of predictors.
Since there is a large p = 50, there may be a p (p-1) / 2 scattering diagram. In this case, it would be a good approach to choose a subset of the predictor p (p << 50), which captures the amount of information. Then follows the scheduling of the observation in the resulting low-dimensional space. It should not be forgotten that each dimension is a linear combination of p-functions.
An example for a matrix with two variables. In this example of the principal component method, a data set is created with two variables (long length and diagonal length) using Davis artificial data.
The components can be drawn on the scatterplot as follows.
This graph illustrates the idea of the first or main component that provides an optimal summary of the data — no other line drawn on such a scatterplot will create a set of predicted values for the data points on the line with less dispersion.
The first component also has an application in regression with a reduced principal axis (RMA), in which it is assumed that both x- and y-variables have errors or uncertainties or where there is no clear difference between the predictor and the response.
Econometric forecasting models
The method of the main components in econometrics is the analysis of variables such as GNP, inflation, exchange rates, etc. Their equations are then estimated using available data, mainly cumulative time series. However, econometric models can be used for many applications, and not for macroeconomic ones. Thus, econometrics means an economic dimension.
Applying statistical methods to related data econometrics shows the relationship between economic variables. A simple example of an econometric model. It is assumed that monthly consumer spending is linearly dependent on consumer income in the previous month. Then the model will consist of the equation
The task of econometrics is to obtain estimates of the parameters a and b. These estimated parameter values, if used in the model equation, make it possible to predict future consumption values, which will depend on the income of the previous month. When developing these types of models, several points must be taken into account:
- the nature of the probabilistic process that generates the data;
- level of knowledge about this;
- system size
- form of analysis;
- forecast horizon;
- mathematical complexity of the system.
All these premises are important because the sources of errors arising from the model depend on them. In addition, to solve these problems, it is necessary to determine the forecasting method. It can be reduced to a linear model, even if there is only a small sample. This type is one of the most common for which you can create predictive analysis.
Nonparametric statistics
The principal component method for nonparametric data refers to measurement methods in which data is extracted from a specific distribution. Nonparametric statistical methods are widely used in various types of studies. In practice, when the assumption of normality is not fulfilled, parametric statistical methods can lead to misleading results. In contrast, nonparametric methods make much less rigorous assumptions about the distribution over dimensions.
They are reliable regardless of the underlying distribution of observations. Because of this attractive advantage, many different types of nonparametric tests have been developed for the analysis of various types of experimental designs. Such projects include single-sample design, two-sample design, randomized block design. Currently, the nonparametric Bayesian approach using the principal component method is used to simplify the analysis of the reliability of railway systems.
The railway system is a typical large-scale complex system with interconnected subsystems that contain numerous components. System reliability is maintained through appropriate maintenance measures, and cost-effective asset management requires an accurate assessment of reliability at the lowest level. However, real reliability data at the level of the components of the railway system are not always available in practice, not to mention completion. The distribution of component life cycles from manufacturers is often hidden and complicated by actual use and the work environment. Thus, reliability analysis requires a suitable methodology for estimating the component lifetime in the absence of failure data.
The method of main components in the social sciences is used to perform two main tasks:
- analysis according to sociological research;
- building models of social phenomena.
Model Calculation Algorithms
The algorithms of the principal component method give a different idea of the structure of the model and its interpretation. They are a reflection of how PCA is used in different disciplines. The NIPALS nonlinear iterative partial least square algorithm is a sequential method for calculating components. Calculation can be terminated ahead of schedule when the user considers that there are enough of them. Most computer packages tend to use the NIPALS algorithm, as it has two main advantages:
- It processes missing data;
- computes components sequentially.
The purpose of considering this algorithm:
- gives an additional idea of what the loads and scores mean;
- , ;
- , .
, , . . NIPALS . t1t1, p1p1 , , XX. , . , . Google .
NIPALS .
T=XW B Y X, , B = WQ. .
- . .