Statistical information: collection, processing, analysis

Throughout the history of statistics, various attempts have been made to create a taxonomy of measurement levels. Psychophysicist Stanley Smith Stevens has defined nominal, ordinal, interval, and proportional scales.

Nominal measurements do not have a significant order of ranks among the values ​​and allow any unambiguous conversion.

Conventional measurements have inaccurate differences between consecutive values, but have a specific order of these values ​​and allow any order-preserving transformation.

Interval measurements have significant distances between points, but a zero value is arbitrary (as in the case of measurements of longitude and temperature in degrees Celsius or Fahrenheit) and allows any linear transformation.

Relationship measurements have both a significant null value and distances between different dimensions, in addition, they allow any scaling transformation.

Variables and classification of information

Since variables corresponding only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped as categorical variables. The measurements of the ratio and the interval are grouped as quantitative variables, which can be either discrete or continuous due to their numerical nature. Such differences are often poorly correlated with the type of data in computer science, since dichotomous categorical variables can be represented by Boolean values, polytomic categorical variables with arbitrarily defined integers in the integral data type, and continuous variables with real components, which include floating point calculations. But the display of data types of statistical information depends on which classification is used.

Statistical information on employees.

Other classifications

Other classifications of statistics (information) have also been created. For example, Mosteller and Tukey distinguished grades, ranks, calculated shares, calculations, amounts and balances. Nelder once described continuous calculations, continuous relationships, correlation of calculations and categorical methods of data transmission. All these classification methods are used to collect statistical information.

Issue

The question of whether it is appropriate to apply different types of statistical methods to data obtained using different measurement (collection) procedures is complicated by problems related to the transformation of variables and the accurate interpretation of research questions. “The connection between the data and what it describes simply reflects the fact that certain types of statistical statements can have truth values ​​that are not invariant under certain transformations. Whether the conversion is appropriate for reflection depends on the question you are trying to answer.

An example of statistical information.

What is a data type?

A data type is a fundamental component of the semantic content of a variable and controls what types of probability distributions can be logically used to describe a variable, valid operations on it, the type of regression analysis used to predict it, etc. The concept of a data type is similar to the concept of a measurement level, but more specific - for example, to calculate the data, a different distribution (Poisson or binomial) is required than for non-negative real values, but both fall under that the same measurement level (coefficient scale).

Statistical information on judges.

Scales

Various attempts have been made to create a taxonomy of measurement levels for processing statistical information. Psychophysicist Stanley Smith Stevens has defined nominal, ordinal, interval, and proportional scales. Nominal measurements do not have a significant order of ranks among the values ​​and allow any unambiguous conversion. Conventional measurements have inaccurate differences between consecutive values, but differ in the significant order of these values ​​and allow any order-preserving transformation. Interval measurements have significant distances between measurements, but a zero value is arbitrary (as in the case of measurements of longitude and temperature in degrees Celsius or Fahrenheit) and allows any linear transformation. Relationship measurements have both a significant null value and distances between different defined dimensions and allow any scaling transformation.

Chart model.

Data that cannot be described using a single number is often included in random vectors of real random variables, although there is a growing tendency to process it yourself. Such examples will be discussed below.

Random vectors

Individual elements may or may not be correlated. Examples of distributions used to describe correlated random vectors are a multidimensional normal distribution and a multidimensional t-distribution. In general, there may be arbitrary correlations between any elements, however this often becomes uncontrollable above a certain size, which requires additional restrictions on the correlated components.

Attributes of an extras.

Random matrices

Random matrices can be linearly arranged and treated as random vectors, however this cannot be an effective way of representing correlations between different elements. Some probability distributions are specifically designed for random matrices, for example, the normal distribution matrix and the Wishart distribution.

Random sequences

Sometimes they are considered the same as random vectors, but in other cases, the term applies specifically to cases where each random variable correlates only with nearby variables (as in the Markov model). This is a special case of the Bayesian network and is used for very long sequences, for example, gene strings or long text documents. A number of models are specially designed for such sequences, for example, hidden Markov ones.

A typical schedule.

Random processes

They are similar to random sequences, but only when the length of the sequence is indefinite or infinite, and the elements in the sequence are processed one after another. This is often used for data that can be described as time series. This is true when it comes, for example, to the price of shares the next day.

Conclusion

The analysis of statistical information entirely depends on the quality of its collection. The latter, in turn, is strongly associated with the possibilities of its classification. Of course, there are many types of classification of statistical information that the reader could verify on his own when reading this article. Nevertheless, the availability of effective tools and good knowledge of mathematics, as well as knowledge in the field of sociology, will do their job, allowing you to conduct any survey or research without significant corrections for the error. Sources of statistical information in the form of people, organizations and other subjects of sociology, fortunately, are represented in great abundance. And no difficulties can be a hindrance for a real researcher.

Source: https://habr.com/ru/post/F1355/


All Articles