Any research consists in observing the properties of objects in order to clarify and evaluate the significant relationships and relationships between the indicators of these properties.
The subject area includes objects that differ in properties and in a certain way are in some respects and interconnected. Solving problems in the field of programming begins with a study of the subject area.
The subject area is a part of the real world that is infinite and contains both significant and non-essential data. The researcher must be able to distinguish a substantial part of them. For example, when solving the problem of granting a loan, all data on the clientโs private life will be considered significant (whether the spouse has work, does the client raise minor children, the clientโs education, etc.). And in order to solve another problem related to banking, such data will be completely irrelevant. The significance of the data depends on what we choose as the subject area.
In the research process, it is necessary to create a domain model. Knowledge from various sources should be formalized. The subject area is formalized by any means. Means can be very different. This can be a textual description of a subject area or a specialized graphic notation. With the help of the domain model, the processes that occur in it are described, and the data of this research area are also studied.
The statement of the problem also consists of a description of the static and dynamic behavior of the objects we are investigating. The description of static behavior involves the characterization of objects and their properties. When describing dynamic behavior, the causes of the behavior of objects are characterized.
The dynamic behavior of objects is often described along with static behavior.
Sometimes the analysis of the subject area and the statement of the problem are combined in 1 stage.
At the stage of determining and analyzing data requirements, the data necessary for the implementation of Data Mining is modeled. To do this, the distribution of users is investigated; analytical characteristics of the system; issues of access to data necessary for analysis.
The subject area is analyzed easier and more efficiently when the organization has a data warehouse. However, far from all enterprises have such data warehouses. In this case, the source for the source data is operational databases, reference and archival materials, that is, data from existing IP (information systems).
It may also require information from the IP of managers, external and internal sources, various documents on paper, as well as the knowledge of specialists and / or the results of surveys.
You should also know that in the process of preparing the data, program developers should describe as many factors as possible that influence the process. Some data may be encoded here. For example, one of the characteristics of a client is his income level, which can be defined as: very low, low, medium, high, very high. In this case, it is necessary to determine the gradations of the income level.
When determining the right amount of data, you must consider the ordering of the data.
When they are ordered, you need to find out if the seasonal / cyclical component is included in such a dataset. When they are not ordered, i.e. the set of events from the database is not connected according to the timeline, then during the collection you must observe the following rules:
1) a small number of records in the database can cause the creation of an inadequate model;
2) the accuracy of the model can be improved with an increase in the number of data;
3) obsolete data is excluded from the set;
4) the algorithms that are used to create a model using very large databases should be scalable.