The Internet has made information available, but to choose the right one, you still have to make serious efforts and lose considerable time. Hypertext languages ​​formalized the presentation of information, but the task of parsing (recognition) did not simplify this, and in some areas even became more complicated. A lot of presentation formats, languages, design styles, access options, data markup methods should be "known and able" by the parser: that "this is exactly what is needed."
A person sees and hears primarily through the prism of his own knowledge and experience, and having formalized this in the form of an algorithm, he receives a static mechanism and is convinced that the ideal solution is still far enough away.
Parsing Tools Palette
Parser - task definition: find the necessary information from the search engine, site content, documents, spreadsheets, files of other formats. More formally: to determine and form the flow of information, apply to it a set of keywords according to certain rules for a specific purpose.
Algorithms are traditionally divided into syntactic and semantic, including a certain number of languages. A parsing tool can be a program, website, or plugin. There are many options for implementation, each with its own advantages and disadvantages. In particular, the X-Parser content parser works on a list of keywords. Result: gives clear text, lists of snippets, links, URLs, ... An advanced system of filters, language settings and formatting of the result are proposed.
The DataCol program is focused on collecting information to fill the site with content. For example, to create a site on a specific topic (restaurants, shops, a tour operator, ...), general information is always necessary, which can be quickly found on the Internet in order to save time than to scan or type manually.
Mailagent Parser is focused on collecting email addresses; SlimerJs allows you to quickly analyze complex dynamic sites. The WordPress site management system offers its own parsing module, which you can configure, for example, a constantly automatically updated news feed.
There are many tools, but the number of work on the formation, disassembly and formatting of information flows is steadily increasing.
Using available tools resembles more the process of understanding the necessary mechanism for a particular parsing for a specific task than trying to attach something already existing to your resource.
The main areas of parsing
Usually the mass customer claims about the parser that it is a filter, and confidently insists on it. Indeed, in order to fulfill the visitor’s desire, the search site analyzes many information sources, although most often it digs into its own databases, nevertheless replenishing them systematically. Any decent site also offers a search on its content, its information, related sites. This also relates to the topic “what is a parser”, but the true content of the task lies in a different plane.
We must pay tribute to the languages ​​of hypertext: their numerous, but strict tags and methods of data formatting make it possible to rigidly formalize what the browser should recognize, and this is already parsing. Many tools for finding information use browser options (engines). Regular expressions are also an effective way to find the information you need. The jQuery implementation is a special form of parsing a document that lies within it and forms a part of it or controls it.
What is a parser? This is PHP, and the browser, and the built-in JavaScript in it. These tools fulfill their largely syntactic function. But what’s real and essential: a parser is a value that determines the scope and purpose.
Speaking of a travel agency, one can set the task of developing a parser for vacation spots, providing updates on information on living conditions, weather, food prices, and museum operating modes. When developing a news site, you should write something that will analyze a specific set of sites and collect fresh information from them.
The structure and content of the process
Before making a meaningful answer to the question “parser: what is it?”, It is necessary to form a flow of information and define a set of keywords. Despite the apparent formality, the search engine analysis algorithm has various elements at the input, in which the searched words and their sequences can go beyond the limits of the desired semantics.
Even prestigious search engines, performing a user request, often do not offer what is required in terms of sense, in addition, at their own discretion they supply everything that they offer with a significant amount of advertising and spam.
It is very early to say about the parser that this is the equivalent of artificial intelligence (since you have to deal with the construction of algorithms that must adapt to changing information flows, mobile rules for the formation and use of keywords).
The lion's share of "parsing", which automatically and unconsciously makes a person every second very simple, the logic of this process can be quite easily formalized, partly existing tools demonstrate this.
From statics to dynamics
You can also say about the parser, that this is a combination of an algorithm for forming the flow of information, the rules for determining keywords and their application. But these three bases are unsteady like sand, and in a particular application they can be interpreted in different ways.
A banal search through Google and its version of parsing for the word "key" with a probability of 0% will find at least one article about a spring that quietly murmurs somewhere in a wonderful place. The probability will not increase, even if you specify the "key in the clearing." Google will faithfully issue:
- The key to the start!
- Outdoor recreation - Official site of the administration ...
- Goryachy Klyuch, the official site "Goryachiy Klyuch", the forum "Goryachiy Klyuch" ... On a glade Attractions Taganay - Taganay National Park
- Guest house on Krasnaya Polyana, rent a house (cottage) on New ...
- Sky Key - Results from Google Books
...
Naturally, the parsing algorithm should optimize this output and provide information about the key as a spring, what they are, where they meet, what interests and are useful. Obviously, even the most developed parsing from the issuance of "Google" will not give anything here.
Active knowledge
For the problem to be solved properly, it is necessary to parse not the search engine results, but the content of many sites and the content of an indefinite number of articles. How to get a meaningful flow of information from the word “key”?
There can be only one option: you need to make the keywords active, that is, the search for a specific word should expand in its meaning. The search rule must be active, that is, initially set, something in itself turns into a preliminary clarification of the meaning, and then the movement begins both in terms of forming the proper source of information (the analyzed stream), and in relation to what is parsed in it .
Active knowledge is something from the field of Man> Intellect> Programming, some CHIPiotics are obtained. This is not just a rule, not just a keyword. A person gained intelligence and formalized it through programming not statically, but dynamically, giving parsing a new meaning - variability at the input and mobility in the process.
The indicated concept presupposes an element of self-development - this is difficult, but if popular search engines “learned” to analyze search queries and started sending adequate ads to each browser, it is quite possible to direct this success into a more expedient direction.
The ideal solution: your own knowledge and experience> the prism of the right rules
Parsing has become a serious tangible task and has generated specific experience in the formation of information flows, the rules for the use of keywords. Recognition of characters, scanned images and almost “perfect” translations from one language to another against the background of the development of interaction interfaces (API sites, search engines, parsers) allow us to determine the correct direction of movement.
It’s still difficult to say how everything will be implemented, but it’s absolutely true that the rules for the formation of information flows, the structure of keywords and the development of the tool should be active, and this component, due to the general static and formality of modern programming languages, should be determined during use.
This is the case when the natural human factor in the process of solving urgent problems can and will contribute to the training and development of the parsing sphere, the formation of the prism of certain rules.