Frequency text analysis: features and examples

You have met this concept more than once in your life if you had to work with texts. In particular, you could turn to online calculators that perform frequency analysis of the text. These handy tools show how many times a particular character or letter has been found in a passage of text. Often shown and the percentage. Why is this needed? How does frequency text analysis β€œcrack” simple ciphers? What is its essence, who invented it? We will answer these and other important questions on the topic throughout the article.

Definition

Frequency analysis is one of the varieties of cryptanalysis. It is based on the assumption of scientists about the existence of a statistical non-trivial distribution of individual characters and their regular sequences in both open and encrypted forms of text.

It is believed that such a distribution, up to the replacement of individual characters, will also be preserved in encryption / decryption processes.

frequency analysis systems

Process characteristic

We now analyze the frequency analysis in simple language. Here, it is understood that the number of occurrences of the same alphabet symbol in texts of sufficient length is the same in different texts written in the same language.

And what about mono-alphabetic encryption now? It is assumed that if there is a character with such a similar probability of occurrence in a section with encrypted text, then it is realistic to assume that it is this encrypted letter.

The followers of the frequency analysis of the text apply the same reasoning to the bigrams (two-letter sequences). Trigrams are for the case of polyalphabetic ciphers.

Method History

Frequency analysis of words is not a modern find. It has been known to the scientific world since the 9th century. His creation is associated with the name of Al-Kindi.

But known cases of applying the frequency analysis method are from a much later period. The most striking example here is the decryption of Egyptian hieroglyphs, made in 1822 by J.-F. Champollion.

If we turn to fiction, we can find many interesting references to a similar decryption method:

  • Conan Doyle - The Dancing Men.
  • Jules Verne - "Children of Captain Grant."
  • Edgar Allan Poe - The Golden Beetle.

However, since the middle of the last century, most of the encryption algorithms used have been developed taking into account their resistance to such frequency cryptanalysis. Therefore, it is most often used today only for training future cryptographers.

text frequency analysis

The basis of the method

Imagine now the analysis of frequency characteristics in detail. This type of analysis is directly based on the fact that the test consists of words, and those, in turn, are letters. The number of letters filling the national alphabets is limited. The letters can be simply listed here.

The most important characteristics of such a text will be the repeatability of letters, various bigrams, trigrams and n-grams, as well as the compatibility of different letters with each other, the alternation of consonants / vowels and other varieties of these characters.

The main idea of ​​the methods is to count the occurrences of possible n-grams (denoted by nm) in open texts long enough for analysis (denoted by T = t1t2 ... tl) made up of the letters of the national alphabet (denoted by {a1, a2, ..., an} ) All of the above causes some consecutive m-grams of text:

t1t2 ... tm, t2t3 ... tm + 1, ..., ti-m + 1tl-m + 2 ... tl.

If this is the number of occurrences of the m-gram ai1ai2 ... aim in a certain text T, and L is the total number of m-grams analyzed by the researcher, then it can be experimentally established that for sufficiently large L the frequencies for such an m-gram will not differ much apart from each other.

frequency analysis

Frequently encountered letters of the Russian alphabet

But the frequency-time analysis, despite a similar name, has nothing to do with the topic of our conversation. This kind of analysis is carried out in relation to signals of stealth radars using a special wavelet transform.

Let's go back to the main topic. When conducting a frequency analysis, you can find out which letters of the Russian alphabet are found in fairly voluminous texts most often (percentage from 0.062 to 0.018):

  • A.
  • IN.
  • D.
  • G.
  • AND.
  • TO.
  • M.
  • ABOUT.
  • R.
  • T.
  • F.
  • C.
  • Sh.
  • B.
  • E.
  • I.

Even a special mnemonic rule has been introduced, which helps to learn the most common letters of the Russian alphabet. To do this, it is enough to remember only one word - "senovaletr".

In general cases, the frequency of using letters in percentage terms is set simply: the specialist calculates how many times the letter appears in the text, then divides the resulting value by the total number of characters in the text. And to express this value in percent, it is enough to multiply it by 100.

It is important to consider that the frequency will depend not only on the volume of the text, but also on its nature. For example, in technical sources the letter "F" appears much more often than in art. Therefore, for objective results, a specialist should recruit texts of a different nature and style for research.

text frequency analysis programs

Bi-, three-, four-grams

In meaningful texts, one can also find the most common (respectively, the most repeated) combinations of two or more letters. Specialists have compiled several tables that indicate the frequencies of such bigrams of various alphabets.

As for the Russian one, the frequency analysis of systems of voluminous meaningful texts made it possible to establish the most common bigrams and trigrams:

  • EH.
  • ST
  • BUT.
  • No.
  • ON THE.
  • RA.
  • OV.
  • KO.
  • IN.
  • ONE HUNDRED.
  • NEW
  • ENO.
  • TOV.
  • OVA.
  • OVO.

Preferred letters to each other

And this is not all the possibilities that a frequency analysis can provide to text researchers. Having systematized information from such tables of bigrams and trigrams, it is possible to extract data on the most common letter combinations. Or, in other words, their preferred connections.

Such an extensive study has already been carried out by specialists. Its result was a table where, along with each letter of the alphabet, its neighbors were indicated. Moreover, those characters that are often found both directly in front of her, and after her. The letters in the table are not registered accidentally. Closer to the symbol, the most frequent neighbors are indicated, further - more rare neighbors.

Consider the following examples:

  • The letter a". The following preferred connections stand out here: l-d-k-t-v-r-n-A-l-n-s-t-r-v-to-m. From here we see that most often before "A" in the texts is "H" ("ON"). And after "A" most often in Russian texts we can find "L" ("AL").
  • The letter "M". Experts have identified such preferred relationships: "I-s-a-and-e-o-M-and-e-o-o-a-n-p-s".
  • The letter "b". Preferred bonds are: "n-c-t-l-b-n-c-b-c-s-e-o-o".
  • The letter "". Preferred bonds: "e-ba-a-i-u-uh-e-and-a".
  • The letter "P". Preferred connections with this symbol of the Russian alphabet: "w-w-o-a-and-e-o-p-o-r-e-a-u-and-l".
time frequency analysis

What determines the analysis?

Modern programs of frequency analysis of the text help to study large volumes of a wide variety of articles, essays, passages and so on. The researcher is provided with the following information as a standard:

  • The total number of characters in the text.
  • The number of spaces used by the author.
  • The number of digits.
  • Information about punctuation marks used - points, commas, etc.
  • The number of letters of each of the available alphabets - Cyrillic, Latin, etc.
  • Information on the frequency of use of each letter and symbol in the text - the number of references and the percentage value in comparison with the entire text.

The fight against over-optimization and oversaturation

Why is a frequency analysis of the text? Is it only for the purpose of curiosity - to establish which characters in the written text were often found? No, the main application of analysis is practical, and it is different.

N-grams are not only stable bigrams and trigrams. This category also includes keywords (tags), collocations. That is, stable combinations of two or more words. They are distinguished by the fact that such compositions are found together in the text and at the same time carry a certain semantic load.

This is in the hands of dishonest CEOs. In their work, they sometimes abuse the repetition of tags and keywords in the text in order to artificially increase the relevance of a particular web page. They try to deceive the system with such a "trick": turning the natural combination with the usual, traditional Russian language combination of words ("buy a mink coat") into an inconsistent one. That is, obtained by rearranging words in such a natural N-gram ("buy a mink coat").

But today, search algorithms have learned to determine reoptimization as effectively as spam - oversaturation of text with keywords, tags that affect the ranking of results on the search page. Overly optimized pages, on the contrary, now get a lower position at the request of the user. And people themselves do not seek to read meaningless, oversaturated tags with text, preferring useful information to him on another resource.

frequency analysis method

Help for private analysis SEO experts

Thus, modern text filters of search engines today give preference to those Internet pages, the information on which is not only easy to read, but also useful to visitors. To optimize their work under new standards, CEOs and turn to the frequency analysis of the text. It is provided today by many popular services.

Frequency analysis helps to revise an upcoming text for informational content. Eliminate unnecessary redundancy of tags and key phrases. It also allows you to draw the attention of the author to unnatural combinations of words that cause suspicion in text filters of search engines.

frequency response analysis

Frequency analysis of the text, thus, helps to determine the frequency of mention of a symbol in the source. The method today is used to assess the saturation of text with tags, unnatural permutations of words.

Source: https://habr.com/ru/post/F16811/


All Articles