Encoding text information in a computer

A computer is a complex device with which you can create, convert and store information. However, the computer does not work in a way that is not entirely clear to us - graphic, textual and numerical data are stored in the form of arrays of binary

text encoding
numbers. In this article we will consider how the encoding of textual information is carried out.

What is text for us, for computers is a sequence of characters. Each character represents a specific set of zeros and ones. Symbols mean not only lowercase and uppercase letters of the Latin alphabet, but also punctuation marks, arithmetic signs, auxiliary characters, special symbols and even a space.

Binary Text Encoding

When a certain key is pressed, an electrical signal is sent to the internal controller, which is converted into a binary code. The code is mapped to a specific character, which is displayed on the screen. To represent the Latin alphabet in digital format, an international ASCII coding system was created. In it, to write one character, 1 byte is required, therefore, the character consists of an eight-digit sequence of zeros and ones. The recording interval is from 00000000 to 11111111, that is, the encoding of text information using this system allows you to represent 256 characters. In most cases, this is enough.

binary encoding of text information

ASCII is divided into two parts. The first 127 characters (from 00000000 to 01111111) are international and are specific characters and letters of the English alphabet. The second part, the extension (from 10,000,000 to 11111111), is intended to represent the national alphabet, the spelling of which is different from Latin.

The encoding of text information in ASCII is based on the principle of increasing sequence, that is, the greater the ordinal number of the Latin letter, the greater the value of its ASCII code. The numbers and the Russian part of the table are built on the same principle.

However, in the world there are several more types of encoding for Cyrillic letters. The most common are KOI-8 (eight-bit encoding, which was used already in the 70s on the first russified Unix OSs), ISO 8859-5 (developed by the International Bureau of Standardization), CP 1251 (text encoding used in

encoding and processing of text information
modern Windows OS), as well as 2-byte Unicode encoding, with which you can represent 65536 characters. Such a variety of encodings is due to the fact that they were all developed at different times, for different operating systems and for various reasons. Because of this, difficulties often arise when transferring text from one medium to another - if the encodings do not match, the user will see only a set of obscure icons. How can I fix this situation? In Word, for example, when you open a document, you get a message about problems with displaying text and several recoding options are offered.

So, coding and processing of textual information in the bowels of a computer is a rather complicated process and time-consuming. All characters of any alphabet represent only a certain sequence of digits of the binary system, one cell is one byte of information.

Source: https://habr.com/ru/post/C38310/


All Articles