UTF-8 - character encoding

Unicode supports almost all existing character sets. The best form of Unicode character set encoding is UTF-8 encoding. It implements compatibility with ASCII, resistance to data distortion, efficiency and ease of processing. But first things first.

Coding forms

Computers operate with numbers not just as abstract mathematical objects, but as combinations of fixed-size information storage and processing units โ€” bytes and 32-bit words. The encoding standard should take this into account when determining how characters are represented by numbers.

In computer systems, integers are stored in memory cells sized 8 bits (1 byte), 16 or 32 bits. Each form of Unicode encoding determines which sequence of memory cells represents an integer corresponding to a particular character. The standard provides three different forms of Unicode character encoding: 8, 16, and 32-bit blocks. Accordingly, they are called UTF-8, UTF-16 and UTF-32. The name UTF stands for Unicode conversion format. Each of the three forms of coding is an equal means of representing Unicode characters, has advantages in various fields of application.

Encoding data can be used to represent all Unicode characters. Thus, they are fully compatible for solutions that, for various reasons, use different forms of coding. Each encoding can be unambiguously converted to any of the other two without data loss.

utf 8 encoding

Principle of non-imposition

Each of the Unicode coding forms is designed taking into account the inadmissibility of partial overlapping. For example, Windows-932 generates characters from one or two bytes of code. The length of the sequence depends on the first byte, so the leading byte values โ€‹โ€‹in a sequence of two bytes and a single byte do not intersect. However, the values โ€‹โ€‹of a single byte and a trailing byte of a sequence may be the same. This means, for example, that when searching for the character D (code 44), you can mistakenly find it in the second part of the sequence of two bytes of the character โ€œDโ€ (code 84 44). To find out which sequence is correct, the program must consider the previous bytes.

The situation will become more complicated if the leading and trailing bytes coincide. This means that in order to remove the ambiguity, a reverse search will be carried out until the beginning of the text or an unambiguous code sequence is reached. This is not only inefficient, but not protected from possible errors, because just one wrong byte is enough to make all the text unreadable.

The Unicode conversion format avoids this problem, because the values โ€‹โ€‹of the leading, trailing, and single information storage units do not match. Thanks to this, all Unicode encodings are suitable for search and comparison, never giving an erroneous result due to the coincidence of different parts of the character code. The fact that these encoding forms comply with the principle of non-imposition distinguishes them from other multibyte East Asian encodings.

Another aspect of non-intersection of Unicode encodings is that each character has clearly defined boundaries. This eliminates the need for scanning an indefinite number of previous characters. This feature of encodings is sometimes called self-synchronization. Distorting one unit of code will distort only one character, and the surrounding characters will remain untouched. In the 8-bit conversion format, if the pointer refers to a byte starting with 10xxxxxx (in binary coding), it takes from one to three reverse jumps to find the beginning of a character.

utf 8 encoding

Coherence

The Unicode Consortium fully supports all 3 forms of encodings. It is important not to contrast UTF-8 and Unicode, because all conversion formats are equally legitimate incarnations of Unicode standard character encoding forms.

Byte orientation

To represent a UTF-32 character, you need one 32-bit code unit, which matches the Unicode code. UTF-16 - from one to two 16-bit units. And UTF-8 uses up to 4 bytes.

UTF-8 encoding is designed for compatibility with ASCII-based byte-oriented systems. Most of the existing software and the practice of information technology for a long time relied on the representation of characters as a sequence of bytes. Many protocols depend on the immutability of ASCII encoding and use or avoid special control characters. You can easily adapt Unicode to such situations by using 8-bit encoding to represent Unicode characters equivalent to any ASCII character or control character. For this, the UTF-8 encoding is intended.

Variable length

UTF-8 is a variable-length encoding consisting of 8-bit information storage units, the high bits of which indicate which part of the sequence each individual byte belongs to. One range of values โ€‹โ€‹is reserved for the first element of the code sequence, the other for subsequent ones. This provides encoding disjointness.

utf 8 character encoding

Ascii

UTF-8 encoding fully supports ASCII codes (0x00-0x7F). This means that Unicode characters U + 0000-U + 007F are converted to a single byte 0x00-0x7F UTF-8 and thus become indistinguishable from ASCII. Moreover, to avoid ambiguity, the values โ€‹โ€‹0x00-0x7F are not used in any Unicode character representation byte. A non-ASCII character encoding uses a sequence of two bytes. The characters of the range U + 0800-U + FFFF are represented by three bytes, and additional characters with codes larger than U + FFFF require four bytes.

Application area

UTF-8 encoding is usually preferred in the HTML protocol and the like.

XML is the first standard with full support for UTF-8 encoding. Standardization organizations also recommend it. The problem of supporting URLs other than ASCII characters was resolved when the W3C consortium and the IETF engineering team agreed to encode all URLs exclusively in UTF-8.

ASCII compatibility facilitates the transition to new software. Most text editors work with UTF-8, including JEdit, Emacs, BBEdit, Eclipse, and Notepad on the Windows operating system. No other form of Unicode coding can boast of such support from the tools.

The advantage of encoding is that it consists of a sequence of bytes. UTF-8 strings are easy to use in C and other programming languages. This is the only encoding form that does not require a BOM byte mark or XML encoding declaration.

html utf 8 encoding

Self synchronization

In an environment that uses 8-bit character processing, compared to other multibyte encodings, UTF-8 has the following advantages:

  • The first byte of the code sequence contains information about its length. This increases the efficiency of direct search.
  • Finding the beginning of a character is simplified since the start byte is limited to a fixed range of values.
  • There is no intersection of byte values.

Benefit Comparison

UTF-8 encoding is compact. But when used to encode East Asian characters (Chinese, Japanese, Korean, using Chinese characters), 3-byte sequences are used. Also, UTF-8 encoding is inferior to other forms of encoding in processing speed. And binary string sorting gives the same result as Unicode binary sorting.

Character Encoding Scheme

A character encoding scheme consists of a character encoding form and a method for byte-wise arrangement of code units. To determine the encoding scheme by the Unicode standard, the use of an initial byte order mark (BOM, Byte order mark) is provided.

When BOM is included in UTF-8, the label function is limited only to indicating the use of the encoding form. UTF-8 has no problem determining byte order, as its encoding unit size is one byte. Using the BOM for this form of encoding is neither mandatory nor recommended. BOM can occur in texts converted from other encodings using a byte order label, or for a UTF-8 encoding signature. It is a sequence of 3 bytes of EF 16 BB 16 BF 16 .

utf 8 encoded file

How to set UTF-8 encoding

In HTML, the UTF-8 encoding is set using the following code:

ห‚headหƒ

ห‚meta http-equiv = "Content-Type" content = "text / html; charset = utf-8" หƒ

In PHP, the UTF-8 encoding is set using the header () function at the very beginning of the file after setting the value of the error output level:

ห‚? Php

error_reporting (-1);

header ('Content-Type: text / html; charset = utf-8');

To connect to MySQL databases, the UTF-8 encoding is set as follows:

ห‚? Php

mysql_set_charset ('utf8');

In CSS files, the UTF-8 character encoding is specified as follows:

@charset "utf-8";

save in utf 8 encoding

When saving files of all types, UTF-8 encoding without BOM is selected, otherwise the site will not work. To do this, in the DreamWeave program, select the menu item "Modifications - Page Properties - Header / Encoding", change the encoding to UTF-8. Then you should reload the page, uncheck the option โ€œConnect Unicode Signatures (BOM)โ€ and apply the changes. If any text on the page or in the database was entered by another form of encoding, then it must be re-entered or transcoded. When working with regular expressions, you must use the u modifier.

You can also save the file in UTF-8 encoding in the Notepad of Windows OS. After selecting the menu item โ€œFile - Save As ...โ€ set the necessary encoding form and save the file in UTF-8 encoding.

In the Notepad ++ text editor, if the encoding is different from UTF-8, change the encoding through the menu item "Convert to UTF-8 without BOM" and save it in UTF-8 encoding.

utf 8 encoding without bom

There is no alternative

In the context of globalization, when political and linguistic boundaries are erased, character sets that have local characteristics become unsuitable. Unicode is the only character set that supports all localizations. And UTF-8 is an example of a proper Unicode implementation, which:

  • supports a wide range of tools, including compatibility with ASCII encoding;
  • It is resistant to data distortion;
  • simple and effective in processing;
  • platform independent.

With the advent of UTF-8, discussions about which form of encoding or character set are better have become meaningless.

Source: https://habr.com/ru/post/K15385/


All Articles