UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a Unicode and ISO 10646 character encoding format that uses variable-length symbols. UTF-8 was created by Robert C. Pike and Kenneth L. Thompson. It is defined as a standard by <RFC 3629> of the Internet Engineering Task Force (IETF). It is currently one of three encoding possibilities recognized by Unicode and web languages, or four in ISO 10646.
Its main features are:
- He is able to represent any Unicode character.
- Use variable length symbols (from 1 to 4 bytes by Unicode character).
- It includes the 7-bit US-ASCII specification, so any ASCII message is represented without change.
- It includes synchrony. It is possible to determine the beginning of each symbol without restarting the reading from the beginning of communication.
- No overlap. The set of values that each byte can take of a multibyte character are disjoint, so it is not possible to confuse them with each other.
These features make it attractive for encoding emails and web pages. The IETF requires all Internet protocols to indicate which encoding they use for text, and that UTF-8 be one of the encodings covered. The Internet Mail Consortium (IMC) recommends that all email programs be capable of creating and displaying messages encoded using UTF-8.
History
UTF-8 was devised by Kenneth L. Thompson under Rob Pike's design criteria on September 2, 1992. Both implemented it and implanted it into their operating system Plan 9 from Bell Labs. It was later officially launched at the USENIX conference in San Diego in January 1993. It was promoted to standard with the sponsorship of X/Open Joint Internationalization Group (XOJIG) and in the process received different names such as FSS/ UTF and UTF-2.
Description
UTF-8 divides Unicode characters into several groups, based on the number of bytes needed to encode them. The number of bytes depends exclusively on the character code assigned by Unicode and the number of bytes needed to represent it. The character distribution is as follows:
- Characters encoded with a byte: Those included in US-ASCII, a total of 128 characters.
- Characters encoded with two bytes: A total of 1920 characters. This group includes the romance characters plus diacritic signs, and the Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets among others.
- Characters encoded with three bytes: Characters of the Unicode basic multilingual plane, which together with the previous group, includes the practice all of characters of common use, including the characters of the CJK group: Chinese, Japanese and Korean.
- Characters encoded with four bytes: Characters of the multilingual supplementary plane. Classical mathematical symbols and alphabets for mainly academic use: Lineal B silábico e ideographic, alphabet Persian, Phoenician... And the ideographic supplemental plane: Han characters of uncommon use.
An important property of the encoding is that the most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits 110 for sequences of two bytes; 1110 for sequences of three bytes, etc. These bits also provide synchronization information that makes it possible to identify the start of a symbol.
Character encoding
The following table shows how the characters are encoded. The fixed values at the beginning of each byte guarantee compliance with the non-overlapping principle, since they are different depending on the position of the byte in the string. UTF-16 encoding is also included to see the difference with a fixed number of bytes encoding.
Points range UNICODE | Scaling value | UTF-16 | UTF-8 | Notes |
000000-00007F | 00000000 0xxxxxxx | 00000000 0xxxxxxx | 0xxxxxxx | Rank equivalent to US-ASCII. Symbols of a single byte where the most significant bit is 0 |
000080-0007FF | 00000yyy yyxxxxxx | 00000yyy yyxxxxxx | 110yyyyy 10xxxxxx | Symbols of two bytes. The first byte starts with 110, the second byte starts with 10 |
000800-00FFFF | zzzzyyyy yyxxxxxx | zzzzyyyy yyxxxxxx | 1110zzzz 10yyyyyy 10xxxxxx | Symbols of three bytes. The first byte starts with 1110, the following bytes begin with 10 |
010000-10FFFF | 000uuuuu zzzzyyyy yyxxxxxx | 110110ww wwzzzzyy | 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx | Symbols of four bytes. The first byte starts with 11110, the following bytes begin with 10 |
Following the scheme above, it would be possible to increase the maximum symbol size from 4 to 6 bytes. The definition of UTF-8 given by Unicode does not support this possibility which is supported by ISO/IEC.
Let's see, as an example, how the character eñe (ñ) is encoded in UTF-8, which in Unicode corresponds to the code point U+00F1:
- Its value is placed in the range U+0080 a U+07FF. A query to the table allows you to see that it should be encoded using 2 bytes, with format 110xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
- The Hexadecimal Value 00F1 is equivalent to binary 0. 1111 0001 (the first 5 bits are ignored, as they are not necessary to represent values in the specified range).
- The 11 bits required are placed in the position marked by the horses: 11000011 10110001.
- The final result is two bytes with hexadecimal values 0xC3 0xB1. That's the lyric code I wrote at UTF-8.
To recover the original code point, the reverse process is performed, decomposing the bit sequences into their components and taking only the necessary bits.
Coding Errors
Encoding rules therefore place limits on the strings that can be formed. According to the standard, a string interpreter should reject as invalid, and not try to interpret, malformed characters. A UTF-8 string interpreter can abort processing by signaling an error, omit malformed characters, or replace them with a U+FFFD (REPLACEMENT CHARACTER) character.
The following are coding errors:
- Trincated sequences, when a multi-byte start character is not followed by enough bytes.
- Bytes of data (committed by 10) without the corresponding beginning of character.
- Anomaly long characteristics: For example, represent with 2 bytes a character of the ASCII range of a byte. The bytes
0xC0
,0xC1
They're not allowed. - Bytes of character start specifying a long anomaly of 5 or 6 bytes. The bytes
0xF8
a0xFD
They're not allowed. - Values outside the Unicode range: Bytes
0xF5
and0xF7
They're not allowed. - Invalid characteristics. Characters in the UTF-16 subrogated pair range, with code
0xD800
a0xDFFF
, they are not real characters and should not be encoded in UTF-8.
Byte order mark (BOM)
When placed at the start of a UTF-8 string, a 0xFEFF
character, encoded in UTF-8 as 0xEF
,0xBB
, 0xBF
, is called a Byte Order Mark (BOM) and identifies the content as a Unicode character string. When this character is found elsewhere in the string it must be interpreted with its original Unicode meaning (ZWNBSP
). Since UTF-8 is an encoding in which the unit of information is the byte, it does not have the utility that it does have in UTF-16 and UTF-32 to identify the order of bytes in a word (endianness).
The specification does not recommend or discourages the use of BOM, although it does advise against eliminating it if it exists as a security measure, preventing errors in digital signature applications, etc. It also warns that it must be removed in concatenation operations to prevent it from being held in non-initial positions.
UTF-8 Derivations
The following encoding standards differ from, and are therefore incompatible with, the UTF-8 specification.
CESU-8
This implementation performs a direct translation of the string represented with UTF-16 instead of encoding the Unicode code points. The result is different encodings for Unicode characters with code greater than 0xFFFF
. Oracle, starting with version 8, implements CESU-8 with the alias UTF8 and, starting with from version 9, standard UTF-8 with another alias. Java and Tcl use this encoding.[citation needed]
Modified UTF-8
With modified UTF-8, the null character is encoded as 0xC080
instead of 0x00
. This way a text containing the null character will not contain the 0x00
byte and therefore will not be truncated in languages like C that consider 0x00
an end of string.
All known implementations of modified UTF-8 are also CESU-8 compliant.[citation needed]
Advantages and disadvantages
Advantages
- UTF-8 allows you to encode any Unicode character.
- It is compatible with US-ASCII, the 7-bit repertoire encoding is direct.
- Easy identification. It is possible to clearly identify a sample of data like UTF-8 using a simple algorithm. The probability of a correct identification increases with sample size.
- UTF-8 will save storage space for texts in Latin characters, where the characters included in US-ASCII are common, when compared to other formats such as UTF-16.
- A byte sequence for a character will never be part of a longer sequence of another character for containing synchronization information.
Disadvantages
- UTF-8 uses variable length symbols; that means different characters can be encoded with different number of bytes. It is necessary to travel the chain from the beginning to find the character that occupies a certain position.
- Ideographic characters use 3 bytes at UTF-8, but only 2 at UTF-16. Thus, Chinese, Japanese or Korean texts occupy more space when represented at UTF-8.
- UTF-8 delivers worse performance than UTF-16 and UTF-32 in computing cost, for example in management operations.
Contenido relacionado
GM-NAA I/O
Kilobit per second
PHP nuke
DirectX
Clipper (programming language)