Character encoding
The character encoding is the method used to convert a character from a natural language (such as that of an alphabet or syllabary) into a symbol from another representation system, such as a number or a sequence of electronic pulses in an electronic system applying standards or coding rules.
They define the way in which a given character is encoded into a symbol in another representation system. Examples of this are Morse code, the ASCII standard or UTF-8, among others.
ASCII
Because it is closely linked to the octet (and therefore to the integers that go from 0 to 127), the problem it presents is that it cannot encode more than 128 different symbols (128 is the total number of different configurations that can be achieve with 7 binary or digital digits (0000000, 0000001, …, 1111111), using the eighth digit of each octet (bit or parity digit) to detect any transmission error). A quota of 128 is enough to include upper and lower case letters of the English alphabet, plus figures, punctuation, and some "control characters" (for example, one that instructs a printer to go to the next page), but ASCII is not. It does not include the accented characters or the question mark used in Spanish, nor many other symbols (mathematical, Greek letters,...) that are necessary in many contexts.
Extended ASCII
Due to the limitations of ASCII, various 8-bit character codes were defined, including Extended ASCII. However, the problem with these 8-bit codes is that each one of them is defined for a set of languages with similar scripts and therefore they do not provide a unified solution to the coding of all the languages of the world. That is, 8 bits are not enough to encode all the alphabets and scripts in the world.
Unicode
As a solution to these problems, since 1991 it has been internationally agreed to use the Unicode standard, which is a large table, which currently assigns a code to each of the more than fifty thousand symbols, which include all European alphabets, Chinese, Japanese, Korean ideograms, many other forms of writing, and more than a thousand local symbols.
Transmission Rules
The transmission rules are intended to define the way in which characters encoded (using the encoding rules) are transmitted over the communication channel (for example, the Internet)
Currently, on the Internet, messages are transmitted in packets that always consist of an integer number of octets, and error detection is no longer done with the eighth digit of each octet, but with special octets that are automatically added to each package. Transmission rules are limited to specifying a reversible mapping between codes (representing characters) and sequences of octets (to be transmitted as data).
Typographic tables
But, finally, to correspond electronically in simplified Chinese (for example) an important detail is missing:
The table that the Unicode Consortium publishes for human reading, contains a graphical representation, or description, of each character included up to that moment; but document display systems, to work, require typography tables, which associate a glyph (drawing) to each character they encompass, and it happens that there are many font tables, with names like Arial or Times, that draw the same letter based on different matrices and in different styles («A» or «A»); however, the vast majority of typefaces contain only a small subset of all Unicode characters.
Common Character Encoding Rules
- ISO 646
- ASCII
- EBCDIC
- CP930
- ISO 8859:
- ISO 8859-1 Western Europe
- ISO 8859-2 Western Europe and Central Europe (checo, Polish, Croatian, Romanian, Slovenian,...)
- ISO 8859-3 Western Europe and South Europe
- ISO 8859-4 Western Europe and Baltic Countries (Lithua, Estonian and Laptop)
- ISO 8859-5 Cyrillic alphabet
- ISO 8859-6 Arabic
- ISO 8859-7 Greek
- ISO 8859-8 Hebrew
- ISO 8859-9 Western Europe with Turkish character game
- ISO 8859-10 Western Europe with Nordic character games, including that of Iceland.
- ISO 8859-11 Thai
- ISO 8859-13 Baltic and Polish languages
- ISO 8859-14 Celtic Languages (Irish Gaelic, Scottish, Welsh)
- ISO 8859-15 Adds the Euro and others symbol to ISO 8859-1
- ISO 8859-16 Central European languages (Polish, Czech, Slovenian, Slovak, Hungarian, Albanian, Romanian, German and Italian)
- CP437, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865, CP869
- Windows Character Games:
- Windows-1250 for Central European languages using a Latin script (Polish, Czech, Slovak, Hungarian, Slovenian, Serbian, Croatian, Romanian and Albanian)
- Windows-1251 for Cyrillic Alphabets
- Windows-1252 for Western languages
- Windows-1253 for Greek
- Windows-1254 for Turkish
- Windows-1255 for Hebrew
- Windows-1256 for Arabic
- Windows-1257 for Baltic Languages
- Windows-1258 for Vietnamese
- Mac OS Roman
- KOI8-R, KOI8-U, KOI7
- MIK
- Cork or T1
- ISCII
- VISCII
- Big5
- HKSCS
- Guobiao
- GB2312
- GBK (Microsoft Code Page 936)
- GB18030
- Shift JIS for Japanese (Microsoft Code Page 932)
- EUC-KR for Korean (Microsoft Code Page 932)
- ISO-2022 and EUC for CJK character games
- Unicode (and its subsets, like the Basic Multilingual Plane 16-bit). See also UTF-8 and UTF-16.
- ANSEL or ISO/IEC 6937
Spanish character encoding
|
|
|
Contenido relacionado
Liquefied petroleum gas
George F.L. Charles Airport
Benchmarking (computing)