Character encoding

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

The character encoding is the method used to convert a character from a natural language (such as that of an alphabet or syllabary) into a symbol from another representation system, such as a number or a sequence of electronic pulses in an electronic system applying standards or coding rules.

They define the way in which a given character is encoded into a symbol in another representation system. Examples of this are Morse code, the ASCII standard or UTF-8, among others.

ASCII

Because it is closely linked to the octet (and therefore to the integers that go from 0 to 127), the problem it presents is that it cannot encode more than 128 different symbols (128 is the total number of different configurations that can be achieve with 7 binary or digital digits (0000000, 0000001, …, 1111111), using the eighth digit of each octet (bit or parity digit) to detect any transmission error). A quota of 128 is enough to include upper and lower case letters of the English alphabet, plus figures, punctuation, and some "control characters" (for example, one that instructs a printer to go to the next page), but ASCII is not. It does not include the accented characters or the question mark used in Spanish, nor many other symbols (mathematical, Greek letters,...) that are necessary in many contexts.

Extended ASCII

Due to the limitations of ASCII, various 8-bit character codes were defined, including Extended ASCII. However, the problem with these 8-bit codes is that each one of them is defined for a set of languages with similar scripts and therefore they do not provide a unified solution to the coding of all the languages of the world. That is, 8 bits are not enough to encode all the alphabets and scripts in the world.

Unicode

As a solution to these problems, since 1991 it has been internationally agreed to use the Unicode standard, which is a large table, which currently assigns a code to each of the more than fifty thousand symbols, which include all European alphabets, Chinese, Japanese, Korean ideograms, many other forms of writing, and more than a thousand local symbols.

Transmission Rules

The transmission rules are intended to define the way in which characters encoded (using the encoding rules) are transmitted over the communication channel (for example, the Internet)

Currently, on the Internet, messages are transmitted in packets that always consist of an integer number of octets, and error detection is no longer done with the eighth digit of each octet, but with special octets that are automatically added to each package. Transmission rules are limited to specifying a reversible mapping between codes (representing characters) and sequences of octets (to be transmitted as data).

Typographic tables

But, finally, to correspond electronically in simplified Chinese (for example) an important detail is missing:

The table that the Unicode Consortium publishes for human reading, contains a graphical representation, or description, of each character included up to that moment; but document display systems, to work, require typography tables, which associate a glyph (drawing) to each character they encompass, and it happens that there are many font tables, with names like Arial or Times, that draw the same letter based on different matrices and in different styles («A» or «A»); however, the vast majority of typefaces contain only a small subset of all Unicode characters.

Common Character Encoding Rules

ISO 646
- ASCII
EBCDIC
- CP930
ISO 8859:
- ISO 8859-1 Western Europe
- ISO 8859-2 Western Europe and Central Europe (checo, Polish, Croatian, Romanian, Slovenian,...)
- ISO 8859-3 Western Europe and South Europe
- ISO 8859-4 Western Europe and Baltic Countries (Lithua, Estonian and Laptop)
- ISO 8859-5 Cyrillic alphabet
- ISO 8859-6 Arabic
- ISO 8859-7 Greek
- ISO 8859-8 Hebrew
- ISO 8859-9 Western Europe with Turkish character game
- ISO 8859-10 Western Europe with Nordic character games, including that of Iceland.
- ISO 8859-11 Thai
- ISO 8859-13 Baltic and Polish languages
- ISO 8859-14 Celtic Languages (Irish Gaelic, Scottish, Welsh)
- ISO 8859-15 Adds the Euro and others symbol to ISO 8859-1
- ISO 8859-16 Central European languages (Polish, Czech, Slovenian, Slovak, Hungarian, Albanian, Romanian, German and Italian)
CP437, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865, CP869
Windows Character Games:
- Windows-1250 for Central European languages using a Latin script (Polish, Czech, Slovak, Hungarian, Slovenian, Serbian, Croatian, Romanian and Albanian)
- Windows-1251 for Cyrillic Alphabets
- Windows-1252 for Western languages
- Windows-1253 for Greek
- Windows-1254 for Turkish
- Windows-1255 for Hebrew
- Windows-1256 for Arabic
- Windows-1257 for Baltic Languages
- Windows-1258 for Vietnamese
Mac OS Roman
KOI8-R, KOI8-U, KOI7
MIK
Cork or T1
ISCII
VISCII
Big5
- HKSCS
Guobiao
- GB2312
- GBK (Microsoft Code Page 936)
- GB18030
Shift JIS for Japanese (Microsoft Code Page 932)
EUC-KR for Korean (Microsoft Code Page 932)
ISO-2022 and EUC for CJK character games
Unicode (and its subsets, like the Basic Multilingual Plane 16-bit). See also UTF-8 and UTF-16.
ANSEL or ISO/IEC 6937

Spanish character encoding

Minuscule
character	ISO-8859-1	UTF-8	UTF-16
a	0x61	0x61	0x00	0x61
b	0x62	0x62	0x00	0x62
c	0x63	0x63	0x00	0x63
d	0x64	0x64	0x00	0x64
e	0x65	0x65	0x00	0x65
f	0x66	0x66	0x00	0x66
g	0x67	0x67	0x00	0x67
h	0x68	0x68	0x00	0x68
i	0x69	0x69	0x00	0x69
j	0x6a	0x6a	0x00	0x6a
k	0x6b	0x6b	0x00	0x6b
l	0x6c	0x6c	0x00	0x6c
m	0x6d	0x6d	0x00	0x6d
n	0x6e	0x6e	0x00	0x6e
or	0x6f	0x6f	0x00	0x6f
p	0x70	0x70	0x00	0x70
q	0x71	0x71	0x00	0x71
r	0x72	0x72	0x00	0x72
s	0x73	0x73	0x00	0x73
t	0x74	0x74	0x00	0x74
u	0x75	0x75	0x00	0x75
v	0x76	0x76	0x00	0x76
w	0x77	0x77	0x00	0x77
x	0x78	0x78	0x00	0x78
and	0x79	0x79	0x00	0x79
z	0x7a	0x7a	0x00	0x7a

Capsules
character	ISO-8859-1	UTF-8	UTF-16
A	0x41	0x41	0x00	0x41
B	0x42	0x42	0x00	0x42
C	0x43	0x43	0x00	0x43
D	0x44	0x44	0x00	0x44
E	0x45	0x45	0x00	0x45
F	0x46	0x46	0x00	0x46
G	0x47	0x47	0x00	0x47
H	0x48	0x48	0x00	0x48
I	0x49	0x49	0x00	0x49
J	0x4a	0x4a	0x00	0x4a
K	0x4b	0x4b	0x00	0x4b
L	0x4c	0x4c	0x00	0x4c
M	0x4d	0x4d	0x00	0x4d
N	0x4e	0x4e	0x00	0x4e
O	0x4f	0x4f	0x00	0x4f
P	0x50	0x50	0x00	0x50
Q	0x51	0x51	0x00	0x51
R	0x52	0x52	0x00	0x52
S	0x53	0x53	0x00	0x53
T	0x54	0x54	0x00	0x54
U	0x55	0x55	0x00	0x55
V	0x56	0x56	0x00	0x56
W	0x57	0x57	0x00	0x57
X	0x58	0x58	0x00	0x58
And	0x59	0x59	0x00	0x59
Z	0x5a	0x5a	0x00	0x5a

Acents and Tildes
character	ISO-8859-1	UTF-8		UTF-16
to	0xe1	0xc3	0xa1	0x00	0xe1
A	0xc1	0xc3	0x81	0x00	0xc1
E	0xe9	0xc3	0xa9	0x00	0xe9
He	0xc9	0xc3	0x89	0x00	0xc9
I	0xed	0xc3	0xad	0x00	0xed
I	0xcd	0xc3	0x8d	0x00	0xcd
or	0xf3	0xc3	0xb3	0x00	0xf3
Or	0xd3	0xc3	0x93	0x00	0xd3
?	0xfa	0xc3	0xba	0x00	0xfa
ONE	0xda	0xc3	0x9a	0x00	0xda
ü	0xfc	0xc3	0xbc	0x00	0xfc
Ü	0xdc	0xc3	0x9c	0x00	0xdc
ñ	0xf1	0xc3	0xb1	0x00	0xf1
Ñ	0xd1	0xc3	0x91	0x00	0xd1

Symbols
character	ISO-8859-1	UTF-8		UTF-16
?	0xbf	0xc2	0xbf	0x00	0xbf
?	0x3f	0x3f		0x00	0x3f
!	0xa1	0xc2	0xa1	0x00	0xa1
!	0x21	0x21		0x00	0x21

Data: Q184759
Multimedia: Character sets / Q184759

Contenido relacionado

Más resultados...