Character encoding

format_list_bulleted Contenido keyboard_arrow_down
ImprimirCitar

The character encoding is the method used to convert a character from a natural language (such as that of an alphabet or syllabary) into a symbol from another representation system, such as a number or a sequence of electronic pulses in an electronic system applying standards or coding rules.

They define the way in which a given character is encoded into a symbol in another representation system. Examples of this are Morse code, the ASCII standard or UTF-8, among others.

ASCII

Because it is closely linked to the octet (and therefore to the integers that go from 0 to 127), the problem it presents is that it cannot encode more than 128 different symbols (128 is the total number of different configurations that can be achieve with 7 binary or digital digits (0000000, 0000001, …, 1111111), using the eighth digit of each octet (bit or parity digit) to detect any transmission error). A quota of 128 is enough to include upper and lower case letters of the English alphabet, plus figures, punctuation, and some "control characters" (for example, one that instructs a printer to go to the next page), but ASCII is not. It does not include the accented characters or the question mark used in Spanish, nor many other symbols (mathematical, Greek letters,...) that are necessary in many contexts.

Extended ASCII

Due to the limitations of ASCII, various 8-bit character codes were defined, including Extended ASCII. However, the problem with these 8-bit codes is that each one of them is defined for a set of languages with similar scripts and therefore they do not provide a unified solution to the coding of all the languages of the world. That is, 8 bits are not enough to encode all the alphabets and scripts in the world.

Unicode

As a solution to these problems, since 1991 it has been internationally agreed to use the Unicode standard, which is a large table, which currently assigns a code to each of the more than fifty thousand symbols, which include all European alphabets, Chinese, Japanese, Korean ideograms, many other forms of writing, and more than a thousand local symbols.

Transmission Rules

The transmission rules are intended to define the way in which characters encoded (using the encoding rules) are transmitted over the communication channel (for example, the Internet)

Currently, on the Internet, messages are transmitted in packets that always consist of an integer number of octets, and error detection is no longer done with the eighth digit of each octet, but with special octets that are automatically added to each package. Transmission rules are limited to specifying a reversible mapping between codes (representing characters) and sequences of octets (to be transmitted as data).

Typographic tables

But, finally, to correspond electronically in simplified Chinese (for example) an important detail is missing:

The table that the Unicode Consortium publishes for human reading, contains a graphical representation, or description, of each character included up to that moment; but document display systems, to work, require typography tables, which associate a glyph (drawing) to each character they encompass, and it happens that there are many font tables, with names like Arial or Times, that draw the same letter based on different matrices and in different styles («A» or «A»); however, the vast majority of typefaces contain only a small subset of all Unicode characters.

Common Character Encoding Rules

  • ISO 646
    • ASCII
  • EBCDIC
    • CP930
  • ISO 8859:
    • ISO 8859-1 Western Europe
    • ISO 8859-2 Western Europe and Central Europe (checo, Polish, Croatian, Romanian, Slovenian,...)
    • ISO 8859-3 Western Europe and South Europe
    • ISO 8859-4 Western Europe and Baltic Countries (Lithua, Estonian and Laptop)
    • ISO 8859-5 Cyrillic alphabet
    • ISO 8859-6 Arabic
    • ISO 8859-7 Greek
    • ISO 8859-8 Hebrew
    • ISO 8859-9 Western Europe with Turkish character game
    • ISO 8859-10 Western Europe with Nordic character games, including that of Iceland.
    • ISO 8859-11 Thai
    • ISO 8859-13 Baltic and Polish languages
    • ISO 8859-14 Celtic Languages (Irish Gaelic, Scottish, Welsh)
    • ISO 8859-15 Adds the Euro and others symbol to ISO 8859-1
    • ISO 8859-16 Central European languages (Polish, Czech, Slovenian, Slovak, Hungarian, Albanian, Romanian, German and Italian)
  • CP437, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865, CP869
  • Windows Character Games:
    • Windows-1250 for Central European languages using a Latin script (Polish, Czech, Slovak, Hungarian, Slovenian, Serbian, Croatian, Romanian and Albanian)
    • Windows-1251 for Cyrillic Alphabets
    • Windows-1252 for Western languages
    • Windows-1253 for Greek
    • Windows-1254 for Turkish
    • Windows-1255 for Hebrew
    • Windows-1256 for Arabic
    • Windows-1257 for Baltic Languages
    • Windows-1258 for Vietnamese
  • Mac OS Roman
  • KOI8-R, KOI8-U, KOI7
  • MIK
  • Cork or T1
  • ISCII
  • VISCII
  • Big5
    • HKSCS
  • Guobiao
    • GB2312
    • GBK (Microsoft Code Page 936)
    • GB18030
  • Shift JIS for Japanese (Microsoft Code Page 932)
  • EUC-KR for Korean (Microsoft Code Page 932)
  • ISO-2022 and EUC for CJK character games
  • Unicode (and its subsets, like the Basic Multilingual Plane 16-bit). See also UTF-8 and UTF-16.
  • ANSEL or ISO/IEC 6937

Spanish character encoding

Minuscule
character ISO-8859-1 UTF-8 UTF-16
a 0x61 0x61 0x00 0x61
b 0x62 0x62 0x00 0x62
c 0x63 0x63 0x00 0x63
d 0x64 0x64 0x00 0x64
e 0x65 0x65 0x00 0x65
f 0x66 0x66 0x00 0x66
g 0x67 0x67 0x00 0x67
h 0x68 0x68 0x00 0x68
i 0x69 0x69 0x00 0x69
j 0x6a 0x6a 0x00 0x6a
k 0x6b 0x6b 0x00 0x6b
l 0x6c 0x6c 0x00 0x6c
m 0x6d 0x6d 0x00 0x6d
n 0x6e 0x6e 0x00 0x6e
or 0x6f 0x6f 0x00 0x6f
p 0x70 0x70 0x00 0x70
q 0x71 0x71 0x00 0x71
r 0x72 0x72 0x00 0x72
s 0x73 0x73 0x00 0x73
t 0x74 0x74 0x00 0x74
u 0x75 0x75 0x00 0x75
v 0x76 0x76 0x00 0x76
w 0x77 0x77 0x00 0x77
x 0x78 0x78 0x00 0x78
and 0x79 0x79 0x00 0x79
z 0x7a 0x7a 0x00 0x7a
Capsules
character ISO-8859-1 UTF-8 UTF-16
A 0x41 0x41 0x00 0x41
B 0x42 0x42 0x00 0x42
C 0x43 0x43 0x00 0x43
D 0x44 0x44 0x00 0x44
E 0x45 0x45 0x00 0x45
F 0x46 0x46 0x00 0x46
G 0x47 0x47 0x00 0x47
H 0x48 0x48 0x00 0x48
I 0x49 0x49 0x00 0x49
J 0x4a 0x4a 0x00 0x4a
K 0x4b 0x4b 0x00 0x4b
L 0x4c 0x4c 0x00 0x4c
M 0x4d 0x4d 0x00 0x4d
N 0x4e 0x4e 0x00 0x4e
O 0x4f 0x4f 0x00 0x4f
P 0x50 0x50 0x00 0x50
Q 0x51 0x51 0x00 0x51
R 0x52 0x52 0x00 0x52
S 0x53 0x53 0x00 0x53
T 0x54 0x54 0x00 0x54
U 0x55 0x55 0x00 0x55
V 0x56 0x56 0x00 0x56
W 0x57 0x57 0x00 0x57
X 0x58 0x58 0x00 0x58
And 0x59 0x59 0x00 0x59
Z 0x5a 0x5a 0x00 0x5a
Acents and Tildes
character ISO-8859-1 UTF-8 UTF-16
to 0xe1 0xc3 0xa1 0x00 0xe1
A 0xc1 0xc3 0x81 0x00 0xc1
E 0xe9 0xc3 0xa9 0x00 0xe9
He 0xc9 0xc3 0x89 0x00 0xc9
I 0xed 0xc3 0xad 0x00 0xed
I 0xcd 0xc3 0x8d 0x00 0xcd
or 0xf3 0xc3 0xb3 0x00 0xf3
Or 0xd3 0xc3 0x93 0x00 0xd3
? 0xfa 0xc3 0xba 0x00 0xfa
ONE 0xda 0xc3 0x9a 0x00 0xda
ü 0xfc 0xc3 0xbc 0x00 0xfc
Ü 0xdc 0xc3 0x9c 0x00 0xdc
ñ 0xf1 0xc3 0xb1 0x00 0xf1
Ñ 0xd1 0xc3 0x91 0x00 0xd1
Symbols
character ISO-8859-1 UTF-8 UTF-16
? 0xbf 0xc2 0xbf 0x00 0xbf
? 0x3f 0x3f 0x00 0x3f
! 0xa1 0xc2 0xa1 0x00 0xa1
! 0x21 0x21 0x00 0x21
  • Wd Data: Q184759
  • Commonscat Multimedia: Character sets / Q184759

Contenido relacionado

Liquefied petroleum gas

Liquid petroleum gas is the mixture of liquefied gases present in natural gas or dissolved in oil. It involves physical and chemical processes, for example...

George F.L. Charles Airport

The George F. L. Charles Airport is located in Castries, the capital of Saint Lucia. Originally it was called Vigie Airport but since August 4, 1997 it...

Benchmarking (computing)

A performance test or comparative is a technique used to measure the performance of of a system or one of its components. More formally, it can be understood...
Más resultados...
Tamaño del texto:
undoredo
format_boldformat_italicformat_underlinedstrikethrough_ssuperscriptsubscriptlink
save