Unicode

format_list_bulleted Contenido keyboard_arrow_down
ImprimirCitar
Examples of Unicode characters
Latin Alphabetical Character "A" (U+0041).
Silaba devanagari "Aum" ().) (U+0950).
Chinese "yue" Ideogram (U+6708).

Unicode is a character encoding standard designed to facilitate the computer processing, transmission, and display of texts from numerous languages and technical disciplines, as well as classical texts from dead languages. The term Unicode comes from the three objectives pursued: universality, uniformity, and uniqueness.

Unicode defines each character or symbol by a name and numeric identifier, the code point (code point). It also includes other information for the correct use of each character, such as writing system, category, directionality, capital letters and other attributes. Unicode treats alphabetic, ideographic, and symbol characters equally, which means they can be mixed in the same text without using markup or control characters.

This standard is maintained by the Unicode Technical Committee (UTC), integrated into the Unicode Consortium, which includes companies with varying degrees of involvement such as: Microsoft, Apple, Adobe, IBM, Oracle, SAP, Google or Facebook, institutions such as the University of Berkeley, or the Government of India and individual professionals and academics. The Unicode Consortium maintains a close relationship with ISO/IEC, with which it has had an agreement since 1991 to synchronize its standards that contain the same characters and code points.

The creation of Unicode has been an ambitious project to replace existing character encoding schemes, many of which were severely limited in size and incompatible with multilingual environments. Unicode has become the most extensive and complete character encoding scheme, being the dominant one in the internationalization and local adaptation of computer software. The standard has been accepted in a considerable number of recent technologies, such as XML, Java, and modern operating systems.

The full description of the standard and character tables are available on the official Unicode website. The complete reference is also published in book form each time a new major version is completed. The digital version of this book is available for free. Revisions and additions are published independently.

Scope of the standard

Unicode includes all the characters in common use today. Version 15.0 contains 149,186 characters from alphabets, ideographic systems and collections of symbols (mathematical, technical, musical, icons...). The figure grows with each version.

Unicode includes modern writing systems such as: Latin; extinct historical writings, for academic purposes, such as: cuneiform, and runic. Among the non-alphabetic characters included in Unicode are musical and mathematical symbols, game tiles such as dominoes, arrows, icons etc.

In addition, Unicode includes diacritics as stand-alone characters that can be combined with other characters and has predefined versions of most letters with diacritics in use today, such as accented vowels in Spanish.

Unicode is a constantly evolving standard and new characters are added all the time. Certain alphabets have been discarded, proposed for various reasons, such as the Klingon alphabet.

Relation to other standards

As noted above, Unicode is in sync with the ISO/IEC standard known as UCS, or Universal Character Set. From a technical point of view, it includes or is compatible with previous encodings such as ASCII7 or ISO 8859-1, the national standards ANSI Z39.64, KS X 1001, JIS X 0208, JIS X 0212, JIS X 0213, GB 2312, GB 18030, HKSCS, and CNS 11643, particular encodings of software manufacturers such as Apple, Adobe, Microsoft, IBM, etc. Also, Unicode reserves space for software manufacturers who can create extensions for their own use.

Character repertoire

The basic element of the Unicode standard is the character . A character is considered to be the smallest element of a meaningful writing system. The Unicode standard encodes the essential characters ―graphemes― defining them in an abstract way and leaves the visual representation (size, dimension, font or style) to the software that deals with it, such as word processors or web browsers. Letters, diacritics, punctuation characters, ideograms, syllabic characters, control characters, and other symbols are included. The characters are grouped into alphabets or writing systems. Characters from different alphabets are considered to be different, even though they share shape and meaning.

Characters are identified by a number or code point and their name or description. When a code has been assigned to a character, that character is said to be encoded. The code space has 1,114,112 possible positions (0x10FFFF). Code points are represented using hexadecimal notation by adding the prefix U+. The hexadecimal value is padded with zeros up to 4 hexadecimal digits when necessary; if it is longer than 4 digits, zeros are not added.

Types of characters

Different versions of the angstrom character (preferred version), as a character with a diacritical sign and as a symbol in the form of a letter.

Code space blocks contain points with the following information:

  • Characters: letters, diacritic signs, figures, punctuation characters, symbols and spaces.
  • Format characteristics: invisible characters that affect the next text process. Examples: U+2028 Line jump, U+2029 paragraph leap, U+00A0 hard spaceetc.
  • Control codes: 65 codes defined by ISO/IEC 2022 compatibility. They are the characters between the ranges [U+0000,U+001F], U+007F and [U+0080..U+009F]. Interpreting them is the responsibility of higher protocols.
  • Private features: reserved for use outside the standard by software manufacturers.
  • Features reserved: codes reserved for use by Unicode. They're unearmarked positions.
  • Lower or higher supplements: Unicode reserves the code points of U+D800 to U+DFFF for use as codes subrogated from UTF-16, in the representation of additional characters.
  • No characters: are codes permanently reserved for internal use by Unicode. The last two points of each U+FFFE and U+FF.

Composition of characters and sequences

Composition of the character "ñ". The first is an independent character, the second one n plus a virgulilla (in English known as TildeCombinable.

Unicode includes a mechanism for forming characters to extend the repertoire of compatibility with existing symbols. A base character is supplemented with marks: diacritics, punctuation marks, or frames. The type of each character and their attributes define the role they can play in a combination. For this reason, there may be multiple options that represent the same character. To facilitate compatibility with older encodings, precomposed characters are provided; in the definition of said characters it is stated which characters are involved in the composition.

A group of consecutive characters, regardless of their type, forms a sequence. In case several sequences represent the same set of essential characters, the standard does not define one of them as 'correct', but considers them equivalent. In order to identify these equivalences, Unicode defines the mechanisms of canonical equivalence and compatibility equivalence based on obtaining normalized forms of the strings to be compared.

Chinese, Korean and Japanese Unified Repertoire

In the Unicode standard, East Asian ideograms (popularly called "Chinese characters") are called "Han ideograms." These ideograms were developed in China and were adapted by nearby cultures for their own use. Japan, Korea, and Vietnam developed their own alphabetic or syllabic systems to be used in combination with the Chinese symbols: hiragana and katakana (in Japan), hangul (in Korea), and yi (in Vietnam). The natural evolution of the writing systems and the different moments of entry of the characters in the different cultures have marked differences in the ideograms used. Unicode considers the different versions of the ideograms as variants of the same abstract character, that is, as a result of the application of a different font in each case, and considers the national variants as belonging to the same writing system. The original version of the standard was developed from existing industry standards in the affected countries.

The body in charge of developing the character repertoire is the Ideographic Rapporteur Group (IRG). IRG is a working group integrated into ISO/IEC JTC1/SC2/WG2, including China, Hong Kong, Macao, Taipei Computer Association, Singapore, Japan, South Korea, North Korea, Vietnam and the United States of America.

The CJK character database is called Unihan and also contains auxiliary information about meaning, conversions, data necessary to use them in the different languages that use them. Below are the blocks that describe this repertoire. IRG defines the characters of the seven unified groups; the next two groups contain characters for compatibility with earlier standards.

BlockCode rangeComments
Unified Ideograms CJK 4E00-9FFF Ideograms of common use. Code size: 2 bytes.
Unified Ideograms CJK - Extension A 3400-4DFF Unusual use ideas. Code size: 2 bytes.
CJK Unified Ideograms - B Extension 20000-2A6DF Ideograms of unusual and historical use.
CJK Unified Ideograms - C Extension 2A700-2B73F Ideograms of unusual and historical use.
CJK Unified Ideograms - D Extension 2B740-2B81F Ideograms of unusual and historical use.
Unified Ideograms CJK - E Extension 2B820-2CEAF Ideograms of unusual and historical use.
Unified Ideograms CJK - F Extension 2CEB0-2EBEF Ideograms of unusual and historical use.
Compatibility Ideograms F900-FAFF Duplicated, unifying variants and corporate characters. Code size: 2 bytes.
Compatibility ideogram supplement 2F800-2FA1F Unifying variables.

Ideographic description sequences

It is admitted that the task of including ideograms in the standard can never be finished, mainly because the creation of new ideograms continues. In order to fill any gaps, Unicode offers a mechanism that allows the representation of missing symbols called "ideographic description sequences". It is based on the fact that in practice, all ideograms can be broken down into smaller pieces that, in turn, are ideograms. Although the representation of a symbol by means of a sequence is possible, the standard specifies that whenever a coded version exists, its use should be preferred. There is no method for the "canonical decomposition" of ideograms or equivalence algorithms, so operations on the text, such as searching or sorting, may fail.

Unicode defines 12 control characters for the description of ideograms representing different possibilities of spatial combination of other Han characters.

Elements of the Unicode Standard

Design principles

The standard was designed with the following objectives:

  • Universality: A wide enough repertoire that shelters all probable characters in the exchange of multlingual text.
  • Efficiency: Generated sequences should be easy to treat.
  • No ambiguity: A given code always represents the same character.

Character database

The set of characters encoded by Unicode is the UCD (unicode character database: Unicode character database). In addition to the name and code point, it includes more information: alphabet to which it belongs, name, classification, capitalization, orientation and other forms of use, standardized variants, combination rules, etc.

Formally, the database is divided into planes and these in turn into areas and blocks. With exceptions, the encoded characters are grouped in the code space by categories such as alphabet or script, so that related characters are close together in the encoding tables.

Plans

For convenience we have divided the code space into large groups called planes. Each plane contains a maximum of 65,536 characters. Given a code point expressed in hexadecimal, the last 4 digits determine the position of the character on the plane.

  • Multilingual basic plan: BMP or flat 0. It contains most modern alphabets, including the most common characters of the CJK system, other historical or unusual characters and 64 reserved for private use.
  • Multilingual Supplementary Plane: SMP or Plane 1. Low-use historical alphabets and technical use systems or other uses.
  • Ideographic Supplementary Plan: IAP or Plan 2. It contains the characters of the CJK system that are not included in the 0 plane. Most are very rare characters or of historical interest.
  • Special purpose plan: SSP or plan 14. Area for control characters that have not been introduced to the 0 plane.
  • Private use plans: plans 15 and 16. Reserved for private use by software manufacturers.

Areas and blocks

The various planes are divided into addressing areas based on the general types they include. This division is conventional, not regulated, and can vary over time. The areas are divided, in turn, into blocks. The blocks are normatively defined and are consecutive ranges of the code space. The blocks are used to form the printed tables of characters but should not be taken as definitions of significant groups of characters.

Information processing

Forms of encoding

Unicode code points are identified by an integer. Depending on its architecture, a computer will use units of 8, 16, or 32 bits to represent these integers. Unicode's encoding forms regulate the way in which code points are to be transformed into computer-processable units.

Unicode defines three forms of encoding under the name UTF (Unicode transformation format):

  • UTF-8: Byte-oriented encoding with variable length symbols.
  • UTF-16: 16-bit coding of variable length optimized for the representation of the multilingual base plane (BMP).
  • UTF-32: 32-bit encoding of fixed length, and the simplest of the three.

The forms of encoding merely describe the way code points are represented in machine-intelligible form. From the 3 identified forms, 7 coding schemes are defined.

Encoding schemes

The encoding schemes deal with the way in which the encoded information is serialized. Security in information exchanges between heterogeneous systems requires the implementation of systems that allow determining the correct order of the bits and bytes and ensure that the reconstruction of the information is correct. A fundamental difference between processors is the arrangement order of bytes in 16-bit and 32-bit words, called endianness. Coding schemes must ensure that the ends of a communication know how to interpret the information received. From the 3 encoding forms, 7 schemes are defined. Although they share names, do not confuse encoding schemes and forms.

Coding schemeEndiannessAdmit BOM
UTF-8 Non-Applicable Yes.
UTF-16 Big-endian or Little-endianYes.
UTF-16BE Big-endianNo.
UTF-16LE Little-endianNo.
UTF-32 Big-endian or Little-endianYes.
UTF-32BE Big-endianNo.
UTF-32LE Little-endianNo.

Unicode defines a special mark, the Byte Order Mark (BOM), at the start of a file or communication to make byte ordering explicit. When a higher protocol specifies byte order, the mark is not required and can be omitted resulting in the schemes in the list above with BE or LE suffix. In the UTF-16 and UTF-32 schemes, which support BOM, if this is not specified the byte ordering is assumed to be big-endian.

The encoding unit in UTF-8 is the byte so it does not need a byte order indication. The standard neither requires nor recommends the use of BOM, but supports it as a mark that text is Unicode or as a result of conversion from other schemas.

History

The Unicode project began in late 1987, after discussions between Joe Becker, Lee Collins, and Mark Davis (engineers at Apple and Xerox companies). As a result of their collaboration, the first draft was published in August 1988. of Unicode under the name Unicode88. In this first version it was considered that only the characters necessary for modern use would be encoded, so 16-bit codes were used.

During 1989, collaborators from other companies such as Microsoft or Sun Microsystems joined. On January 3, 1991, the Unicode Consortium was formed, and in October 1991, the first version of the standard was published. The second version, which already included the Han ideographic script, was published in June 1992. Below is a table with the different versions of the Unicode Standard with their most important additions or modifications.

Version Date Publication ISO/IEC 10646 associated edition Scripture Characters
# Remarkable additions
1.0 October 1991 ISBN 0-201-56788-1 (Vol. 1). 24 7161 The initial repertoire covers the alphabets: Arabic, Armenian, Bengali, Bopomopho, Cyrillic, Devanagari, Georgian, Greek/copto, Guyaratí, gurmukhi, hangul, Hebrew, hiragana, kannada, katakana, lao, latino, malayalam, oriya, támil, telugutano, thai, and
1.0.1 June 1992 ISBN 0-201-60845-6 (Vol.2). 25 28 359 Defined the first set of 20 902 unified CJK ideograms.
1.1 June 1993 ISO/IEC 10646-1:1993 24 34 233 Added 4306 hangul characters, plus the original set of 2350. The Tibetan alphabet is eliminated.
2.0 July 1996 ISBN 0-201-48345-9 ISO/IEC 10646-1:1993 with amendments 5, 6 and 7 25 38 950 Removed the original set of hangul characters; added a new set of 11 172 hangul characters in a new location. The Tibetan alphabet is reinstated in a new location and with a different character game. The system of subrogated codes is defined and the 15 and 16 character planes are created for private use.
2.1 May 1998 ISO/IEC 10646-1:1993 with amendments 5, 6 and 7, and two characters of amendment 18 25 38 952 The euro symbol is added.
3.0 September 1999 ISBN 0-201-61633-5 ISO/IEC 10646-1:2000 38 49 259 Czech Ideograms. Ethiopian, Jemer, Mongol, Myanmar, ogham, Runic alphabet, cingalés, syriac, thaana, unified silage of Canadian Indians, and yi besides braille patterns.
3.1 March 2001 ISO/IEC 10646-1:2000

ISO/IEC 10646-2:2001

41 94 205 The alphabets deseret, Gothic and Etruscan are added, and the symbols of modern musical notation, Byzantine music, and 42 711 unified CJK ideograms.
3.2 March 2002 ISO/IEC 10646-1:2000 with amendment 1

ISO/IEC 10646-2:2001

45 95 221 Added the Philippine writings: buhid, hanunó'o, tagalo, and tagbanwa.
4.0 April 2003 ISBN 0-321-18578-1 ISO/IEC 10646:2003 52 96 447 The Cypriot, limbu, linear B, osmanya, shaviano, tai le, and ugari, and hexagrams I Ching are added.
4.1 March 2005 ISO/IEC 10646:2003 with amendment 1 59 97 720 Adding buginés, glagothic, kharoshthi, new tai lue, antique Persian, syloti nagri, and nifinagh. The copy of the Greek alphabet is separated. Ancient Greek symbols for music and numeration.
5.0 July 2006 ISBN 0-321-48091-0 ISO/IEC 10646:2003 with amendments 1 and 2 and four characters of Amendment 3 64 99 089 Added: balinés, cuneiforme, n'ko (mandé), phags-pa, and fenicio.
5.1 April 2008 ISO/IEC 10646:2003 More amendments 1, 2, 3 and 4 75 100 713 Added: caria writing, cham, kayah li, lepcha writing, lizard alphabet, alfabeto lidio, alfabeto ol chiki, rejang, saurashtra, sundanés, and the vai silabario. The hieroglyphics of the Festos album, mahjong chips and dominoes. Important additions to the Burmese, letters and abbreviations of amanuense used in medieval manuscripts and the addition of the capital ß.
5.2 October 2009 ISBN 978-1-936213-00-9 ISO/IEC 10646:2003 more amendments from 1 to 6 90 107 361 Added: bamum, javanes, lisu, meetei mayek, samaritano, tai tham, and tai viet. Devanagari has been expanded with the addition of the Sanskrit alphabet. Significant expansions for Abkhaz, the unified silage of Canadian Indians, Coptic, khamti shan, malayo, myanmar. Historical symbols and characters are also added such as the Egyptian hieroglyphs of Gardiner, imperial aramaic, adventurous, kaithi, ancient South Arab and ancient Turkish.
6.0 October 2010 ISBN 978-1-936213-01-6 ISO/IEC 10646:2011 93 109 449

Version 6.0 is the first main version of the standard published exclusively in electronic support. Add mandeo, batak and brahmi, expansions of African languages such as tifinagh, Ethiopian and bamum. Other important additions are: 222 CJK ideograms, 1000 symbols including emoji pictograms, the new official symbol for rupee and alchemical symbols as well as extensions of character attributes and other normative and algorithmic modifications.

6.1 2012 ISBN 978-1-936213-02-3 ISO/IEC 10646:2012 110 116 It includes extensions of several existing alphabets; there are significant additions to the Arabic alphabet that include 143 alphabetic mathematical symbols, and the alphabets Pollard Miao, Sorang Sompeng, merotic script, Chakma, Alphabet sharada and 13 emoticons.
6.2 2012 ISBN 978-1-936213-07-8 ISO/IEC 10646:2012 more symbol of the Turkish lira. 110. Special publication for the introduction of the Turkish Lira
6.3 2013 ISBN 978-1-936213-08-5 ISO/IEC 10646:2012 with additives 110 122 Review of the bidirectional text algorithm with the addition of 5 special characters. The new bidirectional algorithm improves the joint representation of texts from different sources respecting the correct order of the characters.
7.0 2014 ISBN 978-1-936213-09-2 ISO/IEC 10646:2012 with adds and ruble sign 112 956 Add 23 new writing systems.
8.0 2015 ISBN 978-1-936213-10-8
9.0 2016 ISBN 978-1-936213-13-9
10.0 2017 ISBN 978-1-936213-16-0 139 136. Among others is the symbol of Bitcoin, 56 emoji characters and writing systems: Masaram Gondi, Nü Shu, Soyombo and the square mongola writing of Zanabazar. The F extension of CJK unified characters is inserted.
11.0 5 June 2018 ISBN 978-1-936213-19-1 137 374 Dogra, capital letters from Georgian Mtavruli, Gunjala Gondi, Hanifi Rohingya, Makasar, Medefaidrin, Old Sogdian, Sogdian, and various symbols (5 new unified ideograms CJK, 66 additional emoji, copyleft, half star, additional astrological symbols and Chinese chess Xiangqi)
12.0 March 5, 2019 ISBN 978-1-936213-22-1 150 137 928 Elimaico
12.1 7 May 2019 ISBN 978-1-936213-25-2 137 929 Adds a unique character for the Reiwa Era
13.0 10 March 2020 ISBN 978-1-936213-26-9 154 143.859 Add 4 new alphabets Corasmio, Dhives Akuru, Kitan language
14.0 14 September 2021 144 697 Toto, Cypro-Minoan, Vithkuqi, Old Uyghur, Tangsa, Latin writing additions in SMP blocks (Ext-F, Ext-G) to use in extended IPA, Arabic writing additions to use in African and Iran languages, Pakistan Malaysia, Indonesia, Java and Bosnia, and to write honorific, additions for use in the North Philippines, other additions to admit languages in Kyrgyzstan
15.0 13 September 2022 149 186 Kawi and Mundari, several new letters including 20 emojis, 4192 CJK ideograms and Egyptian hieroglyph control characters

Contenido relacionado

Languages of germany

The official language is Standard German or also known as German, around 95% of the population daily uses Standard German or some High or Low Germanic variety...

Fala (Jalama Valley)

The fala is a Romance language of the Galician-Portuguese subgroup spoken in the municipalities of San Martín de Trevejo, Eljas and Valverde del Fresno, all...

Pc

The acronym PC can correspond...
Más resultados...
Tamaño del texto:
undoredo
format_boldformat_italicformat_underlinedstrikethrough_ssuperscriptsubscriptlink
save