ISO/IEC 10646

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

The international standard ISO/IEC 10646 defines the Universal Character Set (in English Universal Character Set or UCS) as a multi-octet character encoding system. The latest version contains about 136,000 abstract characters, each precisely identified by an integer called a code point. The ISO/IEC 10646 standard is maintained alongside the Unicode standard ("Unicode"), both of which are identical in code. It was created in the year 1993, therefore, it is sometimes also called ISO/IEC 10646-1:1993.

In this set are the various characters (letters, numbers, symbols, ideograms, logograms, etc.) from many languages, alphabets, etc..., as well as from the various punctuation traditions existing in all languages. languages of the world, represented in UCS with unique code points. The inclusion of new characters is constant, making the set permanently in a state of update.

Since 1991, the Unicode Consortium has been working with ISO to develop the Unicode standard and ISO/IEC 10646 together. The repertoire, character names, and code points of Unicode version 2.0 make a perfect fit with the first seven publications of ISO/IEC 10646-1:1993. After the publication of Unicode 3.0 in February 2000, new characters were introduced into the UCS via ISO/IEC 10646-1:2000.

The UCS set has about 1.1 million code points, but only the first 65,536 (the Unicode Character Association, or BMP) came into use before the year 2000. This situation began to change when the People's Republic of China (PRC) requested in the year 2000 that the computerized systems sold in its territory had to support GB18030, this situation meant that the systems that had to be sold in the PRC had to go beyond the BMP. The set as it is defined today has deliberately left some gaps in order to be able to incorporate other characters in the future that do not conflict with the current ones.

Ways to encode the Universal Character Set

ISO 10646 defines various character "encoding forms" for the Universal Character Set. The simplest is called, UCS-2, which uses a single "code value" (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one word of code). 16-bit) to represent the value. UCS-2 therefore allows a binary representation of each code point in the BMP system, making there a one-to-one correspondence between the value and the code point representing the character. UCS-2 cannot represent code points outside of the BMP set.

The first amendment to the original edition of UCS was defined as UTF-16, and was an extension of UCS-2, to represent code points outside of BMP. There is currently a special area of code points called S (Special) in the BMP that remains unassigned to characters. UCS-2 does not allow code values to be used for these code points, but UTF-16 allows them to be used in pairs. Each pair consists of an "RC-element" (a tuple of two octets forming an R-octet and a C-octet of a total sequence of four octets that is associated with a cell in the character-assigned code space). The Unicode standard has also adopted UTF-16, but in Unicode terminology, the upper half of the element area is called "high replacements" and the bottom half of the table is called "low substitutions".

Another encoding is UCS-4 which uses a single character between 0 and (theoretically) in hex up to 7FFFFFFF to be assigned to each character (however UCS stops at 10FFFF and ISO/IEC 10646 has committed future assignments of characters in this range). UCS-4 allows representations of each value using exactly four bytes (a 32-bit word). UCS-4 allows for a binary representation of every code point in the UCS system, including those outside of the BMP. As in the UCS-2 encoding system, each encoded character has a fixed length in bytes.

History of ISO 10646

The International Organization for Standardization (ISO) specified the character set in 1989 and published a draft called ISO 10646 in 1990. Hugh McGregor Ross was one of the main devisers and architects of this first draft. That standard is quite different from the current one. In the first draft, 128 groups of 256 plans were defined, each with 256 columns and 256 cells. With this, it was possible to place apparently 2,147,483,648 characters, but currently the standard can place only 679,477,248 characters, as the rules prohibit control character values (0x00 through 0x1F and 0x80 through 0x9F, in hexadecimal notation). For example, the letter A has a position in group 0x20, plane 0x20, column 0x20, cell 0x41.

You could encrypt the characters of this primordial ISO 10646 standard in one of three ways:

UCS-4, four octets for each character, allowing simple encoding of all characters;
UCS-2, two octets for each character, allowing the encoding of the foreground, 0x20, the basic multilingual plane, containing the first 36.864 code points, direct, and other planes and groups changing to them with ISO 2022 escape sequences;
UTF-1, which encodes all characters in length octet sequences that vary (1 to 5 octets, which contain no control character).

In 1990, two initiatives to create a Universal Character Set came to light: Unicode, with 16 bits for each character (65,536 possible characters), and ISO 10646. Software companies refused to accept the complexity requirement and the size of the ISO standard and managed to convince a number of national ISO bodies to vote against it. The ISO standardizers agreed that they could not continue to support the standard in its current state and that they would negotiate the unification of their standard with Unicode. Two changes occurred after this situation: the abandonment of the limitation on characters (prohibition of control character values), thus permit characters up to 0x0000101F; and the synchronization of the repertoire of the basic multilingual plane with that of Unicode.

Meanwhile, as time went on, the situation changed in the Unicode standard itself: 65,536 characters began to seem sparse, and the standard starting with version 2.0 encodes 1,112,064 characters using the UTF-16 encoding. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16, and no more. That is, a little over a million characters instead of over 2 billion. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the range UTF-16 and under the name of UTF-32. As for UTF-1, no one used it, due in part to its poor design encoding (there was no way to distinguish between octet functionality, a problem similar to Japanese Shift-JIS encoding) and its degradation of the benefits (many division operations). Rob Pike and Ken Thompson, the designers of Bell Labs' Plan 9 operating system, devised a new, fast, and well-designed mixed-width encoding, which they called UTF-8.

Field of Applications and Scope

The multi-octet Universal Coded Character Set (UCS) is applicable to the representation, transmission, exchange, processing, storage, input, and presentation of virtually all of the world's languages in written form. The basic part of the standard, from 1993, specifies the general architecture of this four-octet (32-bit) encoding, a compendium of numerous national and international character sets. It defines the terms used in its writing, as well as the general structure of UCS. It also specifies what the Basic Multilanguage Plan (BMP) consists of, a reduction of the code to 2 octets (16 bits) and specifies the set of graphic symbols of which it is composed and their coded representations. It explains what the coded representations of the control functions are, and the management of future additions to this character set.

Map to Unicode

ISO/IEC 10646-1:1993 ≈ Unicode 1.1
ISO/IEC 10646-1:2000 ≈ Unicode 3.0
ISO/IEC 10646-2:2001 ≈ Unicode 3.2
ISO/IEC 10646:2003 ≈ Unicode 4.0
ISO/IEC 10646:2003 more amendment 1 ≈ Unicode 4.1
ISO/IEC 10646:2003 plus amendment 1, amendment 2, and part of the amendment 3 ≈ Unicode 5.0

See: §D.1 of The Unicode Standard for more details.

Connection with other standards

The first 127 characters of the Basic Multilanguage Plane (BMP) used for the 16-bit interchange code correspond to ISO 646, the international version of ASCII. The characters that make up the second half of the first row are those used by ISO 8859-1, the Latin-1 set. ISO/IEC DIS 14755 -- Methods of inputting characters from the ISO/IEC 10646 repertoire with a keyboard or other input devices. It is expected to become the basic information representation code for all 16-bit and 32-bit systems very soon.

Contenido relacionado

Más resultados...