Floating point

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

Fraction of the floating coma

The floating point representation (in English, floating point) is a form of scientific notation used in computers with which extremely large and small real numbers can be represented in a very efficient and compact and with which arithmetic operations can be performed. The current standard for floating point representation is IEEE 754.

Scientific notation

Main article: Scientific notion

Since floating point representation is almost identical to traditional scientific notation, with some additions and differences, we will first describe scientific notation to understand how it works, and then describe floating point representation and differences.

Representation

Scientific notation is used to represent real numbers. Being r the real number to represent, the representation in scientific notation is made up of three parts:

$r=ccdot b^{e},!$

c. The coefficient, formed by a real number with one whole digit followed by a comma (or point) and several fractional digits.
b. La base, which in our decimal system is 10, and in the binary computer system is 2.
e. The exponent Whole, which elevates the base to a power.

Coefficient

A sign on the coefficient indicates whether the real number is positive or negative.

The coefficient has a certain number of significant digits, which indicate the precision of the represented number, the more digits the coefficient has, the more precise the representation. For example, the number π can be represented in scientific notation with 3 significant figures, 3.14 x 10⁰, or with 12 significant figures, 3.14159265359 x 10⁰; the second representation is much more accurate than the first.

Base and exponent

The coefficient is multiplied by the base raised to an integer exponent. In the decimal system, the base is 10. When multiplying the coefficient by the base raised to an integer power, the comma of the coefficient is moved as many positions (as many digits) as indicated by the exponent. The comma is shifted to the right if the exponent is positive, or to the left if it is negative.

Example of how a number changes when the exponent of the base changes:

2,71828 x 10^-2 represents the actual number 0,0271828
2,71828 x 10^-1 represents the actual number 0,271828
2,71828 x 10⁰ represents the actual number 2,71828 (the zero exponent indicates that the coma does not move)
2,71828 x 10¹ represents the actual number 27,1828
2,71828 x 10² represents the actual number 271,828

Example

An example of a number in scientific notation is as follows:

-1,234 567 89 x 10³

The coefficient is -1.23456789, has 9 significant digits and is multiplied by base ten raised to 3. The sign of the coefficient indicates whether the real number represented by scientific notation is positive or negative.

The value of the power indicates how many places (how many digits) the comma in the coefficient must be shifted to get the final real number. The sign of the power indicates whether this shift of the comma should be to the right or to the left: a positive power indicates that the shift of the comma is to the right, while a negative sign indicates that the shift should be to the right. left. If the exponent is zero, the comma is not moved by any position. The reason for the name "floating point" is because the point moves or "floats" as many digits as indicated by the exponent of the base; by changing the exponent, the comma "floats" to another position.

In the number represented in the scientific notation above, -1.23456789 x 10³, the exponent is positive 3, indicating that the comma of the coefficient -1.23456789 should be shifted 3 places to the right, resulting in the equivalent real number:

- 1234.567 89

The following is a table with examples of real numbers with three significant digits and their representation in scientific notation:

Actual number	Scientific notion
123 000 000 000 000 000 000 000 000,0	1.23 x 10²⁰
123 000 000,0	1.23 x 10⁸
1230.0	1.23 x 10³
123.0	1.23 x 10²
12.3	1.23 x 10¹
1.23	1.23 x 10⁰
0.123	1.23 x 10^-1
0.012 3	1.23 x 10^-2
0.001 23	1.23 x 10^-3
0,000 000 012 3	1.23 x 10^-8
0,000 000 000 000 000 012 3	1.23 x 10^{- 20}

As can be seen in the table, the scientific notation representation of real numbers is much more compact when the numbers are very large in magnitude or when they are of very small magnitude (close to zero); For this reason, it is widely used in science, where huge numbers have to be handled, such as the mass of the Sun, 1.98892 × 10³⁰ kg, or very small ones, such as the charge of the electron, -1, 602176487 × 10^-19 coulombs, and for this reason it is also used, in floating point form, for the representation of real numbers in the computer.

Representation on computers and calculators

For the input and display of numbers in scientific notation, computers and calculators can represent them in different ways. For example, depending on the system, the speed of light, 2.99792458 x 10⁸, can be represented as follows:

Notation	Commentary
2.99792458 x 10⁸	Standard scientific notion used in science and technology.
2,99792458e8	Usually used in computers and calculators; sometimes "e" is capital.
2,99792458d8	Used in BASIC language to represent double-precision numbers (15 significant digits). The "e" is used, as in the previous example, for simple precision numbers (6 1/2 significant digits).
2.99792458 x 10⁸	Used in calculators. The exponent of 10 (the expression x 10⁸) is entered using different keys depending on the calculator, as 10^x or EXP

Binary system

Main article: binary system

A real value can be extended by an arbitrary number of digits. Floating point allows only a limited number of digits of a real number to be represented; that is, only the most significant digits (those with the greatest weight) of the real number will be used, in such a way that a real number generally cannot be represented with total precision, but rather as an approximation that will depend on the number of significant digits it has. the floating point representation you are working with. The limitation is found when there are digits with less weight than the digits of the significant part. In this case, these are usually rounded and, if they are very small, they are truncated. However, and depending on the use, the relevance of these data can be negligible, which is why the method is interesting, despite being a potential source of error.

In the floating-point binary representation, the most significant bit defines the sign value (0 for positive and 1 for negative). This is followed by a series of bits that define the exponent. The rest of the bits are the significant part.

Because the significant part is generally normalized, in these cases the most significant bit of the significant part is always 1, so it is not represented when stored but is implicitly assumed. In order to perform the calculations, that implicit bit is made explicit before operating on the floating point number. There are other cases where the most significant bit is not a 1, such as in the representation of the number zero, or when the number is very small in magnitude and exceeds the capacity of the exponent, in which case the significant digits are represented in a denormalized way to not lose precision in one hit but progressively. In these cases, the most significant bit is zero and the number gradually loses precision (as it gets smaller in magnitude when you perform calculations) until it finally becomes zero.

Example

The following examples describe floating point notation. Below are three numbers in a 16-bit floating point representation. The leftmost bit is the sign, then there are 6 bits for the exponent, followed by 9 bits for the significant part:

${displaystyle {begin{matrix}{text{Signo}}\overbrace {1} \0\0end{matrix}}quad {begin{matrix}{text{Exponente}}\overbrace {100011} \011011\101001end{matrix}}quad {begin{matrix}{text{Parte Significativa}}\overbrace {011101100} \111001101\000000001end{matrix}}{begin{matrix} \quad =mathrm {0xC6EC} \quad =mathrm {0x37CD} \ =mathrm {0x5201} end{matrix}}}$

Sign

The leftmost bit expresses the sign, with 0 indicating that the number is positive and 1 indicating that the number is negative. In the examples above, the first number is negative and the other two are positive.

Exponent

The exponent indicates how much to shift (to the right or to the left) the binary point of the significant part. In this case, the exponent occupies 6 bits capable of representing 64 different values; that is, it is a binary (base 2) exponent ranging from -31 to +32 to represent powers of 2 between 2^-31 and 2⁺³² and indicates that the binary point can be shifted up to 31 binary digits to the left (a number very close to zero) and up to 32 binary digits to the right (a very large number).

But the exponent is not stored as a signed binary number (from -31 to +32), but rather as an equivalent positive integer from 0 to 63. To do this, a bias must be added to the exponent) which, in this case of the 6-bit exponent (64 values), is 31 (31 is half of the 64 values that can be represented, minus 1) and, at the end, the range of the exponent from -31 to + 32 is represented internally as a number between 0 and 63, where the numbers between 31 and 63 represent the exponents between 0 and 32 and the numbers between 0 and 30 represent the exponents between -31 and -1 respectively:

-31 0 32. Real binary exposure
+-----------+---------------+-----------------
0 31 63. Representation in floating comma
6-bit exponent
(It's the binary exponent plus a bit of 31)

Significant part

The significant part, in this case, is made up of 10 significant binary digits, of which 9 are explicit digits and 1 implicit that is not stored.

This significant part is generally normalized and will always have a 1 as the most significant bit. Because, with certain exceptions, the most significant bit of the signifier is always 1, to save space and to increase one-bit precision, this bit is not stored, and is therefore called the hidden or implicit bit; however, before performing the calculations this implicit bit must be converted to an explicit bit.

Represented real numbers

The generic floating point notation described above respectively represents the following real numbers (expressed in binary). The red color indicates the most significant bit, which when stored is implicit (see above the significant part in the floating point representation), but when calculations are done or when the information is displayed it becomes explicit:

-color {red}1color {black},011101100times 2^{{4}}=-color {red}1color {black}0111,01100

(The comma moves 4 binary positions (bits) to the right)

color {red}1color {black},111001101times 2^{{-4}}=0,000color {red}1color {black}111001101

(The comma moves 4 binary positions to the left)

color {red}1color {black},000000001times 2^{{10}}=color {red}1color {black}0000000010,0

(The comma moves 10 binary positions to the right)

(with all values expressed in binary representation)

Comparison with Fixed Point

For a given size of bytes, floating point notation can be slower to process and less accurate than fixed point notation, because in addition to storing the number (significant part), the exponent must also be stored, but allows a greater range in the numbers that can be represented.

Numeric Coprocessor and Floating Point Libraries

Because arithmetic operations on floating-point numbers are complex, many systems dedicate a special processor specifically for this type of operation, called a floating-point unit, or have specialized components built into it. In cases where this facility does not exist or the floating point hardware cannot perform certain operations, software libraries are used to perform the calculations.

Floating point formats

Binary formats of floating point numbers from the IEEE 754 (2008) standard.

	Representation (bit number)				Features
Type	Sign	Exponente	Significant	Total	Size	Bits del exponent	Precision Bits (significant bits)	Significant digits in decimal	Rank
Medium (half)	1	5	10	16	2 bytes (16 bits)	15	11	3,31	- 65504?
Simple (single)	1	8	23	32	4 bytes (32 bits)	127	24	7.22	-1,701411733e38. 3,402823466e+38
Double (double)	1	11	52	64	8 bytes (64 bits)	1023	53	15,95	2.22507385072014e-308.. 1,7976931348623158e+308
Quadruple (quad)	1	15	112	128	16 bytes (128 bits)	16383	113	34,02	?

Contenido relacionado

Más resultados...