Programing

단 정밀도와 배정 밀도 부동 소수점 연산의 차이점은 무엇입니까?

lottogame 2020. 6. 17. 20:27
반응형

단 정밀도와 배정 밀도 부동 소수점 연산의 차이점은 무엇입니까?


단 정밀도 부동 소수점 연산과 배정 밀도 부동 연산의 차이점은 무엇입니까?

비디오 게임 콘솔과 관련하여 실용적인 용어에 특히 관심이 있습니다. 예를 들어, Nintendo 64는 64 비트 프로세서를 가지고 있으며 그렇다면 그렇다면 배정도 부동 소수점 연산이 가능 했습니까? PS3와 Xbox 360은 배정도 부동 소수점 연산을 할 수 있거나 단 정도만 사용할 수 있으며 일반적으로 배정도 기능 (사용하는 경우)을 사용합니다.


참고 : Nintendo 64 에는 64 비트 프로세서가 있습니다.

64 비트 데이터를 처리 할 때 일반적으로 64 비트 데이터 유형에 사용할 수있는 더 큰 데이터 정밀도가 필요하지 않으며 64 비트 데이터를 처리하는 데 두 배의 RAM과 캐시가 사용되므로 많은 게임에서 칩의 32 비트 처리 모드를 이용했습니다. 및 대역폭으로 인해 전체 시스템 성능이 저하됩니다.

에서 Webopedia :

배정 밀도라는 용어는 정밀도가 실제로 배가 아니기 때문에 잘못된 것입니다.
double이라는 단어는 배정 밀도 숫자가 일반 부동 소수점 숫자보다 두 배 많은 비트를 사용한다는 사실에서 파생됩니다.
예를 들어 단 정밀도 숫자에 32 비트가 필요한 경우 배정 밀도 대응 길이는 64 비트입니다.

여분의 비트는 정밀도뿐만 아니라 표현 될 수있는 크기의 범위를 증가시킵니다.
정밀도와 크기의 범위가 증가하는 정확한 양은 프로그램이 부동 소수점 값을 나타내는 데 사용하는 형식에 따라 다릅니다.
대부분의 컴퓨터는 IEEE 부동 소수점 형식으로 알려진 표준 형식을 사용합니다.

로부터 부동 소수점 연산을위한 IEEE 표준

단정도

IEEE 단 정밀도 부동 소수점 표준 표현에는 32 비트 워드가 필요합니다. 32 비트 워드는 0에서 31까지 번호가 왼쪽에서 오른쪽으로 표시 될 수 있습니다.

  • 첫 번째 비트는 부호 비트 S입니다.
  • 다음 8 비트는 지수 비트 'E'이며
  • 마지막 23 비트는 분수 'F'입니다.

    S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
    0 1      8 9                    31
    

단어로 표현 된 값 V는 다음과 같이 결정될 수있다 :

  • E = 255이고 F가 0이 아니면 V = NaN ( "숫자 아님")
  • E = 255이고 F가 0이고 S가 1이면 V = -Infinity
  • E = 255이고 F가 0이고 S가 0이면 V = 무한대
  • 경우 0<E<255다음 V=(-1)**S * 2 ** (E-127) * (1.F)"1.F는"1를 선도하는 암시 적 이진 지점 F를 접두사에 의해 생성 된 이진수를 나타 내기위한된다.
  • E = 0이고 F가 0이 아니면 V=(-1)**S * 2 ** (-126) * (0.F). 이들은 "비정규 화 된"값입니다.
  • E = 0이고 F가 0이고 S가 1이면 V = -0
  • E = 0이고 F가 0이고 S가 0이면 V = 0

특히,

0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN

0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                     0.00000000000000000000001 = 
                                     2**(-149)  (Smallest positive value)

배정도

IEEE 배정 밀도 부동 소수점 표준 표현에는 64 비트 워드가 필요하며, 왼쪽에서 오른쪽으로 0에서 63까지 번호가 매겨져 표시 될 수 있습니다.

  • 첫 번째 비트는 부호 비트 S입니다.
  • the next eleven bits are the exponent bits, 'E', and
  • the final 52 bits are the fraction 'F':

    S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
    0 1        11 12                                                63
    

The value V represented by the word may be determined as follows:

  • If E=2047 and F is nonzero, then V=NaN ("Not a number")
  • If E=2047 and F is zero and S is 1, then V=-Infinity
  • If E=2047 and F is zero and S is 0, then V=Infinity
  • If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
  • If E=0 and F is zero and S is 1, then V=-0
  • If E=0 and F is zero and S is 0, then V=0

Reference:
ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic.


I read a lot of answers but none seems to correctly explain where the word double comes from. I remember a very good explanation given by a University professor I had some years ago.

Recalling the style of VonC's answer, a single precision floating point representation uses a word of 32 bit.

  • 1 bit for the sign, S
  • 8 bits for the exponent, 'E'
  • 24 bits for the fraction, also called mantissa, or coefficient (even though just 23 are represented). Let's call it 'M' (for mantissa, I prefer this name as "fraction" can be misunderstood).

Representation:

          S  EEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMM
bits:    31 30      23 22                     0

(Just to point out, the sign bit is the last, not the first.)

A double precision floating point representation uses a word of 64 bit.

  • 1 bit for the sign, S
  • 11 bits for the exponent, 'E'
  • 53 bits for the fraction / mantissa / coefficient (even though only 52 are represented), 'M'

Representation:

           S  EEEEEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
bits:     63 62         52 51                                                  0

As you may notice, I wrote that the mantissa has, in both types, one bit more of information compared to its representation. In fact, the mantissa is a number represented without all its non-significative 0. For example,

  • 0.000124 becomes 0.124 × 10−3
  • 237.141 becomes 0.237141 × 103

This means that the mantissa will always be in the form

0.α1α2...αt × βp

where β is the base of representation. But since the fraction is a binary number, α1 will always be equal to 1, thus the fraction can be rewritten as 1.α2α3...αt+1 × 2p and the initial 1 can be implicitly assumed, making room for an extra bit (αt+1).

Now, it's obviously true that the double of 32 is 64, but that's not where the word comes from.

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

With that said, it's easy to estimate the number of decimal digits which can be safely used:

  • single precision: log10(224), which is about 7~8 decimal digits
  • double precision: log10(253), which is about 15~16 decimal digits

Okay, the basic difference at the machine is that double precision uses twice as many bits as single. In the usual implementation,that's 32 bits for single, 64 bits for double.

But what does that mean? If we assume the IEEE standard, then a single precision number has about 23 bits of the mantissa, and a maximum exponent of about 38; a double precision has 52 bits for the mantissa, and a maximum exponent of about 308.

The details are at Wikipedia, as usual.


To add to all the wonderful answers here

First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.

For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.

So, basically we want to know how much accurately can the number be stored and is what we call precision.

Quoting @Alessandro here

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part

So, float can store double the amount of fractional part. That is why Double is called double the float


As to the question "Can the ps3 and xbxo 360 pull off double precision floating point operations or only single precision and in generel use is the double precision capabilities made use of (if they exist?)."

I believe that both platforms are incapable of double floating point. The original Cell processor only had 32 bit floats, same with the ATI hardware which the XBox 360 is based on (R600). The Cell got double floating point support later on, but I'm pretty sure the PS3 doesn't use that chippery.


Basically single precision floating point arithmetic deals with 32 bit floating point numbers whereas double precision deals with 64 bit.

The number of bits in double precision increases the maximum value that can be stored as well as increasing the precision (ie the number of significant digits).


Double precision means the numbers takes twice the word-length to store. On a 32-bit processor, the words are all 32 bits, so doubles are 64 bits. What this means in terms of performance is that operations on double precision numbers take a little longer to execute. So you get a better range, but there is a small hit on performance. This hit is mitigated a little by hardware floating point units, but its still there.

The N64 used a MIPS R4300i-based NEC VR4300 which is a 64 bit processor, but the processor communicates with the rest of the system over a 32-bit wide bus. So, most developers used 32 bit numbers because they are faster, and most games at the time did not need the additional precision (so they used floats not doubles).

All three systems can do single and double precision floating operations, but they might not because of performance. (although pretty much everything after the n64 used a 32 bit bus so...)


According to the IEEE754 • Standard for floating point storage • 32 and 64 bit standards (single precision and double precision) • 8 and 11 bit exponent respectively • Extended formats (both mantissa and exponent) for intermediate results


First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.

For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.

So, basically we want to know how much accurately can the number be stored and is what we call precision.

Quoting @Alessandro here

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part

So, double can store double the amount of fractional part as of float. That is why Double is called double the float


All have explained in great detail and nothing I could add further. Though I would like to explain it in Layman's Terms or plain ENGLISH

1.9 is less precise than 1.99
1.99 is less precise than 1.999
1.999 is less precise than 1.9999

.....

A variable, able to store or represent "1.9" provides less precision than the one able to hold or represent 1.9999. These Fraction can amount to a huge difference in large calculations.


Single precision number uses 32 bits, with the MSB being sign bit, whereas double precision number uses 64bits, MSB being sign bit

Single precision

SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)

Double precision:

SEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)

참고URL : https://stackoverflow.com/questions/801117/whats-the-difference-between-a-single-precision-and-double-precision-floating-p

반응형