1 score 的存储double 64的浮点存储格式
Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as
double
. The IEEE 754 standard specifies a binary64 as having:
- Sign bit: 1 bit
- Exponent: 11 bits
- Significand precision: 53 bits (52 explicitly stored)
The sign bit determines the sign of the number (including when this number is zero, which is signed).
The exponent field is an 11-bit unsigned integer from 0 to 2047, in biased form: an exponent value of 1023 represents the actual zero. Exponents range from −1022 to +1023 because exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers.
The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.[1]
The format is written with the significand having an implicit integer bit of value 1 (except for special data, see the exponent encoding below). With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈ 15.955). The bits are laid out as follows:
由上面的公式可以得出,long型精度最长52位,超过52位就会丢失精度
2 示例
fmt.Printf("%b\n", math.Float64bits(1 << 52))
fmt.Printf("%b\n", math.Float64bits(1 << 51))
fmt.Printf("%b\n", math.Float64bits(1 << 50))
fmt.Printf("%b\n", math.Float64bits(1 << 51 + 2))
fmt.Printf("%b\n", math.Float64bits(1 << 51 + 3))
// OK
100001100100000000000000000000000000000000000000000000000000100
100001100100000000000000000000000000000000000000000000000000110
fmt.Printf("%b\n", math.Float64bits(1 << 52 + 2))
fmt.Printf("%b\n", math.Float64bits(1 << 52 + 3))
// OK
100001100110000000000000000000000000000000000000000000000000010
100001100110000000000000000000000000000000000000000000000000011
fmt.Printf("%b\n", math.Float64bits(1 << 53 + 2))
fmt.Printf("%b\n", math.Float64bits(1 << 53 + 3))
// OK
100001101000000000000000000000000000000000000000000000000000001
100001101000000000000000000000000000000000000000000000000000010
fmt.Printf("%b\n", math.Float64bits(1 << 54 + 6))
fmt.Printf("%b\n", math.Float64bits(1 << 54 + 7))
// 精度丢失
100001101010000000000000000000000000000000000000000000000000010
100001101010000000000000000000000000000000000000000000000000010
fmt.Printf("%b\n", math.Float64bits(1 << 55 + 2))
fmt.Printf("%b\n", math.Float64bits(1 << 55 + 3))
// 精度丢失
100001101100000000000000000000000000000000000000000000000000000
100001101100000000000000000000000000000000000000000000000000000
3 53,54位后精度丢失对比
4 52位精度准确