18 浮点数运算

Floating Point Arithmetic

本节译者：洪世哲

本节校审：张梓源（ChatGPT辅助）

Computers approximate real values in \(\mathbb{R}\) using a fixed number of bits. This chapter explains how this is done and why it is important for writing robust Stan (and other numerical) programs. The subfield of computer science devoted to studying how real arithmetic works on computers is called numerical analysis.

计算机会使用固定的位数来近似 \(\mathbb{R}\) 中的实数值。本章将解释这是如何实现的，以及为什么编写稳健的 Stan（和其他数值）程序非常重要。研究计算机上实数算术工作原理的计算机科学子领域被称为数值分析。

18.1 Floating-point representations

浮点数表示

Stan’s arithmetic is implemented using double-precision arithmetic. The behavior of most¹ modern computers follows the floating-point arithmetic, IEEE Standard for Floating-Point Arithmetic (IEEE 754).

Stan 的算术运算是使用双精度浮点数实现的。大多数现代计算机的算术规则²遵循 IEEE 浮点数算术标准（IEEE 754）。

Finite values

有限值

The double-precision component of the IEEE 754 standard specifies the representation of real values using a fixed pattern of 64 bits (8 bytes). All values are represented in base two (i.e., binary). The representation is divided into two signed components:

IEEE 754标准中的双精度部分指定使用64位（8字节）的固定模式表示实数值。所有值均采用以 2 为底的形式（即二进制）表示。其表示由两个带符号的部分组成：

significand (53 bits): base value representing significant digits
有效数字 （53位）：表示有效数字的基值
exponent (11 bits): power of two multiplied by the base
指数（11位）：表示要乘的以2为底的幂

The value of a finite floating point number is

一个有限浮点数的值为

\[ v = (-1)^s \times c \, 2^q \]

Normality

正规性

A normal floating-point value does not use any leading zeros in its significand; subnormal numbers may use leading zeros. Not all I/O systems support subnormal numbers.

正规浮点数的有效数字部分不会以零开头；而次正规数第一位可能会使用零。并非所有的输入/输出（I/O）系统都支持次正规数。

Ranges and extreme values

范围和极值

There are some reserved exponent values so that legal exponent values range between\(-(2^{10}) + 2 = -1022\) and \(2^{10} - 1 = 1023\). Legal significand values are between \(-2^{52}\) and \(2^{52} - 1\). Floating point allows the representation of both really big and really small values. Some extreme values are

largest normal finite number: \(\approx 1.8 \times 10^{308}\)
largest subnormal finite number: \(\approx 2.2 \times 10^{308}\)
smallest positive normal number: \(\approx 2.2 \times 10^{-308}\)
smallest positive subnormal number: \(\approx 4.9 \times 10^{-324}\)

指数的某些取值被保留，因此合法指数的范围是 \(-(2^{10}) + 2 = -1022\) 和 \(2^{10} - 1 = 1023\) 之间。合法的有效数值介于 \(-2^{52}\) 和 \(2^{52} - 1\) 之间。浮点数允许表示非常大和非常小的值。一些极端值包括：

最大的正规有限数：\(\approx 1.8 \times 10^{308}\)
最大的次正规有限数：\(\approx 2.2 \times 10^{308}\)
最小的正规有限正数：\(\approx 2.2 \times 10^{-308}\)
最小的次正规有限正数：\(\approx 4.9 \times 10^{-324}\)

Signed zero

符号零

Because of the sign bit, there are two ways to represent zero, often called “positive zero” and “negative zero”. This distinction is irrelevant in Stan (as it is in R), because the two values are equal (i.e., 0 == -0 evaluates to true).

由于存在符号位，零有两种表示方式，分别为“正零”和“负零”。这种区别在 Stan 中是无关紧要的（就像在 R 中一样），因为这两个值是相等的(即 0 == -0 被认为是正确的)。

Not-a-number values

非数字值

A specially chosen bit pattern is used for the not-a-number value (often written as NaN in programming language output, including Stan’s).

有一种特别的位模式被用于表示非数字值（在编程语言的输出中通常写作 NaN，包括 Stan）。

Stan provides a value function not_a_number() that returns this special not-a-number value. It is meant to represent error conditions, not missing values. Usually when not-a-number is an argument to a function, the result will not-a-number if an exception (a rejection in Stan) is not raised.

Stan 提供了一个值函数 not_a_number()，它返回这个特殊的非数字值。它用于表示错误条件而不是缺失值。通常，当非数字是一个函数的参数时，如果没有输出异常（在 Stan 中是拒绝输出），则结果将是非数字。

Stan also provides a test function is_nan(x) that returns 1 if x is not-a-number and 0 otherwise.

Stan 还提供了一个测试函数 is_nan(x)，如果 x 是非数字，则返回1，否则返回0。

Not-a-number values propagate under almost all mathematical operations. For example, all of the built-in binary arithmetic operations (addition, subtraction, multiplication, division, negation) return not-a-number if any of their arguments are not-a-number. The built-in functions such as log and exp have the same behavior, propagating not-a-number values.

在几乎所有数学运算中，非数字值都会传播。例如，所有的内置二元算术运算（加法、减法、乘法、除法、取反运算）如果它们的任何一个参数是非数字，则返回非数字。内置函数如 log 和 exp 也具有该传播非数字值的行为。

Most of Stan’s built-in functions will throw exceptions (i.e., reject) when any of their arguments is not-a-number.

Stan 的大多数内置函数在它们的任何一个参数是非数字时将输出异常（即拒绝）。

Comparisons with not-a-number always return false, up to and including comparison with itself. That is, not_a_number() == not_a_number() somewhat confusingly returns false. That is why there is a built-in is_nan() function in Stan (and in C++). The only exception is negation, which remains coherent. This means not_a_number() != not_a_number() returns true.

与非数字的比较会始终返回 false，包括与自身的比较。也就是说，not_a_number() == not_a_number() 会返回 false，尽管这会有些令人困惑。这也是为什么 Stan（以及 C++）中会提供内置函数 is_nan()。唯一的例外是取反运算，它仍然连贯。这意味着 not_a_number() != not_a_number() 返回 true。

Undefined operations often return not-a-number values. For example, sqrt(-1) will evaluate to not-a-number.

未定义的操作通常返回非数字值。例如，sqrt(-1) 会输出非数字。

Positive and negative infinity

正无穷和负无穷

There are also two special values representing positive infinity (\(\infty)\) and negative infinity (\(-\infty\)). These are not as pathological as not-a-number, but are often used to represent error conditions such as overflow and underflow. For example, rather than raising an error or returning not-a-number, log(0) evaluates to negative infinity. Exponentiating negative infinity leads back to zero, so that 0 == exp(log(0)). Nevertheless, this should not be done in Stan because the chain rule used to calculate the derivatives will attempt illegal operations and return not-a-number.

还有两个特殊值用于表示正无穷（\(\infty\)）和负无穷（\(-\infty\)）。它们不像非数字那样“病态”，但仍常用于表示如上溢或下溢等错误情形。例如，log(0) 不会引发错误或返回非数字，而是求值为负无穷。对负无穷进行指数化将返回零，因此会有 0 == exp(log(0))。然而，在 Stan 中不应该这样做，因为用于计算导数的链式法则将尝试非法操作并返回非数字。

There are value functions positive_infinity() and negative_infinity() as well as a test function is_inf().

Stan 中有 positive_infinity() 和 negative_infinity() 两个值函数，以及一个测试函数 is_inf()。

Positive and negative infinity have the expected comparison behavior, so that negative_infinty() < 0 evaluates to true (represented with 1 in Stan). Also, negating positive infinity leads to negative infinity and vice-versa.

正无穷和负无穷具有符合预期的比较行为，因此对 negative_infinty() < 0 求值的结果为 true（在 Stan 中用1表示）。此外，对正无穷取反会得到负无穷，反之亦然。

Positive infinity added to either itself or a finite value produces positive infinity. Negative infinity behaves the same way. However, attempts to subtract positive infinity from itself produce not-a-number, not zero. Similarly, attempts to divide infinite values results in a not-a-number value.

正无穷加上它自身或有限值会得到正无穷。负无穷的行为类似。然而，试图将正无穷减去其自身将得到非数字，而不是零。类似地，对无穷大值进行除法操作会得到非数字值。

18.2 Literals: decimal and scientific notation

字面量：十进制和科学计数法

In programming languages such as Stan, numbers may be represented in standard decimal (base 10) notation. For example, 2.39 or -1567846.276452. Remember there is no point in writing more than 16 significant digits as they cannot be represented. A number may be coded in Stan using scientific notation, which consists of a signed decimal representation of a base and a signed integer decimal exponent. For example, 36.29e-3 represents the number \(36.29 \times 10^{-3}\), which is the same number as is represented by 0.03629.

在像 Stan 这样的编程语言中，数字可以用标准的十进制（基数为10）表示法来表示。例如，2.39 或 -1567846.276452。请注意，超过 16 位的有效数字将无法被准确表示，因此书写时无实际意义。可以使用科学计数法来编码 Stan 中的数字，它由带符号的基数的十进制表示和带符号的整数十进制指数组成。例如，36.29e-3表示的数字是 \(36.29 \times 10^{-3}\)，它与 0.03629 表示的数字相同。

18.3 Arithmetic precision

算术精度

The choice of significand provides \(\log_{10} 2^{53} \approx 15.95\) decimal (base 10) digits of arithmetic precision. This is just the precision of the floating-point representation. After several operations are chained together, the realized arithmetic precision is often much lower.

有效数字部分的选择提供了 \(\log_{10} 2^{53} \approx 15.95\) 位十进制（以10为底）数字的算术精度。这仅仅是浮点数表示的精度。当多个运算连接在一起后，实际的算术精度往往会大大降低。

Rounding and probabilities

舍入和概率

In practice, the finite amount of arithmetic precision leads to rounding, whereby a number is represented by the closest floating-point number. For example, with only 16 decimal digits of accuracy,

在实际应用中，有限的算术精度导致了舍入，即通过最接近的浮点数来表示一个数字。例如，当只有16个十进制位的精确度时，

1 + 1e-20 == 1

The closest floating point number to \(1 + 10^{-20}\) turns out to be \(1\) itself. By contrast,

最接近 \(1 + 10^{-20}\) 的浮点数是 \(1\) 本身。相比之下，

0 + 1e-20 == 1e-20

This highlights the fact that precision depends on scale. Even though 1 + 1e-20 == 1, we have 1e-20 + 1e-20 == 2e-20, as expected.

这说明了精度依赖于数值的尺度。尽管1 + 1e-20 == 1，但我们有1e-20 + 1e-20 == 2e-20，正如预期的那样。

Rounding also manifests itself in a lack of transitivity. In particular, it does not usually hold for three floating point numbers \(a, b, c\) that \((a + b) + c = a + (b + c)\).

舍入也表现为缺乏传递性。特别是，对于三个浮点数 \(a, b, c\)，通常不满足 \((a + b) + c = a + (b + c)\)。

In statistical applications, problems often manifest in situations where users expect the usual rules of real-valued arithmetic to hold. Suppose we have a lower triangular matrix \(L\) with strictly positive diagonal, so that it is the Cholesky factor of a positive-definite matrix \(L \, L^{\top}\). In practice, rounding and loss of precision may render the result \(L \, L^{\top}\) neither symmetric nor positive definite.

在统计应用中，当用户假设实数运算的常规规则始终成立时，问题往往就会暴露出来。假设我们有一个具有严格正对角线的下三角矩阵 \(L\)，使得它是正定矩阵 \(L \, L^{\top}\) 的 Cholesky 因子。在实际应用中，舍入和精度损失可能导致结果 \(L \, L^{\top}\) 既不对称也不正定。

In practice, care must be taken to defend against rounding. For example, symmetry may be produced by adding \(L \, L^{\top}\) with its transpose and dividing by two, or by copying the lower triangular portion into the upper portion. Positive definiteness may be maintained by adding a small quantity to the diagonal.

在实践中，应注意避免舍入误差带来的影响。例如，可以通过将 \(L \, L^{\top}\) 与其转置相加并除以二，或者将下三角部分复制到上半部分来产生对称性。正定性则可以通过在对角线上加一个小的数值来保持。

Machine precision and the asymmetry of 0 and 1

机器精度与0、1的不对称性

The smallest number greater than zero is roughly \(0 + 10^{-323}\). The largest number less than one is roughly \(1 - 10^{-15.95}\). The asymmetry is apparent when considering the representation of that largest number smaller than one—the exponent is of no help, and the number is represented as the binary equivalent of \(0.9999999999999999\).

大于零的最小数字大约是 \(0 + 10^{-323}\)。小于1的最大数字大约是 \(1 - 10^{-15.95}\)。当考虑到小于1的最大数字的表示时，不对称性就变得明显了—— 指数没有任何帮助，该数字表示为二进制的等效值 \(0.9999999999999999\)。

For this reason, the machine precision is said to be roughly \(10^{-15.95}\). This constant is available as machine_precision() in Stan.

基于这个原因，机器精度被认为大约是 \(10^{-15.95}\)。在 Stan 中，该常量可以使用 machine_precision() 函数获得。

Complementary and epsilon functions

互补和 epsilon 函数

Special operations are available to mitigate this problem with numbers rounding when they get close to one. For example, consider the operation log(1 + x) for positive x. When x is small (less than \(10^{-16}\) for double-precision floating point), the sum in the argument will round to 1 and the result will round to zero. To allow more granularity, programming languages provide a library function directly implementing \(f(x) = \log (1 + x)\). In Stan (as in C++), this operation is written as log1p(x). Because x itself may be close to zero, the function log1p(x) can take the logarithm of values very close to one, the results of which are close to zero.

有专门的运算可以用来缓解数值接近1时的舍入问题。例如，考虑对正数 x 进行 log(1 + x) 运算。当 x 很小时（对于双精度浮点数小于 \(10^{-16}\)），参数中的求和会四舍五入为1，结果会四舍五入为零。为了提供更高的精度，编程语言提供了一个直接实现 \(f(x) = \log (1 + x)\) 的库函数。在 Stan 中（与 C++ 相同），这个操作被写作 log1p(x)。由于 x 本身可能接近于零，函数 log1p(x) 能够计算非常接近1的数值的对数，其结果接近于零。

Similarly, the complementary cumulative distribution functions (CCDF), defined by \(F^{\complement}_Y(y) = 1 - F_Y(y)\), where \(F_Y\) is the cumulative distribution function (CDF) for the random variable \(Y\). This allows values very close to one to be represented in complementary form.

类似地，互补累积分布函数（CCDF）可以通过以下方式定义： \(F^{\complement}_Y(y) = 1 - F_Y(y)\)，其中 \(F_Y\) 是随机变量 \(Y\) 的累积分布函数（CDF）。这样可以用互补形式来表示接近于1的值。

Catastrophic cancellation

抵消灾难

Another downside to floating point representations is that subtraction of two numbers close to each other results in a loss of precision that depends on how close they are. This is easy to see in practice. Consider

浮点数表示的另一个缺点是：相近数值相减会导致精度显著损失，损失程度取决于它们的接近程度。这一问题在实践中表现得尤为明显，例如以下运算： \[\begin{align*} 1&.23456789012345 \\ - 1&.23456789012344 \\ = 0&.00000000000001 \end{align*}\] We start with fifteen decimal places of accuracy in the arguments and are left with a single decimal place of accuracy in the result.

我们开始时具有15位小数的准确性，结果仅保留了一个有效数字的精度。

Catastrophic cancellation arises in statistical computations whenever we calculate variance for a distribution with small standard deviations relative to its location. When calculating summary statistics, Stan uses Welford’s algorithm for computing variances. This avoids catastrophic cancellation and may also be carried out in a single pass.

在统计计算中，若一个分布具有相对于其位置而言很小的标准差，当我们计算该分布的方差时，会出现抵消灾难。在计算统计量时，Stan 使用 Welford 算法来计算方差。该算法可避免抵消灾难，并且支持单次遍历完成方差计算。

Overflow

上溢

Even though 1e200 may be represented as a double precision floating point value, there is no finite value large enough to represent 1e200 * 1e200. The result of 1e200 * 1e200 is said to overflow. The IEEE 754 standard requires the result to be positive infinity.

即使 1e200 可以表示为双精度浮点值，也没有足够大的有限值可以表示 1e200 * 1e200。1e200 * 1e200 的结果发生了上溢。IEEE 754标准要求结果为正无穷大。

Overflow is rarely a problem in statistical computations. If it is, it’s possible to work on the log scale, just as for underflow as described below.

在统计计算中，上溢一般不是问题。如果是的话，可以在对数尺度上进行运算，就像下面描述的下溢一样。

Underflow and the log scale

下溢和对数尺度

When there is no number small enough to represent a result, it is said to underflow. For instance, 1e-200 may be represented, but 1e-200 * 1e-200 underflows so that the result is zero.

当没有足够小的数字能够表示一个结果时，就称为下溢。例如，1e-200 可被正常表示，但是 1e-200 * 1e-200 会发生下溢，结果变为零。

Underflow is a ubiquitous problem in likelihood calculations, For example, if \(p(y_n \mid \theta) < 0.1\), then

下溢是似然计算中普遍存在的问题。例如，如果 \(p(y_n \mid \theta) < 0.1\)，那么一旦 \(N > 350\)，似然函数

\[ p(y \mid \theta) = \prod_{n=1}^N p(y_n \mid \theta) \] will underflow as soon as \(N > 350\) or so.

就会发生下溢。

To deal with underflow, work on the log scale. Even though \(p(y \mid \theta)\) can’t be represented, there is no problem representing

为了处理下溢，可以在对数尺度上进行运算。虽然无法表示 \(p(y \mid \theta)\)，但我们可以表示 \[ \begin{array}{rcl} \log p(y \mid \theta) & = & \log \prod_{n=1}^N p(y_n \mid \theta) \\[4pt] & = & \sum_{n = 1}^N \log p(y_n \mid \theta) \end{array} \]

This is why all of Stan’s probability functions operate on the log scale.

这就是为什么 Stan 中所有的概率函数都在对数尺度上进行操作。

18.4 Log sum of exponentials

对数求和的指数化

Working on the log scale, multiplication is converted to addition,

当在对数尺度上进行运算时，乘法会转化为加法， \[ \log (a \cdot b) = \log a + \log b. \] Thus sequences of multiplication operations can remain on the log scale. But what about addition? Given \(\log a\) and \(\log b\), how do we get \(\log (a + b)\)? Working out the algebra,

因此，乘法操作的序列可以保持在对数尺度上。但是加法呢？给定 \(\log a\) 和 \(\log b\)，我们如何得到 \(\log (a + b)\)？通过计算代数式，我们可以得到： \[ \log (a + b) = \log (\exp(\log a) + \exp(\log b)). \]

Log-sum-exp function

Log-sum-exp 函数

The nested log of sum of exponentials is so common, it has its own name, “log-sum-exp”,

指数到求和再到对数的嵌套运算非常常见，它有着自己的名字，“log-sum-exp”, \[ \textrm{log-sum-exp}(u, v) = \log (\exp(u) + \exp(v)). \] so that

由此有 \[ \log (a + b) = \textrm{log-sum-exp}(\log a, \log b). \]

Although it appears this might overflow as soon as exponentiation is introduced, evaluation does not proceed by evaluating the terms as written. Instead, with a little algebra, the terms are rearranged into a stable form,

虽然看起来一旦引入指数运算就可能发生溢出，但实际计算并不是按照书面形式逐项求值的。相反，通过一些代数变换，这些项被重新排列成稳定的形式，

\[ \textrm{log-sum-exp}(u, v) = \max(u, v) + \log\big( \exp(u - \max(u, v)) + \exp(v - \max(u, v)) \big). \]

Because the terms inside the exponentiations are \(u - \max(u, v)\) and \(v - \max(u, v)\), one will be zero and the other will be negative. Because the operation is symmetric, it may be assumed without loss of generality that \(u \geq v\), so that

因为指数中的项是 \(u - \max(u, v)\) 和 \(v - \max(u, v)\)，所以其中一个为零，另一个为负数。由于操作是对称的，不失一般性可以假设 \(u \geq v\)，从而有 \[ \textrm{log-sum-exp}(u, v) = u + \log\big(1 + \exp(v - u)\big). \]

Although the inner term may itself be evaluated using the built-in function log1p, there is only limited gain because \(\exp(v - u)\) is only near zero when \(u\) is much larger than \(v\), meaning the final result is likely to round to \(u\) anyway.

尽管内部项本身可以使用内置函数 log1p 进行求值，但这样做的改善很有限，因为只有当 \(u\) 远远大于 \(v\) 时，\(\exp(v - u)\) 才会接近零，这意味着最终的结果很可能还是会四舍五入为 \(u\)。

To conclude, when evaluating \(\log (a + b)\) given \(\log a\) and \(\log b\), and assuming \(\log a > \log b\), return

总之，当给定 \(\log a\) 和 \(\log b\) 来求解 \(\log (a + b)\) 时，假设\(\log a > \log b\)，则返回 \[ \log (a + b) = \log a + \textrm{log1p}\big(\exp(\log b - \log a)\big). \]

Applying log-sum-exp to a sequence

将 log-sum-exp 应用于序列

The log sum of exponentials function may be generalized to sequences in the obvious way, so that if \(v = v_1, \ldots, v_N\), then

Log-sum-exp 函数可以自然地推广到序列形式，即如果 \(v = v_1, \ldots, v_N\)，那么 \[\begin{eqnarray*} \textrm{log-sum-exp}(v) & = & \log \sum_{n = 1}^N \exp(v_n) \\[4pt] & = & \max(v) + \log \sum_{n = 1}^N \exp(v_n - \max(v)). \end{eqnarray*}\] The exponent cannot overflow because its argument is either zero or negative. This form makes it easy to calculate \(\log (u_1 + \cdots + u_N)\) given only \(\log u_n\).

指数部分不会发生上溢，因为其参数要么为零，要么为负。这种形式使得只需给出 \(\log u_n\) 就可以很容易计算出 \(\log (u_1 + \cdots + u_N)\)。

Calculating means with log-sum-exp

使用 log-sum-exp 计算均值

An immediate application is to computing the mean of a vector \(u\) entirely on the log scale. That is, given \(\log u\) and returning \(\log \textrm{mean}(u)\).

一个直接的应用是完全在对数尺度上计算向量 \(u\) 的均值。也就是说，给定 \(\log u\) 并返回 \(\log \textrm{mean}(u)\)。 \[\begin{eqnarray*} \log \left( \frac{1}{N} \sum_{n = 1}^N u_n \right) & = & \log \frac{1}{N} + \log \sum_{n = 1}^N \exp(\log u_n) \\[4pt] & = & -\log N + \textrm{log-sum-exp}(\log u). \end{eqnarray*}\] where \(\log u = (\log u_1, \ldots, \log u_N)\) is understood elementwise.

其中 \(\log u = (\log u_1, \ldots, \log u_N)\) 表示对每个元素逐个进行对数运算得到的结果。

18.5 Comparing floating-point numbers

浮点数的比较

Because floating-point representations are inexact, it is rarely a good idea to test exact inequality. The general recommendation is that rather than testing x == y, an approximate test may be used given an absolute or relative tolerance.

由于浮点数表示本身并不精确，因此进行精确相等性测试通常并不是一个好主意。一般的建议是，与其判断 x == y，不如基于绝对容差或相对容差进行近似比较。

Given a positive absolute tolerance of epsilon, x can be compared to y using the conditional

给定一个正的绝对容差 epsilon，可以使用条件语句比较 x 与 y

abs(x - y) <= epsilon.

Absolute tolerances work when the scale of x and y and the relevant comparison is known.

当 x 和 y 的数量级以及相关比较的尺度已知时，绝对容差是有效的。

Given a positive relative tolerance of epsilon, a typical comparison is

给定一个正的相对容差 epsilon，常见的比较方式是

2 * abs(x - y) / (abs(x) + abs(y)) <= epsilon.

The notable exception is Intel’s optimizing compilers under certain optimization settings.↩︎
在某些优化设置下，英特尔的优化编译器是一个显著的例外。↩︎