27 并行计算

Parallelization

本节译者：谭颂华、杨智

初次校审：李君竹

二次校审：李君竹（Claude 辅助）

Stan has support for different types of parallelization: multi-threading with Intel Threading Building Blocks (TBB), multi-processing with Message Passing Interface (MPI) and manycore processing with OpenCL.

Stan 支持不同类型的并行化：使用英特尔线程构建模块（Intel Threading Building Blocks，TBB）的多线程、使用消息传递接口（Message Passing Interface，MPI）的多进程以及使用 OpenCL 的众核处理。

Multi-threading in Stan can be used with two mechanisms: reduce with summation and rectangular map. The latter can also be used with multi-processing.

Stan 中的多线程可以通过两种机制使用：求和归约和矩形映射。后者也可以与多进程一起使用。

The advantages of reduce with summation are:

求和归约的优点是：

More flexible argument interface, avoiding the packing and unpacking that is necessary with rectanguar map.
更灵活的参数接口，避免了矩形映射所需的打包和解包操作。
Partitions data for parallelization automatically (this is done manually in rectanguar map).
自动为并行化划分数据（在矩形映射中需要手动完成）。
Is easier to use.
更易于使用。

The advantages of rectangular map are:

矩形映射的优点是：

Returns a list of vectors, while the reduce summation returns only a scalar.
返回向量列表，而求和归约只返回标量。
Can be parallelized across multiple cores and multiple computers, while reduce summation can only parallelized across multiple cores on a single machine.
可以跨多个核心和多台计算机并行化，而求和归约只能在单台机器的多个核心上并行化。

The actual speedup gained from using these functions will depend on many details. It is strongly recommended to only parallelize the computationally most expensive operations in a Stan program. Oftentimes this is the evaluation of the log likelihood for the observed data. When it is not clear which parts of the model is the most computationally expensive, we recommend using profiling, which is available in Stan 2.26 and newer.

使用这些函数获得的实际加速效果将取决于许多细节。强烈建议只对 Stan 程序中计算成本最高的操作进行并行化。通常这是对观测数据的对数似然评估。当不清楚模型的哪些部分计算成本最高时，我们建议使用性能分析，该功能在 Stan 2.26 及更新版本中可用。

Since only portions of a Stan program will run in parallel, the maximal speedup one can achieve is capped, a phenomen described by Amdahl’s law.

由于只有 Stan 程序的部分内容会并行运行，因此能够实现的最大加速是有上限的，这是阿姆达尔定律所描述的现象。

27.1 Reduce-sum

求和归约

It is often necessary in probabilistic modeling to compute the sum of a number of independent function evaluations. This occurs, for instance, when evaluating a number of conditionally independent terms in a log-likelihood. If g: U -> real is the function and { x1, x2, ... } is an array of inputs, then that sum looks like:

在概率建模中，经常需要计算多个独立函数评估的总和。例如，在评估对数似然中的多个条件独立项时就会出现这种情况。如果 g: U -> real 是函数，{ x1, x2, ... } 是输入数组，那么该总和看起来像：

g(x1) + g(x2) + ...

reduce_sum and reduce_sum_static are tools for parallelizing these calculations.

reduce_sum 和 reduce_sum_static 是并行化这些计算的工具。

For efficiency reasons the reduce function doesn’t work with the element-wise evaluated function g, but instead the partial sum function f: U[] -> real, where f computes the partial sum corresponding to a slice of the sequence x passed in. Due to the associativity of the sum reduction it holds that:

出于效率原因，归约函数不使用逐元素评估的函数 g，而是使用部分和函数 f: U[] -> real，其中 f 计算与传入的序列 x 的切片对应的部分和。由于求和归约的结合律，有：

g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
                      = f({ x1, x2 }) + f({ x3 })
                      = f({ x1 }) + f({ x2, x3 })
                      = f({ x1 }) + f({ x2 }) + f({ x3 })

With the partial sum function f: U[] -> real reduction of a large number of terms can be evaluated in parallel automatically, since the overall sum can be partitioned into arbitrary smaller partial sums. The exact partitioning into the partial sums is not under the control of the user. However, since the exact numerical result will depend on the order of summation, Stan provides two versions of the reduce summation facility:

使用部分和函数 f: U[] -> real，大量项的归约可以自动并行评估，因为总和可以划分为任意较小的部分和。部分和的确切划分不受用户控制。然而，由于确切的数值结果将取决于求和顺序，Stan 提供了两个版本的求和归约功能：

reduce_sum: Automatically choose partial sums partitioning based on a dynamic scheduling algorithm.
reduce_sum：基于动态调度算法自动选择部分和划分。
reduce_sum_static: Compute the same sum as reduce_sum, but partition the input in the same way for given data set (in reduce_sum this partitioning might change depending on computer load).
reduce_sum_static：计算与 reduce_sum 相同的和，但对给定数据集以相同方式划分输入（在 reduce_sum 中，这种划分可能会根据计算机负载而改变）。

grainsize is the one tuning parameter. For reduce_sum, grainsize is a suggested partial sum size. A grainsize of 1 leaves the partitioning entirely up to the scheduler. This should be the default way of using reduce_sum unless time is spent carefully picking grainsize. For picking a grainsize, see details below.

grainsize 是唯一的调优参数。对于 reduce_sum，grainsize 是建议的部分和大小。grainsize 为 1 将划分完全交给调度器。除非花时间仔细选择 grainsize，否则这应该是使用 reduce_sum 的默认方式。关于选择 grainsize，详见下文。

For reduce_sum_static, grainsize specifies the maximal partial sum size. With reduce_sum_static it is more important to choose grainsize carefully since it entirely determines the partitioning of work. See details below.

对于 reduce_sum_static，grainsize 指定最大部分和大小。对于 reduce_sum_static，仔细选择 grainsize 更为重要，因为它完全决定了工作的划分。详见下文。

For efficiency and convenience additional shared arguments can be passed to every term in the sum. So for the array { x1, x2, ... } and the shared arguments s1, s2, ... the effective sum (with individual terms) looks like:

为了效率和便利，可以将额外的共享参数传递给和中的每一项。因此，对于数组 { x1, x2, ... } 和共享参数 s1, s2, ...，有效的和（带有各项）看起来像：

g(x1, s1, s2, ...) + g(x2, s1, s2, ...) + g(x3, s1, s2, ...) + ...

which can be written equivalently with partial sums to look like:

这可以等价地用部分和写成：

f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)

where the particular slicing of the x array can change.

其中 x 数组的特定切片可以改变。

Given this, the signatures are:

基于此，函数签名为：

real reduce_sum(F f, array[] T x, int grainsize, T1 s1, T2 s2, ...)
real reduce_sum_static(F f, array[] T x, int grainsize, T1 s1, T2 s2, ...)

f - User defined function that computes partial sums
f - 用户定义的计算部分和的函数
x - Array to slice, each element corresponds to a term in the summation
x - 要切片的数组，每个元素对应求和中的一项
grainsize - Target for size of slices
grainsize - 切片大小的目标
s1, s2, ... - Arguments shared in every term
s1, s2, ... - 每项中共享的参数

The user-defined partial sum functions have the signature:

用户定义的部分和函数具有签名：

real f(array[] T x_slice, int start, int end, T1 s1, T2 s2, ...)

and take the arguments:

并接受参数：

x_slice - The subset of x (from reduce_sum / reduce_sum_static) for which this partial sum is responsible (x_slice = x[start:end])
x_slice - x 的子集（来自 reduce_sum / reduce_sum_static），负责这个部分和（x_slice = x[start:end]）
start - An integer specifying the first term in the partial sum
start - 指定部分和中第一项的整数
end - An integer specifying the last term in the partial sum (inclusive)
end - 指定部分和中最后一项的整数（包含）
s1, s2, ... - Arguments shared in every term (passed on without modification from the reduce_sum / reduce_sum_static call)
s1, s2, ... - 每项中共享的参数（从 reduce_sum / reduce_sum_static 调用中无修改地传递）

The user-provided function f is expected to compute the partial sum with the terms start through end of the overall sum. The user function is passed the subset x[start:end] as x_slice. start and end are passed so that f can index any of the tailing sM arguments as necessary. The trailing sM arguments are passed without modification to every call of f.

用户提供的函数 f 预期计算总和中从 start 到 end 项的部分和。用户函数接收子集 x[start:end] 作为 x_slice。传递 start 和 end 以便 f 可以根据需要索引任何尾部 sM 参数。尾部 sM 参数无修改地传递给 f 的每次调用。

A reduce_sum (or reduce_sum_static) call:

一个 reduce_sum（或 reduce_sum_static）调用：

real sum = reduce_sum(f, x, grainsize, s1, s2, ...);

can be replaced by either:

可以替换为：

real sum = f(x, 1, size(x), s1, s2, ...);

or the code:

或代码：

real sum = 0.0;
for(i in 1:size(x)) {
  sum += f({ x[i] }, i, i, s1, s2, ...);
}

Example: logistic regression

示例：逻辑回归

Logistic regression is a useful example to clarify both the syntax and semantics of reduce summation and how it can be used to speed up a typical model. A basic logistic regression can be coded in Stan as:

逻辑回归是一个有用的示例，可以澄清求和归约的语法和语义，以及如何使用它来加速典型模型。基本的逻辑回归可以在 Stan 中编码为：

data {
  int N;
  array[N] int y;
  vector[N] x;
}
parameters {
  vector[2] beta;
}
model {
  beta ~ std_normal();
  y ~ bernoulli_logit(beta[1] + beta[2] * x);
}

In this model predictions are made about the N outputs y using the covariate x. The intercept and slope of the linear equation are to be estimated. The key point to getting this calculation to use reduce summation, is recognizing that the statement:

在这个模型中，使用协变量 x 对 N 个输出 y 进行预测。需要估计线性方程的截距和斜率。使这个计算使用求和归约的关键是认识到语句：

y ~ bernoulli_logit(beta[1] + beta[2] * x);

can be rewritten (up to a proportionality constant) as:

可以重写（忽略比例常数）为：

for(n in 1:N) {
  target += bernoulli_logit_lpmf(y[n] | beta[1] + beta[2] * x[n])
}

Now it is clear that the calculation is the sum of a number of conditionally independent Bernoulli log probability statements, which is the condition where reduce summation is useful. To use the reduce summation, a function must be written that can be used to compute arbitrary partial sums of the total sum. Using the interface defined in Reduce-Sum, such a function can be written like:

现在很清楚，计算是多个条件独立的伯努利对数概率语句的和，这正是求和归约有用的条件。要使用求和归约，必须编写一个可用于计算总和的任意部分和的函数。使用求和归约中定义的接口，这样的函数可以写成：

functions {
  real partial_sum(array[] int y_slice,
                   int start, int end,
                   vector x,
                   vector beta) {
    return bernoulli_logit_lpmf(y_slice | beta[1] + beta[2] * x[start:end]);
  }
}

The likelihood statement in the model can now be written:

模型中的似然语句现在可以写成：

target += partial_sum(y, 1, N, x, beta); // Sum terms 1 to N of the likelihood

In this example, y was chosen to be sliced over because there is one term in the summation per value of y. Technically x would have worked as well. Use whatever conceptually makes the most sense for a given model, e.g. slice over independent terms like conditionally independent observations or groups of observations as in hierarchical models. Because x is a shared argument, it is subset accordingly with start:end. With this function, reduce summation can be used to automatically parallelize the likelihood:

在这个例子中，选择对 y 进行切片，因为每个 y 值在求和中有一项。技术上 x 也可以工作。使用对给定模型在概念上最有意义的方式，例如，对独立项进行切片，如条件独立的观测或分层模型中的观测组。因为 x 是共享参数，它相应地用 start:end 进行子集化。有了这个函数，求和归约可以用来自动并行化似然：

int grainsize = 1;
target += reduce_sum(partial_sum, y,
                     grainsize,
                     x, beta);

The reduce summation facility automatically breaks the sum into pieces and computes them in parallel. grainsize = 1 specifies that the grainsize should be estimated automatically. The final model is:

求和归约功能自动将和分解成片段并并行计算它们。grainsize = 1 指定应自动估计 grainsize。最终模型是：

functions {
  real partial_sum(array[] int y_slice,
                   int start, int end,
                   vector x,
                   vector beta) {
    return bernoulli_logit_lpmf(y_slice | beta[1] + beta[2] * x[start:end]);
  }
}
data {
  int N;
  array[N] int y;
  vector[N] x;
}
parameters {
  vector[2] beta;
}
model {
  int grainsize = 1;
  beta ~ std_normal();
  target += reduce_sum(partial_sum, y,
                       grainsize,
                       x, beta);
}

Picking the grainsize

选择 grainsize

The rational for choosing a sensible grainsize is based on balancing the overhead implied by creating many small tasks versus creating fewer large tasks which limits the potential parallelism.

选择合理的 grainsize 的理由是基于平衡创建许多小任务所隐含的开销与创建较少的大任务（限制潜在并行性）之间的关系。

In reduce_sum, grainsize is a recommendation on how to partition the work in the partial sum into smaller pieces. A grainsize of 1 leaves this entirely up to the internal scheduler and should be chosen if no benchmarking of other grainsizes is done. Ideally this will be efficient, but there are no guarantees.

在 reduce_sum 中，grainsize 是关于如何将部分和中的工作划分为更小片段的建议。grainsize 为 1 将这完全留给内部调度器，如果没有对其他 grainsize 进行基准测试，应该选择这个值。理想情况下这将是高效的，但没有保证。

In reduce_sum_static, grainsize is an upper limit on the worksize. Work will be split until all partial sums are just smaller than grainsize (and the split will happen the same way every time for the same inputs). For the static version it is more important to select a sensible grainsize.

在 reduce_sum_static 中，grainsize 是工作大小的上限。工作将被分割，直到所有部分和都小于 grainsize（对于相同的输入，分割将以相同的方式发生）。对于静态版本，选择合理的 grainsize 更为重要。

In order to figure out an optimal grainsize, if there are N terms and M cores, run a quick test model with grainsize set roughly to N / M. Record the time, cut the grainsize in half, and run the test again. Repeat this iteratively until the model runtime begins to increase. This is a suitable grainsize for the model, because this ensures the calculations can be carried out with the most parallelism without losing too much efficiency.

为了找出最优的 grainsize，如果有 N 项和 M 个核心，运行一个快速测试模型，将 grainsize 大致设置为 N / M。记录时间，将 grainsize 减半，然后再次运行测试。迭代重复这个过程，直到模型运行时间开始增加。这是模型的合适 grainsize，因为这确保计算可以以最大的并行性进行，而不会损失太多效率。

For instance, in a model with N=10000 and M = 4, start with grainsize = 2500, and sequentially try grainsize = 1250, grainsize = 625, etc.

例如，在 N=10000 和 M = 4 的模型中，从 grainsize = 2500 开始，依次尝试 grainsize = 1250、grainsize = 625 等。

It is important to repeat this process until performance gets worse. It is possible after many halvings nothing happens, but there might still be a smaller grainsize that performs better. Even if a sum has many tens of thousands of terms, depending on the internal calculations, a grainsize of thirty or forty or smaller might be the best, and it is difficult to predict this behavior. Without doing these halvings until performance actually gets worse, it is easy to miss this.

重要的是重复这个过程直到性能变差。可能在多次减半后什么都没发生，但可能仍有更小的 grainsize 表现更好。即使一个和有数万项，根据内部计算，grainsize 为三十或四十或更小可能是最好的，这种行为很难预测。如果不进行这些减半直到性能实际变差，很容易错过这一点。

27.2 Map-rect

矩形映射

Map-reduce allows large calculations (e.g., log likelihoods) to be broken into components which may be calculated modularly (e.g., data blocks) and combined (e.g., by summation and incrementing the target log density).

映射-归约允许将大型计算（例如，对数似然）分解为可以模块化计算的组件（例如，数据块）并组合（例如，通过求和和增加目标对数密度）。

A map function is a higher-order function that applies an argument function to every member of some collection, returning a collection of the results. For example, mapping the square function, \(f(x) = x^2\), over the vector \([3, 5, 10]\) produces the vector \([9, 25, 100]\). In other words, map applies the square function elementwise.

_映射函数_是一个高阶函数，它将参数函数应用于某个集合的每个成员，返回结果的集合。例如，将平方函数 \(f(x) = x^2\) 映射到向量 \([3, 5, 10]\) 上产生向量 \([9, 25, 100]\)。换句话说，映射逐元素应用平方函数。

The output of mapping a sequence is often fed into a reduction. A reduction function takes an arbitrarily long sequence of inputs and returns a single output. Examples of reduction functions are summation (with the return being a single value) or sorting (with the return being a sorted sequence). The combination of mapping and reducing is so common it has its own name, map-reduce.

映射序列的输出通常被输入到归约中。_归约函数_接受任意长的输入序列并返回单个输出。归约函数的例子包括求和（返回单个值）或排序（返回排序后的序列）。映射和归约的组合非常常见，它有自己的名称，映射-归约。

Map function

映射函数

In order to generalize the form of functions and results that are possible and accommodate both parameters (which need derivatives) and data values (which don’t), Stan’s map function operates on more than just a sequence of inputs.

为了推广可能的函数和结果的形式，并容纳参数（需要导数）和数据值（不需要），Stan 的映射函数不仅仅对输入序列进行操作。

Map function signature

映射函数签名

Stan’s map function has the following signature

Stan 的映射函数具有以下签名

vector map_rect((vector, vector, array[] real, array[] int):vector f,
                vector phi, array[] vector thetas,
                data array[,] real x_rs, data array[,] int x_is);

The arrays thetas of parameters, x_rs of real data, and x_is of integer data have the suffix “s” to indicate they are arrays. These arrays must all be the same size, as they will be mapped in parallel by the function f. The value of phi is reused in each mapped operation.

参数数组 thetas、实数数据 x_rs 和整数数据 x_is 具有后缀”s“表示它们是数组。这些数组必须都是相同的大小，因为它们将由函数 f 并行映射。phi 的值在每个映射操作中被重用。

The _rect suffix in the name arises because the data structures it takes as arguments are rectangular. In order to deal with ragged inputs, ragged inputs must be padded out to rectangular form.

名称中的 _rect 后缀源于它作为参数的数据结构是矩形的。为了处理不规则输入，不规则输入必须填充为矩形形式。

The last two arguments are two dimensional arrays of real and integer data values. These argument types are marked with the data qualifier to indicate that they must only contain variables originating in the data or transformed data blocks. This will allow such data to be pinned to a processor on which it is being processed to reduce communication overhead.

最后两个参数是实数和整数数据值的二维数组。这些参数类型用 data 限定符标记，表示它们必须只包含源自数据或转换数据块的变量。这将允许将此类数据固定到正在处理它的处理器上，以减少通信开销。

The notation (vector, vector, array[] real, array[] int):vector indicates that the function argument f must have the following signature.

符号 (vector, vector, array[] real, array[] int):vector 表示函数参数 f 必须具有以下签名。

vector f(vector phi, vector theta,
         data array[] real x_r, data array[] int x_i);

Although f will often return a vector of size one, the built-in flexibility allows general multivariate functions to be mapped, even raggedly.

虽然 f 通常会返回大小为 1 的向量，但内置的灵活性允许映射一般的多元函数，甚至是不规则的。

Map function semantics

映射函数语义

Stan’s map function applies the function f to the shared parameters along with one element each of the job parameters, real data, and integer data arrays. Each of the arguments theta, x_r, and x_i must be arrays of the same size. If the arrays are all size N, the result is defined as follows.

Stan 的映射函数将函数 f 应用于共享参数以及作业参数、实数数据和整数数据数组的每个元素。参数 theta、x_r 和 x_i 必须是相同大小的数组。如果数组都是大小 N，结果定义如下。

map_rect(f, phi, thetas, xs, ns)
= f(phi, thetas[1], xs[1], ns[1]) . f(phi, thetas[2], xs[2], ns[2])
  . ... . f(phi, thetas[N], xs[N], ns[N])

The dot operators in the notation above are meant to indicate concatenation (implemented as append_row in Stan). The output of each application of f is a vector, and the sequence of N vectors is concatenated together to return a single vector.

上述符号中的点运算符意味着连接（在 Stan 中实现为 append_row）。每次应用 f 的输出是一个向量，N 个向量的序列连接在一起返回单个向量。

Example: logistic regression

示例：逻辑回归

An example should help to clarify both the syntax and semantics of the mapping operation and how it may be combined with reductions built into Stan to provide a map-reduce implementation.

一个例子应该有助于澄清映射操作的语法和语义，以及它如何与 Stan 中内置的归约结合以提供映射-归约实现。

Unmapped logistic regression

未映射的逻辑回归

Consider the following simple logistic regression model, which is coded unconventionally to accommodate direct translation to a mapped implementation.

考虑以下简单的逻辑回归模型，它以非常规方式编码以适应直接转换为映射实现。

data {
  array[12] int y;
  array[12] real x;
}
parameters {
  vector[2] beta;
}
model {
  beta ~ std_normal();
  y ~ bernoulli_logit(beta[1] + beta[2] * to_vector(x));
}

The program is unusual in that it (a) hardcodes the data size, which is not required by the map function but is just used here for simplicity, (b) represents the predictors as a real array even though it needs to be used as a vector, and (c) represents the regression coefficients (intercept and slope) as a vector even though they’re used individually. The bernoulli_logit distribution is used because the argument is on the logit scale—it implicitly applies the inverse logit function to map the argument to a probability.

该程序的不寻常之处在于它（a）硬编码数据大小，这不是映射函数所需的，只是为了简单起见在这里使用，（b）将预测变量表示为实数数组，即使它需要用作向量，（c）将回归系数（截距和斜率）表示为向量，即使它们是单独使用的。使用 bernoulli_logit 分布是因为参数在 logit 尺度上——它隐式地应用逆 logit 函数将参数映射到概率。

Mapped logistic regression

映射的逻辑回归

The unmapped logistic regression model described in the previous subsection may be implemented using Stan’s rectangular mapping functionality as follows.

前一小节中描述的未映射逻辑回归模型可以使用 Stan 的矩形映射功能实现，如下所示。

functions {
  vector lr(vector beta, vector theta, array[] real x, array[] int y) {
    real lp = bernoulli_logit_lpmf(y | beta[1]
                                       + to_vector(x) * beta[2]);
    return [lp]';
  }
}
data {
  array[12] int y;
  array[12] real x;
}
transformed data {
  // K = 3 shards
  array[3, 4] = { y[1:4], y[5:8], y[9:12] int ys };
  array[3, 4] = { x[1:4], x[5:8], x[9:12] real xs };
  array[3] vector[0] theta;
}
parameters {
  vector[2] beta;
}
model {
  beta ~ std_normal();
  target += sum(map_rect(lr, beta, theta, xs, ys));
}

The first piece of the code is the actual function to compute the logistic regression. The argument beta will contain the regression coefficients (intercept and slope), as before. The second argument theta of job-specific parameters is not used, but nevertheless must be present. The modeled data y is passed as an array of integers and the predictors x as an array of real values. The function body then computes the log probability mass of y and assigns it to the local variable lp. This variable is then used in [lp]' to construct a row vector and then transpose it to a vector to return.

代码的第一部分是计算逻辑回归的实际函数。参数 beta 将包含回归系数（截距和斜率），如前所述。作业特定参数的第二个参数 theta 未使用，但必须存在。建模数据 y 作为整数数组传递，预测变量 x 作为实数值数组传递。函数体然后计算 y 的对数概率质量并将其赋给局部变量 lp。然后在 [lp]' 中使用此变量构造行向量，然后转置为向量返回。

The data are taken in as before. There is an additional transformed data block that breaks the data up into three shards.¹

数据如前所述被接收。有一个额外的转换数据块将数据分解为三个分片。²

The value 3 is also hard coded; a more practical program would allow the number of shards to be controlled. There are three parallel arrays defined here, each of size three, corresponding to the number of shards. The array ys contains the modeled data variables; each element of the array ys is an array of size four. The second array xs is for the predictors, and each element of it is also of size four. These contained arrays are the same size because the predictors x stand in a one-to-one relationship with the modeled data y. The final array theta is also of size three; its elements are empty vectors, because there are no shard-specific parameters.

值 3 也是硬编码的；更实用的程序将允许控制分片数量。这里定义了三个并行数组，每个大小为三，对应于分片数量。数组 ys 包含建模数据变量；数组 ys 的每个元素是大小为四的数组。第二个数组 xs 用于预测变量，它的每个元素也是大小为四。这些包含的数组大小相同，因为预测变量 x 与建模数据 y 呈一对一关系。最后的数组 theta 也是大小为三；它的元素是空向量，因为没有分片特定的参数。

The parameters and the prior are as before. The likelihood is now coded using map-reduce. The function lr to compute the log probability mass is mapped over the data xs and ys, which contain the original predictors and outcomes broken into shards. The parameters beta are in the first argument because they are shared across shards. There are no shard-specific parameters, so the array of job-specific parameters theta contains only empty vectors.

参数和先验如前所述。似然现在使用映射-归约编码。计算对数概率质量的函数 lr 映射到数据 xs 和 ys 上，它们包含分解为分片的原始预测变量和结果。参数 beta 在第一个参数中，因为它们在分片之间共享。没有分片特定的参数，所以作业特定参数的数组 theta 只包含空向量。

Example: hierarchical logistic regression

示例：分层逻辑回归

Consider a hierarchical model of American presidential voting behavior based on state of residence.³

考虑基于居住州的美国总统投票行为的分层模型。⁴

Each of the fifty states \(k \in \{1,\dotsc,50\}\) will have its own slope \(\beta_k\) and intercept \(\alpha_k\) to model the log odds of voting for the Republican candidate as a function of income. Suppose there are \(N\) voters and with voter \(n \in 1{:}N\) being in state \(s[n]\) with income \(x_n\). The data model for the vote \(y_n \in \{ 0, 1 \}\) is

五十个州中的每一个 \(k \in \{1,\dotsc,50\}\) 都有自己的斜率 \(\beta_k\) 和截距 \(\alpha_k\) 来建模投票给共和党候选人的对数几率作为收入的函数。假设有 \(N\) 个选民，选民 \(n \in 1{:}N\) 在州 \(s[n]\) 中，收入为 \(x_n\)。投票 \(y_n \in \{ 0, 1 \}\) 的数据模型是

\[ y_n \sim \textsf{Bernoulli} \Big( \operatorname{logit}^{-1}\left( \alpha_{s[n]} + \beta_{s[n]} \, x_n \right) \Big). \]

The slopes and intercepts get hierarchical priors,

斜率和截距获得分层先验，

\[\begin{align*} \alpha_k &\sim \textsf{normal}(\mu_{\alpha}, \sigma_{\alpha}) \\ \beta_k &\sim \textsf{normal}(\mu_{\beta}, \sigma_{\beta}) \end{align*}\]

Unmapped implementation

未映射实现

This model can be coded up in Stan directly as follows.

这个模型可以直接在 Stan 中编码如下。

data {
  int<lower=0> K;
  int<lower=0> N;
  array[N] int<lower=1, upper=K> kk;
  vector[N] x;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  matrix[K, 2] beta;
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  for (i in 1:2) {
    beta[ , i] ~ normal(mu[i], sigma[i]);
  }
  y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}

For this model the vector of predictors x is coded as a vector, corresponding to how it is used in the model. The priors for mu and sigma are vectorized. The priors on the two components of beta (intercept and slope, respectively) are stored in a \(K \times 2\) matrix.

对于这个模型，预测变量向量 x 被编码为向量，对应于它在模型中的使用方式。mu 和 sigma 的先验是向量化的。beta 的两个分量（分别是截距和斜率）的先验存储在 \(K \times 2\) 矩阵中。

The distribution statement is also vectorized using multi-indexing with index kk for the states and elementwise multiplication (.*) for the income x. The vectorized distribution statement works out to the same thing as the following less efficient looped form.

分布语句也使用多索引与州的索引 kk 和收入 x 的逐元素乘法（.*）进行向量化。向量化的分布语句等同于以下效率较低的循环形式。

for (n in 1:N) {
  y[n] ~ bernoulli_logit(beta[kk[n], 1] + beta[kk[n], 2] * x[n]);
}

Mapped implementation

映射实现

The mapped version of the model will map over the states K. This means the group-level parameters, real data, and integer-data must be arrays of the same size.

模型的映射版本将映射到州 K 上。这意味着组级参数、实数数据和整数数据必须是相同大小的数组。

The mapped implementation requires a function to be mapped. In this function we can’t use distribution statements, but need to accumulate the desired log prior and log likelihood terms to the return value. The following function evaluates both the likelihood for the data observed for a group as well as the prior for the group-specific parameters (the name bernoulli_logit_glm derives from the fact that it’s a generalized linear model with a Bernoulli data model and logistic link function).

映射实现需要一个要映射的函数。在这个函数中，我们不能使用分布语句，但需要将所需的对数先验和对数似然项累积到返回值。以下函数评估为组观察到的数据的似然以及组特定参数的先验（名称 bl_glm 源于它是具有伯努利数据模型和逻辑链接函数的广义线性模型）。

functions {
 vector bl_glm(vector mu_sigma, vector beta,
               array[] real x, array[] int y) {
   vector[2] mu = mu_sigma[1:2];
   vector[2] sigma = mu_sigma[3:4];
   real lp = normal_lpdf(beta | mu, sigma);
   real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
   return [lp + ll]';
 }
}

The shared parameter mu_sigma contains the locations (mu_sigma[1:2]) and scales (mu_sigma[3:4]) of the priors, which are extracted in the first two lines of the program. The variable lp is assigned the log density of the prior on beta. The vector beta is of size two, as are the vectors mu and sigma, so everything lines up for the vectorization. Next, the variable ll is assigned to the log likelihood contribution for the group. Here beta[1] is the intercept of the regression and beta[2] the slope. The predictor array x needs to be converted to a vector allow the multiplication.

共享参数 mu_sigma 包含先验的位置（mu_sigma[1:2]）和尺度（mu_sigma[3:4]），这在程序的前两行中提取。变量 lp 被赋予 beta 先验的对数密度。向量 beta 的大小为二，向量 mu 和 sigma 也是如此，所以一切都为向量化排列好了。接下来，变量 ll 被赋给组的对数似然贡献。这里 beta[1] 是回归的截距，beta[2] 是斜率。预测变量数组 x 需要转换为向量以允许乘法。

The data block is identical to that of the previous program, but repeated here for convenience. A transformed data block computes the data structures needed for the mapping by organizing the data into arrays indexed by group.

数据块与前一个程序的相同，但为了方便在这里重复。转换数据块通过将数据组织成按组索引的数组来计算映射所需的数据结构。

data {
  int<lower=0> K;
  int<lower=0> N;
  array[N] int<lower=1, upper=K> kk;
  vector[N] x;
  array[N] int<lower=0, upper=1> y;
}
transformed data {
  int<lower=0> J = N / K;
  array[K, J] real x_r;
  array[K, J] int<lower=0, upper=1> x_i;
  {
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = to_array_1d(y[pos:end]);
      pos += J;
    }
  }
}

The integer J is set to the number of observations per group.⁵

整数 J 设置为每组的观察数。⁶

The real data array x_r holds the predictors and the integer data array x_i holds the outcomes. The grouped data arrays are constructed by slicing the predictor vector x (and converting it to an array) and slicing the outcome array y.

实数数据数组 x_r 保存预测变量，整数数据数组 x_i 保存结果。分组数据数组通过切片预测变量向量 x（并将其转换为数组）和切片结果数组 y 来构建。

Given the transformed data with groupings, the parameters are the same as the previous program. The model has the same priors for the hyperparameters mu and sigma, but moves the prior for beta and the likelihood to the mapped function.

给定具有分组的转换数据，参数与前一个程序相同。模型对超参数 mu 和 sigma 具有相同的先验，但将 beta 的先验和似然移到映射函数。

parameters {
  array[K] vector[2] beta;
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma), beta, x_r, x_i));
                         
}

The model as written here computes the priors for each group’s parameters along with the likelihood contribution for the group. An alternative mapping would leave the prior in the model block and only map the likelihood computation. In a serial setting this shouldn’t make much of a difference, but with parallelization, there is reduced communication (the prior’s parameters need not be transmitted) and also reduced parallelization with the version that leaves the prior in the model block.

这里写的模型计算每组参数的先验以及组的似然贡献。另一种映射将先验留在模型块中，只映射似然计算。在串行设置中，这不应该有太大区别，但在并行化中，通信减少（不需要传输先验的参数），并且在模型块中留下先验的版本并行化也减少。

Ragged inputs and outputs

不规则输入和输出

The previous examples included rectangular data structures and single outputs. Despite the name, this is not technically required by map_rect.

前面的例子包括矩形数据结构和单个输出。尽管有这个名字，map_rect 在技术上并不需要这个。

Ragged inputs

不规则输入

If each group has a different number of observations, then the rectangular data structures for predictors and outcomes will need to be padded out to be rectangular. In addition, the size of the ragged structure will need to be passed as integer data. This holds for shards with varying numbers of parameters as well as varying numbers of data points.

如果每组有不同数量的观察，那么预测变量和结果的矩形数据结构将需要填充为矩形。此外，不规则结构的大小将需要作为整数数据传递。这适用于具有不同数量参数以及不同数量数据点的分片。

Ragged outputs

不规则输出

The output of each mapped function is concatenated in order of inputs to produce the output of map_rect. When every shard returns a singleton (size one) array, the result is the same size as the number of shards and is easy to deal with downstream. If functions return longer arrays, they can still be structured using the to_matrix function if they are rectangular.

每个映射函数的输出按输入顺序连接以产生 map_rect 的输出。当每个分片返回单例（大小为一）数组时，结果与分片数量相同，并且易于在下游处理。如果函数返回更长的数组，如果它们是矩形的，仍然可以使用 to_matrix 函数构建它们。

If the outputs are of varying sizes, then there will have to be some way to convert it back to a usable form based on the input, because there is no way to directly return sizes or a ragged structure.

如果输出大小不同，那么必须有某种方法根据输入将其转换回可用形式，因为没有办法直接返回大小或不规则结构。

27.3 OpenCL

OpenCL

OpenCL (Open Computing Language) is a framework that enables writing programs that execute across heterogeneous platforms. An OpenCL program can be run on CPUs and GPUs. In order to run OpenCL programs, an OpenCL runtime be installed on the target system.

OpenCL（开放计算语言）是一个框架，能够编写跨异构平台执行的程序。OpenCL 程序可以在 CPU 和 GPU 上运行。为了运行 OpenCL 程序，必须在目标系统上安装 OpenCL 运行时。

Stan’s OpenCL backend is currently supported in CmdStan and its wrappers. In order to use it, the model must be compiled with the STAN_OPENCL makefile flag. Setting this flag means that the Stan-to-C++ translator (stanc3) will be supplied the --use-opencl flag and that the OpenCL enabled backend (Stan Math functions) will be enabled.

Stan 的 OpenCL 后端目前在 CmdStan 及其包装器中得到支持。为了使用它，模型必须使用 STAN_OPENCL makefile 标志编译。设置此标志意味着 Stan-to-C++ 翻译器（stanc3）将提供 --use-opencl 标志，并且将启用支持 OpenCL 的后端（Stan Math 函数）。

In Stan, the following distributions can be automatically run in parallel on both CPUs and GPUs with OpenCL:

在 Stan 中，以下分布可以使用 OpenCL 在 CPU 和 GPU 上自动并行运行：

bernoulli_lpmf
bernoulli_logit_lpmf
bernoulli_logit_glm_lpmf*
beta_lpdf
beta_proportion_lpdf
binomial_lpmf
categorical_logit_glm_lpmf*
cauchy_lpdf
chi_square_lpdf
double_exponential_lpdf
exp_mod_normal_lpdf
exponential_lpdf
frechet_lpdf
gamma_lpdf
gumbel_lpdf
inv_chi_square_lpdf
inv_gamma_lpdf
logistic_lpdf
lognormal_lpdf
neg_binomial_lpmf
neg_binomial_2_lpmf
neg_binomial_2_log_lpmf
neg_binomial_2_log_glm_lpmf*
normal_lpdf
normal_id_glm_lpdf*
ordered_logistic_glm_lpmf*
pareto_lpdf
pareto_type_2_lpdf
poisson_lpmf
poisson_log_lpmf
poisson_log_glm_lpmf*
rayleigh_lpdf
scaled_inv_chi_square_lpdf
skew_normal_lpdf
std_normal_lpdf
student_t_lpdf
uniform_lpdf
weibull_lpdf

* OpenCL is not used when the covariate argument to the GLM functions is a row_vector.

* 当 GLM 函数的协变量参数是 row_vector 时，不使用 OpenCL。

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

The term “shard” is borrowed from databases, where it refers to a slice of the rows of a database. That is exactly what it is here if we think of rows of a dataframe. Stan’s shards are more general in that they need not correspond to rows of a dataframe.↩︎
术语”分片”借自数据库，它指的是数据库行的切片。如果我们考虑数据框的行，这正是这里的情况。Stan 的分片更通用，因为它们不需要对应于数据框的行。↩︎
This example is a simplified form of the model described in (Gelman and Hill 2007, sec. 14.2)↩︎
这个例子是 (Gelman and Hill 2007, 第14.2节) 中描述的模型的简化形式↩︎
This makes the strong assumption that each group has the same number of observations!↩︎
这做出了每组具有相同数量观察的强假设！↩︎