1  回归模型

Regression Models

本节译者:于曙光、王才兴、王超

初次校审:张梓源

二次校审:邱怡轩(DeepSeek 辅助)

Stan supports regression models from simple linear regressions to multilevel generalized linear models.

Stan 支持从简单的线性回归到多层次广义线性模型等多种回归模型。

1.1 Linear regression

线性回归

The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as

最简单的线性回归模型包含一个自变量、斜率、截距以及服从正态分布的噪声项。使用标准回归符号,可以将该模型写成如下形式:

\[ y_n = \alpha + \beta x_n + \epsilon_n \quad\text{where}\quad \epsilon_n \sim \operatorname{normal}(0,\sigma). \]

This is equivalent to the following sampling involving the residual,

这相当于涉及残差的以下采样,

\[ y_n - (\alpha + \beta X_n) \sim \operatorname{normal}(0,\sigma), \] and reducing still further, to

并可以进一步简化为如下形式:

\[ y_n \sim \operatorname{normal}(\alpha + \beta X_n, \, \sigma). \]

This latter form of the model is coded in Stan as follows.

最后一种模型的表述形式在 Stan 中的代码如下。

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}

There are N observations and for each observation, \(n \in N\), we have predictor x[n] and outcome y[n]. The intercept and slope parameters are alpha and beta. The model assumes a normally distributed noise term with scale sigma. This model has improper priors for the two regression coefficients.

这里共有 N 个观测值,对于每个观测值 \(n \in N\),都包含一个自变量 x[n] 和对应的因变量 y[n]。截距参数为 alpha,斜率参数为 beta。该模型假设噪声项服从标准差为 sigma 的正态分布。该模型对于两个回归系数采用了非正规先验分布。

Matrix notation and vectorization

矩阵符号和向量化

The distribution statement in the previous model is vectorized, with

前面模型中的分布语句进行向量化处理后,有如下表达形式,

y ~ normal(alpha + beta * x, sigma);

providing the same model as the unvectorized version,

这与未向量化的版本对应的是相同的模型,

for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], sigma);
}

In addition to being more concise, the vectorized form is much faster.1

除了使表达式更加简洁外,向量化形式还能显著提高运行效率。2

In general, Stan allows the arguments to distributions such as normal to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused for each vector entry.

在 Stan 中,像 normal(正态分布)这样的分布函数通常允许使用向量作为参数。如果其他参数中有向量或数组,则它们必须具有相同的大小。如果其他参数是标量,则会为每个向量条目重复使用该标量。

The other reason this works is that Stan’s arithmetic operators are overloaded to perform matrix arithmetic on matrices. In this case, because x is of type vector and beta of type real, the expression beta * x is of type vector. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation.

这种写法的另一个优势在于,Stan 的算术运算符经过重载后可以直接执行矩阵运算。在这种情况下,因为 x 的类型是 vectorbeta 的类型是 real,所以表达式 beta * x 的类型是 vector。由于 Stan 支持向量化,可以直接使用矩阵符号编写具有多个自变量的回归模型。

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(x * beta + alpha, sigma);  // data model
}

The constraint lower=0 in the declaration of sigma constrains the value to be greater than or equal to 0. With no prior in the model block, the effect is an improper prior on non-negative real numbers. Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors.

sigma 的声明中,通过 lower=0 的约束条件将其取值范围限制为非负实数。在模型块中没有设定先验分布,就相当于对非负实数设定了非正规先验分布。尽管可以添加一个有信息的先验,但只要能产生正规的后验分布,使用非正规先验分布也是可以接受的。

In the model above, x is an \(N \times K\) matrix of predictors and beta a \(K\)-vector of coefficients, so x * beta is an \(N\)-vector of predictions, one for each of the \(N\) data items. These predictions line up with the outcomes in the \(N\)-vector y, so the entire model may be written using matrix arithmetic as shown. It would be possible to include a column of ones in the data matrix x to remove the alpha parameter.

在上述模型中,x 是一个 \(N \times K\) 的自变量矩阵或数据矩阵,beta 是一个 \(K\) 维的系数向量,因此 x * beta 是一个 \(N\) 维的预测值向量,每个元素对应于 \(N\) 个数据项中的一个。这些预测值与 \(N\) 维向量 y 中的观测结果对应,因此整个模型可以使用矩阵运算来表示。如果希望消除 alpha 参数,可以在数据矩阵 x 中添加一列全为1的向量。

The distribution statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.

上述模型中的分布语句采用了基于向量的高效编码方式。在统计学意义上,上述写法与下面这一使用循环语句写成的版本是等价的。

model {
  for (n in 1:N) {
    y[n] ~ normal(x[n] * beta, sigma);
  }
}

With Stan’s matrix indexing scheme, x[n] picks out row n of the matrix x; because beta is a column vector, the product x[n] * beta is a scalar of type real.

按照 Stan 的矩阵索引规则,x[n] 表示矩阵 x 的第 n 行;由于 beta 是一个列向量,因此乘积 x[n] * beta 是一个 real 类型的标量。

Intercepts as inputs

将截距作为输入

In the model formulation

在模型的表述中,

y ~ normal(x * beta, sigma);

there is no longer an intercept coefficient alpha. Instead, we have assumed that the first column of the input matrix x is a column of 1 values. This way, beta[1] plays the role of the intercept. If the intercept gets a different prior than the slope terms, then it would be clearer to break it out. It is also slightly more efficient in its explicit form with the intercept variable singled out because there’s one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity.

不再存在截距系数 alpha。取而代之的是,我们假设输入矩阵 x 的第一列是全为1的列。这样,beta[1] 就扮演截距的角色。如果截距与斜率项有不同的先验分布,那么将其单独分离出来会更清晰。将截距变量显式地单独列出,在计算效率上会略有提升,因为可以减少一次乘法运算;不过对速度来说,这并不会有太大的影响,所以我们应当选择更简明清晰的代码表述。

1.2 The QR reparameterization

QR 重参数化

In the previous example, the linear predictor can be written as \(\eta = x \beta\), where \(\eta\) is a \(N\)-vector of predictions, \(x\) is a \(N \times K\) matrix, and \(\beta\) is a \(K\)-vector of coefficients. Presuming \(N \geq K\), we can exploit the fact that any design matrix \(x\) can be decomposed using the thin QR decomposition into an orthogonal matrix \(Q\) and an upper-triangular matrix \(R\), i.e. \(x = Q R\).

在前面的例子中,线性预测器可以写作 \(\eta = x \beta\),其中 \(\eta\) 是一个 \(N\) 维的预测向量,\(x\) 是一个 \(N \times K\) 的矩阵,\(\beta\) 是一个 \(K\) 维的系数向量。当 \(N \geq K\) 时,我们可以对任意设计矩阵 \(x\) 进行薄QR分解(thin QR decomposition),将其分解为一个正交矩阵 \(Q\) 和一个上三角矩阵 \(R\),即 \(x = Q R\)

The functions qr_thin_Q and qr_thin_R implement the thin QR decomposition, which is to be preferred to the fat QR decomposition that would be obtained by using qr_Q and qr_R, as the latter would more easily run out of memory (see the Stan Functions Reference for more information on the qr_thin_Q and qr_thin_R functions). In practice, it is best to write \(x = Q^\ast R^\ast\) where \(Q^\ast = Q * \sqrt{n - 1}\) and \(R^\ast = \frac{1}{\sqrt{n - 1}} R\). Thus, we can equivalently write \(\eta = x \beta = Q R \beta = Q^\ast R^\ast \beta\). If we let \(\theta = R^\ast \beta\), then we have \(\eta = Q^\ast \theta\) and \(\beta = R^{\ast^{-1}} \theta\). In that case, the previous Stan program becomes

函数 qr_thin_Qqr_thin_R 实现了薄QR分解,它相比使用 qr_Qqr_R 得到的完整QR分解更可取,因为后者更容易耗尽内存(有关 qr_thin_Qqr_thin_R 函数的更多信息,请参阅 Stan 函数参考手册)。在实际应用中,最好将其表示为 \(x = Q^\ast R^\ast\),其中 \(Q^\ast = Q * \sqrt{n - 1}\)\(R^\ast = \frac{1}{\sqrt{n - 1}} R\)。因此,我们可以将其等价地写成 \(\eta = x\beta = QR\beta = Q^\ast R^\ast\beta\)。如果我们令 \(\theta = R^\ast\beta\),那么我们有 \(\eta = Q^\ast\theta\)\(\beta = R^{\ast^{-1}}\theta\)。在这种情况下,前面的 Stan 程序变为:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
transformed data {
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_thin_Q(x) * sqrt(N - 1);
  R_ast = qr_thin_R(x) / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}
parameters {
  real alpha;           // intercept
  vector[K] theta;      // coefficients on Q_ast
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(Q_ast * theta + alpha, sigma);  // data model
}
generated quantities {
  vector[K] beta;
  beta = R_ast_inverse * theta; // coefficients on x
}

Since this Stan program generates equivalent predictions for \(y\) and the same posterior distribution for \(\alpha\), \(\beta\), and \(\sigma\) as the previous Stan program, many wonder why the version with this QR reparameterization performs so much better in practice, often both in terms of wall time and in terms of effective sample size. The reasoning is threefold:

由于这个 Stan 程序与之前的程序生成的预测值 \(y\) 相同,并且二者给出的 \(\alpha\)\(\beta\)\(\sigma\) 的后验分布也相同,许多人想知道为什么具有这种 QR 重参数化的版本在实践中表现得更好,不仅在实际时间上,也在有效样本量方面。原因有三点:

  1. The columns of \(Q^\ast\) are orthogonal whereas the columns of \(x\) generally are not. Thus, it is easier for a Markov Chain to move around in \(\theta\)-space than in \(\beta\)-space.

  2. The columns of \(Q^\ast\) have the same scale whereas the columns of \(x\) generally do not. Thus, a Hamiltonian Monte Carlo algorithm can move around the parameter space with a smaller number of larger steps

  3. Since the covariance matrix for the columns of \(Q^\ast\) is an identity matrix, \(\theta\) typically has a reasonable scale if the units of \(y\) are also reasonable. This also helps HMC move efficiently without compromising numerical accuracy.

  4. \(Q^\ast\) 的列是正交的,而 \(x\) 的列通常不是。因此,马尔可夫链在 \(\theta\) 空间中移动比在 \(\beta\) 空间中更容易。

  5. \(Q^\ast\) 的列具有相同的尺度,而 \(x\) 的列通常没有。 因此,哈密顿蒙特卡罗算法只需以较大步长在参数空间中进行少量移动。

  6. 由于 \(Q^\ast\) 的列的协方差矩阵是单位矩阵,如果 \(y\) 的单位也合理,\(\theta\) 通常具有合理的尺度。这也有助于哈密顿蒙特卡罗方法在保持数值精度的同时高效地移动。

Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever \(K > 1\) and you do not have an informative prior on the location of \(\beta\). It can also be worthwhile to subtract the mean from each column of \(x\) before obtaining the QR decomposition, which does not affect the posterior distribution of \(\theta\) or \(\beta\) but does affect \(\alpha\) and allows you to interpret \(\alpha\) as the expectation of \(y\) in a linear model.

因此,在 Stan 中处理线性和广义线性模型时,当 \(K > 1\) 且缺乏关于 \(\beta\) 位置的有信息的先验时,建议采用这种 QR 重参数化方法。在获得 QR 分解之前,将每列的均值减去也是值得考虑的做法。这不会影响 \(\theta\)\(\beta\) 的后验分布,但会影响 \(\alpha\),并使您能够将 \(\alpha\) 解释为线性模型中 \(y\) 的期望值。

1.3 Priors for coefficients and scales

系数和尺度的先验分布

See our general discussion of priors for tips on priors for parameters in regression models.

关于回归模型中参数先验分布的建议,请参阅我们的先验分布的概述讨论

Later sections discuss univariate hierarchical priors and multivariate hierarchical priors, as well as priors used to identify models.

后续章节讨论了单变量层次先验分布多变量层次先验分布,以及用于识别模型的先验分布

However, as described in QR-reparameterization section, if you do not have an informative prior on the location of the regression coefficients, then you are better off reparameterizing your model so that the regression coefficients are a generated quantity. In that case, it usually does not matter much what prior is used on on the reparameterized regression coefficients and almost any weakly informative prior that scales with the outcome will do.

然而,如在 QR 重参数化章节所述,如果你对回归系数的位置参数先验信息不足,那么最好对模型进行重参数化,使得回归系数成为输出量。在这种情况下,通常对重参数化后的回归系数使用何种先验并不重要,几乎任何随着结果变化的弱信息先验都可以使用。

1.4 Robust noise models

稳健噪声模型

The standard approach to linear regression is to model the noise term \(\epsilon\) as having a normal distribution. From Stan’s perspective, there is nothing special about normally distributed noise. For instance, robust regression can be accommodated by giving the noise term a Student-\(t\) distribution. To code this in Stan, the distribution distribution is changed to the following.

线性回归的标准方法是将噪声项 \(\epsilon\) 建模为服从正态分布。从 Stan 的角度来看,正态分布的噪声没有什么特别之处。例如,稳健回归中可以让噪声项服从学生 \(t\) 分布。为了在 Stan 中编码这个模型,需要将采样分布更改为以下形式。

data {
  // ...
  real<lower=0> nu;
}
// ...
model {
  y ~ student_t(nu, alpha + beta * x, sigma);
}

The degrees of freedom constant nu is specified as data.

自由度常数 nu 被指定为数据。

1.5 Logistic and probit regression

Logistic 回归和 Probit 回归

For binary outcomes, either of the closely related logistic or probit regression models may be used. These generalized linear models vary only in the link function they use to map linear predictions in \((-\infty,\infty)\) to probability values in \((0,1)\). Their respective link functions, the logistic function and the standard normal cumulative distribution function, are both sigmoid functions (i.e., they are both S-shaped).

对于二元结果,可以使用紧密相关的 Logistic 回归模型或 Probit 回归模型之一。这些广义线性模型仅在它们用于将线性预测从 \((-\infty,\infty)\) 映射到 \((0,1)\) 的链接函数上有所不同。它们各自的链接函数,即 Logistic 函数和标准正态累积分布函数,都是 S 型函数(即这些函数的形状都与字母 S 相同)。

A logistic regression model with one predictor and an intercept is coded as follows.

具有一个自变量和截距项的 Logistic 回归模型可以按以下方式进行编码。

data {
  int<lower=0> N;
  vector[N] x;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real alpha;
  real beta;
}
model {
  y ~ bernoulli_logit(alpha + beta * x);
}

The noise parameter is built into the Bernoulli formulation here rather than specified directly.

噪声参数在这里内置于伯努利分布中,而不是直接给定。

Logistic regression is a kind of generalized linear model with binary outcomes and the log odds (logit) link function, defined by

Logistic 回归是一种具有二元结果和对数几率(逻辑)链接函数的广义线性模型,定义如下:

\[ \operatorname{logit}(v) = \log \left( \frac{v}{1-v} \right). \]

The inverse of the link function appears in the model:

模型中出现的链接函数的反函数:

\[ \operatorname{logit}^{-1}(u) = \texttt{inv}\mathtt{\_}\texttt{logit}(u) = \frac{1}{1 + \exp(-u)}. \]

The model formulation above uses the logit-parameterized version of the Bernoulli distribution, which is defined by

上面的模型中使用了伯努利分布的对数几率参数化版本(logit-parameterized version),其定义如下:

\[ \texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha \right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right). \]

The formulation is also vectorized in the sense that alpha and beta are scalars and x is a vector, so that alpha + beta * x is a vector. The vectorized formulation is equivalent to the less efficient version

在这种表达中,alphabeta 是标量,x 是向量,因此 alpha + beta * x 是一个向量。该向量化表达式和下面这个低效形式等价:

for (n in 1:N) {
  y[n] ~ bernoulli_logit(alpha + beta * x[n]);
}

Expanding out the Bernoulli logit, the model is equivalent to the more explicit, but less efficient and less arithmetically stable

将伯努利逻辑表达式展开,模型等同于如下形式。该形式表述清晰,但更加低效,数值计算上也更不稳定。

for (n in 1:N) {
  y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}

Other link functions may be used in the same way. For example, probit regression uses the cumulative normal distribution function, which is typically written as

其他链接函数可以以相同的方式使用。例如,Probit 回归使用的是累积正态分布函数,通常表示为:

\[ \Phi(x) = \int_{-\infty}^x \textsf{normal}\left(y \mid 0,1 \right) \,\textrm{d}y. \]

The cumulative standard normal distribution function \(\Phi\) is implemented in Stan as the function Phi. The probit regression model may be coded in Stan by replacing the logistic model’s distribution statement with the following.

在 Stan 中,标准正态的累积分布函数 \(\Phi\) 通过函数 Phi 来实现。要在 Stan 中实现概率回归模型,只需用以下语句替换 Logistic 回归模型的采样语句即可。

y[n] ~ bernoulli(Phi(alpha + beta * x[n]));

A fast approximation to the cumulative standard normal distribution function \(\Phi\) is implemented in Stan as the function Phi_approx.3 The approximate probit regression model may be coded with the following.

在 Stan 中,对累积标准正态分布函数 \(\Phi\) 的快速近似实现为函数 Phi_approx4可以使用以下代码编写近似 Probit 回归模型。

y[n] ~ bernoulli(Phi_approx(alpha + beta * x[n]));

1.6 Multi-logit regression

多类别 Logistic 回归

Multiple outcome forms of logistic regression can be coded directly in Stan. For instance, suppose there are \(K\) possible outcomes for each output variable \(y_n\). Also suppose that there is a \(D\)-dimensional vector \(x_n\) of predictors for \(y_n\). The multi-logit model with \(\textsf{normal}(0,5)\) priors on the coefficients is coded as follows.

在 Stan 中可以直接编码多输出形式的 Logistic 回归。例如,假设对于每个输出变量 \(y_n\)\(K\) 个可能的结果。同时假设对于每个 \(y_n\) 都有一个维度为 \(D\) 的预测变量向量 \(x_n\) 。系数的先验分布为 \(\textsf{normal}(0,5)\) 的多输出形式的 Logistic 模型的编码如下。

data {
  int K;
  int N;
  int D;
  array[N] int y;
  matrix[N, D] x;
}
parameters {
  matrix[D, K] beta;
}
model {
  matrix[N, K] x_beta = x * beta;

  to_vector(beta) ~ normal(0, 5);

  for (n in 1:N) {
    y[n] ~ categorical_logit(x_beta[n]');

  }
}

where x_beta[n]' is the transpose of x_beta[n]. The prior on beta is coded in vectorized form. As of Stan 2.18, the categorical-logit distribution is not vectorized for parameter arguments, so the loop is required. The matrix multiplication is pulled out to define a local variable for all of the predictors for efficiency. Like the Bernoulli-logit, the categorical-logit distribution applies softmax internally to convert an arbitrary vector to a simplex,

其中 x_beta[n]'x_beta[n] 的转置。beta 的先验也是向量形式。截至 Stan 2.18版本,分类-逻辑分布(categorical-logit distribution)的参数未进行向量化处理,因此需要使用循环。这里用矩阵乘法来定义一个对应所有自变量的局部变量,以提高效率。与伯努利-逻辑(Bernoulli-logit)分布一样,分类-逻辑分布在内部应用 softmax 函数将任意向量转换为单纯形,

\[ \texttt{categorical}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{categorical}\left(y \mid \texttt{softmax}(\alpha)\right), \]

where

其中

\[ \texttt{softmax}(u) = \exp(u) / \operatorname{sum}\left(\exp(u)\right). \]

The categorical distribution with log-odds (logit) scaled parameters used above is equivalent to writing

上面使用对数几率(逻辑)缩放参数的分布也等价于下面的代码:

y[n] ~ categorical(softmax(x[n] * beta));

Constraints on data declarations

数据声明的约束

The data block in the above model is defined without constraints on sizes K, N, and D or on the outcome array y. Constraints on data declarations provide error checking at the point data are read (or transformed data are defined), which is before sampling begins. Constraints on data declarations also make the model author’s intentions more explicit, which can help with readability. The above model’s declarations could be tightened to

上述模型中的数据块没有给 KNDy 定义约束。数据声明的约束可以在读取数据(或定义转换数据)时进行错误检查,该步骤发生在采样开始之前。数据声明的约束还可以使模型作者的意图更加明确,从而提高可读性。上述模型的声明可以进一步加强,如下所示:

int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;

These constraints arise because the number of categories, K, must be at least two in order for a categorical model to be useful. The number of data items, N, can be zero, but not negative; unlike R, Stan’s for-loops always move forward, so that a loop extent of 1:N when N is equal to zero ensures the loop’s body will not be executed. The number of predictors, D, must be at least one in order for beta * x[n] to produce an appropriate argument for softmax(). The categorical outcomes y[n] must be between 1 and K in order for the discrete sampling to be well defined.

这些约束是因为在分类模型中,类别数量 K 必须至少为2才能有用。数据项数量 N 可以为零,但不能为负数;与 R 不同,Stan 的 for 循环总是向前移动,因此当 N 等于零时,循环范围为 1:N 可确保循环体不会被执行。自变量的数量 D 必须至少为 1,以便 beta * x[n] 生成 softmax() 的适当参数。分类结果 y[n] 必须在 1K 之间,以使离散采样定义良好。

Constraints on data declarations are optional. Constraints on parameters declared in the parameters block, on the other hand, are not optional—they are required to ensure support for all parameter values satisfying their constraints. Constraints on transformed data, transformed parameters, and generated quantities are also optional.

数据声明的约束是可选的。另一方面,在参数(parameters)块中声明的约束是不可选的,它们是必需的,以确保所有参数的支撑集满足约束条件。在变换后数据、变换后参数和生成量部分,约束也是可选的。

Identifiability

可识别性

Because softmax is invariant under adding a constant to each component of its input, the model is typically only identified if there is a suitable prior on the coefficients.

因为在向 softmax 输入值的每个分量添加同一个常数时,输出不变,所以模型通常只有系数有合适的先验分布时才有可识别性。

An alternative is to use \((K-1)\)-vectors by fixing one of them to be zero. The partially known parameters section discusses how to mix constants and parameters in a vector. In the multi-logit case, the parameter block would be redefined to use \((K - 1)\)-vectors

另一种方法是通过将其中之一固定为零的方式,来使用 \((K-1)\) -向量。部分已知参数章节讨论了如何在向量中混合常数和参数。在多类别逻辑回归情况下,参数块将被重新定义为使用 \((K-1)\) -向量的形式

parameters {
  matrix[D, K - 1] beta_raw;
}

and then these are transformed to parameters to use in the model. First, a transformed data block is added before the parameters block to define a vector of zero values,

然后这些会被转换为在模型中使用的参数。首先,在参数块之前需要添加一个转换数据块,以定义一个零向量,

transformed data {
  vector[D] zeros = rep_vector(0, D);
}

which can then be appended to beta_raw to produce the coefficient matrix beta,

然后可以将其附加到 beta_raw 上,以生成系数矩阵 beta

transformed parameters {
  matrix[D, K] beta = append_col(beta_raw, zeros);
}

The rep_vector(0, D) call creates a column vector of size D with all entries set to zero. The derived matrix beta is then defined to be the result of appending the vector zeros as a new column at the end of beta_raw; the vector zeros is defined as transformed data so that it doesn’t need to be constructed from scratch each time it is used.

rep_vector(0, D) 的调用创建了一个长度为 D 的列向量,其中所有元素均设为零。然后,通过在 beta_raw 的末尾作为新列附加向量 zeros 的方式,定义矩阵 beta;向量 zeros 被定义为转换数据,这样每次使用时就不需要从头开始构建它。

This is not the same model as using \(K\)-vectors as parameters, because now the prior only applies to \((K-1)\)-vectors. In practice, this will cause the maximum likelihood solutions to be different and also the posteriors to be slightly different when taking priors centered around zero, as is typical for regression coefficients.

这与将 \(K\) -向量作为参数使用的模型不同,因为现在先验分布仅适用于 \((K-1)\) -向量。实际上,这将导致最大似然的解有所不同,并且在采用以零为中心的先验分布(通常用于回归系数)时,得到的后验分布也会略有不同。

1.7 Parameterizing centered vectors

中心化向量的参数化

When there are varying effects in a regression, the resulting likelihood is not identified unless further steps are taken. For example, we might have a global intercept \(\alpha\) and then a varying effect \(\beta_k\) for age group \(k\) to make a linear predictor \(\alpha + \beta_k\). With this predictor, we can add a constant to \(\alpha\) and subtract from each \(\beta_k\) and get exactly the same likelihood.

当回归模型中存在变化效应时,除非采取进一步的措施,否则得到的似然函数是不可识别的。例如,我们可能有一个全局截距 \(\alpha\),然后为每个年龄组 \(k\) 设置一个变化效应 \(\beta_k\),从而构建线性预测器 \(\alpha + \beta_k\)。使用这种预测器时,我们可以为 \(\alpha\) 添加一个常数,并从每个 \(\beta_k\) 中减去该常数,得到的似然函数完全相同。

The traditional approach to identifying such a model is to pin the first varing effect to zero, i.e., \(\beta_1 = 0\). With one of the varying effects fixed, you can no longer add a constant to all of them and the model’s likelihood is identified. In addition to the difficulty in specifying such a model in Stan, it is awkward to formulate priors because the other coefficients are all interpreted relative to \(\beta_1\).

识别此类模型的传统方法是将第一个变化效应固定为零,即 \(\beta_1 = 0\)。通过固定其中一个变化效应,我们不能再为所有的 \(\beta_k\) 添加常数,这将使模型的似然函数变得可识别。然而,除了在 Stan 中指定此类模型的困难外,由于其他系数都是相对于 \(\beta_1\) 解释的,因此设定先验分布也变得复杂。

In a Bayesian setting, a proper prior on each of the \(\beta\) is enough to identify the model. Unfortunately, this can lead to inefficiency during sampling as the model is still only weakly identified through the prior—there is a very simple example of the difference in the discussion of collinearity in the collinearity section.

在贝叶斯框架中,为每个 \(\beta\) 设置适当的先验分布足以识别模型。然而,这可能会导致采样效率低下,因为模型仅通过先验分布弱识别——在共线性章节中有一个非常简单的例子来说明这种差异。

An alternative identification strategy that allows a symmetric prior is to enforce a sum-to-zero constraint on the varying effects, i.e., \(\sum_{k=1}^K \beta_k = 0.\)

另一种允许对称先验的识别策略是对变化效应施加总和为零的约束,即 \(\sum_{k=1}^K \beta_k = 0\)

A parameter vector constrained to sum to zero may also be used to identify a multi-logit regression parameter vector (see the multi-logit section for details), or may be used for ability or difficulty parameters (but not both) in an IRT model (see the item-response model section for details).

一个总和为零的参数向量也可以用于识别多类别 Logistic 回归的参数向量(详见多类别 Logistic 回归章节),或者用于项目-响应理论(IRT)模型中的能力参数或难度参数(但不能同时用于两者,详见项目-响应模型章节)。

Built-in sum-to-zero vector

内置的和为零的向量

As of Stan 2.36, there is a built in sum_to_zero_vector type, which can be used as follows.

从 Stan 2.36 开始,内置了 sum_to_zero_vector 类型,可以按如下方式使用:

parameters {
  sum_to_zero_vector[K] beta;
  // ...
}

This produces a vector of size K such that sum(beta) = 0. In the unconstrained representation requires only K - 1 values because the last is determined by the first K - 1.

这将生成一个大小为 K 的向量,且满足 sum(beta) = 0。在无约束表示中,只需要 K - 1 个值,因为最后一个值由前 K - 1 个值决定。

Placing a prior on beta in this parameterization, for example,

在这种参数化下对 beta 设置先验,例如

  beta ~ normal(0, 1);

leads to a subtly different posterior than what you would get with the same prior on an unconstrained size-K vector. As explained below, the variance is reduced.

会导致后验分布与在无约束的 K 维向量上使用相同先验时略有不同。如下所述,方差会减小。

The sum-to-zero constraint can be implemented naively by setting the last element to the negative sum of the first elements, i.e., \(\beta_K = -\sum_{k=1}^{K-1} \beta_k.\) But that leads to high correlation among the \(\beta_k\).

和为零的约束可以通过简单地将最后一个元素设置为前几个元素的负和来实现,即 \(\beta_K = -\sum_{k=1}^{K-1} \beta_k\)。但这会导致 \(\beta_k\) 之间的高度相关性。

The transform used in Stan eliminates these correlations by constructing an orthogonal basis and applying it to the zero-sum-constraint; Seyboldt (2024) provides an explanation. The Stan Reference Manual provides the details in the chapter on transforms. Although any orthogonal basis can be used, Stan uses the inverse isometric log transform because it is convenient to describe and the transform simplifies to efficient scalar operations rather than more expensive matrix operations.

Stan 中使用的变换通过构建正交基并将其应用于和为零的约束来消除这些相关性;Seyboldt (2024) 提供了相关解释。Stan 参考手册在变换章节中提供了详细信息。尽管可以使用任何正交基,但 Stan 使用逆等距对数变换,因为它便于描述,并且该变换可以简化为高效的标量操作,而不是更昂贵的矩阵操作。

Marginal distribution of sum-to-zero components

和为零的成分的边际分布

On the Stan forums, Aaron Goodman provided the following code to produce a prior with standard normal marginals on the components of beta,

在 Stan 论坛上,Aaron Goodman 提供了以下代码,用于生成在 beta 各分量上具有标准正态边缘分布的先验,

model {
  beta ~ normal(0, inv(sqrt(1 - inv(K))));
  // ...
}

The scale component can be multiplied by sigma to produce a normal(0, sigma) prior marginally.

由于分量总和为零,所以各分量之间不是独立的。由于求和和取负都是线性操作,我们不需要雅可比矩阵(因此雅可比为常数)。

To generate distributions with marginals other than standard normal, the resulting beta may be scaled by some factor sigma and translated to some new location mu.

我们可以通过对 beta 进行平移和伸缩变换(参数分别为 musigma)来生成具有非标准正态边缘分布的先验。

Soft centering

软中心化

Adding a prior such as \(\beta \sim \textsf{normal}(0,\epsilon)\) for a small \(\epsilon\) will provide a kind of soft centering of a parameter vector \(\beta\) by preferring, all else being equal, that \(\sum_{k=1}^K \beta_k = 0\). This approach is only guaranteed to roughly center if \(\beta\) and the elementwise addition \(\beta + c\) for a scalar constant \(c\) produce the same likelihood (perhaps by another vector \(\alpha\) being transformed to \(\alpha - c\), as in the IRT models). This is another way of achieving a symmetric prior, though it requires choosing an \(\epsilon\). If \(\epsilon\) is too large, there won’t be a strong enough centering effect and if it is too small, it will add high curvature to the target density and impede sampling.

相较于使 \(\sum_{k=1}^K \beta_k = 0\)(在其他条件相等的情况下),添加例如 \(\beta \sim \textsf{normal}(0,\sigma)\) 的先验提供了一种对参数向量 \(\beta\) 进行软中心化的方式。只有当 \(\beta\)\(\beta + c\)\(c\)是标量常数)产生相同的似然(可能通过将另一个向量 \(\alpha\) 转化为 \(\alpha - c\),如在 IRT 模型中一样),这种方法才能保证能够大致中心化。这是实现对称先验的另一种方式。

1.8 Ordered logistic and probit regression

有序 Logistic 回归和 Probit 回归

Ordered regression for an outcome \(y_n \in \{ 1, \dotsc, k \}\) with predictors \(x_n \in \mathbb{R}^D\) is determined by a single coefficient vector \(\beta \in \mathbb{R}^D\) along with a sequence of cutpoints \(c \in \mathbb{R}^{K-1}\) sorted so that \(c_d < c_{d+1}\). The discrete output is \(k\) if the linear predictor \(x_n \beta\) falls between \(c_{k-1}\) and \(c_k\), assuming \(c_0 = -\infty\) and \(c_K = \infty\). The noise term is fixed by the form of regression, with examples for ordered logistic and ordered probit models.

有序回归包含输出 \(y_n \in \{ 1, \dotsc, k \}\) 和预测变量 \(x_n \in \mathbb{R}^D\),它由单个系数向量 \(\beta \in \mathbb{R}^D\) 和已排序的截断点序列 \(c \in \mathbb{R}^{K-1}\) 确定,其中 \(c_d < c_{d+1}\)。如果线性预测器 \(x_n \beta\) 的值落在 \(c_{k-1}\)\(c_k\) 之间(假设 \(c_0 = -\infty\)\(c_K = \infty\)),则离散输出为 \(k\)。噪声项由回归模型的形式确定,以有序 Logistic 模型和有序 Probit 模型为例。

Ordered logistic regression

有序 Logistic 回归

The ordered logistic model can be coded in Stan using the ordered data type for the cutpoints and the built-in ordered_logistic distribution.

有序 Logistic 模型可以在 Stan 中使用 ordered 数据类型表示截断点和内置的 ordered_logistic 分布进行编码。

data {
  int<lower=2> K;
  int<lower=0> N;
  int<lower=1> D;
  array[N] int<lower=1, upper=K> y;
  array[N] row_vector[D] x;
}
parameters {
  vector[D] beta;
  ordered[K - 1] c;
}
model {
  for (n in 1:N) {
    y[n] ~ ordered_logistic(x[n] * beta, c);
  }
}

The vector of cutpoints c is declared as ordered[K - 1], which guarantees that c[k] is less than c[k + 1].

截断点向量 c 被声明为 ordered[K - 1],这确保 c[k] 小于 c[k + 1]

If the cutpoints were assigned independent priors, the constraint effectively truncates the joint prior to support over points that satisfy the ordering constraint. Luckily, Stan does not need to compute the effect of the constraint on the normalizing term because the probability is needed only up to a proportion.

如果给截断点分配独立的先验,约束条件会有效地将联合先验截断为满足排序约束的支持点。幸运的是,Stan 不需要计算约束对归一化项的影响,因为只需要计算概率的比例。

Ordered probit

有序 Probit 回归

An ordered probit model could be coded in exactly the same way by swapping the cumulative logistic (inv_logit) for the cumulative normal (Phi).

通过将累积逻辑函数 (inv_logit) 替换为累积正态函数 (Phi),可以以完全相同的方式编码有序 Probit 模型。

data {
  int<lower=2> K;
  int<lower=0> N;
  int<lower=1> D;
  array[N] int<lower=1, upper=K> y;
  array[N] row_vector[D] x;
}
parameters {
  vector[D] beta;
  ordered[K - 1] c;
}
model {
  vector[K] theta;
  for (n in 1:N) {
    real eta;
    eta = x[n] * beta;
    theta[1] = 1 - Phi(eta - c[1]);
    for (k in 2:(K - 1)) {
      theta[k] = Phi(eta - c[k - 1]) - Phi(eta - c[k]);
    }
    theta[K] = Phi(eta - c[K - 1]);
    y[n] ~ categorical(theta);
  }
}

The logistic model could also be coded this way by replacing Phi with inv_logit, though the built-in encoding based on the softmax transform is more efficient and more numerically stable. A small efficiency gain could be achieved by computing the values Phi(eta - c[k]) once and storing them for re-use.

尽管基于 softmax 变换的内置编码更高效且数值上更稳定,Logistic 回归模型也可以通过将 Phi 替换为 inv_logit 这种方式进行编码。通过一次性计算值 Phi(eta - c[k]) 并将其存储以供重复使用,可以获得一定的效率提升。

1.9 Hierarchical regression

分层回归

The simplest multilevel model is a hierarchical model in which the data are grouped into \(L\) distinct categories (or levels). An extreme approach would be to completely pool all the data and estimate a common vector of regression coefficients \(\beta\). At the other extreme, an approach with no pooling assigns each level \(l\) its own coefficient vector \(\beta_l\) that is estimated separately from the other levels. A hierarchical model is an intermediate solution where the degree of pooling is determined by the data and a prior on the amount of pooling.

最简单的多层次模型是分层模型,它将数据划分为 \(L\) 个不同的类别(或层级)。一种极端的方法是完全汇集所有数据并估计一个共同的回归系数向量 \(\beta\)。另一种极端的方法是不进行汇集,为每个层级 \(l\) 分别进行估计并分配其系数向量 \(\beta_l\)。分层模型是一种折中方案,其汇集程度由数据和汇集程度的先验分布共同决定。

Suppose each binary outcome \(y_n \in \{ 0, 1 \}\) has an associated level, \(ll_n \in \{ 1, \dotsc, L \}\). Each outcome will also have an associated predictor vector \(x_n \in \mathbb{R}^D\). Each level \(l\) gets its own coefficient vector \(\beta_l \in \mathbb{R}^D\). The hierarchical structure involves drawing the coefficients \(\beta_{l,d} \in \mathbb{R}\) from a prior that is also estimated with the data. This hierarchically estimated prior determines the amount of pooling. If the data in each level are similar, strong pooling will be reflected in low hierarchical variance. If the data in the levels are dissimilar, weaker pooling will be reflected in higher hierarchical variance.

假设每个二元结果 \(y_n \in \{0, 1\}\) 都对应一个层级 \(ll_n \in \{1, \dots, L\}\)。每个结果还将有一个相关的预测向量 \(x_n \in \mathbb{R}^D\)。每个层级 \(l\) 都有自己的系数向量 \(\beta_l \in \mathbb{R}^D\)。分层结构通过从先验分布中抽取系数 \(\beta_{l,d} \in \mathbb{R}\) 来实现,该先验分布也由数据估计得到。这种分层估计的先验分布决定了汇集的强度。如果每个层级中的数据相似,强汇集将反映在较低的层次方差中。如果层级中的数据不同,较弱的汇集将反映在较高的层次方差中。

The following model encodes a hierarchical logistic regression model with a hierarchical prior on the regression coefficients.

以下模型编码了一个具有分层先验的分层 Logistic 回归模型。

data {
  int<lower=1> D;
  int<lower=0> N;
  int<lower=1> L;
  array[N] int<lower=0, upper=1> y;
  array[N] int<lower=1, upper=L> ll;
  array[N] row_vector[D] x;
}
parameters {
  array[D] real mu;
  array[D] real<lower=0> sigma;
  array[L] vector[D] beta;
}
model {
  for (d in 1:D) {
    mu[d] ~ normal(0, 100);
    for (l in 1:L) {
      beta[l, d] ~ normal(mu[d], sigma[d]);
    }
  }
  for (n in 1:N) {
    y[n] ~ bernoulli(inv_logit(x[n] * beta[ll[n]]));
  }
}

The standard deviation parameter sigma gets an implicit uniform prior on \((0,\infty)\) because of its declaration with a lower-bound constraint of zero. Stan allows improper priors as long as the posterior is proper. Nevertheless, it is usually helpful to have informative or at least weakly informative priors for all parameters; see the regression priors section for recommendations on priors for regression coefficients and scales.

由于标准差参数 sigma 在声明中设定了下界为零,因此它隐式地获得了 \((0, \infty)\) 上的均匀先验分布。Stan 允许使用非正规先验分布,只要后验分布是正规的。然而,通常情况下,为所有参数提供有信息或至少弱信息的先验分布是有帮助的;有关回归系数和尺度的先验分布建议,请参阅回归先验章节

Optimizing the model

模型优化

Where possible, vectorizing distribution statements leads to faster log probability and derivative evaluations. The speed boost is not because loops are eliminated, but because vectorization allows sharing subcomputations in the log probability and gradient calculations and because it reduces the size of the expression tree required for gradient calculations.

在可能的情况下,将分布语句向量化可以显著提升对数概率和导数计算的效率。速度的提升并非源于消除循环,而是因为向量化使得对数概率和梯度计算能够共享子计算过程,同时减少了梯度计算所需的表达式树规模。

The first optimization vectorizes the for-loop over D as

第一个优化是将对 D 的循环向量化,如下所示:

mu ~ normal(0, 100);
for (l in 1:L) {
  beta[l] ~ normal(mu, sigma);
}

The declaration of beta as an array of vectors means that the expression beta[l] denotes a vector. Although beta could have been declared as a matrix, an array of vectors (or a two-dimensional array) is more efficient for accessing rows; see the indexing efficiency section for more information on the efficiency tradeoffs among arrays, vectors, and matrices.

beta 声明为向量数组意味着表达式 beta[l] 表示一个向量。尽管 beta 可以声明为一个矩阵,但是对于访问行而言,向量数组(或二维数组)更高效;有关数组、向量和矩阵之间的效率权衡的更多信息,请参阅索引效率部分

This model can be further sped up and at the same time made more arithmetically stable by replacing the application of inverse-logit inside the Bernoulli distribution with the logit-parameterized Bernoulli,5

通过用对数几率参数化的伯努利分布(logit-parameterized Bernoulli)替代伯努利分布中的逆 logit(inverse-logit)变换,可以进一步提升模型的计算效率并增强算法的稳定性。6

for (n in 1:N) {
  y[n] ~ bernoulli_logit(x[n] * beta[ll[n]]);
}

Unlike in R or BUGS, loops, array access and assignments are fast in Stan because they are translated directly to C++. In most cases, the cost of allocating and assigning to a container is more than made up for by the increased efficiency due to vectorizing the log probability and gradient calculations. Thus the following version is faster than the original formulation as a loop over a distribution statement.

与 R 或 BUGS 不同,Stan 中的循环、数组访问和赋值操作非常高效,因为它们会被直接转换为 C++ 代码。在大多数情况下,分配和赋值容器的成本被向量化的对数概率和梯度计算带来的效率提升所抵消。因此,以下版本比原始的循环采样语句更快。

{
  vector[N] x_beta_ll;
  for (n in 1:N) {
    x_beta_ll[n] = x[n] * beta[ll[n]];
  }
  y ~ bernoulli_logit(x_beta_ll);
}

The brackets introduce a new scope for the local variable x_beta_ll; alternatively, the variable may be declared at the top of the model block.

括号引入了一个新的作用域,用于局部变量 x_beta_ll;或者,该变量可以在模型块的顶部声明。

In some cases, such as the above, the local variable assignment leads to models that are less readable. The recommended practice in such cases is to first develop and debug the more transparent version of the model and only work on optimizations when the simpler formulation has been debugged.

在某些情况下(如上述示例),使用局部变量赋值可能会降低模型的可读性。在这种情况下,建议的做法是首先开发和调试更简单、更透明的模型版本,只有在简单版本调试完成后才进行优化。

1.10 Hierarchical priors

分层先验

Priors on priors, also known as “hyperpriors,” should be treated the same way as priors on lower-level parameters in that as much prior information as is available should be brought to bear. Because hyperpriors often apply to only a handful of lower-level parameters, care must be taken to ensure the posterior is both proper and not overly sensitive either statistically or computationally to wide tails in the priors.

关于先验的先验(即超先验)应当与低层参数的先验同等对待,尽可能充分利用所有可用的先验信息。由于超先验通常仅适用于少数低层参数,因此需要特别注意确保后验分布既合理,又不会对先验分布中厚尾部分的统计特性或计算过程过于敏感。

Boundary-avoiding priors for MLE in hierarchical models

为层次模型中的最大似然估计设置避免边界的先验分布

The fundamental problem with maximum likelihood estimation (MLE) in the hierarchical model setting is that as the hierarchical variance drops and the values cluster around the hierarchical mean, the overall density grows without bound. As an illustration, consider a simple hierarchical linear regression (with fixed prior mean) of \(y_n \in \mathbb{R}\) on \(x_n \in \mathbb{R}^K\), formulated as

在分层模型中,极大似然估计面临的根本问题是:随着分层方差的减小和参数值向分层均值聚集,整体密度会无限增大。以一个简单的分层线性回归模型(具有固定的先验均值)作为例子,假设有 \(y_n \in \mathbb{R}\)\(x_n \in \mathbb{R}^K\),模型可以表述为

\[\begin{align*} y_n & \sim \textsf{normal}(x_n \beta, \sigma) \\ \beta_k & \sim \textsf{normal}(0,\tau) \\ \tau & \sim \textsf{Cauchy}(0,2.5) \\ \end{align*}\]

In this case, as \(\tau \rightarrow 0\) and \(\beta_k \rightarrow 0\), the posterior density

在这种情况下,当 \(\tau \rightarrow 0\)\(\beta_k \rightarrow 0\) 时,后验密度

\[ p(\beta,\tau,\sigma|y,x) \propto p(y|x,\beta,\tau,\sigma) \] grows without bound. See the Neal’s funnel density, which has similar behavior.

将无限增大。参考 Neal 的漏斗状密度函数,其表现类似。

There is obviously no MLE estimate for \(\beta,\tau,\sigma\) in such a case, and therefore the model must be modified if posterior modes are to be used for inference. The approach recommended by Chung et al. (2013) is to use a gamma distribution as a prior, such as

显然,在这种情况下 \(\beta,\tau,\sigma\) 不存在极大似然估计,因此若想使用后验模型进行推断,必须对模型进行修改。根据 Chung et al. (2013) 推荐的方法,可以使用伽马分布作为 \(\sigma\) 的先验,例如:

\[ \sigma \sim \textsf{Gamma}(2, 1/A), \]

for a reasonably large value of \(A\), such as \(A = 10\).

这里,\(A\) 可以选择一个相对较大的值,比如 \(A = 10\)

1.11 Item-response theory models

项目-响应理论模型

Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; see Gelman and Hill (2007, pps. 314–320) for a textbook introduction to hierarchical IRT models and Curtis (2010) for encodings of a range of IRT models in BUGS.

项目-响应理论(Item-Response Theory,IRT)模型用于描述多个学生对一组测试题目中一个或多个问题的回答情况。该模型基于学生的能力、题目的难度等参数,在更复杂的模型中还会考虑题目的区分度和猜测正确的概率。有关分层 IRT 模型的教材介绍,请参考 Gelman and Hill (2007, pps. 314–320),而关于在 BUGS 中编码各种 IRT 模型的内容,请参考 Curtis (2010)

Data declaration with missingness

带有缺失值的数据声明

The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.

考虑到不是每个学生都需要回答每个问题,可以将提供的 IRT 模型数据声明如下所示:

data {
  int<lower=1> J;                     // number of students
  int<lower=1> K;                     // number of questions
  int<lower=1> N;                     // number of observations
  array[N] int<lower=1, upper=J> jj;  // student for observation n
  array[N] int<lower=1, upper=K> kk;  // question for observation n
  array[N] int<lower=0, upper=1> y;   // correctness for observation n
}

This declares a total of N student-question pairs in the data set, where each n in 1:N indexes a binary observation y[n] of the correctness of the answer of student jj[n] on question kk[n].

这里声明了数据集中的 N 个学生-问题对,其中每个 n\(1 \leq n \leq N\))对应一个二元观测值 y[n],表示学生 jj[n] 对问题 kk[n] 的回答是否正确。

The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility.

在本节剩余部分中,为简化说明,先验超参数将被硬编码,但在 Stan 中它们也可以作为数据进行编码,以提高灵活性。

1PL (Rasch) model

单参数 Logistic(Rasch)模型

The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L).

单参数 Logistic(1PL)项目-响应模型,也称为 Rasch 模型,它假设每个问题只有一个参数(1P),并使用 Logistic 链接函数(L)。

The model parameters are declared as follows.

模型参数的声明如下:

parameters {
  real delta;            // mean student ability
  array[J] real alpha;   // ability of student j - mean ability
  array[K] real beta;    // difficulty of question k
}

The parameter alpha[J] is the ability coefficient for student j and beta[k] is the difficulty coefficient for question k. The non-standard parameterization used here also includes an intercept term delta, which represents the average student’s response to the average question.7

参数 alpha[J] 表示学生 j 的能力系数,beta[k] 表示问题 k 的难度系数。这里采用的非标准参数化还包括截距项 delta,用于表示平均水平学生对平均难度问题的响应。8

The model itself is as follows.

模型具体如下:

model {
  alpha ~ std_normal();         // informative true prior
  beta ~ std_normal();          // informative true prior
  delta ~ normal(0.75, 1);      // informative true prior
  for (n in 1:N) {
    y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
  }
}

This model uses the logit-parameterized Bernoulli distribution, where

该模型采用了对数几率参数化的伯努利分布,其定义为

\[ \texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right). \]

The key to understanding it is the term inside the bernoulli_logit distribution, from which it follows that

理解该模型的关键在于 bernoulli_logit 分布中的项,由此可得

\[ \Pr[y_n = 1] = \operatorname{logit}^{-1}\left(\alpha_{jj[n]} - \beta_{kk[n]} + \delta\right). \]

The model suffers from additive identifiability issues without the priors. For example, adding a term \(\xi\) to each \(\alpha_j\) and \(\beta_k\) results in the same predictions. The use of priors for \(\alpha\) and \(\beta\) located at 0 identifies the parameters; see Gelman and Hill (2007) for a discussion of identifiability issues and alternative approaches to identification.

若不使用先验分布,该模型将面临加法可识别性问题。例如,给每个 \(\alpha_j\)\(\beta_k\) 添加一个项 \(\xi\) 将导致相同的预测结果。通过为 \(\alpha\)\(\beta\) 设置以0为中心的先验分布可以解决这个问题。有关可识别性问题和可替代的识别方法,可以参考 Gelman and Hill (2007) 中的讨论。

For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section.

出于测试目的,Stan 中提供的 IRT 1PL 模型使用了有信息的先验分布,这些先验与 R 中用于模拟数据的实际数据生成过程相匹配(模拟代码与模型位于同一目录下)。虽然这对大多数实际应用来说并不现实,但可以用于验证 Stan 的推断结果。通过使用更宽松的先验进行简单的敏感性分析可以发现,即使在拥有400名学生和100道题目,且仅有25%随机缺失数据的情况下,后验分布对先验仍然相当敏感。在实际应用中,应该将先验分布与其他参数一起进行分层拟合,这将在后续章节中详细讨论。

Multilevel 2PL model

多层两参数 Logistic 模型

The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows.

在本节中,我们将对上一节中描述的简单 1PL 模型进行推广,通过添加一个区分度参数来模拟问题的噪声程度,并为问题的难度和区分度参数添加多层先验。模型的参数声明如下:

parameters {
  real mu_beta;                // mean question difficulty
  vector[J] alpha;             // ability for j - mean
  vector[K] beta;              // difficulty for k
  vector<lower=0>[K] gamma;    // discrimination of k
  real<lower=0> sigma_beta;    // scale of difficulties
  real<lower=0> sigma_gamma;   // scale of log discrimination
}

The parameters should be clearer after the model definition.

在定义模型后,参数将更加清晰。

model {
  alpha ~ std_normal();
  beta ~ normal(0, sigma_beta);
  gamma ~ lognormal(0, sigma_gamma);
  mu_beta ~ cauchy(0, 5);
  sigma_beta ~ cauchy(0, 5);
  sigma_gamma ~ cauchy(0, 5);
  y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - (beta[kk] + mu_beta)));
}

The std_normal function is used here, defined by

在这里使用了std_normal函数,其定义为

\[ \texttt{std}\mathtt{\_}\texttt{normal}(y) = \textsf{normal}\left(y \mid 0, 1\right). \]

The distribution statement is also vectorized using elementwise multiplication; it is equivalent to

采样语句还使用了逐元素乘法进行向量化处理;其等价于以下形式:

for (n in 1:N) {
  y[n] ~ bernoulli_logit(gamma[kk[n]]
                         * (alpha[jj[n]] - (beta[kk[n]] + mu_beta));
}

The 2PL model is similar to the 1PL model, with the additional parameter gamma[k] modeling how discriminative question k is. If gamma[k] is greater than 1, responses are more attenuated with less chance of getting a question right at random. The parameter gamma[k] is constrained to be positive, which prohibits there being questions that are easier for students of lesser ability; such questions are not unheard of, but they tend to be eliminated from most testing situations where an IRT model would be applied.

2PL 模型与 1PL 模型类似,但额外引入了参数 gamma[k] 来衡量问题 k 的区分度。如果 gamma[k] 大于1,则答案更加模糊,随机猜对的几率较低。我们限制参数 gamma[k] 为正数,这禁止了存在对低能力学生来说更容易的问题;虽然这种问题并非罕见,但它们在大多数应用 IRT 模型的考试情境中会被剔除。

The model is parameterized here with student abilities alpha being given a standard normal prior. This is to identify both the scale and the location of the parameters, both of which would be unidentified otherwise; see the problematic posteriors chapter for further discussion of identifiability. The difficulty and discrimination parameters beta and gamma then have varying scales given hierarchically in this model. They could also be given weakly informative non-hierarchical priors, such as

在此模型中,学生能力 alpha 的先验分布为标准正态分布。这样做是为了确定参数的尺度和位置,否则它们将无法确定;有关可识别性的进一步讨论,请参阅问题性后验分布章节。难度参数 beta 和区分度参数 gamma 在此模型中具有层次化的不同尺度。它们也可以给出弱信息的非层次化先验,例如

beta ~ normal(0, 5);
gamma ~ lognormal(0, 2);

The point is that the alpha determines the scale and location and beta and gamma are allowed to float.

关键在于 alpha 确定了尺度和位置,而 betagamma 可以浮动。

The beta parameter is here given a non-centered parameterization, with parameter mu_beta serving as the mean beta location. An alternative would’ve been to take:

在这里,beta 参数采用了非中心参数化,其中参数 mu_beta 作为 beta 位置的均值。另一种选择是采用以下形式:

beta ~ normal(mu_beta, sigma_beta);

and

y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]));

Non-centered parameterizations tend to be more efficient in hierarchical models; see the reparameterization section for more information on non-centered reparameterizations.

非中心参数化在分层模型中往往更高效;有关非中心重参数化的更多信息,请参阅重参数化部分

The intercept term mu_beta can’t itself be modeled hierarchically, so it is given a weakly informative \(\textsf{Cauchy}(0,5)\) prior. Similarly, the scale terms, sigma_beta, and sigma_gamma, are given half-Cauchy priors. As mentioned earlier, the scale and location for alpha are fixed to ensure identifiability. The truncation in the half-Cauchy prior is implicit; explicit truncation is not necessary because the log probability need only be calculated up to a proportion and the scale variables are constrained to \((0,\infty)\) by their declarations.

截距项 mu_beta 本身不能被分层建模,因此它被赋予了一个弱信息先验 \(\textsf{Cauchy}(0,5)\)。类似地,尺度项 sigma_betasigma_gamma 被赋予半柯西先验。如上所述,为了确保可识别性,alpha 的尺度和位置是固定的。半柯西先验的截断是隐式的;不需要显式截断,因为对数概率只需计算比例,并且尺度变量的声明将其约束在 \((0,\infty)\) 范围内。

1.12 Priors for identifiability

可识别先验

Location and scale invariance

位置和尺度不变性

One application of (hierarchical) priors is to identify the scale and/or location of a group of parameters. For example, in the IRT models discussed in the previous section, there is both a location and scale non-identifiability. With uniform priors, the posteriors will float in terms of both scale and location. See the collinearity section for a simple example of the problems this poses for estimation.

(分层)先验的一个应用是识别一组参数的尺度和/或位置。例如,在前面讨论的 IRT 模型中,存在尺度和位置的不可识别问题。如果使用均匀先验,后验将在尺度和位置上浮动。请参考共线性章节中对此估计问题的简单示例。

The non-identifiability is resolved by providing a standard normal (i.e., \(\textsf{normal}(0,1)\)) prior on one group of coefficients, such as the student abilities. With a standard normal prior on the student abilities, the IRT model is identified in that the posterior will produce a group of estimates for student ability parameters that have a sample mean of close to zero and a sample variance of close to one. The difficulty and discrimination parameters for the questions should then be given a diffuse, or ideally a hierarchical prior, which will identify these parameters by scaling and locating relative to the student ability parameters.

这种不可识别问题可以通过为一组系数(如学生能力)提供标准正态(即 \(\textsf{normal}(0,1)\))先验来解决。对于学生能力参数,使用标准正态先验可以使 IRT 模型变成可识别,因为后验将产生一个对学生能力参数的估计组,其样本均值接近于零,样本方差接近于一。然后,对于问题难度和区分度参数,应该给予一个模糊的,或者理想情况下的分层先验,通过相对于学生能力参数的尺度和位置来确定这些参数。

Collinearity

共线性

Another case in which priors can help provide identifiability is in the case of collinearity in a linear regression. In linear regression, if two predictors are collinear (i.e, one is a linear function of the other), then their coefficients will have a correlation of 1 (or -1) in the posterior. This leads to non-identifiability. By placing normal priors on the coefficients, the maximum likelihood solution of two duplicated predictors (trivially collinear) will be half the value than would be obtained by only including one.

另一个可以通过先验提供可识别性的情况是线性回归中的共线性问题。在线性回归中,如果两个预测变量存在共线性(即一个是另一个的线性函数),那么它们的系数在后验中将具有相关性为1(或-1)。这会导致不可识别的问题。通过对系数施加正态先验,对于两个重复的预测变量(明显共线),最大似然解将是只包含一个预测变量时解的一半。

Separability

可分性

In a logistic regression, if a predictor is positive in cases of 1 outcomes and negative in cases of 0 outcomes, then the maximum likelihood estimate for the coefficient for that predictor diverges to infinity. This divergence can be controlled by providing a prior for the coefficient, which will “shrink” the estimate back toward zero and thus identify the model in the posterior.

在 Logistic 回归中,如果一个预测变量在输出为1的情况下为正,而在输出为0的情况下为负,则该预测变量的最大似然估计会发散到无穷大。通过为该系数提供先验可以控制这种发散,使估计值向零收缩,从而在后验中识别模型。

Similar problems arise for sampling with improper flat priors. The sampler will try to draw large values. By providing a prior, the posterior will be concentrated around finite values, leading to well-behaved sampling.

类似的问题也出现在使用非正规的均匀先验进行抽样时。采样器会试图抽取大的数值。通过提供先验,后验将集中在有限的值周围,从而实现良好的采样效果。

1.13 Multivariate priors for hierarchical models

分层模型的多元先验

In hierarchical regression models (and other situations), several individual-level variables may be assigned hierarchical priors. For example, a model with multiple varying intercepts and slopes within might assign them a multivariate prior.

在分层回归模型(以及其他情况)中,多个个体级别的变量可以被赋予分层先验。例如,在一个具有多个变化截距和斜率的模型中,可以为这些截距与斜率参数指定一个多元分布作为先验。

As an example, the individuals might be people and the outcome income, with predictors such as education level and age, and the groups might be states or other geographic divisions. The effect of education level and age as well as an intercept might be allowed to vary by state. Furthermore, there might be state-level predictors, such as average state income and unemployment level.

举个例子,个体可以是人,结果变量可以是收入,预测变量可以包括教育水平和年龄,组可以是州或其他地理划分。教育水平和年龄的效应以及截距可能会因州而异。此外,州级别可能还有其他预测变量,比如平均州收入和失业率。

Multivariate regression example

多元回归例子

Gelman and Hill (2007, chap. 13, Chapter 17) provide a discussion of a hierarchical model with \(N\) individuals organized into \(J\) groups. Each individual has a predictor row vector \(x_n\) of size \(K\); to unify the notation, they assume that \(x_{n,1} = 1\) is a fixed “intercept” predictor. To encode group membership, they assume individual \(n\) belongs to group \(jj[n] \in \{ 1, \dotsc, J \}\). Each individual \(n\) also has an observed outcome \(y_n\) taking on real values.

Gelman and Hill (2007, chap. 13, Chapter 17) 讨论了一个具有 \(N\) 个个体并分成 \(J\) 组的分层模型。每个个体都有一个大小为 \(K\) 的预测行向量 \(x_n\);为了统一符号,他们假设 \(x_{n,1} = 1\) 是一个固定的“截距”预测变量。为了编码群体成员关系,他们假设个体 \(n\) 属于群体 \(jj[n] \in \{ 1, \dotsc, J \}\)。每个个体 \(n\) 还具有一个取实值的观测结果 \(y_n\)

Data model

数据模型

The model is a linear regression with slope and intercept coefficients varying by group, so that \(\beta_j\) is the coefficient \(K\)-vector for group \(j\). The data model for individual \(n\) is then just

该模型是一个具有不同群体间斜率和截距系数的线性回归模型,其中 \(\beta_j\) 是第 \(j\) 组的回归系数向量,大小为 \(K\)。对于个体 \(n\),其数据模型可以表示为:

\[ y_n \sim \textsf{normal}(x_n \, \beta_{jj[n]}, \, \sigma) \quad\text{for}\quad n \in \{ 1, \dotsc, N \}. \]

Coefficient prior

系数先验

Gelman and Hill model the coefficient vectors \(\beta_j\) as being drawn from a multivariate distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\),

Gelman 和 Hill 将系数向量 \(\beta_j\) 建模为从多元正态分布中采样得到,该分布有均值向量为 \(\mu\),协方差矩阵为 \(\Sigma\),即

\[ \beta_j \sim \textsf{multivariate normal}(\mu_j, \, \Sigma) \quad\text{for}\quad j \in \{ 1, \dotsc, J \}. \]

Below, we discuss the full model of Gelman and Hill, which uses group-level predictors to model \(\mu\); for now, we assume \(\mu\) is a simple vector parameter.

接下来,我们将讨论 Gelman 和 Hill 的完整模型,该模型使用组级预测变量来建模 \(\mu\);在此之前,我们假设 \(\mu\) 是一个简单的向量参数。

Hyperpriors

超先验

For hierarchical modeling, the group-level mean vector \(\mu\) and covariance matrix \(\Sigma\) must themselves be given priors. The group-level mean vector can be given a reasonable weakly-informative prior for independent coefficients, such as

在分层建模中,群组水平的均值向量 \(\mu\) 和协方差矩阵 \(\Sigma\) 本身也必须被赋予先验分布。我们可以个群组水平的均值向量指定合理的弱信息先验,比如独立系数的情况下可以指定

\[ \mu_j \sim \textsf{normal}(0,5). \] If more is known about the expected coefficient values \(\beta_{j, k}\), this information can be incorporated into the prior for \(\mu_j\).

如果对于预期系数值 \(\beta_{j, k}\) 有更多的了解,可以将这些信息纳入 \(\mu_j\) 的先验中。

For the prior on the covariance matrix, Gelman and Hill suggest using a scaled inverse Wishart. That choice was motivated primarily by convenience as it is conjugate to the multivariate likelihood function and thus simplifies Gibbs sampling

对于协方差矩阵的先验,Gelman 和 Hill 建议使用伸缩逆 Wishart 分布。这个选择主要是出于方便考虑,因为它与多变量似然函数共轭,从而简化了 Gibbs 采样过程。

In Stan, there is no restriction to conjugacy for multivariate priors, and we in fact recommend a slightly different approach. Like Gelman and Hill, we decompose our prior into a scale and a matrix, but are able to do so in a more natural way based on the actual variable scales and a correlation matrix. Specifically, we define

在 Stan 中,对于多元先验,没有限制于共轭性,事实上我们推荐稍微不同的方法。与 Gelman 和 Hill 类似,我们将先验分解为一个尺度参数和一个矩阵,但我们能够更自然地基于实际变量尺度和相关矩阵来进行分解。具体来说,我们定义

\[ \Sigma = \texttt{diag}\mathtt{\_}\texttt{matrix}(\tau) \times \Omega \times \texttt{diag}\mathtt{\_}\texttt{matrix}(\tau), \] where \(\Omega\) is a correlation matrix and \(\tau\) is the vector of coefficient scales. This mapping from scale vector \(\tau\) and correlation matrix \(\Omega\) can be inverted, using

其中 \(\Omega\) 是一个相关矩阵,\(\tau\) 是系数尺度的向量。这种从尺度向量 \(\tau\) 和相关矩阵 \(\Omega\) 的映射可以被反转,使用以下公式:

\[ \tau_k = \sqrt{\Sigma_{k,k}} \quad\textsf{and}\quad \Omega_{i, j} = \frac{\Sigma_{i, j}}{\tau_i \, \tau_j}. \]

The components of the scale vector \(\tau\) can be given any reasonable prior for scales, but we recommend something weakly informative like a half-Cauchy distribution with a small scale, such as

尺度向量 \(\tau\) 的组成部分可以使用任何合理的尺度先验,但我们推荐使用一些弱信息的先验,比如尺度较小的半柯西分布,例如

\[ \tau_k \sim \textsf{Cauchy}(0, 2.5) \quad\text{for}\quad k \in \{ 1, \dotsc, K \} \quad\text{constrained\ by}\quad \tau_k > 0. \] As for the prior means, if there is information about the scale of variation of coefficients across groups, it should be incorporated into the prior for \(\tau\). For large numbers of exchangeable coefficients, the components of \(\tau\) itself (perhaps excluding the intercept) may themselves be given a hierarchical prior.

至于先验均值,如果存在关于不同组之间系数变化程度的信息,应该将其纳入 \(\tau\) 的先验假设中。对于大量可交换系数的情况,\(\tau\) 的组成部分本身(可能不包括截距项)也可以给予一个分层先验。

Our final recommendation is to give the correlation matrix \(\Omega\) an LKJ prior with shape \(\eta \geq 1\),9

我们最后的建议是给予相关矩阵 \(\Omega\) 一个 LKJ 先验,其中形状参数 \(\eta \geq 1\)10

\[ \Omega \sim \textsf{LKJCorr}(\eta). \]

The LKJ correlation distribution is defined by

LKJ 相关分布的定义如下:

\[ \textsf{LKJCorr}\left(\Sigma \mid \eta\right) \propto \operatorname{det}\left(\Sigma\right)^{\eta - 1}. \]

The basic behavior of the LKJ correlation distribution is similar to that of a beta distribution. For \(\eta = 1\), the result is a uniform distribution. Despite being the identity over correlation matrices, the marginal distribution over the entries in that matrix (i.e., the correlations) is not uniform between -1 and 1. Rather, it concentrates around zero as the dimensionality increases due to the complex constraints.

LKJ 相关分布的基本性质类似于 Beta 分布。当 \(\eta = 1\) 时,其结果是均匀分布。尽管它在相关矩阵上是一个恒等变换,但该矩阵的元素(即相关系数)上的边际分布在-1和1之间并不均匀。相反,由于复杂的约束条件,随着维度的增加,它会集中在零附近。

For \(\eta > 1\), the density increasingly concentrates mass around the unit matrix, i.e., favoring less correlation. For \(\eta < 1\), it increasingly concentrates mass in the other direction, i.e., favoring more correlation.

对于 \(\eta > 1\),概率密度越来越集中在单位矩阵附近,即更倾向于较低的相关性。对于 \(\eta < 1\),概率密度越来越集中在另一个方向,即更倾向于较高的相关性。

The LKJ prior may thus be used to control the expected amount of correlation among the parameters \(\beta_j\). For a discussion of decomposing a covariance prior into a prior on correlation matrices and an independent prior on scales, see Barnard, McCulloch, and Meng (2000).

因此,LKJ 先验可以用于控制参数 \(\beta_j\) 之间的预期相关性程度。有关将协方差先验分解为相关矩阵的先验和独立的尺度先验的讨论,请参考 Barnard, McCulloch, and Meng (2000)

Group-level predictors for prior mean

先验均值的群组级别预测变量

To complete Gelman and Hill’s model, suppose each group \(j \in \{ 1, \dotsc, J \}\) is supplied with an \(L\)-dimensional row-vector of group-level predictors \(u_j\). The prior mean for the \(\beta_j\) can then itself be modeled as a regression, using an \(L\)-dimensional coefficient vector \(\gamma\). The prior for the group-level coefficients then becomes

为了完善 Gelman 和 Hill 的模型,假设每个组 \(j \in \{ 1, \dotsc, J \}\) 都有一个 \(L\) 维的行向量组级预测变量 \(u_j\)。然后,\(\beta_j\) 的先验均值本身可以使用一个 \(L\) 维系数向量 \(\gamma\) 建模为一个回归模型。组级系数的先验分布则变为

\[ \beta_j \sim \textsf{multivariate normal}(u_j \, \gamma, \Sigma) \]

The group-level coefficients \(\gamma\) may themselves be given independent weakly informative priors, such as

组级系数 \(\gamma\) 本身可以被赋予独立的弱信息先验,例如

\[ \gamma_l \sim \textsf{normal}(0,5). \] As usual, information about the group-level means should be incorporated into this prior.

通常情况下,关于组级均值的信息应该被纳入这个先验中。

Coding the model in Stan

在 Stan 中建模

The Stan code for the full hierarchical model with multivariate priors on the group-level coefficients and group-level prior means follows its definition.

下面完整的分层模型的 Stan 代码,根据模型定义给组级系数和组级先验均值设置多元先验。

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  array[N] int<lower=1, upper=J> jj;  // group for individual
  matrix[N, K] x;              // individual predictors
  array[J] row_vector[L] u;    // group predictors
  vector[N] y;                 // outcomes
}
parameters {
  corr_matrix[K] Omega;        // prior correlation
  vector<lower=0>[K] tau;      // prior scale
  matrix[L, K] gamma;          // group coeffs
  array[J] vector[K] beta;     // indiv coeffs by group
  real<lower=0> sigma;         // prediction error scale
}
model {
  tau ~ cauchy(0, 2.5);
  Omega ~ lkj_corr(2);
  to_vector(gamma) ~ normal(0, 5);
  {
    array[J] row_vector[K] u_gamma;
    for (j in 1:J) {
      u_gamma[j] = u[j] * gamma;
    }
    beta ~ multi_normal(u_gamma, quad_form_diag(Omega, tau));
  }
  for (n in 1:N) {
    y[n] ~ normal(x[n] * beta[jj[n]], sigma);
  }
}

The hyperprior covariance matrix is defined implicitly through the quadratic form in the code because the correlation matrix Omega and scale vector tau are more natural to inspect in the output; to output Sigma, define it as a transformed parameter. The function quad_form_diag is defined so that quad_form_diag(Sigma, tau) is equivalent to diag_matrix(tau) * Sigma * diag_matrix(tau), where diag_matrix(tau) returns the matrix with tau on the diagonal and zeroes off diagonal; the version using quad_form_diag should be faster. For details on these and other matrix arithmetic operators and functions, see the function reference manual.

超先验协方差矩阵通过代码中的二次型隐式定义,因为在输出时,相关矩阵 Omega 和尺度向量 tau 更容易检查;为了输出 Sigma,将其定义为一个转换后的参数。函数 quad_form_diag 的定义是使得 quad_form_diag(Sigma, tau) 等同于 diag_matrix(tau) * Sigma * diag_matrix(tau),其中 diag_matrix(tau) 返回以 tau 为对角线元素,零为非对角线元素的矩阵;使用 quad_form_diag 版本应该更快。有关这些和其他矩阵运算符和函数的详细信息,请参阅函数参考手册。

Optimization through vectorization

向量化优化

The code in the Stan program above can be sped up dramatically by replacing the the distribution statement inside the for loop:

通过替换循环中的分布语句,可以显著提升上述 Stan 程序的执行速度:

for (n in 1:N) {
  y[n] ~ normal(x[n] * beta[jj[n]], sigma);
}

with the vectorized distribution statement:

其向量化形式为:

{
  vector[N] x_beta_jj;
  for (n in 1:N) {
    x_beta_jj[n] = x[n] * beta[jj[n]];
  }
  y ~ normal(x_beta_jj, sigma);
}

The outer brackets create a local scope in which to define the variable x_beta_jj, which is then filled in a loop and used to define a vectorized distribution statement. The reason this is such a big win is that it allows us to take the log of sigma only once and it greatly reduces the size of the resulting expression graph by packing all of the work into a single distribution function.

外层括号创建了一个局部作用域,用于定义变量 x_beta_jj,随后在循环中为其赋值,并用于定义向量化的采样语句。这样做的好处在于只需对 sigma 进行一次对数运算,同时通过将所有计算整合到一个密度函数中,显著减少了结果表达式图的规模。

Although it is tempting to redeclare beta and include a revised model block distribution statement,

虽然重新声明 beta 并包含修订后的模型块采样语句看似很有吸引力,

parameters {
  matrix[J, K] beta;
// ...
}
model {
  y ~ normal(rows_dot_product(x, beta[jj]), sigma);
  // ...
}

this fails because it breaks the vectorization for beta,11

这种做法会失败,因为它破坏了 beta 采样的向量化。12

beta ~ multi_normal(...);

which requires beta to be an array of vectors. Both vectorizations are important, so the best solution is to just use the loop above, because rows_dot_product cannot do much optimization in and of itself because there are no shared computations.

这要求 beta 必须是一个向量数组。由于两种向量化都很重要,因此最佳解决方案是使用上述循环,因为 rows_dot_product 本身无法进行太多优化,其计算过程中没有可共享的部分。

The code in the Stan program above also builds up an array of vectors for the outcomes and for the multivariate normal, which provides a major speedup by reducing the number of linear systems that need to be solved and differentiated.

在上述 Stan 程序中,通过为观测值和多元正态分布构建向量数组,显著减少了需要求解和求导的线性系统数量,从而大幅提升了计算速度。

{
  matrix[K, K] Sigma_beta;
  Sigma_beta = quad_form_diag(Omega, tau);
  for (j in 1:J) {
    beta[j] ~ multi_normal((u[j] * gamma)', Sigma_beta);
  }
}

In this example, the covariance matrix Sigma_beta is defined as a local variable so as not to have to repeat the quadratic form computation \(J\) times. This vectorization can be combined with the Cholesky-factor optimization in the next section.

在此示例中,协方差矩阵 Sigma_beta 被定义为局部变量,从而避免了重复进行 \(J\) 次二次型计算。这种向量化可以与下一节的 Cholesky 分解优化结合使用。

Optimization through Cholesky factorization

Cholesky 分解优化

The multivariate normal density and LKJ prior on correlation matrices both require their matrix parameters to be factored. Vectorizing, as in the previous section, ensures this is only done once for each density. An even better solution, both in terms of efficiency and numerical stability, is to parameterize the model directly in terms of Cholesky factors of correlation matrices using the multivariate version of the non-centered parameterization. For the model in the previous section, the program fragment to replace the full matrix prior with an equivalent Cholesky factorized prior is as follows.

多元正态密度和 LKJ 相关矩阵先验都要求它们的矩阵参数进行因子分解。如前一节所述的向量化可以确保每个分布只需进行一次因子分解。从效率和数值稳定性角度考虑,更好的解决方案是直接使用相关矩阵的 Cholesky 因子对模型进行参数化,采用非中心参数化的多元版本。针对前一节的模型进行改进,将完整的矩阵先验替换为等价的 Cholesky 因子先验的程序片段如下:

data {
  matrix[L, J] u;              // group predictors transposed
  // ...
}
parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  matrix[K, L] gamma;
  // ...
}
transformed parameters {
  matrix[K, J] beta;
  beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}
model {
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  // ...
}

The data variable u was originally an array of vectors, which is efficient for access; here it is redeclared as a matrix in order to use it in matrix arithmetic. Moreover, it is transposed, along with gamma and beta, to minimize the number of transposition operations. The new parameter L_Omega is the Cholesky factor of the original correlation matrix Omega, so that

数据变量 u 最初是一个向量数组,这种结构便于高效访问;在此重新声明为矩阵,以便用于矩阵运算。此外,它与 gammabeta 一同被转置,以尽量减少转置操作的次数。新参数 L_Omega 是原始相关矩阵 Omega 的 Cholesky 分解因子,即

Omega = L_Omega * L_Omega'

The prior scale vector tau is unchanged, and furthermore, pre-multiplying the Cholesky factor by the scale produces the Cholesky factor of the final covariance matrix,

先前的先验尺度向量 tau 保持不变,同时,将 Cholesky 因子预乘以尺度会生成最终协方差矩阵的 Cholesky 因子,即

Sigma_beta
  = quad_form_diag(Omega, tau)
  = diag_pre_multiply(tau, L_Omega) * diag_pre_multiply(tau, L_Omega)'

where the diagonal pre-multiply compound operation is defined by

其中对角线预乘复合操作定义为:

diag_pre_multiply(a, b) = diag_matrix(a) * b

The new variable z is declared as a matrix, the entries of which are given independent standard normal priors; the to_vector operation turns the matrix into a vector so that it can be used as a vectorized argument to the univariate normal density. This results in every column of z being a \(K\)-variate normal random vector with the identity as covariance matrix. Therefore, multiplying z by the Cholesky factor of the covariance matrix and adding the mean (u * gamma)' produces a beta distributed as in the original model, where the variance is, letting \(L = \mathrm{diag}(\tau)\,\Omega_L\),

新变量 z 被声明为矩阵,其每个元素都具有独立的标准正态先验分布;to_vector 操作将矩阵转换为向量,以便用作单变量正态密度函数的向量化参数。这使得 z 的每一列都成为具有单位协方差矩阵的 \(K\) 维正态随机向量。因此,将 z 乘以协方差矩阵的 Cholesky 分解因子,再加上均值 (u * gamma)',即可得到与原模型具有相同分布的 beta。其中方差根据 \(L = \mathrm{diag}(\tau)\,\Omega_L\) 计算,

\[ \begin{aligned} \mathbb{V}[\beta] &= \mathbb{E}\big((L \, z)(L \, z)^\top) \\ &= \mathbb{E}\big((L \, z \, z^\top \, L^\top) \\ &= L \, \mathbb{E}(z \, z^\top) \, L^\top \\ &= L \, L^\top =(\mathrm{diag}(\tau)\,\Omega_L)\,(\mathrm{diag}(\tau)\,\Omega_L)^\top \\ &= \mathrm{diag}(\tau)\,\Omega\,\mathrm{diag}(\tau) \\ &= \Sigma. \end{aligned} \] Where we have used the linearity of expectations (line 2 to 3), the definition of \(\Omega = \Omega_L \, \Omega_L^\top\), and the fact that \(\mathbb{E}(z \, z^\top) = I\) since \(z \sim \mathcal{N}(0, I)\).

在此,我们运用了期望的线性性质(从第2行到第3行),\(\Omega = \Omega_L \, \Omega_L^\top\) 的定义,以及由于 \(z \sim \mathcal{N}(0, I)\),故 \(\mathbb{E}(z \, z^\top) = I\)

Omitting the remaining data declarations, which are the same as before with the exception of u, the optimized model is as follows.

除了 u 之外,省略了其他与之前相同的数据声明,优化后的模型如下:

parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0, upper=pi() / 2>[K] tau_unif;  // prior scale
  matrix[K, L] gamma;                        // group coeffs
  real<lower=0> sigma;                       // prediction error scale
}
transformed parameters {
  vector<lower=0>[K] tau = 2.5 * tan(tau_unif);
  matrix[K, J] beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}
model {
  vector[N] mu;
  for(n in 1:N) {
    mu[n] = x[n, ] * beta[, jj[n]];
  }
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(gamma) ~ normal(0, 5);
  y ~ normal(mu, sigma);
}

This model also reparameterizes the prior scale tau to avoid potential problems with the heavy tails of the Cauchy distribution. The statement tau_unif ~ uniform(0, pi() / 2) can be omitted from the model block because Stan increments the log posterior for parameters with uniform priors without it.

该模型还对先验尺度 tau 进行了重参数化,以避免柯西分布的厚尾可能导致的潜在问题。语句 tau_unif ~ uniform(0, pi() / 2) 可以从模型块中省略,因为 Stan 会自动为具有均匀先验的参数进行对数后验的增量计算。

1.14 Prediction, forecasting, and backcasting

预测、预报和回溯

Stan models can be used for “predicting” the values of arbitrary model unknowns. When predictions are about the future, they’re called “forecasts;” when they are predictions about the past, as in climate reconstruction or cosmology, they are sometimes called “backcasts” (or “aftcasts” or “hindcasts” or “antecasts,” depending on the author’s feelings about the opposite of “fore”).

Stan 模型可用于“预测”任意模型未知变量的值。当预测涉及未来时,称为“预报”(forecasts);当涉及过去的预测,如气候重建或宇宙学时,有时称为“回溯”(backcasts)(或“后验”、“推测”、“反向预测”等,具体取决于作者对“预测”相反概念的理解)。

Programming predictions

预测程序设计

As a simple example, the following linear regression provides the same setup for estimating the coefficients beta as in our very first example, using y for the N observations and x for the N predictor vectors. The model parameters and model for observations are exactly the same as before.

作为一个简单示例,以下线性回归模型采用了与我们第一个例子中估计系数 beta 相同的设置,其中 y 表示 N 个观测值,x 表示 N 个预测向量。模型参数和观测模型与之前完全一致。

To make predictions, we need to be given the number of predictions, N_new, and their predictor matrix, x_new. The predictions themselves are modeled as a parameter y_new. The model statement for the predictions is exactly the same as for the observations, with the new outcome vector y_new and prediction matrix x_new.

要进行预测,我们需要提供预测数量 N_new 及其预测矩阵 x_new。预测本身被建模为参数 y_new。预测模型的语句与观测模型完全相同,只是使用了新的输出向量 y_new 和预测矩阵 x_new

data {
  int<lower=1> K;
  int<lower=0> N;
  matrix[N, K] x;
  vector[N] y;

  int<lower=0> N_new;
  matrix[N_new, K] x_new;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;

  vector[N_new] y_new;                  // predictions
}
model {
  y ~ normal(x * beta, sigma);          // observed model

  y_new ~ normal(x_new * beta, sigma);  // prediction model
}

Predictions as generated quantities

生成量作为预测

Where possible, the most efficient way to generate predictions is to use the generated quantities block. This provides proper Monte Carlo (not Markov chain Monte Carlo) inference, which can have a much higher effective sample size per iteration.

在可能的情况下,生成预测的最有效方法是使用生成量块(generated quantities block)。这提供了适当的蒙特卡洛(而非马尔可夫链蒙特卡洛)推断,每次迭代的有效样本量可能更高。

// ...data as above...

parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(x * beta, sigma);
}
generated quantities {
  vector[N_new] y_new;
  for (n in 1:N_new) {
    y_new[n] = normal_rng(x_new[n] * beta, sigma);
  }
}

Now the data are just as before, but the parameter y_new is now declared as a generated quantity, and the prediction model is removed from the model and replaced by a pseudo-random draw from a normal distribution.

现在数据与之前相同,但参数 y_new 被声明为生成量,且预测模型从模型中移除,取而代之的是从正态分布中进行伪随机抽样。

Overflow in generated quantities

生成量溢出

It is possible for values to overflow or underflow in generated quantities. The problem is that if the result is NaN, then any constraints placed on the variables will be violated. It is possible to check a value assigned by an RNG and reject it if it overflows, but this is both inefficient and leads to biased posterior estimates. Instead, the conditions causing overflow, such as trying to generate a negative binomial random variate with a mean of \(2^{31}\), must be intercepted and dealt with. This is typically done by reparameterizing or reimplementing the random number generator using real values rather than integers, which are upper-bounded by \(2^{31} - 1\) in Stan.

在生成量中,值可能会发生溢出或下溢。问题在于,如果结果为 NaN,将违反施加在变量上的所有约束条件。可以通过检查随机数生成器(RNG)分配的值,并在溢出时拒绝它,但这既低效又会导致后验估计出现偏差。更好的方法是识别并处理导致溢出的条件,例如尝试生成均值为 \(2^{31}\) 的负二项分布的随机变量。通常通过重新参数化,或使用实数值而非整数值来重新实现随机数生成器,因为整数在 Stan 中的上限为 \(2^{31} - 1\)

1.15 Multivariate outcomes

多元输出

Most regressions are set up to model univariate observations (be they scalar, boolean, categorical, ordinal, or count). Even multinomial regressions are just repeated categorical regressions. In contrast, this section discusses regression when each observed value is multivariate. To relate multiple outcomes in a regression setting, their error terms are provided with covariance structure.

大多数回归模型用于建模单变量观测值(无论是标量、布尔值、分类、有序还是计数)。即使是多项式回归,也仅仅是重复的分类回归。相比之下,本节讨论的是每个观测值均为多变量的回归模型。为了在回归设置中关联多个观测结果,需要为它们的误差项提供协方差结构。

This section considers two cases, seemingly unrelated regressions for continuous multivariate quantities and multivariate probit regression for boolean multivariate quantities.

本节考虑两种情况:连续多变量数量的看似不相关回归(seemingly unrelated regressions)和布尔多变量数量的多元 Probit 回归。

Seemingly unrelated regressions

看似不相关回归

The first model considered is the “seemingly unrelated” regressions (SUR) of econometrics where several linear regressions share predictors and use a covariance error structure rather than independent errors (Zellner 1962; Greene 2011).

第一个考虑的模型是计量经济学中的“看似无关”回归(seemingly unrelated regressions,SUR),其中多个线性回归共享预测变量,并使用协方差误差结构而非独立误差(Zellner 1962; Greene 2011)

The model is easy to write down as a regression,

该模型可以简单地表示为如下回归:

\[\begin{align*} y_n &= x_n \, \beta + \epsilon_n \\ \epsilon_n &\sim \textsf{multivariate normal}(0, \Sigma) \end{align*}\]

where \(x_n\) is a \(J\)-row-vector of predictors (\(x\) is an \((N \times J)\) matrix), \(y_n\) is a \(K\)-vector of observations, \(\beta\) is a \((K \times J)\) matrix of regression coefficients (vector \(\beta_k\) holds coefficients for outcome \(k\)), and \(\Sigma\) is covariance matrix governing the error. As usual, the intercept can be rolled into \(x\) as a column of ones.

其中 \(x_n\) 是一个长度为 \(J\) 的行向量,表示预测变量(\(x\) 是一个大小为 \((N \times J)\) 的矩阵),\(y_n\) 是一个 \(K\) 维观测向量,\(\beta\) 是一个 \((K \times J)\) 的回归系数矩阵(向量 \(\beta_k\) 存储第 \(k\) 个结果的系数),\(\Sigma\) 是控制误差的协方差矩阵。通常,截距可以通过在 \(x\) 中添加一列全为1的列来处理。

The basic Stan code is straightforward (though see below for more optimized code for use with LKJ priors on correlation).

基本的 Stan 代码非常直接(尽管后面会介绍在使用 LKJ 先验时更优化的代码):

data {
  int<lower=1> K;
  int<lower=1> J;
  int<lower=0> N;
  array[N] vector[J] x;
  array[N] vector[K] y;
}
parameters {
  matrix[K, J] beta;
  cov_matrix[K] Sigma;
}
model {
  array[N] vector[K] mu;
  for (n in 1:N) {
    mu[n] = beta * x[n];
  }
  y ~ multi_normal(mu, Sigma);
}

For efficiency, the multivariate normal is vectorized by precomputing the array of mean vectors and sharing the same covariance matrix.

为了提高效率,可以通过预计算均值向量数组并共享相同的协方差矩阵来向量化多元正态分布。

Following the advice in the multivariate hierarchical priors section, we will place a weakly informative normal prior on the regression coefficients, an LKJ prior on the correlations and a half-Cauchy prior on standard deviations. The covariance structure is parameterized in terms of Cholesky factors for efficiency and arithmetic stability.

根据多元层次先验部分的建议,我们将为回归系数设置弱信息正态先验,为相关性设置 LKJ 先验,为标准差设置半 Cauchy 先验。为了提高效率和数值稳定性,协方差结构采用 Cholesky 因子分解进行参数化。

// ...
parameters {
  matrix[K, J] beta;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0>[K] L_sigma;
}
model {
  array[N] vector[K] mu;
  matrix[K, K] L_Sigma;

  for (n in 1:N) {
    mu[n] = beta * x[n];

  }

  L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

  to_vector(beta) ~ normal(0, 5);
  L_Omega ~ lkj_corr_cholesky(4);
  L_sigma ~ cauchy(0, 2.5);

  y ~ multi_normal_cholesky(mu, L_Sigma);
}

The Cholesky factor of the covariance matrix is then reconstructed as a local variable and used in the model by scaling the Cholesky factor of the correlation matrices. The regression coefficients get a prior all at once by converting the matrix beta to a vector.

随后,协方差矩阵的 Cholesky 因子分解被重建为局部变量,并通过缩放相关矩阵的 Cholesky 因子分解在模型中使用。回归系数通过将矩阵 beta 转换为向量,一次性获得先验分布。

If required, the full correlation or covariance matrices may be reconstructed from their Cholesky factors in the generated quantities block.

如果需要,可以在生成量块中从它们的 Cholesky 因子重建完整的相关矩阵或协方差矩阵。

Multivariate probit regression

多元 Probit 回归

The multivariate probit model generates sequences of boolean variables by applying a step function to the output of a seemingly unrelated regression.

多元 Probit 模型通过将阶跃函数应用于看似无关回归的输出,生成布尔变量序列。

The observations \(y_n\) are \(D\)-vectors of boolean values (coded 0 for false, 1 for true). The values for the observations \(y_n\) are based on latent values \(z_n\) drawn from a seemingly unrelated regression model (see the previous section),

观测值 \(y_n\) 是一个 \(D\) 维布尔向量(0 表示 false,1 表示 true)。观测值 \(y_n\) 的取值基于从看似无关回归模型(参见前一节)中得到的潜在值 \(z_n\)

\[\begin{align*} z_n &= x_n \, \beta + \epsilon_n \\ \epsilon_n &\sim \textsf{multivariate normal}(0, \Sigma) \end{align*}\]

These are then put through the step function to produce a \(K\)-vector \(z_n\) of boolean values with elements defined by

然后,这些值经过阶跃函数处理,生成一个 \(K\) 维布尔向量 \(z_n\),其元素定义如下:

\[ y_{n, k} = \operatorname{I}\left(z_{n, k} > 0\right), \] where \(\operatorname{I}()\) is the indicator function taking the value 1 if its argument is true and 0 otherwise.

其中 \(\operatorname{I}()\) 是指示函数,若其参数为真,则取值为1;否则取值为0。

Unlike in the seemingly unrelated regressions case, here the covariance matrix \(\Sigma\) has unit standard deviations (i.e., it is a correlation matrix). As with ordinary probit and logistic regressions, letting the scale vary causes the model (which is defined only by a cutpoint at 0, not a scale) to be unidentified (see Greene (2011)).

与看似无关回归情况不同,此处的协方差矩阵 \(\Sigma\) 具有单位标准差(即它是一个相关矩阵)。与普通的 Probit 和 Logistic 回归一样,若允许标准差变化,则模型(仅由0处的切点定义,而非标准差)将无法确定(参见 Greene (2011))。

Multivariate probit regression can be coded in Stan using the trick introduced by Albert and Chib (1993), where the underlying continuous value vectors \(y_n\) are coded as truncated parameters. The key to coding the model in Stan is declaring the latent vector \(z\) in two parts, based on whether the corresponding value of \(y\) is 0 or 1. Otherwise, the model is identical to the seemingly unrelated regression model in the previous section.

利用 Albert and Chib (1993) 引入的技巧,可以在 Stan 中编码多元 Probit 回归,其中基础连续值向量 \(y_n\) 被编码为截断参数。在 Stan 中编码该模型的关键是将潜变量向量 \(z\) 分为两部分,依据相应的 \(y\) 值是0还是1。除此之外,该模型与前一节中的看似无关回归模型一致。

First, we introduce a sum function for two-dimensional arrays of integers; this is going to help us calculate how many total 1 values there are in \(y\).

首先,我们引入一个用于计算二维整数数组之和的函数。这有助于我们计算 \(y\) 中值为1的元素总数。

functions {
  int sum2d(array[,] int a) {
    int s = 0;
    for (i in 1:size(a)) {
      s += sum(a[i]);
    }
    return s;
  }
}

The function is trivial, but it’s not a built-in for Stan and it’s easier to understand the rest of the model if it’s pulled into its own function so as not to create a distraction.

该函数虽然简单,但在 Stan 中并非内置函数。将其作为单独的函数提取出来,可以更轻松地理解模型的其余部分,避免分散注意力。

The data declaration block is much like for the seemingly unrelated regressions, but the observations y are now integers constrained to be 0 or 1.

数据声明块与前面的看似无关回归类似,但现在观测值 y 是取值为0或1的整数。

data {
  int<lower=1> K;
  int<lower=1> D;
  int<lower=0> N;
  array[N, D] int<lower=0, upper=1> y;
  array[N] vector[K] x;
}

After declaring the data, there is a rather involved transformed data block whose sole purpose is to sort the data array y into positive and negative components, keeping track of indexes so that z can be easily reassembled in the transformed parameters block.

在声明数据后,有一个相当复杂的转换数据块,其唯一目的是将数据数组 y 分为正部和负部并排序,同时跟踪索引,以便在转换参数块中轻松重组 z

transformed data {
  int<lower=0> N_pos;
  array[sum2d(y)] int<lower=1, upper=N> n_pos;
  array[size(n_pos)] int<lower=1, upper=D> d_pos;
  int<lower=0> N_neg;
  array[(N * D) - size(n_pos)] int<lower=1, upper=N> n_neg;
  array[size(n_neg)] int<lower=1, upper=D> d_neg;

  N_pos = size(n_pos);
  N_neg = size(n_neg);
  {
    int i;
    int j;
    i = 1;
    j = 1;
    for (n in 1:N) {
      for (d in 1:D) {
        if (y[n, d] == 1) {
          n_pos[i] = n;
          d_pos[i] = d;
          i += 1;
        } else {
          n_neg[j] = n;
          d_neg[j] = d;
          j += 1;
        }
      }
    }
  }
}

The variables N_pos and N_neg are set to the number of true (1) and number of false (0) observations in y. The loop then fills in the sequence of indexes for the positive and negative values in four arrays.

变量 N_posN_neg 分别设置为 y 中为真(1)和为假(0)的观测值数量。随后,循环在四个数组中填充正值和负值的索引序列。

The parameters are declared as follows.

参数声明如下:

parameters {
  matrix[D, K] beta;
  cholesky_factor_corr[D] L_Omega;
  vector<lower=0>[N_pos] z_pos;
  vector<upper=0>[N_neg] z_neg;
}

These include the regression coefficients beta and the Cholesky factor of the correlation matrix, L_Omega. This time there is no scaling because the covariance matrix has unit scale (i.e., it is a correlation matrix; see above).

这些包括回归系数 beta 和相关矩阵的 Cholesky 分解因子 L_Omega。这次没有进行缩放,因为协方差矩阵具有单位标度(即是一个相关矩阵;参见上文)。

The critical part of the parameter declaration is that the latent real value \(z\) is broken into positive-constrained and negative-constrained components, whose size was conveniently calculated in the transformed data block. The transformed data block’s real work was to allow the transformed parameter block to reconstruct \(z\).

参数声明的关键部分是将潜在的实值 \(z\) 分解为正约束和负约束的分量,其大小在转换数据块中方便地计算出来。转换数据块的真正作用是允许转换参数块重新构建 \(z\)

transformed parameters {
  array[N] vector[D] z;
  for (n in 1:N_pos) {
    z[n_pos[n], d_pos[n]] = z_pos[n];
  }
  for (n in 1:N_neg) {
    z[n_neg[n], d_neg[n]] = z_neg[n];
  }
}

At this point, the model is simple, pretty much recreating the seemingly unrelated regression.

此时,模型非常简单,基本上重新创建了看似无关回归模型。

model {
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(beta) ~ normal(0, 5);
  {
    array[N] vector[D] beta_x;
    for (n in 1:N) {
      beta_x[n] = beta * x[n];
    }
    z ~ multi_normal_cholesky(beta_x, L_Omega);
  }
}

This simple form of model is made possible by the Albert and Chib-style constraints on z.

这种简单形式的模型是通过对 z 施加 Albert 和 Chib 风格的约束实现的。

Finally, the correlation matrix itself can be put back together in the generated quantities block if desired.

最后,如果需要,相关矩阵本身可以在生成量块中重新组合。

generated quantities {
  corr_matrix[D] Omega;
  Omega = multiply_lower_tri_self_transpose(L_Omega);
}

The same could be done for the seemingly unrelated regressions in the previous section.

对于前一节中的看似无关回归模型,也可以采用相同的方法。

1.16 Applications of pseudorandom number generation

伪随机数生成的应用

The main application of pseudorandom number generator (PRNGs) is for posterior inference, including prediction and posterior predictive checks. They can also be used for pure data simulation, which is like a posterior predictive check with no conditioning. See the function reference manual for a complete description of the syntax and usage of pseudorandom number generators.

伪随机数生成器(PRNG)的主要应用是后验推断,包括预测和后验预测检验。它们还可用于纯数据模拟,类似于无条件的后验预测检验。有关伪随机数生成器语法和用法的完整描述,请参阅函数参考手册。

Prediction

预测

Consider predicting unobserved outcomes using linear regression. Given predictors \(x_1, \dotsc, x_N\) and observed outcomes \(y_1, \dotsc, y_N\), and assuming a standard linear regression with intercept \(\alpha\), slope \(\beta\), and error scale \(\sigma\), along with improper uniform priors, the posterior over the parameters given \(x\) and \(y\) is

在使用线性回归预测未观测结果时,给定预测变量 \(x_1, \dotsc, x_N\) 和观测结果 \(y_1, \dotsc, y_N\),假设标准线性回归模型具有截距 \(\alpha\)、斜率 \(\beta\) 和误差尺度 \(\sigma\),以及非正规的均匀先验分布,给定 \(x\)\(y\) 的参数后验分布为

\[ p\left(\alpha, \beta, \sigma \mid x, y \right) \propto \prod_{n=1}^N \textsf{normal}\left(y_n \mid \alpha + \beta x_n, \sigma\right). \]

For this model, the posterior predictive inference for a new outcome \(\tilde{y}_m\) given a predictor \(\tilde{x}_m\), conditioned on the observed data \(x\) and \(y\), is

对于该模型,在给定观测数据 \(x\)\(y\) 的条件下,对于给定预测变量 \(\tilde{x}_m\) 的新结果 \(\tilde{y}_m\),其后验预测推断为

\[ p\left(\tilde{y}_n \mid \tilde{x}_n, x, y\right) = \int_{(\alpha,\beta,\sigma)} \textsf{normal}\left(\tilde{y}_n \mid \alpha + \beta \tilde{x}_n, \sigma\right) \times p\left(\alpha, \beta, \sigma \mid x, y\right) \,\textrm{d}(\alpha,\beta,\sigma). \]

To code the posterior predictive inference in Stan, a standard linear regression is combined with a random number in the generated quantities block.

为了在 Stan 中编码后验预测推断,可以在生成量块中将标准线性回归与随机数结合。

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  vector[N_tilde] y_tilde;
  for (n in 1:N_tilde) {
    y_tilde[n] = normal_rng(alpha + beta * x_tilde[n], sigma);
  }
}

Given observed predictors \(x\) and outcomes \(y\), y_tilde will be drawn according to \(p\left(\tilde{y} \mid \tilde{x}, y, x\right)\). This means that, for example, the posterior mean for y_tilde is the estimate of the outcome that minimizes expected square error (conditioned on the data and model).

给定观测到的预测变量 \(x\) 和结果 \(y\)y_tilde 将根据 \(p\left(\tilde{y} \mid \tilde{x}, y, x\right)\) 进行抽样。这意味着,例如,y_tilde 的后验均值是使期望平方误差最小化的结果估计(在给定数据和模型的条件下)。

Posterior predictive checks

后验预测检验

A good way to investigate the fit of a model to the data, a critical step in Bayesian data analysis, is to generate simulated data according to the parameters of the model. This is carried out with exactly the same procedure as before, only the observed data predictors \(x\) are used in place of new predictors \(\tilde{x}\) for unobserved outcomes. If the model fits the data well, the predictions for \(\tilde{y}\) based on \(x\) should match the observed data \(y\).

通过根据模型参数生成模拟数据,可以很好地研究模型与数据的拟合程度,这是贝叶斯数据分析的关键步骤。这个过程与之前完全相同,只是观测到的数据预测变量 \(x\) 取代了未观测结果的预测变量 \(\tilde{x}\)。如果模型很好地拟合数据,基于 \(x\)\(\tilde{y}\) 的预测结果应与观测到的数据 \(y\) 匹配。

To code posterior predictive checks in Stan requires only a slight modification of the prediction code to use \(x\) and \(N\) in place of \(\tilde{x}\) and \(\tilde{N}\),

在 Stan 中编写后验预测检验只需对预测代码进行轻微修改,将 \(\tilde{x}\)\(\tilde{N}\) 替换为 \(x\)\(N\)

generated quantities {
  vector[N] y_tilde;
  for (n in 1:N) {
    y_tilde[n] = normal_rng(alpha + beta * x[n], sigma);
  }
}

Gelman et al. (2013) recommend choosing several posterior draws \(\tilde{y}^{(1)}, \dotsc, \tilde{y}^{(M)}\) and plotting each of them alongside the data \(y\) that was actually observed. If the model fits well, the simulated \(\tilde{y}\) will look like the actual data \(y\).

Gelman et al. (2013) 建议选择多个后验抽样 \(\tilde{y}^{(1)}, \dotsc, \tilde{y}^{(M)}\),并将每个抽样结果与实际观测数据 \(y\) 进行对比绘图。如果模型拟合良好,模拟得到的 \(\tilde{y}\) 将与实际数据 \(y\) 相似。

Albert, J. H., and S. Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88: 669–79.
Barnard, John, Robert McCulloch, and Xiao-Li Meng. 2000. “Modeling Covariance Matrices in Terms of Standard Deviations and Correlations, with Application to Shrinkage.” Statistica Sinica, 1281–311.
Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika 78 (4): 685–709.
Curtis, S. McKay. 2010. BUGS Code for Item Response Theory.” Journal of Statistical Software 36 (1): 1–34.
Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.
Greene, William H. 2011. Econometric Analysis. 7th ed. Prentice-Hall.
Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.
Seyboldt, Adrian. 2024. “Add ZeroSumNormal Distribution.” https://github.com/pyro-ppl/numpyro/pull/1751#issuecomment-1980569811.
Zellner, Arnold. 1962. “An Efficient Method of Estimating Seemingly Unrelated Regression Equations and Tests for Aggregation Bias.” Journal of the American Statistical Association 57: 348–68.

  1. Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as log(sigma) in the above model, will be computed once and reused.↩︎

  2. 与 Python 和 R 不同,Stan 是被转换为 C++ 并进行编译的,因此循环和赋值语句速度较快。在 Stan 中,向量化代码更快的原因是:(1) 用于计算导数的表达式树可以简化,从而减少虚函数调用;(2) 在循环版本中会重复进行的计算,例如上述模型中的 log(sigma),将只计算一次并被重复使用。↩︎

  3. The Phi_approx function is a rescaled version of the inverse logit function, so while the scale is roughly the same \(\Phi\), the tails do not match.↩︎

  4. Phi_approx 函数是逆逻辑函数的重新缩放版本,因此尽管尺度大致与 \(\Phi\) 相同,但尾部并不匹配。↩︎

  5. The Bernoulli-logit distribution builds in the log link function, taking \[\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).\]↩︎

  6. 伯努利对数几率分布内置了对数链接函数,其公式为 \[\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).\]↩︎

  7. Gelman and Hill (2007) treat the \(\delta\) term equivalently as the location parameter in the distribution of student abilities.↩︎

  8. Gelman and Hill (2007)\(\delta\) 视为学生能力分布的位置参数。↩︎

  9. The prior is named for Lewandowski, Kurowicka, and Joe, as it was derived by inverting the random correlation matrix generation strategy of Lewandowski, Kurowicka, and Joe (2009).↩︎

  10. 该先验以 Lewandowski、Kurowicka 和 Joe 的名字命名,因为它是通过反转 Lewandowski, Kurowicka, and Joe (2009) 中的随机相关矩阵生成策略推导得到的。↩︎

  11. Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual.↩︎

  12. 感谢 Mike Lawrence 在 GitHub 上的问题反馈。↩︎