28  Posterior Predictive Sampling

29 后验预测抽样

本节译者:谭朕斯 本节校审:张梓源

The goal of inference is often posterior prediction, that is evaluating or sampling from the posterior predictive distribution \(p(\tilde{y} \mid y),\) where \(y\) is observed data and \(\tilde{y}\) is yet to be observed data. Often there are unmodeled predictors \(x\) and \(\tilde{x}\) for the observed data \(y\) and unobserved data \(\tilde{y}\). With predictors, the posterior predictive density is \(p(\tilde{y} \mid \tilde{x}, x, y).\) All of these variables may represent multivariate quantities.

推断的目标通常是后验预测,即评估或从后验预测分布中进行抽样 \(p(\tilde{y} \mid y)\),其中 \(y\) 是观测数据,\(\tilde{y}\) 是尚未观测到的数据。通常存在未建模的预测变量 \(x\)\(\tilde{x}\),用于观测数据 \(y\) 和未观测数据 \(\tilde{y}\)。在考虑了预测变量后,后验预测密度为 \(p(\tilde{y} \mid \tilde{x}, x, y)\)。所有这些变量都可以表示多元变量。

This chapter explains how to sample from the posterior predictive distribution in Stan, including applications to posterior predictive simulation and calculating event probabilities. These techniques can be coded in Stan using random number generation in the generated quantities block. Further, a technique for fitting and performing inference in two stages is presented in a section on stand-alone generated quantities in Stan

本节重点介绍了如何在 Stan 中从后验预测分布中进行抽样,它涵盖了后验预测模拟和计算事件发生概率的技术和相应的应用,所提供的示例和解释将帮助您理解如何在 Stan 中实现这些技术。

29.1 Posterior predictive distribution

29.2 后验预测分布

Given a full Bayesian model \(p(y, \theta)\), the posterior predictive density for new data \(\tilde{y}\) given observed data \(y\) is

给定一个贝叶斯模型 \(p(y, \theta)\),对于给定观测数据 \(y\) 的新数据 \(\tilde{y}\) 的后验预测密度为:

\[ p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. \] The product under the integral reduces to the joint posterior density \(p(\tilde{y}, \theta \mid y),\) so that the integral is simply marginalizing out the parameters \(\theta,\) leaving the predictive density \(p(\tilde{y} \mid y)\) of future observations given past observations.

积分下的乘积简化为联合后验密度 \(p(\tilde{y}, \theta \mid y)\),因此积分实际上是对参数 \(\theta\) 进行边际化,得到给定过去观测的未来观测的预测密度 \(p(\tilde{y} \mid y)\)

29.3 Computing the posterior predictive distribution

29.4 计算后验预测分布

The posterior predictive density (or mass) of a prediction \(\tilde{y}\) given observed data \(y\) can be computed using \(M\) Monte Carlo draws

给定观测数据 \(y\),可以使用 \(M\) 次蒙特卡洛抽样来计算预测 \(\tilde{y}\) 的后验预测密度(或概率)。

\[ \theta^{(m)} \sim p(\theta \mid y) \] from the posterior as

从后验分布中进行抽样,

\[ p(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \theta^{(m)}). \]

Computing directly using this formula will lead to underflow in many situations, but the log posterior predictive density, \(\log p(\tilde{y} \mid y)\) may be computed using the stable log sum of exponents function as

直接使用这个公式进行计算在许多情况下可能会导致下溢(underflow),但是可以使用稳定的对数求和函数(stable log sum of exponents function)计算对数后验预测密度 \(\log p(\tilde{y} \mid y)\),如下所示:

\[\begin{eqnarray*} \log p(\tilde{y} \mid y) & \approx & \log \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \theta^{(m)}). \\[4pt] & = & - \log M + \textrm{log-sum-exp}_{m = 1}^M \log p(\tilde{y} \mid \theta^{(m)}), \end{eqnarray*}\]

where

其中

\[ \textrm{log-sum-exp}_{m = 1}^M v_m = \log \sum_{m = 1}^M \exp v_m \] is used to maintain arithmetic precision. See the section on log sum of exponentials for more details.

用于保持算术精度。有关更多详细信息,请参阅对数指数和 section on log sum of exponentials 的部分。

29.5 Sampling from the posterior predictive distribution

29.6 从后验预测分布中进行抽样

Given draws from the posterior \(\theta^{(m)} \sim p(\theta \mid y),\) draws from the posterior predictive \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid y)\) can be generated by randomly generating from the sampling distribution with the parameter draw plugged in,

给定从后验分布中得到的抽样 \(\theta^{(m)} \sim p(\theta \mid y)\),可以通过将参数抽样代入到抽样分布中进行随机生成,从而生成从后验预测分布中得到的抽样 \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid y)\)

\[ \tilde{y}^{(m)} \sim p(y \mid \theta^{(m)}). \]

Randomly drawing \(\tilde{y}\) from the data model is critical because there are two forms of uncertainty in posterior predictive quantities, aleatoric uncertainty and epistemic uncertainty. Epistemic uncertainty arises because \(\theta\) is unknown and estimated based only on a finite sample of data \(y\). Aleatoric uncertainty arises because even a known value of \(\theta\) leads to uncertainty about new \(\tilde{y}\) as described by the data model \(p(\tilde{y} \mid \theta)\). Both forms of uncertainty show up in the factored form of the posterior predictive distribution,

从抽样分布中随机抽取 \(\tilde{y}\) 是至关重要的,因为后验预测量中存在两种形式的不确定性,即抽样不确定性和估计不确定性。估计不确定性是由于 \(\theta\) 仅基于数据样本 \(y\) 进行估计而产生的。抽样不确定性是因为即使是已知的 \(\theta\) 值,也会导致 \(\tilde{y}\) 在抽样分布 \(p(\tilde{y} \mid \theta)\) 中具有变异性。这两种形式的不确定性体现在后验预测分布的分解形式中。

\[ p(\tilde{y} \mid y) = \int \underbrace{p(\tilde{y} \mid \theta)}_{\begin{array}{l} \textrm{aleatoric} \\[-2pt] \textrm{uncertainty} \end{array}} \cdot \underbrace{p(\theta \mid y)}_{\begin{array}{l} \textrm{epistemic} \\[-2pt] \textrm{uncertainty} \end{array}} \, \textrm{d}\theta. \]

29.7 Posterior predictive simulation in Stan

29.8 在 Stan 中进行后验预测模拟

Posterior predictive quantities can be coded in Stan using the generated quantities block.

在 Stan 中,可以使用 generated quantities 块来编写后验预测量。

Simple Poisson model

简单泊松模型

For example, consider a simple Poisson model for count data with a rate parameter \(\lambda > 0\) having a gamma-distributed prior,

例如,考虑一个简单的泊松模型,用于处理计数数据,其中参数 \(\lambda > 0\), 遵循先验分布为伽玛分布

\[ \lambda \sim \textrm{gamma}(1, 1). \] The \(N\) observations \(y_1, \ldots, y_N\) are modeled as Poisson distributed,

\(N\) 个观测数据 \(y_1, \ldots, y_N\) 被建模为泊松分布,

\[ y_n \sim \textrm{poisson}(\lambda). \]

Stan code

Stan 代码

The following Stan program defines a variable for \(\tilde{y}\) by random number generation in the generated quantities block.

以下是一个 Stan 程序,其中在 generated quantities 块中通过随机数生成为 \(\tilde{y}\) 定义了一个变量。

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0> lambda;
}
model {
  lambda ~ gamma(1, 1);
  y ~ poisson(lambda);
}
generated quantities {
  int<lower=0> y_tilde = poisson_rng(lambda);
}

The random draw from the data model for \(\tilde{y}\) is coded using Stan’s Poisson random number generator in the generated quantities block. This accounts for the aleatoric component of the uncertainty; Stan’s posterior sampler will account for the epistemic uncertainty, generating a new \(\tilde{y}^{(m)} \sim p(y \mid \lambda^{(m)})\) for each posterior draw \(\lambda^{(m)} \sim p(\theta \mid y).\)

在 Stan 的 generated quantities 块中,使用 Stan 的泊松随机数生成器编码了对于 \(\tilde{y}\) 的数据模型的随机抽样。这考虑了抽样的不确定性成分;而 Stan 的后验抽样器将考虑估计的不确定性,在每次后验抽样 \(\lambda^{(m)} \sim p(\theta \mid y)\) 中生成一个新的 \(\tilde{y}^{(m)} \sim p(y \mid \lambda^{(m)})\)

The posterior draws \(\tilde{y}^{(m)}\) may be used to estimate the expected value of \(\tilde{y}\) or any of its quantiles or posterior intervals, as well as event probabilities involving \(\tilde{y}\). In general, \(\mathbb{E}[f(\tilde{y}, \theta) \mid y]\) may be evaluated as

后验抽样 \(\tilde{y}^{(m)}\) 可用于估计 \(\tilde{y}\) 的期望值、分位数或后验区间,以及涉及 \(\tilde{y}\) 的事件概率。一般来说,可以如下评估 \(\mathbb{E}[f(\tilde{y}, \theta) \mid y]\)

\[ \mathbb{E}[f(\tilde{y}, \theta) \mid y] \approx \frac{1}{M} \sum_{m=1}^M f(\tilde{y}^{(m)}, \theta^{(m)}), \]

which is just the posterior mean of \(f(\tilde{y}, \theta).\) This quantity is computed by Stan if the value of \(f(\tilde{y}, \theta)\) is assigned to a variable in the generated quantities block. That is, if we have

这个量就是 \(f(\tilde{y}, \theta)\) 的后验均值。如果在 generated quantities 块中将 \(f(\tilde{y}, \theta)\) 的值赋给一个变量,Stan 会自动计算这个量。也就是说,如果我们有以下代码:

generated quantities {
  real f_val = f(y_tilde, theta);
  // ...
}

where the value of \(f(\tilde{y}, \theta)\) is assigned to variable f_val, then the posterior mean of f_val will be the expectation \(\mathbb{E}[f(\tilde{y}, \theta) \mid y]\).

其中,将 \(f(\tilde{y}, \theta)\) 的值赋给变量 f_val,那么 f_val 的后验均值将是期望值 \(\mathbb{E}[f(\tilde{y}, \theta) \mid y]\)

Analytic posterior and posterior predictive

解析后验和解析后验预测

The gamma distribution is the conjugate prior distribution for the Poisson distribution, so the posterior density \(p(\lambda \mid y)\) will also follow a gamma distribution.

Gamma 分布是泊松分布的共轭先验分布,因此后验密度 \(p(\lambda \mid y)\) 也将遵循 Gamma 分布。

Because the posterior follows a gamma distribution and the sampling distribution is Poisson, the posterior predictive \(p(\tilde{y} \mid y)\) will follow a negative binomial distribution, because the negative binomial is defined as a compound gamma-Poisson. That is, \(y \sim \textrm{negative-binomial}(\alpha, \beta)\) if \(\lambda \sim \textrm{gamma}(\alpha, \beta)\) and \(y \sim \textrm{poisson}(\lambda).\) Rather than marginalizing out the rate parameter \(\lambda\) analytically as can be done to define the negative binomial probability mass function, the rate \(\lambda^{(m)} \sim p(\lambda \mid y)\) is sampled from the posterior and then used to generate a draw of \(\tilde{y}^{(m)} \sim p(y \mid \lambda^{(m)}).\)

由于后验分布遵循 Gamma 分布,而采样分布是泊松分布,后验预测分布 \(p(\tilde{y} \mid y)\) 将遵循负二项分布,因为负二项分布定义为 Gamma-泊松的组合。也就是说,如果 \(\lambda \sim \textrm{gamma}(\alpha, \beta)\)\(y \sim \textrm{poisson}(\lambda)\),则 \(y \sim \textrm{negative-binomial}(\alpha, \beta)\)。为了生成负二项分布的概率质量函数,通常会对速率参数 \(\lambda\) 进行解析边际化。但是,在后验预测中,我们可以从后验中抽样得到 \(\lambda^{(m)} \sim p(\lambda \mid y)\),然后使用该值生成 \(\tilde{y}^{(m)} \sim p(y \mid \lambda^{(m)})\) 的样本。

29.9 Posterior prediction for regressions

29.10 回归的后验预测

Posterior predictive distributions for regressions

回归模型的后验预测分布

Consider a regression with a single predictor \(x_n\) for the training outcome \(y_n\) and \(\tilde{x}_n\) for the test outcome \(\tilde{y}_n.\) Without considering the parametric form of any of the distributions, the posterior predictive distribution for a general regression in

考虑一个回归模型,其中训练数据的预测变量为 \(x_n\),对应的观测结果为 \(y_n\),而测试数据的预测变量为 \(\tilde{x}_n\),对应的预测结果为 \(\tilde{y}_n\)。在不考虑分布的参数形式的情况下,回归模型的后验预测分布可以表示为:

\[\begin{eqnarray} p(\tilde{y} \mid \tilde{x}, y, x) & = & \int p(\tilde{y} \mid x, \theta) \cdot p(\theta \mid y, x) \, \textrm{d}\theta \\[4pt] & \approx & \frac{1}{M} \sum_{m=1}^M \, p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \end{eqnarray}\]

where \(\theta^{(m)} \sim p(\theta \mid x, y).\)

其中 \(\theta^{(m)} \sim p(\theta \mid x, y).\)

Stan program

Stan 程序

The following program defines a Poisson regression with a single predictor. These predictors are all coded as data, as are their sizes. Only the observed \(y\) values are coded as data. The predictive quantities \(\tilde{y}\) appear in the generated quantities block, where they are generated by random number generation.

下面的程序定义了一个具有单个预测变量的泊松回归模型。这些预测变量都被编码为数据,包括它们的大小。只有观测到的 \(y\) 值被编码为数据。预测量 \(\tilde{y}\) 出现在 generated quantities 块中,它们通过随机数生成进行生成。

data {
  int<lower=0> N;
  vector[N] x;
  array[N] int<lower=0> y;
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
}
model {
  y ~ poisson_log(alpha + beta * x);
  { alpha, beta } ~ normal(0, 1);
}
generated quantities {
  array[N_tilde] int<lower=0> y_tilde
    = poisson_log_rng(alpha + beta * x_tilde);
}

The Poisson distributions in both the model and generated quantities block are coded using the log rate as a parameter (that’s poisson_log vs. poisson, with the suffixes defining the scale of the parameter). The regression coefficients, an intercept alpha and slope beta, are given standard normal priors.

在模型和 generated quantities 块中,泊松分布使用对数率作为参数进行编码(即 poisson_log 和 poisson,后缀定义了参数的尺度)。回归系数,即截距 alpha 和斜率 beta,具有标准正态分布的先验分布。

In the model block, the log rate for the Poisson is a linear function of the training data \(x\), whereas in the generated quantities block it is a function of the test data \(\tilde{x}\). Because the generated quantities block does not affect the posterior draws, the model fits \(\alpha\) and \(\beta\) using only the training data, reserving \(\tilde{x}\) to generate \(\tilde{y}.\)

在模型块中,泊松分布的对数率是训练数据 \(x\) 的线性函数,而在 generated quantities 块中,它是测试数据 \(\tilde{x}\) 的函数。由于 generated quantities 块不影响后验抽样,模型仅使用训练数据拟合 \(\alpha\)\(\beta\),将 \(\tilde{x}\) 保留用于生成 \(\tilde{y}\)

The result from running Stan is a predictive sample \(\tilde{y}^{(1)}, \ldots \tilde{y}^{(M)}\) where each \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y).\)

在运行 Stan 后,得到的结果是预测样本 \(\tilde{y}^{(1)}, \ldots, \tilde{y}^{(M)}\),其中每个 \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\)

The mean of the posterior predictive distribution is the expected value

后验预测分布的均值是期望值,可以通过计算预测样本的平均值得到。

\[\begin{align} \mathbb{E}[\tilde{y} \mid \tilde{x}, x, y] & = \int \tilde{y} \cdot p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta \\[4pt] & \approx \frac{1}{M} \sum_{m = 1}^M \tilde{y}^{(m)}, \end{align}\]

where the \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\) are drawn from the posterior predictive distribution. Thus the posterior mean of y_tilde[n] after running Stan is the expected value of \(\tilde{y}_n\) conditioned on the training data \(x, y\) and predictor \(\tilde{x}_n.\) This is the Bayesian estimate for \(\tilde{y}\) with minimum expected squared error. The posterior draws can also be used to estimate quantiles for the median and any posterior intervals of interest for \(\tilde{y}\), as well as covariance of the \(\tilde{y_n}.\) The posterior draws \(\tilde{y}^{(m)}\) may also be used to estimate predictive event probabilities, such as \(\Pr[\tilde{y}_1 > 0]\) or \(\Pr[\prod_{n = 1}^{\tilde{N}}(\tilde{y_n}) > 1],\) as expectations of indicator functions.

在这种情况下,\(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\) 是从后验预测分布中抽取的样本。因此,经过 Stan 运行后,y_tilde[n]的后验均值是在给定训练数据 \(x, y\) 和预测变量 \(\tilde{x}_n\) 的条件下 \(\tilde{y}_n\) 的期望值。这是 \(\tilde{y}\) 的贝叶斯估计,具有最小的期望平方误差。后验抽样还可以用于估计中位数和任何感兴趣的后验区间的分位数,以及 \(\tilde{y_n}\) 的协方差。后验抽样 \(\tilde{y}^{(m)}\) 还可以用于估计预测事件的概率,例如 \(\mbox{Pr}[\tilde{y}1 > 0]\)\(\mbox{Pr}[\prod_{n =1}^{\tilde{N}}(\tilde{y_n}) > 1]\),这些可以看作是指示函数的期望。

All of this can be carried out by running Stan only a single time to draw a single sample of \(M\) draws,

所有这些都可以通过仅运行一次 Stan 来进行,从而得到 \(M\) 个后验抽样。这样可以实现对预测分布、期望值、分位数、协方差和事件概率等进行估计。

\[ \tilde{y}^{(1)}, \ldots, \tilde{y}^{(M)} \sim p(\tilde{y} \mid \tilde{x}, x, y). \]

It’s only when moving to cross-validation where multiple runs are required.

只有在进行交叉验证时才需要多次运行。在交叉验证中,需要多次运行 Stan 来获得不同的训练和测试数据集,并进行模型的训练和评估。这样可以更准确地评估模型的性能和泛化能力。

29.11 Estimating event probabilities

29.12 估计事件概率

Event probabilities involving either parameters or predictions or both may be coded in the generated quantities block. For example, to evaluate \(\Pr[\lambda > 5 \mid y]\) in the simple Poisson example with only a rate parameter \(\lambda\), it suffices to define a generated quantity

涉及参数或预测的事件概率可以在生成的量块中进行编码。例如,在只有速率参数 \(\lambda\) 的简单泊松模型中评估 \(\textrm{Pr}[\lambda > 5 \mid y]\),只需要定义一个生成的量:

generated quantities {
  int<lower=0, upper=1> lambda_gt_5 = lambda > 5;
  // ...
}

The value of the expression lambda > 5 is 1 if the condition is true and 0 otherwise. The posterior mean of this parameter is the event probability

如果条件为真,则表达式 lambda > 5 的值为1,否则为0。该参数的后验均值即为事件概率。

\[\begin{eqnarray*} \Pr[\lambda > 5 \mid y] & = & \int \textrm{I}(\lambda > 5) \cdot p(\lambda \mid y) \, \textrm{d}\lambda \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{I}[\lambda^{(m)} > 5], \end{eqnarray*}\]

where each \(\lambda^{(m)} \sim p(\lambda \mid y)\) is distributed according to the posterior. In Stan, this is recovered as the posterior mean of the parameter lambda_gt_5.

其中每个 \(\lambda^{(m)} \sim p(\lambda \mid y)\) 是根据后验分布进行抽样。在 Stan 中,这可以通过参数 lambda_gt_5 的后验均值来计算。

In general, event probabilities may be expressed as expectations of indicator functions. For example,

通常情况下,事件概率可以表示为指示函数的期望。例如,可以将事件概率表示为以下形式:

\[\begin{eqnarray*} \Pr[\lambda > 5 \mid y] & = & \mathbb{E}[\textrm{I}[\lambda > 5] \mid y] \\[4pt] & = & \int \textrm{I}(\lambda > 5) \cdot p(\lambda \mid y) \, \textrm{d}\lambda \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{I}(\lambda^{(m)} > 5). \end{eqnarray*}\]

The last line above is the posterior mean of the indicator function as coded in Stan.

以上最后一行是在 Stan 中编码的指示函数的后验均值。

Event probabilities involving posterior predictive quantities \(\tilde{y}\) work exactly the same way as those for parameters. For example, if \(\tilde{y}_n\) is the prediction for the \(n\)-th unobserved outcome (such as the score of a team in a game or a level of expression of a protein in a cell), then

涉及后验预测量 \(\tilde{y}\) 的事件概率与参数的情况完全相同。例如,如果 \(\tilde{y}_n\) 是第 \(n\) 个未观测到的结果的预测值(例如比赛中一支球队的得分或细胞中蛋白质的表达水平),那么可以使用以下方式计算事件概率:

\[\begin{eqnarray*} \Pr[\tilde{y}_3 > \tilde{y}_7 \mid \tilde{x}, x, y] & = & \mathbb{E}\!\left[I[\tilde{y}_3 > \tilde{y}_7] \mid \tilde{x}, x, y\right] \\[4pt] & = & \int \textrm{I}(\tilde{y}_3 > \tilde{y}_7) \cdot p(\tilde{y} \mid \tilde{x}, x, y) \, \textrm{d}\tilde{y} \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{I}(\tilde{y}^{(m)}_3 > \tilde{y}^{(m)}_7), \end{eqnarray*}\]

where \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y).\)

其中 \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\) 表示从后验预测分布中抽取的样本。

29.13 Stand-alone generated quantities and ongoing prediction

29.14 独立的生成量和持续的预测

Stan’s sampling algorithms take a Stan program representing a posterior \(p(\theta \mid y, x)\) along with actual data \(x\) and \(y\) to produce a set of draws \(\theta^{(1)}, \ldots, \theta^{(M)}\) from the posterior. Posterior predictive draws \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\) can be generated by drawing

Stan 的抽样算法使用表示后验分布 \(p(\theta \mid y, x)\) 的 Stan 程序,以及实际数据 \(x\)\(y\),生成一组从后验分布中抽取的样本 \(\theta^{(1)}, \ldots, \theta^{(M)}\)。可以通过抽取来生成后验预测样本 \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\)

\[ \tilde{y}^{(m)} \sim p(y \mid \tilde{x}, \theta^{(m)}) \] from the data model. Note that drawing \(\tilde{y}^{(m)}\) only depends on the new predictors \(\tilde{x}\) and the posterior draws \(\theta^{(m)}\). Most importantly, neither the original data or the model density is required.

从数据模型中进行抽取。请注意,抽取 \(\tilde{y}^{(m)}\) 仅取决于新的预测变量 \(\tilde{x}\) 和后验样本 \(\theta^{(m)}\)。最重要的是,不需要原始数据或模型密度。

By saving the posterior draws, predictions for new data items \(\tilde{x}\) may be generated whenever needed. In Stan’s interfaces, this is done by writing a second Stan program that inputs the original program’s parameters and the new predictors. For example, for the linear regression case, the program to take posterior draws declares the data and parameters, and defines the model.

通过保存后验样本,可以在需要时生成新数据 \(\tilde{x}\) 的预测值。在 Stan 的接口中,这可以通过编写第二个 Stan 程序来实现,该程序输入原始程序的参数和新的预测变量。例如,在线性回归的情况下,用于获取后验样本的程序会声明数据和参数,并定义模型。

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
}

A second program can be used to generate new observations. This follow-on program need only declare the parameters as they were originally defined. This may require defining constants in the data block such as sizes and hyperparameters that are involved in parameter size or constraint declarations. Then additional data is read in corresponding to predictors for new outcomes that have yet to be observed. There is no need to repeat the model or unneeded transformed parameters or generated quantities. The complete follow-on program for prediction just declares the predictors in the data, the original parameters, and then the predictions in the generated quantities block.

第二个程序可以用于生成新的观测数据。这个后续程序只需要像原来定义参数那样声明参数即可。这可能需要在数据块中定义一些常量,例如参数大小或约束声明中涉及的大小和超参数。然后,读入与尚未观测到的新结果相关的预测变量的额外数据。不需要重复模型、不需要的转换参数或生成的数量。完整的预测后续程序只需在数据中声明预测变量、原始参数,然后在生成的数量块中声明预测结果。

data {
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
generated quantities {
  vector[N_tilde] y_tilde
    = normal_rng(alpha + beta * x_tilde, sigma);
}

When running stand-alone generated quantities, the inputs required are the original draws for the parameters and any predictors corresponding to new predictions, and the output will be draws for \(\tilde{y}\) or derived quantities such as event probabilities.

当运行独立的生成数量时,所需的输入包括原始参数的抽样结果以及与新预测对应的任何预测变量,而输出将是 \(\tilde{y}\) 或派生数量(如事件概率)的抽样结果。

Any posterior predictive quantities desired may be generated this way. For example, event probabilities are estimated in the usual way by defining indicator variables in the generated quantities block.

通过这种方式可以生成所需的任何后验预测数量。例如,可以通过在生成的数量块中定义指示变量来通常的方式估计事件概率