31 Held-Out 评估与交叉验证

Held-Out Evaluation and Cross-Validation

本节译者：沈梓梁

初次校审：李君竹

二次校审：李君竹（Claude 辅助）

Held-out evaluation involves splitting a data set into two parts, a training data set and a test data set. The training data is used to estimate the model and the test data is used for evaluation. Held-out evaluation is commonly used to declare winners in predictive modeling competitions such as those run by Kaggle.

留出评估（Held-out evaluation）是将数据集分成两部分：训练数据集和测试数据集。训练数据用于估计模型，测试数据用于评估。留出评估常用于预测建模竞赛（如 Kaggle 举办的竞赛）中确定获胜者。

Cross-validation involves repeated held-out evaluations performed by partitioning a single data set in different ways. The training/test split can be done either by randomly selecting the test set, or by partitioning the data set into several equally-sized subsets and then using each subset in turn as the test data with the other folds as training data.

交叉验证涉及通过不同方式划分单个数据集来执行重复的留出评估。训练/测试分割可以通过随机选择测试集来完成，也可以通过将数据集划分为几个大小相等的子集，然后依次使用每个子集作为测试数据，其他折作为训练数据。

Held-out evaluation and cross-validation may involve any kind of predictive statistics, with common choices being the predictive log density on test data, squared error of parameter estimates, or accuracy in a classification task.

留出评估和交叉验证可以涉及任何类型的预测统计量，常见的选择包括测试数据的预测对数密度、参数估计的平方误差或分类任务的准确率。

31.1 Evaluating posterior predictive densities

评估后验预测密度

Given training data \((x, y)\) consisting of parallel sequences of predictors and observations and test data \((\tilde{x}, \tilde{y})\) of the same structure, the posterior predictive density is

给定由预测变量和观测值的并行序列组成的训练数据 \((x, y)\) 以及相同结构的测试数据 \((\tilde{x}, \tilde{y})\)，后验预测密度为

\[ p(\tilde{y} \mid \tilde{x}, x, y) = \int p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta, \]

where \(\theta\) is the vector of model parameters. This predictive density is the density of the test observations, conditioned on both the test predictors \(\tilde{x}\) and the training data \((x, y).\)

其中 \(\theta\) 是模型参数向量。该预测密度是测试观测值的密度，以测试预测变量 \(\tilde{x}\) 和训练数据 \((x, y)\) 为条件。

This integral may be calculated with Monte Carlo methods as usual,

该积分可以如常使用蒙特卡罗方法计算，

\[ p(\tilde{y} \mid \tilde{x}, x, y) \approx \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \] where the \(\theta^{(m)} \sim p(\theta \mid x, y)\) are draws from the posterior given only the training data \((x, y).\)

其中 \(\theta^{(m)} \sim p(\theta \mid x, y)\) 是仅基于训练数据 \((x, y)\) 的后验抽样。

To avoid underflow in calculations, it will be more stable to compute densities on the log scale. Taking the logarithm and pushing it through results in a stable computation,

为了避免计算中的下溢，在对数尺度上计算密度会更加稳定。取对数并推导可以得到稳定的计算，

\[\begin{eqnarray*} \log p(\tilde{y} \mid \tilde{x}, x, y) & \approx & \log \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \\[4pt] & = & -\log M + \log \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \\[4pt] & = & -\log M + \log \sum_{m = 1}^M \exp(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)})) \\[4pt] & = & -\log M + \textrm{log-sum-exp}_{m = 1}^M \log p(\tilde{y} \mid \tilde{x}, \theta^{(m)}) \end{eqnarray*}\] where the log sum of exponentials function is defined so as to make the above equation hold,

其中对数指数和函数的定义使得上述方程成立，

\[ \textrm{log-sum-exp}_{m = 1}^M \, \mu_m = \log \sum_{m=1}^M \exp(\mu_m). \] The log sum of exponentials function can be implemented so as to avoid underflow and maintain high arithmetic precision as

对数指数和函数可以实现为避免下溢并保持高算术精度，

\[ \textrm{log-sum-exp}_{m = 1}^M \mu_m = \textrm{max}(\mu) + \log \sum_{m = 1}^M \exp(\mu_m - \textrm{max}(\mu)). \] Pulling the maximum out preserves all of its precision. By subtracting the maximum, the terms \(\mu_m - \textrm{max}(\mu) \leq 0\), and thus will not overflow.

提取最大值可以保留其所有精度。通过减去最大值，项 \(\mu_m - \textrm{max}(\mu) \leq 0\)，因此不会溢出。

Stan program

Stan 程序

To evaluate the log predictive density of a model, it suffices to implement the log predictive density of the test data in the generated quantities block. The log sum of exponentials calculation must be done on the outside of Stan using the posterior draws of \(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)}).\)

要评估模型的对数预测密度，只需在 generated quantities 块中实现测试数据的对数预测密度即可。对数指数和的计算必须在 Stan 外部使用 \(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)})\) 的后验抽样完成。

Here is the code for evaluating the log posterior predictive density in a simple linear regression of the test data \(\tilde{y}\) given predictors \(\tilde{x}\) and training data \((x, y).\)

以下是在给定预测变量 \(\tilde{x}\) 和训练数据 \((x, y)\) 的简单线性回归中评估测试数据 \(\tilde{y}\) 的对数后验预测密度的代码

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
  vector[N_tilde] y_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  real log_p = normal_lpdf(y_tilde | alpha + beta * x_tilde, sigma);
}

Only the training data x and y are used in the model block. The test data y_tilde and test predictors x_tilde appear in only the generated quantities block. Thus the program is not cheating by using the test data during training. Although this model does not do so, it would be fair to use x_tilde in the model block—only the test observations y_tilde are unknown before they are predicted.

只有训练数据 x 和 y 在 model 块中使用。测试数据 y_tilde 和测试预测变量 x_tilde 只出现在 generated quantities 块中。因此，该程序没有在训练期间使用测试数据作弊。虽然这个模型没有这样做，但在 model 块中使用 x_tilde 是合理的——只有测试观测值 y_tilde 在预测之前是未知的。

Given \(M\) posterior draws from Stan, the sequence log_p[1:M] will be available, so that the log posterior predictive density of the test data given training data and predictors is just log_sum_exp(log_p) - log(M).

给定来自 Stan 的 \(M\) 个后验抽样，序列 log_p[1:M] 将可用，因此给定训练数据和预测变量的测试数据的对数后验预测密度就是 log_sum_exp(log_p) - log(M)。

31.2 Estimation error

估计误差

Parameter estimates

参数估计

Estimation is usually considered for unknown parameters. If the data from which the parameters were estimated came from simulated data, the true value of the parameters may be known. If \(\theta\) is the true value and \(\hat{\theta}\) the estimate, then error is just the difference between the prediction and the true value,

估计通常针对未知参数。如果估计参数的数据来自模拟数据，则参数的真实值可能是已知的。如果 \(\theta\) 是真实值，\(\hat{\theta}\) 是估计值，那么误差就是预测值和真实值之间的差值，

\[ \textrm{err} = \hat{\theta} - \theta. \]

If the estimate is larger than the true value, the error is positive, and if it’s smaller, then error is negative. If an estimator’s unbiased, then expected error is zero. So typically, absolute error or squared error are used, which will always have positive expectations for an imperfect estimator. Absolute error is defined as

如果估计值大于真实值，则误差为正；如果估计值小于真实值，则误差为负。如果估计量是无偏的，那么期望误差为零。因此通常使用绝对误差或平方误差，对于不完美的估计量，它们总是具有正的期望值。绝对误差定义为

\[ \textrm{abs-err} = \left| \hat{\theta} - \theta \right| \] and squared error as

平方误差定义为

\[ \textrm{sq-err} = \left( \hat{\theta} - \theta \right)^2. \] Gneiting and Raftery (2007) provide a thorough overview of such scoring rules and their properties.

Gneiting and Raftery (2007) 提供了这些评分规则及其性质的全面概述。

Bayesian posterior means minimize expected square error, whereas posterior medians minimize expected absolute error. Estimates based on modes rather than probability, such as (penalized) maximum likelihood estimates or maximum a posterior estimates, do not have these properties.

贝叶斯后验均值最小化期望平方误差，而后验中位数最小化期望绝对误差。基于众数而非概率的估计，如（惩罚的）最大似然估计或最大后验估计，不具有这些性质。

Predictive estimates

预测估计

In addition to parameters, other unknown quantities may be estimated, such as the score of a football match or the effect of a medical treatment given to a subject. In these cases, square error is defined in the same way. If there are multiple exchangeable outcomes being estimated, \(z_1, \ldots, z_N,\) then it is common to report mean square error (MSE),

除了参数之外，还可以估计其他未知量，如足球比赛的比分或对受试者施加的医疗治疗效果。在这些情况下，平方误差以相同的方式定义。如果有多个可交换的结果被估计，\(z_1, \ldots, z_N\)，那么通常报告均方误差（MSE），

\[ \textrm{mse} = \frac{1}{N} \sum_{n = 1}^N \left( \hat{z}_n - z_n\right)^2. \] To put the error back on the scale of the original value, the square root may be applied, resulting in what is known prosaically as root mean square error (RMSE),

为了使误差回到原始值的尺度，可以取平方根，得到所谓的均方根误差（RMSE），

\[ \textrm{rmse} = \sqrt{\textrm{mean-sq-err}}. \]

Predictive estimates in Stan

Stan 中的预测估计

Consider a simple linear regression model, parameters for the intercept \(\alpha\) and slope \(\beta\), along with predictors \(\tilde{x}_n\). The standard Bayesian estimate is the expected value of \(\tilde{y}\) given the predictors and training data,

考虑一个简单的线性回归模型，带有截距 \(\alpha\) 和斜率 \(\beta\) 参数，以及预测变量 \(\tilde{x}_n\)。标准贝叶斯估计是给定预测变量和训练数据的 \(\tilde{y}\) 的期望值，

\[\begin{eqnarray*} \hat{\tilde{y}}_n & = & \mathbb{E}[\tilde{y}_n \mid \tilde{x}_n, x, y] \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \tilde{y}_n^{(m)} \end{eqnarray*}\] where \(\tilde{y}_n^{(m)}\) is drawn from the data model

其中 \(\tilde{y}_n^{(m)}\) 从数据模型中抽取

\[ \tilde{y}_n^{(m)} \sim p(\tilde{y}_n \mid \tilde{x}_n, \alpha^{(m)}, \beta^{(m)}), \] for parameters \(\alpha^{(m)}\) and \(\beta^{(m)}\) drawn from the posterior,

对于从后验中抽取的参数 \(\alpha^{(m)}\) 和 \(\beta^{(m)}\)，

\[ (\alpha^{(m)}, \beta^{(m)}) \sim p(\alpha, \beta \mid x, y). \]

In the linear regression case, two stages of simplification can be carried out, the first of which helpfully reduces the variance of the estimator. First, rather than averaging samples \(\tilde{y}_n^{(m)}\), the same result is obtained by averaging linear predictions,

在线性回归的情况下，可以进行两个阶段的简化，第一个阶段有助于减少估计量的方差。首先，不是平均样本 \(\tilde{y}_n^{(m)}\)，而是通过平均线性预测获得相同的结果，

\[\begin{eqnarray*} \hat{\tilde{y}}_n & = & \mathbb{E}\left[ \alpha + \beta \cdot \tilde{x}_n \mid \tilde{x}_n, x, y \right] \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n. \end{eqnarray*}\] This is possible because

这是可能的，因为

\[ \tilde{y}_n^{(m)} \sim \textrm{normal}(\tilde{y}_n \mid \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n, \sigma^{(m)}), \] and the normal distribution has symmetric error so that the expectation of \(\tilde{y}_n^{(m)}\) is the same as \(\alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n\). Replacing the sampled quantity \(\tilde{y}_n^{(m)}\) with its expectation is a general variance reduction technique for Monte Carlo estimates known as Rao-Blackwellization (Rao 1945; Blackwell 1947).

正态分布具有对称误差，因此 \(\tilde{y}_n^{(m)}\) 的期望与 \(\alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n\) 相同。用期望替换采样量 \(\tilde{y}_n^{(m)}\) 是蒙特卡罗估计的一般方差减少技术，称为 Rao-Blackwellization (Rao 1945; Blackwell 1947)。

In the linear case, because the predictor is linear in the coefficients, the estimate can be further simplified to use the estimated coefficients,

在线性情况下，因为预测器在系数上是线性的，估计可以进一步简化为使用估计的系数，

\[\begin{eqnarray*} \tilde{y}_n^{(m)} & \approx & \frac{1}{M} \sum_{m = 1}^M \left( \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n \right) \\[4pt] & = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)} + \frac{1}{M} \sum_{m = 1}^M (\beta^{(m)} \cdot \tilde{x}_n) \\[4pt] & = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)} + \left( \frac{1}{M} \sum_{m = 1}^M \beta^{(m)}\right) \cdot \tilde{x}_n \\[4pt] & = & \hat{\alpha} + \hat{\beta} \cdot \tilde{x}_n. \end{eqnarray*}\]

In Stan, only the first of the two steps (the important variance reduction step) can be coded in the object model. The linear predictor is defined in the generated quantities block.

在 Stan 中，只有两个步骤中的第一个（重要的方差减少步骤）可以在对象模型中编码。线性预测器在 generated quantities 块中定义。

data {
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
  // ...
}
// ...
generated quantities {
  vector[N_tilde] tilde_y = alpha + beta * x_tilde;
}

The posterior mean of tilde_y calculated by Stan is the Bayesian estimate \(\hat{\tilde{y}}.\) The posterior median may also be calculated and used as an estimate, though square error and the posterior mean are more commonly reported.

Stan 计算的 tilde_y 的后验均值是贝叶斯估计 \(\hat{\tilde{y}}\)。后验中位数也可以被计算并用作估计，尽管平方误差和后验均值更常被报告。

31.3 Cross-validation

交叉验证

Cross-validation involves choosing multiple subsets of a data set as the test set and using the other data as training. This can be done by partitioning the data and using each subset in turn as the test set with the remaining subsets as training data. A partition into ten subsets is common to reduce computational overhead. In the limit, when the test set is just a single item, the result is known as leave-one-out (LOO) cross-validation (Vehtari, Gelman, and Gabry 2017).

交叉验证涉及选择数据集的多个子集作为测试集，并使用其他数据作为训练集。这可以通过划分数据并依次使用每个子集作为测试集，其余子集作为训练数据来完成。通常分成十个子集以减少计算开销。在极限情况下，当测试集只是单个项目时，结果称为留一法（LOO）交叉验证 (Vehtari, Gelman, and Gabry 2017)。

Partitioning the data and reusing the partitions is very fiddly in the indexes and may not lead to even divisions of the data. It’s far easier to use random partitions, which support arbitrarily sized test/training splits and can be easily implemented in Stan. The drawback is that the variance of the resulting estimate is higher than with a balanced block partition.

划分数据并重用分区在索引处理上非常繁琐，可能不会导致数据的均匀划分。使用随机分区要容易得多，它支持任意大小的测试/训练分割，并且可以在 Stan 中轻松实现。缺点是结果估计的方差高于平衡块分区。

Stan implementation with random folds

使用随机折叠的 Stan 实现

For the simple linear regression model, randomized cross-validation can be implemented in a single model. To randomly permute a vector in Stan, the simplest approach is the following.

对于简单的线性回归模型，随机交叉验证可以在单个模型中实现。在 Stan 中随机排列向量，最简单的方法如下。

functions {
  array[] int permutation_rng(int N) {
    array[N] int y;
    for (n in 1 : N) {
      y[n] = n;
    }
    vector[N] theta = rep_vector(1.0 / N, N);
    for (n in 1 : size(y)) {
      int i = categorical_rng(theta);
      int temp = y[n];
      y[n] = y[i];
      y[i] = temp;
    }
    return y;
  }
}

The name of the function must end in _rng because it uses other random functions internally. This will restrict its usage to the transformed data and generated quantities block. The code walks through an array of integers exchanging each item with another randomly chosen item, resulting in a uniformly drawn permutation of the integers 1:N.¹

函数名必须以 _rng 结尾，因为它在内部使用其他随机函数。这将限制其只能在 transformed data 和 generated quantities 块中使用。代码遍历一个整数数组，将每个项目与另一个随机选择的项目交换，从而得到整数 1:N 的均匀排列。²

The transformed data block uses the permutation RNG to generate training data and test data by taking prefixes and suffixes of the permuted data.

transformed data 块使用排列随机数生成器通过获取排列数据的前缀和后缀来生成训练数据和测试数据。

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
  int<lower=0, upper=N> N_test;
}
transformed data {
  int N_train = N - N_test;
  array[N] int permutation = permutation_rng(N);
  vector[N_train] x_train = x[permutation[1 : N_train]];
  vector[N_train] y_train = y[permutation[1 : N_train]];
  vector[N_test] x_test = x[permutation[N_train + 1 : N]];
  vector[N_test] y_test = y[permutation[N_train + 1 : N]];
}

Recall that in Stan, permutation[1:N_train] is an array of integers, so that x[permutation[1 : N_train]] is a vector defined for i in 1:N_train by

回想一下，在 Stan 中，permutation[1:N_train] 是一个整数数组，因此 x[permutation[1 : N_train]] 是一个向量，对于 i in 1:N_train 定义为

x[permutation[1 : N_train]][i] = x[permutation[1:N_train][i]]
                               = x[permutation[i]]

Given the test/train split, the rest of the model is straightforward.

给定测试/训练分割，模型的其余部分很简单。

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y_train ~ normal(alpha + beta * x_train, sigma);
  { alpha, beta, sigma } ~ normal(0, 1);
}
generated quantities {
  vector[N] y_test_hat = normal_rng(alpha + beta * x_test, sigma);
  vector[N] err = y_test_sim - y_hat;
}

The prediction y_test_hat is defined in the generated quantities block using the general form involving all uncertainty. The posterior of this quantity corresponds to using a posterior mean estimator,

预测 y_test_hat 在 generated quantities 块中使用涉及所有不确定性的一般形式定义。该量的后验对应于使用后验均值估计量，

\[\begin{eqnarray*} \hat{y}^{\textrm{test}} & = & \mathbb{E}\left[ y^{\textrm{test}} \mid x^{\textrm{test}}, x^{\textrm{train}} y^{\textrm{train}} \right] \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \hat{y}^{\textrm{test}(m)}. \end{eqnarray*}\]

Because the test set is constant and the expectation operator is linear, the posterior mean of err as defined in the Stan program will be the error of the posterior mean estimate,

因为测试集是常数且期望算子是线性的，Stan 程序中定义的 err 的后验均值将是后验均值估计的误差，

\[\begin{eqnarray*} \hat{y}^{\textrm{test}} - y^{\textrm{test}} & = & \mathbb{E}\left[ \hat{y}^{\textrm{test}} \mid x^{\textrm{test}}, x^{\textrm{train}}, y^{\textrm{train}} \right] - y^{\textrm{test}} \\[4pt] & = & \mathbb{E}\left[ \hat{y}^{\textrm{test}} - y^{\textrm{test}} \mid x^{\textrm{test}}, x^{\textrm{train}}, y^{\textrm{train}} \right] \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \hat{y}^{\textrm{test}(m)} - y^{\textrm{test}}, \end{eqnarray*}\] where \[ \hat{y}^{\textrm{test}(m)} \sim p(y \mid x^{\textrm{test}}, x^{\textrm{train}}, y^{\textrm{train}}). \] This just calculates error; taking absolute value or squaring will compute absolute error and mean square error. Note that the absolute value and square operation should not be done within the Stan program because neither is a linear function and the result of averaging squares is not the same as squaring an average in general.

这只是计算误差；取绝对值或平方将计算绝对误差和均方误差。注意，绝对值和平方操作不应该在 Stan 程序中完成，因为两者都不是线性函数，平均平方的结果与平方平均在一般情况下不同。

Because the test set size is chosen for convenience in cross-validation, results should be presented on a per-item scale, such as average absolute error or root mean square error, not on the scale of error in the fold being evaluated.

因为在交叉验证中为了方便选择测试集大小，结果应该以每个项目的尺度呈现，如平均绝对误差或均方根误差，而不是以被评估折叠的误差尺度呈现。

User-defined permutations

用户定义的排列

It is straightforward to declare the variable permutation in the data block instead of the transformed data block and read it in as data. This allows an external program to control the blocking, allowing non-random partitions to be evaluated.

在 data 块而不是 transformed data 块中声明变量 permutation 并将其作为数据读入是很简单的。这允许外部程序控制分块，允许评估非随机分区。

Cross-validation with structured data

结构化数据的交叉验证

Cross-validation must be done with care if the data is inherently structured. For example, in a simple natural language application, data might be structured by document. For cross-validation, one needs to cross-validate at the document level, not at the individual word level. This is related to mixed replication in posterior predictive checking, where there is a choice to simulate new elements of existing groups or generate entirely new groups.

如果数据本质上是结构化的，则必须谨慎进行交叉验证。例如，在简单的自然语言应用中，数据可能按文档结构化。对于交叉验证，需要在文档级别而不是在单个词级别进行交叉验证。这与后验预测检查中的混合复制有关，其中可以选择模拟现有组的新元素或生成全新的组。

Education testing applications are typically grouped by school district, by school, by classroom, and by demographic features of the individual students or the school as a whole. Depending on the variables of interest, different structured subsets should be evaluated. For example, the focus of interest may be on the performance of entire classrooms, so it would make sense to cross-validate at the class or school level on classroom performance.

教育测试应用通常按学区、学校、教室以及学生个人或学校整体的人口统计特征分组。根据感兴趣的变量，应该评估不同的结构化子集。例如，关注的焦点可能是整个教室的表现，因此在班级或学校级别对教室表现进行交叉验证是有意义的。

Cross-validation with spatio-temporal data

时空数据的交叉验证

Often data measurements have spatial or temporal properties. For example, home energy consumption varies by time of day, day of week, on holidays, by season, and by ambient temperature (e.g., a hot spell or a cold snap). Cross-validation must be tailored to the predictive goal. For example, in predicting energy consumption, the quantity of interest may be the prediction for next week’s energy consumption given historical data and current weather covariates. This suggests an alternative to cross-validation, wherein individual weeks are each tested given previous data. This often allows comparing how well prediction performs with more or less historical data.

数据测量通常具有空间或时间属性。例如，家庭能源消耗因一天中的时间、一周中的某一天、节假日、季节和环境温度（例如热浪或寒流）而变化。交叉验证必须针对预测目标进行调整。例如，在预测能源消耗时，感兴趣的量可能是给定历史数据和当前天气协变量的下周能源消耗预测。这提出了交叉验证的替代方案，其中每个星期都根据先前的数据进行测试。这通常允许比较预测在拥有更多或更少历史数据时的表现。

Approximate cross-validation

近似交叉验证

Vehtari, Gelman, and Gabry (2017) introduce a method that approximates the evaluation of leave-one-out cross validation inexpensively using only the data point log likelihoods from a single model fit. This method is documented and implemented in the R package loo (Gabry et al. 2019).

Vehtari, Gelman, and Gabry (2017) 引入了一种方法，仅使用来自单个模型拟合的数据点对数似然，以低成本近似评估留一法交叉验证。该方法在 R 包 loo (Gabry et al. 2019) 中有文档记录和实现。

Blackwell, David. 1947. “Conditional Expectation and Unbiased Sequential Estimation.” The Annals of Mathematical Statistics 18 (1): 105–10. https://doi.org/10.1214/aoms/1177730497.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 182 (2): 389–402.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.

Rao, C. Radhakrishna. 1945. “Information and Accuracy Attainable in the Estimation of Statistical Parameters.” Bulletin of the Calcutta Math Society 37 (3): 81–91.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.

The traditional approach is to walk through a vector and replace each item with a random element from the remaining elements, which is guaranteed to only move each item once. This was not done here as it’d require new categorical theta because Stan does not have a uniform discrete RNG built in.↩︎
传统方法是遍历向量并用剩余元素中的随机元素替换每个项目，这保证每个项目只移动一次。这里没有这样做，因为它需要新的分类 theta，因为 Stan 没有内置的均匀离散随机数生成器。↩︎