3  缺失数据和部分已知参数

Missing Data and Partially Known Parameters

本节译者:于娇

初次校审:张梓源

二次校审:邱怡轩(DeepSeek 辅助)

Bayesian inference supports a general approach to missing data in which any missing data item is represented as a parameter that is estimated in the posterior (Gelman et al. 2013). If the missing data are not explicitly modeled, as in the predictors for most regression models, then the result is an improper prior on the parameter representing the missing predictor.

贝叶斯推断提供了一种通用的缺失数据处理方法,将每个缺失数据项都视为后验分布中需要估计的参数 (Gelman et al. 2013)。如果缺失数据没有像大多数回归模型中的预测变量那样被显式建模,那么表示缺失预测变量的参数将具有非正规的先验分布。

Mixing arrays of observed and missing data can be difficult to include in Stan, partly because it can be tricky to model discrete unknowns in Stan and partly because unlike some other statistical languages (for example, R and Bugs), Stan requires observed and unknown quantities to be defined in separate places in the model. Thus it can be necessary to include code in a Stan program to splice together observed and missing parts of a data structure. Examples are provided later in the chapter.

在 Stan 中处理包含观测数据和缺失数据的混合数组可能具有挑战性,部分原因是 Stan 中对离散未知量建模较为复杂,而且与某些其他统计语言(例如 R 和 Bugs)不同,Stan 要求在模型中分别定义观测量和未知量。因此,可能需要在 Stan 程序中添加代码来拼接数据结构的观测部分和缺失部分。本章后面提供了示例。

3.1 Missing data

缺失数据

Stan treats variables declared in the data and transformed data blocks as known and the variables in the parameters block as unknown.

Stan 将在 datatransformed data 块中声明的变量视为已知的,将 parameters 块中声明的变量视为未知量。

An example involving missing normal observations could be coded as follows.1

涉及缺失正态观测值的示例可以如下编码。2

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  array[N_obs] real y_obs;
}
parameters {
  real mu;
  real<lower=0> sigma;
  array[N_mis] real y_mis;
}
model {
  y_obs ~ normal(mu, sigma);
  y_mis ~ normal(mu, sigma);
}

The number of observed and missing data points are coded as data with non-negative integer variables N_obs and N_mis. The observed data are provided as an array data variable y_obs. The missing data are coded as an array parameter, y_mis. The ordinary parameters being estimated, the location mu and scale sigma, are also coded as parameters. The model is vectorized on the observed and missing data; combining them in this case would be less efficient because the data observations would be promoted and have needless derivatives calculated.

观测数据和缺失数据的数量分别用非负整数变量 N_obsN_mis 编码。观测数据被表示为数组数据变量 y_obs 。缺失数据编码为一个数组参数 y_mis。被估计的普通参数包括位参数 mu 和尺度参数 sigma,它们也被编码为参数。该模型在观测和缺失数据上进行向量化;在这种情况下合并它们会降低效率,因为数据观测值会被提升,导致不必要的导数计算。

3.2 Partially known parameters

部分参数已知

In some situations, such as when a multivariate probability function has partially observed outcomes or parameters, it will be necessary to create a vector mixing known (data) and unknown (parameter) values. This can be done in Stan by creating a vector or array in the transformed parameters block and assigning to it.

在某些情况下,如多变量概率函数具有部分观测结果或参数时,需要创建包含已知(数据)和未知(参数)值的混合向量。这可以在 Stan 中通过在 transformed parameters 块中创建向量或数组并对其赋值来完成。

The following example involves a bivariate covariance matrix in which the variances are known, but the covariance is not.

下面的例子涉及一个双变量协方差矩阵,其中方差是已知的,但协方差不是。

data {
  int<lower=0> N;
  array[N] vector[2] y;
  real<lower=0> var1;
  real<lower=0> var2;
}
transformed data {
  real<lower=0> max_cov = sqrt(var1 * var2);
  real<upper=0> min_cov = -max_cov;
}
parameters {
  vector[2] mu;
  real<lower=min_cov, upper=max_cov> cov;
}
transformed parameters {
  matrix[2, 2] Sigma;
  Sigma[1, 1] = var1;
  Sigma[1, 2] = cov;
  Sigma[2, 1] = cov;
  Sigma[2, 2] = var2;
}
model {
  y ~ multi_normal(mu, Sigma);
}

The variances are defined as data in variables var1 and var2, whereas the covariance is defined as a parameter in variable cov. The \(2 \times 2\) covariance matrix Sigma is defined as a transformed parameter, with the variances assigned to the two diagonal elements and the covariance to the two off-diagonal elements.

方差定义为变量 var1var2 中的数据,协方差定义为变量 cov 中的参数。\(2 \times 2\) 协方差矩阵 Sigma 被定义为一个变换参数,其中把方差分配给两个对角元素,把协方差分配给这两个非对角元素。

The constraint on the covariance declaration ensures that the resulting covariance matrix sigma is positive definite. The bound, plus or minus the square root of the product of the variances, is defined as transformed data so that it is only calculated once.

协方差声明的约束确保了生成的协方差矩阵 sigma 是正定的。协方差取值的上下限,即方差乘积的平方根的正负值,被定义为转换数据,这样它们只需计算一次。

The vectorization of the multivariate normal is critical for efficiency here. The transformed parameter Sigma could be defined as a local variable within the model block if it does not need to be included in the sampler’s output.

这里,多元正态分布的向量化对效率至关重要。如果不需要将转换参数 Sigma 包含在采样器的输出中,可以在模型块中将其定义为局部变量。

3.3 Sliced missing data

切片缺失数据

If the missing data are part of some larger data structure, then it can often be effectively reassembled using index arrays and slicing. Here’s an example for time-series data, where only some entries in the series are observed.

当缺失数据是更大数据结构的一部分时,通常可以使用索引数组和切片来有效地重组数据。这里是一个关于时间序列数据的例子,这个时间序列中只观测到了部分的数据点。

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  array[N_obs] int<lower=1, upper=N_obs + N_mis> ii_obs;
  array[N_mis] int<lower=1, upper=N_obs + N_mis> ii_mis;
  array[N_obs] real y_obs;
}
transformed data {
  int<lower=0> N = N_obs + N_mis;
}
parameters {
  array[N_mis] real y_mis;
  real<lower=0> sigma;
}
transformed parameters {
  array[N] real y;
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
}
model {
  sigma ~ gamma(1, 1);
  y[1] ~ normal(0, 100);
  y[2:N] ~ normal(y[1:(N - 1)], sigma);
}

The index arrays ii_obs and ii_mis contain the indexes into the final array y of the observed data (coded as a data vector y_obs) and the missing data (coded as a parameter vector y_mis). See the time series chapter for further discussion of time-series model and specifically the autoregression section for an explanation of the vectorization for y as well as an explanation of how to convert this example to a full AR(1) model. To ensure y[1] has a proper posterior in case it is missing, we have given it an explicit, albeit broad, prior.

索引数组 ii_obsii_mis 包含观测数据(编码为数据向量 y_obs)和缺失数据(编码为参数向量 y_mis)的最终数组 y 中的索引。有关时间序列模型的进一步讨论,请参见 时间序列章节,特别是 自回归模型部分关于 y 的向量化的解释,以及如何将这个例子转换为完整的 AR(1) 模型。为了确保当 y[1] 缺失时也能获得正规的后验分布,我们为其指定了一个明确的(尽管范围较广的)先验分布。

Another potential application would be filling the columns of a data matrix of predictors for which some predictors are missing; matrix columns can be accessed as vectors and assigned the same way, as in

另一个潜在的应用是,当预测因子的数据矩阵有一些预测变量缺失时,可以填充这个数据矩阵的列; 矩阵的列可以像向量一样访问和赋值,例如

x[N_obs_2, 2] = x_obs_2;
x[N_mis_2, 2] = x_mis_2;

where the relevant variables are all hard coded with index 2 because Stan doesn’t support ragged arrays. These could all be packed into a single array with more fiddly indexing that slices out vectors from longer vectors (see the ragged data structures section for a general discussion of coding ragged data structures in Stan).

这里所有的相关变量都硬编码成索引 2,因为 Stan 不支持不规则数组。这些变量可以通过更复杂的索引打包到一个数组中,从更长的向量中切片出向量(请参阅 不规则数据结构章节,了解 Stan 中对不规则数据结构化编码的一般讨论)。

3.4 Loading matrix for factor analysis

在因子分析中加载矩阵

Rick Farouni, on the Stan users group, inquired as to how to build a Cholesky factor for a covariance matrix with a unit diagonal, as used in Bayesian factor analysis (Aguilar and West 2000). This can be accomplished by declaring the below-diagonal elements as parameters, then filling the full matrix as a transformed parameter.

Stan 用户小组的 Rick Farouni 询问了:如何为具有单位对角线的协方差矩阵构建 Cholesky 因子,这在贝叶斯因子分析中有应用 (Aguilar and West 2000)。这可以通过将下对角线元素声明为参数,然后将整个矩阵填充为转换参数来实现。

data {
  int<lower=2> K;
}
transformed data {
  int<lower=1> K_choose_2;
  K_choose_2 = (K * (K - 1)) / 2;
}
parameters {
  vector[K_choose_2] L_lower;
}
transformed parameters {
  cholesky_factor_cov[K] L;
  for (k in 1:K) {
    L[k, k] = 1;
  }
  {
    int i;
    for (m in 2:K) {
      for (n in 1:(m - 1)) {
        L[m, n] = L_lower[i];
        L[n, m] = 0;
        i += 1;
      }
    }
  }
}

It is most convenient to place a prior directly on L_lower. An alternative would be a prior for the full Cholesky factor L, because the transform from L_lower to L is just the identity and thus does not require a Jacobian adjustment (despite the warning from the parser, which is not smart enough to do the code analysis to infer that the transform is linear). It would not be at all convenient to place a prior on the full covariance matrix L * L', because that would require a Jacobian adjustment; the exact adjustment is detailed in the reference manual.

最简单的方式是直接在 L_lower 上放置先验。另一种选择是对完整的 Cholesky 因子 L 进行先验分布,因为从L_lowerL的变换只是一个恒等变换,因此不需要雅可比调整(尽管解析器发出了警告,但它不够智能到通过代码分析推断出这是个线性变换)。在完整的协方差矩阵 L * L' 上放置先验分布不是很方便,因为这需要雅可比调整;详细的调整在参考手册中有说明。

3.5 Missing multivariate data

缺失的多变量数据

It’s often the case that one or more components of a multivariate outcome are missing.3

很多情况下,多元结果中的一个或多个组成部分可能会缺失。4

As an example, we’ll consider the bivariate distribution, which is easily marginalized. The coding here is brute force, representing both an array of vector observations y and a boolean array y_observed to indicate which values were observed (others can have dummy values in the input).

例如,我们考虑双变量分布,它很容易被边缘化(容易求边际分布)。这里的编码是暴力编码,同时表示了向量观察值数组 y 和布尔数组 y_observed,后者用于指示哪些值已被观察到(其他值可以在输入中有虚拟值)。

array[N] vector[2] y;
array[N, 2] int<lower=0, upper=1> y_observed;

If both components are observed, we model them using the full multi-normal, otherwise we model the marginal distribution of the component that is observed.

如果观察到两个分量,我们使用完整的多元正态分布对它们进行建模,否则我们对观察到的分量的边际分布进行建模。

for (n in 1:N) {
  if (y_observed[n, 1] && y_observed[n, 2]) {
    y[n] ~ multi_normal(mu, Sigma);
  } else if (y_observed[n, 1]) {
    y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
  } else if (y_observed[n, 2]) {
    y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
  }
}

It’s a bit more work, but much more efficient to vectorize these distribution statements. In transformed data, build up three vectors of indices, for the three cases above:

虽然要做更多工作,但将这些采样语句向量化可以更有效率。在 transformed data 块中,针对上述三种情况,建立三个索引向量:

transformed data {
  array[observed_12(y_observed)] int ns12;
  array[observed_1(y_observed)] int ns1;
  array[observed_2(y_observed)] int ns2;
}

You will need to write functions that pull out the count of observations in each of the three situations. This must be done with functions because the result needs to go in top-level block variable size declaration. Then the rest of transformed data just fills in the values using three counters.

您需要编写函数来统计三种采样情况下的观测数量。这必须通过函数来实现,因为结果需要放入顶层块的变量大小声明中。然后,transformed data 块的其余部分只需使用三个计数器填充值即可。

int n12 = 1;
int n1 = 1;
int n2 = 1;
for (n in 1:N) {
  if (y_observed[n, 1] && y_observed[n, 2]) {
    ns12[n12] = n;
    n12 += 1;
  } else if (y_observed[n, 1]) {
    ns1[n1] = n;
    n1 += 1;
  } else if (y_observed[n, 2]) {
    ns2[n2] = n;
    n2 += 1;
  }
}

Then, in the model block, everything is vectorizable using those indexes constructed once in transformed data:

然后,在 model 块中,使用在 transformed data 块中一次性构建的索引,所有内容都可以向量化:

y[ns12] ~ multi_normal(mu, Sigma);
y[ns1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
y[ns2] ~ normal(mu[2], sqrt(Sigma[2, 2]));

The result will be much more efficient than using latent variables for the missing data, but it requires the multivariate distribution to be marginalized analytically. It’d be more efficient still to precompute the three arrays in the transformed data block, though the efficiency improvement will be relatively minor compared to vectorizing the probability functions.

与使用潜在变量来处理缺失的数据相比,该结果要有效得多,但它要求多元分布能够解析得到边际分布。在 transformed data 块中预计算三个数组,效率会更高一些。尽管与向量化概率函数相比,效率的提高相对较小。

This approach can easily be generalized with some index fiddling to the general multivariate case. The trick is to pull out entries in the covariance matrix for the missing components. It can also be used in situations such as multivariate differential equation solutions where only one component is observed, as in a phase-space experiment recording only time and position of a pendulum (and not recording momentum).

通过一些索引操作,这种方法可以轻松地推广到一般的多变量情况。诀窍是在协方差矩阵中提取缺失分量对应的项。它也可以用于只有一个分量被观测到的情况,比如多变量微分方程组的解,以及在相空间实验中仅记录摆锤的时间和位置(而不记录动量)的情况。

Aguilar, Omar, and Mike West. 2000. “Bayesian Dynamic Factor Models and Portfolio Allocation.” Journal of Business & Economic Statistics 18 (3): 338–57.
Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

  1. A more meaningful estimation example would involve a regression of the observed and missing observations using predictors that were known for each and specified in the data block.↩︎

  2. 一个更有意义的估计示例涉及到,使用在 data 块中为每个观测值指定的已知预测变量,对观测到的和缺失的观测值进行回归分析。↩︎

  3. This is not the same as missing components of a multivariate predictor in a regression problem; in that case, you will need to represent the missing data as a parameter and impute missing values in order to feed them into the regression.↩︎

  4. 这与回归问题中多变量预测变量的缺失分量不同;在这种情况下,您需要将缺失的数据表示为一个参数,并估算缺失的值,以便将它们输入到回归中。 ↩︎