8  稀疏与不规则数据结构

Sparse and Ragged Data Structures

本节译者:郭鑫

初次校审:李君竹

二次校审:邱怡轩(DeepSeek 辅助)

Stan does not directly support either sparse or ragged data structures, though both can be accommodated with some programming effort. The sparse matrices chapter introduces a special-purpose sparse matrix times dense vector multiplication, which should be used where applicable; this chapter covers more general data structures.

Stan 并不直接支持稀疏或不规则数据结构,但通过适当的编程处理,两者都可以实现。稀疏矩阵章节 介绍了一种专用的稀疏矩阵乘稠密向量乘法,适用于特定场景;本章则涵盖了更通用的数据结构。

8.1 Sparse data structures

稀疏数据结构

Coding sparse data structures is as easy as moving from a matrix-like data structure to a database-like data structure. For example, consider the coding of sparse data for the IRT models discussed in the item-response model section. There are \(J\) students and \(K\) questions, and if every student answers every question, then it is practical to declare the data as a \(J \times K\) array of answers.

稀疏数据结构的编码可以通过从矩阵形式的数据结构转换为类似数据库的结构来实现。例如,考虑项目-响应模型章节 中讨论的 IRT 模型的稀疏数据编码。假设有 \(J\) 个学生和 \(K\) 个问题,如果每个学生都回答了每个问题,那么将数据声明为 \(J \times K\) 的答案数组是可行的。

data {
  int<lower=1> J;
  int<lower=1> K;
  array[J, K] int<lower=0, upper=1> y;
  // ...
model {
  for (j in 1:J) {
    for (k in 1:K) {
      y[j, k] ~ bernoulli_logit(delta[k] * (alpha[j] - beta[k]));
    }
  }
  // ...
}

When not every student is given every question, the dense array coding will no longer work, because Stan does not support undefined values.

当并非每个学生都回答每个问题时,稠密数组编码将不再适用,因为 Stan 不支持未定义的值。

The following missing data example shows an example with \(J=3\) and \(K=4\), with missing responses shown as NA, as in R.

以下缺失数据示例展示了一个 \(J=3\)\(K=4\) 的情况,缺失的响应用 NA 表示,类似于 R 中的表示方式。

\[\begin{equation*} y = \left[ \begin{array}{cccc} 0 & 1 & \mbox{NA} & 1 \\ 0 & \mbox{NA} & \mbox{NA} & 1 \\ \mbox{NA} & 0 & \mbox{NA} & \mbox{NA} \end{array} \right] \end{equation*}\]

There is no support within Stan for R’s NA values, so this data structure cannot be used directly. Instead, it must be converted to a “long form” as in a database, with columns indicating the indices along with the value. With columns \(jj\) and \(kk\) used for the indexes (following Gelman and Hill (2007)), the 2-D array \(y\) is recoded as a table. The number of rows in the table equals the number of defined array elements, here \(y_{1,1} = 0\), \(y_{1,2} = 1\), up to \(y_{3,2} = 1\). As the array becomes larger and sparser, the long form becomes the more economical encoding.

由于 Stan 不支持 R 的 NA 值,因此无法直接使用这种数据结构。相反,必须将其转换为类似数据库的“长格式”,其中列表示索引以及对应的值。使用 \(jj\)\(kk\) 作为索引(遵循 Gelman and Hill (2007)),二维数组 \(y\) 被重新编码为一个表格。表格的行数等于已定义的数组元素数量,例如 \(y_{1,1} = 0\)\(y_{1,2} = 1\),直到 \(y_{3,2} = 1\)。随着数组变得更大且更稀疏,长格式编码会变得更加高效。

jj kk y
1 1 0
1 2 1
1 4 1
2 1 0
2 4 1
3 2 0

Letting \(N\) be the number of \(y\) that are defined, here \(N=6\), the data and model can be formulated as follows.

假设 \(N\) 是已定义的 \(y\) 的数量,此处 \(N=6\),数据和模型可以如下表示。

data {
  // ...
  int<lower=1> N;
  array[N] int<lower=1, upper=J> jj;
  array[N] int<lower=1, upper=K> kk;
  array[N] int<lower=0, upper=1> y;
  // ...
}
model {
  for (n in 1:N) {
    y[n] ~ bernoulli_logit(delta[kk[n]]
                           * (alpha[jj[n]] - beta[kk[n]]));
  }
  // ...
}

In the situation where there are no missing values, the two model formulations produce exactly the same log posterior density.

在没有缺失值的情况下,两种模型公式会生成完全相同的对数后验密度。

8.2 Ragged data structures

不规则数据结构

Ragged arrays are arrays that are not rectangular, but have different sized entries. This kind of structure crops up when there are different numbers of observations per entry.

不规则数组是指非矩形的数组,其条目大小各不相同。当每个条目的观测值数量不同时,这种结构就会出现。

A general approach to dealing with ragged structure is to move to a full database-like data structure as discussed in the previous section. A more compact approach is possible with some indexing into a linear array.

处理不规则结构的常见方法是采用上一节中讨论的类似数据库的完整数据结构。通过将某些结构索引到线性数组中,可以实现更紧凑的编码方式。

For example, consider a data structure for three groups, each of which has a different number of observations.

例如,考虑一个包含三个组的数据结构,每个组的观测值数量不同。

\(y_1 = \left[1.3 \ \ 2.4 \ \ 0.9\right]\\\) \(y_2 = \left[-1.8 \ \ -0.1\right]\\\) \(y_3 = \left[12.9 \ \ 18.7 \ \ 42.9 \ \ 4.7\right]\)

\(z = [1.3 \ \ 2.4 \ \ 0.9 \ \ -1.8 \ \ -0.1 \ \ 12.9 \ \ 18.7 \ \ 42.9 \ \ 4.7]\\\) \(s = \{ 3 \ \ 2 \ \ 4 \}\)

On the left is the definition of a ragged data structure \(y\) with three rows of different sizes (\(y_1\) is size 3, \(y_2\) size 2, and \(y_3\) size 4). On the right is an example of how to code the data in Stan, using a single vector \(z\) to hold all the values and a separate array of integers \(s\) to hold the group row sizes. In this example, \(y_1 = z_{1:3}\), \(y_2 = z_{4:5}\), and \(y_3 = z_{6:9}\).

左边是不规则的数据结构 \(y\) 的定义,具有三行不同大小(\(y_1\) 大小为3,\(y_2\) 大小为 2,\(y_3\) 大小为 4)。右侧是如何在 Stan 中对数据进行编码的示例,使用单个向量 \(z\) 来保存所有值,并使用单独的整数数组 \(s\) 来保存组行大小。在此示例中,\(y_1 = z_{1:3}\)\(y_2 = z_{4:5}\)\(y_3 = z_{6:9}\)

Suppose the model is a simple varying intercept model, which, using vectorized notation, would yield a log-likelihood

假设该模型是一个简单的变化截距模型,它使用矢量化表示法,将产生对数似然

\[\begin{equation*} \sum_{n=1}^3 \log \textsf{normal}(y_n \mid \mu_n, \sigma). \end{equation*}\] There’s no direct way to encode this in Stan.

在 Stan 中无法直接对此进行编码。

A full database type structure could be used, as in the sparse example, but this is inefficient, wasting space for unnecessary indices and not allowing vector-based density operations. A better way to code this data is as a single list of values, with a separate data structure indicating the sizes of each subarray. This is indicated on the right of the example. This coding uses a single array for the values and a separate array for the sizes of each row.

虽然可以使用类似稀疏示例中的完整数据库结构,但这种方法效率较低,浪费了不必要的索引空间,并且无法支持基于向量的密度操作。更好的编码方式是将数据存储为单个值列表,并使用单独的数据结构来记录每个子数组的大小。这在示例的右侧指示。此编码对值使用单个数组,对每行的大小使用单独的数组。

The model can then be coded up using slicing operations as follows.

然后可以使用切片操作对模型进行编码,如下所示。

data {
  int<lower=0> N;   // # observations
  int<lower=0> K;   // # of groups
  vector[N] y;      // observations
  array[K] int s;   // group sizes
  // ...
}
model {
  int pos;
  pos = 1;
  for (k in 1:K) {
    segment(y, pos, s[k]) ~ normal(mu[k], sigma);
    pos = pos + s[k];
  }

This coding allows for efficient vectorization, which is worth the copy cost entailed by the segment() vector slicing operation.

这种编码方式支持高效的向量化操作,尽管 segment() 向量切片操作会带来一定的复制成本,但这种代价是值得的。

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.