9  聚类模型

Clustering Models

本节译者:王钦怡

初次校审:张梓源

二次校审:邱怡轩(DeepSeek 辅助)

Unsupervised methods for organizing data into groups are collectively referred to as clustering. This chapter describes the implementation in Stan of two widely used statistical clustering models, soft \(K\)-means and latent Dirichlet allocation (LDA). In addition, this chapter includes naive Bayesian classification, which can be viewed as a form of clustering which may be supervised. These models are typically expressed using discrete parameters for cluster assignments. Nevertheless, they can be implemented in Stan like any other mixture model by marginalizing out the discrete parameters (see the mixture modeling chapter).

聚类是一种无监督学习方法,用于将数据组织成不同的组。本章介绍了在 Stan 中实现的两种广泛使用的统计聚类模型:软 \(K\)-均值聚类和潜在狄利克雷分配(latent Dirichlet allocation,LDA)。此外,本章还包括朴素贝叶斯分类,它可以看作是一种有监督的聚类。这些模型通常通过离散参数来表示聚类。尽管如此,它们可以通过边缘化离散参数的方式在 Stan 中实现,类似于其他混合模型(参见混合模型章节)。

9.1 Relation to finite mixture models

与有限混合模型的关系

As mentioned in the clustering section, clustering models and finite mixture models are really just two sides of the same coin. The “soft” \(K\)-means model described in the next section is a normal mixture model (with varying assumptions about covariance in higher dimensions leading to variants of \(K\)-means). Latent Dirichlet allocation is a mixed-membership multinomial mixture.

正如在有限混合模型章节中提到的,聚类模型和有限混合模型实际上是同一事物的两个方面。下一节中描述的“软” \(K\)-均值模型是一个正态混合模型(在高维中对协方差的不同假设会衍生出不同的 \(K\)-均值变体)。潜在狄利克雷分配则是一种混合隶属度的多项混合模型。

9.2 Soft K-means

K-均值聚类

\(K\)-means clustering is a method of clustering data represented as \(D\)-dimensional vectors. Specifically, there will be \(N\) items to be clustered, each represented as a vector \(y_n \in \mathbb{R}^D\). In the “soft” version of \(K\)-means, the assignments to clusters will be probabilistic.

\(K\)-均值聚类是一种对 \(D\) 维向量数据进行聚类的方法。具体来说,有 \(N\) 个待聚类项,每个项表示为向量 \(y_n \in \mathbb{R}^D\)。在“软” \(K\)-均值聚类中,对每一聚类(cluster)的分配是概率性的。

Geometric hard K-means clustering

几何硬 K-均值聚类

\(K\)-means clustering is typically described geometrically in terms of the following algorithm, which assumes the number of clusters \(K\) and data vectors \(y\) as input.

\(K\)-均值聚类通常用以下算法进行几何描述,假设输入数据是聚类的数量 \(K\) 和数据向量 \(y\)

  1. For each \(n\) in \(\{1,\dotsc,N\}\), randomly assign vector \(y_n\) to a cluster in \(\{1,\dotsc,K\}\);
  2. Repeat
    1. For each cluster \(k\) in \(\{1,\dotsc,K\}\), compute the cluster centroid \(\mu_k\) by averaging the vectors assigned to that cluster;
    2. For each \(n\) in \(\{1,\dotsc,N\}\), reassign \(y_n\) to the cluster \(k\) for which the (Euclidean) distance from \(y_n\) to \(\mu_k\) is smallest;
    3. If no vectors changed cluster, return the cluster assignments.
  3. 对于 \(\{1,\dotsc,N\}\) 中的每个 \(n\),将向量 \(y_n\) 随机分配到 \(\{1,\dotsc,K\}\) 中的一个聚类中;
  4. 重复
    1. 对于 \(\{1,\dotsc,K\}\) 中的每个聚类 \(k\),通过对分配给该聚类的向量进行平均来计算聚类中心 \(\mu_k\);
    2. 对于 \(\{1,\dotsc,N\}\) 中的每个 \(n\),重新将 \(y_n\) 分配给离 \(\mu_k\) 欧式距离最小的聚类 \(k\);
    3. 如果没有向量更改聚类,则返回聚类的分配结果。

This algorithm is guaranteed to terminate.

该算法可保证终止。

Soft K-means clustering

K-均值聚类

Soft \(K\)-means clustering treats the cluster assignments as probability distributions over the clusters. Because of the connection between Euclidean distance and multivariate normal models with a fixed covariance, soft \(K\)-means can be expressed (and coded in Stan) as a multivariate normal mixture model.

\(K\)-均值聚类将聚类分配视为聚类上的概率分布。基于欧氏距离与固定协方差的多元正态模型之间的联系,软 \(K\)-均值聚类可表示为(并在 Stan 中编码为)多元正态混合模型。

In the full generative model, each data point \(n\) in \(\{1,\dotsc,N\}\) is assigned a cluster \(z_n \in \{1,\dotsc,K\}\) with symmetric uniform probability,

在完整的生成模型中,每个数据点 \(n \in \{1,\dotsc,N\}\) 以对称均匀概率分配到聚类 \(z_n \in \{1,\dotsc,K\}\) 中, \[ z_n \sim \textsf{categorical}(1/K), \] where \(1\) is the unit vector of \(K\) dimensions, so that \(1/K\) is the symmetric \(K\)-simplex. Thus the model assumes that each data point is drawn from a hard decision about cluster membership. The softness arises only from the uncertainty about which cluster generated a data point.

其中 \(1\)\(K\) 维的单位向量,因此 \(1/K\) 是对称的 \(K\)-单纯形。模型假设每个数据点均来自硬分配的聚类成员,而“软性”仅源于对数据点生成聚类的不确定性。

The data points themselves are generated from a multivariate normal distribution whose parameters are determined by the cluster assignment \(z_n\),

数据点本身是从多元正态分布中生成的,其参数由聚类分配 \(z_n\) 决定,

\[ y_n \sim \textsf{normal}(\mu_{z[n]},\Sigma_{z[n]}) \]

The sample implementation in this section assumes a fixed unit covariance matrix shared by all clusters \(k\),

本节假定所有聚类共享固定的单位协方差矩阵,

\[ \Sigma_k = \mathrm{diag\_matrix}({\bf 1}), \] so that the log multivariate normal can be implemented directly up to a proportion by

从而对数多元正态分布具有以下比例关系

\[ \mathrm{normal}\left( y_n \mid \mu_k, \mathrm{diag\_matrix}({\bf 1}) \right) \propto \exp \left (- \frac{1}{2} \sum_{d=1}^D \left( \mu_{k,d} - y_{n,d} \right)^2 \right). \] The spatial perspective on \(K\)-means arises by noting that the inner term is just half the negative Euclidean distance from the cluster mean \(\mu_k\) to the data point \(y_n\).

从几何视角看,内部项即为聚类均值 \(\mu_k\) 到数据点 \(y_n\) 的负欧氏距离的一半。

Stan implementation of soft K-means

用 Stan 实现软 K-均值

Consider the following Stan program for implementing \(K\)-means clustering.

以下为实现 \(K\)-均值聚类的 Stan 程序。

data {
  int<lower=0> N;        // number of data points
  int<lower=1> D;        // number of dimensions
  int<lower=1> K;        // number of clusters
  array[N] vector[D] y;  // observations
}
transformed data {
  real<upper=0> neg_log_K;
  neg_log_K = -log(K);
}
parameters {
  array[K] vector[D] mu; // cluster means
}
transformed parameters {
  array[N, K] real<upper=0> soft_z; // log unnormalized clusters
  for (n in 1:N) {
    for (k in 1:K) {
      soft_z[n, k] = neg_log_K
                     - 0.5 * dot_self(mu[k] - y[n]);
    }
  }
}
model {
  // prior
  for (k in 1:K) {
    mu[k] ~ std_normal();
  }

  // likelihood
  for (n in 1:N) {
    target += log_sum_exp(soft_z[n]);
  }
}

There is an independent standard normal prior on the centroid parameters; this prior could be swapped with other priors, or even a hierarchical model to fit an overall problem scale and location.

聚类中心参数具有独立的标准正态分布先验,这个先验可以被其他先验所代替,或者使用分层模型来拟合尺度参数和位置参数。

The only parameter is mu, where mu[k] is the centroid for cluster \(k\). The transformed parameters soft_z[n] contain the log of the unnormalized cluster assignment probabilities. The vector soft_z[n] can be converted back to a normalized simplex using the softmax function (see the functions reference manual), either externally or within the model’s generated quantities block.

这个模型只有一个参数 mu,其中 mu[k] 是第 \(k\) 个聚类的中心。变换后的参数 soft_z[n] 包含了未归一化的聚类分配概率的对数。向量 soft_z[n] 可以使用 softmax 函数(参见函数参考手册)进行归一化,这一步可以在模型的“generated quantities”区块内部或外部进行。

Generalizing soft K-means

K-均值的推广

The multivariate normal distribution with unit covariance matrix produces a log probability density proportional to Euclidean distance (i.e., \(L_2\) distance). Other distributions relate to other geometries. For instance, replacing the normal distribution with the double exponential (Laplace) distribution produces a clustering model based on \(L_1\) distance (i.e., Manhattan or taxicab distance).

具有单位协方差矩阵的多元正态分布产生的对数概率密度正比例于欧几里得距离(即 \(L_2\) 距离)。其他分布也与其他几何关系有关。例如,用双指数(拉普拉斯)分布替换正态分布会产生基于 \(L_1\) 距离(即曼哈顿或“出租车”距离)的聚类模型。

Within the multivariate normal version of \(K\)-means, replacing the unit covariance matrix with a shared covariance matrix amounts to working with distances defined in a space transformed by the inverse covariance matrix.

\(K\)-均值的多元正态版本中,用共享协方差矩阵代替单位协方差矩阵相当于在由逆协方差矩阵转换的空间中使用距离。

Although there is no global spatial analog, it is common to see soft \(K\)-means specified with a per-cluster covariance matrix. In this situation, a hierarchical prior may be used for the covariance matrices.

虽然没有全局的空间类比,但通常会看到软 \(K\)-均值中为每个聚类指定一个协方差矩阵。在这种情况下,可以将层次先验用于协方差矩阵。

9.3 The difficulty of Bayesian inference for clustering

聚类中贝叶斯推断的困难之处

Two problems make it pretty much impossible to perform full Bayesian inference for clustering models, the lack of parameter identifiability and the extreme multimodality of the posteriors. There is additional discussion related to the non-identifiability due to label switching in the label switching section.

两个问题使得对聚类模型进行完整的贝叶斯推断几乎不可能:参数不可识别性和后验的极端多模态性。关于由标签交换引起的非识别性的更多讨论,请参见标签交换一节。

Non-identifiability

不可识别性

Cluster assignments are not identified—permuting the cluster mean vectors mu leads to a model with identical likelihoods. For instance, permuting the first two indexes in mu and the first two indexes in each soft_z[n] leads to an identical likelihood (and prior).

聚类分配是不可识别的——置换聚类均值向量 mu 会产生另一个具有相同似然的模型。例如,置换 mu 中的前两个索引和每个 soft_z[n] 中的前两个索引会导致相同的似然(和先验)。

The lack of identifiability means that the cluster parameters cannot be compared across multiple Markov chains. In fact, the only parameter in soft \(K\)-means is not identified, leading to problems in monitoring convergence. Clusters can even fail to be identified within a single chain, with indices swapping if the chain is long enough or the data are not cleanly separated.

不可识别性意味着不同马尔可夫链的聚类参数不具有可比性。事实上,软 \(K\)-均值中唯一的参数是不可识别的,这将导致收敛性的监测出现问题。如果链足够长或数据分离不清晰,聚类索引可能在单条链内就发生交换。

Multimodality

多峰性

The other problem with clustering models is that their posteriors are highly multimodal. One form of multimodality is the non-identifiability leading to index swapping. But even without the index problems the posteriors are highly multimodal.

聚类模型的另一个问题是它们的后验分布通常具有高度的多峰性。多峰性的一种形式是由于不可识别性导致的索引交换问题。但即使没有索引问题,后验分布也具有高度的多峰性。

Bayesian inference fails in cases of high multimodality because there is no way to visit all of the modes in the posterior in appropriate proportions and thus no way to evaluate integrals involved in posterior predictive inference.

在高度多峰的情况下,贝叶斯推断失败的原因是无法以适当的比例访问后验分布中的所有概率密度峰,因此无法计算涉及后验预测推断的积分。

In light of these two problems, the advice often given in fitting clustering models is to try many different initializations and select the sample with the highest overall probability. It is also popular to use optimization-based point estimators such as expectation maximization or variational Bayes, which can be much more efficient than sampling-based approaches.

鉴于这两个问题,在拟合聚类模型时通常建议尝试多种不同的初始化方案,并选择其中具有最高概率值的样本。另一类流行的方法是使用基于优化的点估计量(如 EM 算法或变分贝叶斯),它们有时会比基于采样的方法更加高效。

9.4 Naive Bayes classification and clustering

朴素贝叶斯分类和聚类

Naive Bayes is a kind of mixture model that can be used for classification or for clustering (or a mix of both), depending on which labels for items are observed.1

朴素贝叶斯是一种混合模型,可用于分类或聚类(或两者的结合),具体取决于数据中哪些项的标签是已知的。2

Multinomial mixture models are referred to as “naive Bayes” because they are often applied to classification problems where the multinomial independence assumptions are clearly false.

多项分布混合模型被称为“朴素贝叶斯”,因为它们通常用于违背多项分布独立性假设的分类问题。

Naive Bayes classification and clustering can be applied to any data with multinomial structure. A typical example of this is natural language text classification and clustering, which is used an example in what follows.

朴素贝叶斯分类和聚类可应用于任意具有多项分布结构的数据。一个典型的例子是自然语言文本分类和聚类,在接下来的例子中将会进行展示。

The observed data consists of a sequence of \(M\) documents made up of bags of words drawn from a vocabulary of \(V\) distinct words. A document \(m\) has \(N_m\) words, which are indexed as \(w_{m,1}, \dotsc, w_{m,N[m]} \in \{1,\dotsc,V\}\). Despite the ordered indexing of words in a document, this order is not part of the model, which is clearly defective for natural human language data. A number of topics (or categories) \(K\) is fixed.

观测数据由 \(M\) 个文档组成,每个文档的词袋模型基于包含 \(V\) 个不同词的词汇表。文档 \(m\)\(N_m\) 个词语,这些词语被索引为 \(w_{m,1}, \dotsc, w_{m,N[m]} \in \{1,\dotsc,V\}\)。尽管文档中的词语按顺序索引,但这个顺序并不是模型的一部分,因此这一处理并不符合自然语言数据的特性。主题(或类别)的数量 \(K\) 是固定的。

The multinomial mixture model generates a single category \(z_m \in \{1,\dotsc,K\}\) for each document \(m \in \{1,\dotsc,M\}\) according to a categorical distribution,

多项分布混合模型为每个文档 \(m \in \{1,\dotsc,M\}\) 生成一个类别 \(z_m \in \{1,\dotsc,K\}\),其生成过程服从分类分布:

\[ z_m \sim \textsf{categorical}(\theta). \] The \(K\)-simplex parameter \(\theta\) represents the prevalence of each category in the data.

参数 \(\theta\) 是一个 \(K\)-单纯形,表示每个类别在数据中的普遍性。

Next, the words in each document are generated conditionally independently of each other and the words in other documents based on the category of the document, with word \(n\) of document \(m\) being generated as

接下来,每个文档中的词语基于文档的类别条件独立生成,且与其他文档中的词无关。文档 \(m\) 的第 \(n\) 个词语的生成方式为

\[ w_{m,n} \sim \textsf{categorical}(\phi_{z[m]}). \] The parameter \(\phi_{z[m]}\) is a \(V\)-simplex representing the probability of each word in the vocabulary in documents of category \(z_m\).

参数 \(\phi_{z[m]}\) 是一个 \(V\)-单纯形,表示类别为 \(z_m\) 的文档中词汇表中每个词语的概率。

The parameters \(\theta\) and \(\phi\) are typically given symmetric Dirichlet priors. The prevalence \(\theta\) is sometimes fixed to produce equal probabilities for each category \(k \in \{1,\dotsc,K\}\).

参数 \(\theta\)\(\phi\) 通常被赋予对称狄利克雷先验。有时会将 \(\theta\) 固定,使每个类别 \(k \in \{1,\dotsc,K\}\) 的概率相等。

Coding ragged arrays

编码不规则数组

The specification for naive Bayes in the previous sections have used a ragged array notation for the words \(w\). Because Stan does not support ragged arrays, the models are coded using an alternative strategy that provides an index for each word in a global list of words. The data is organized as follows, with the word arrays laid out in a column and each assigned to its document in a second column.

在前面的部分中,朴素贝叶斯模型使用了针对词语 \(w\) 的不规则数组表示。由于 Stan 不支持不规则数组,因此模型采用了一种替代的编码策略,即为全局词语列表中的每个词语提供一个索引。数据如下方表格所示,其中词语数组放在其中的一列,并在下一列指定每个词语所属的文档。

\[ \begin{array}{lll} \hline \mathrm{n} \qquad\qquad\qquad\qquad & \mathrm{w[n]} \qquad & \mathrm{doc[n]} \\ \hline 1 & w_{1,1} & 1 \\ 2 & w_{1,2} & 1 \\ \vdots & \vdots & \vdots \\ N_1 & w_{1,N[1]} & 1 \\ N_1 + 1 & w_{2,1} & 2 \\ N_1 + 2 & w_{2,2} & 2 \\ \vdots & \vdots & \vdots \\ N_1 + N_2 & w_{2,N[2]} & 2 \\ N_1 + N_2 + 1 & w_{3,1} & 3 \\ \vdots & \vdots & \vdots \\ N = \sum_{m=1}^M N_m & w_{M,N[M]} & M \\ \hline \end{array} \]

The relevant variables for the program are N, the total number of words in all the documents, the word array w, and the document identity array doc.

这个程序中相关的变量包括所有文档的总词数 N,词语数组 w,以及文档标识数组 doc

Estimation with category-labeled training data

使用带类别标签的训练数据进行估计

A naive Bayes model for estimating the simplex parameters given training data with documents of known categories can be coded in Stan as follows

对于已知类别的训练数据,可用以下 Stan 代码实现朴素贝叶斯模型的单纯形参数估计:

data {
  // training data
  int<lower=1> K;               // num topics
  int<lower=1> V;               // num words
  int<lower=0> M;               // num docs
  int<lower=0> N;               // total word instances
  array[M] int<lower=1, upper=K> z;    // topic for doc m
  array[N] int<lower=1, upper=V> w;    // word n
  array[N] int<lower=1, upper=M> doc;  // doc ID for word n
  // hyperparameters
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta;             // topic prevalence
  array[K] simplex[V] phi;      // word dist for topic k
}
model {
  theta ~ dirichlet(alpha);
  for (k in 1:K) {
    phi[k] ~ dirichlet(beta);
  }
  for (m in 1:M) {
    z[m] ~ categorical(theta);
  }
  for (n in 1:N) {
    w[n] ~ categorical(phi[z[doc[n]]]);
  }
}

The topic identifiers \(z_m\) are declared as data and the latent category assignments are included as part of the likelihood function.

主题标识符 \(z_m\) 被声明为数据,潜在类别分配被包含在似然函数中。

Estimation without category-labeled training data

无类别标签训练数据的估计

Naive Bayes models can be used in an unsupervised fashion to cluster multinomial-structured data into a fixed number \(K\) of categories. The data declaration includes the same variables as the model in the previous section excluding the topic labels z. Because z is discrete, it needs to be summed out of the model calculation. This is done for naive Bayes as for other mixture models. The parameters are the same up to the priors, but the likelihood is now computed as the marginal document probability

朴素贝叶斯模型可以以无监督的方式将具有多项分布结构的数据聚类到 \(K\) 个类别中,其中 \(K\) 是一个固定的数。数据声明与上一节模型相同,但需排除主题标签 z。由于 z 是离散的,因此需在模型计算中对其求和。其参数与有监督模型类似,但似然变为边缘文档概率:

\[\begin{align*} \log\, &p(w_{m,1},\dotsc,w_{m,N_m} \mid \theta,\phi) \\ &= \log \sum_{k=1}^K \left( \textsf{categorical}(k \mid \theta) \times \prod_{n=1}^{N_m} \textsf{categorical}(w_{m,n} \mid \phi_k) \right) \\ &= \log \sum_{k=1}^K \exp \left( \log \textsf{categorical}(k \mid \theta) + \sum_{n=1}^{N_m} \log \textsf{categorical}(w_{m,n} \mid \phi_k) \right). \end{align*}\]

The last step shows how the log_sum_exp function can be used to stabilize the numerical calculation and return a result on the log scale.

最后一步展示了如何使用 log_sum_exp 函数来稳定数值计算并以对数尺度返回结果。

model {
  array[M, K] real gamma;
  theta ~ dirichlet(alpha);
  for (k in 1:K) {
    phi[k] ~ dirichlet(beta);
  }
  for (m in 1:M) {
    for (k in 1:K) {
      gamma[m, k] = categorical_lpmf(k | theta);
    }
  }
  for (n in 1:N) {
    for (k in 1:K) {
      gamma[doc[n], k] = gamma[doc[n], k]
                         + categorical_lpmf(w[n] | phi[k]);
    }
  }
  for (m in 1:M) {
    target += log_sum_exp(gamma[m]);
  }
}

The local variable gamma[m, k] represents the value

局部变量 gamma[m, k] 表示

\[ \gamma_{m,k} = \log \textsf{categorical}(k \mid \theta) + \sum_{n=1}^{N_m} \log \textsf{categorical}(w_{m,n} \mid \phi_k). \]

Given \(\gamma\), the posterior probability that document \(m\) is assigned category \(k\) is

给定 \(\gamma\),文档 \(m\) 被分配到类别 \(k\) 的后验概率为

\[ \Pr[z_m = k \mid w,\alpha,\beta] = \exp \left( \gamma_{m,k} - \log \sum_{k=1}^K \exp \left( \gamma_{m,k} \right) \right). \]

If the variable gamma were declared and defined in the transformed parameter block, its sampled values would be saved by Stan. The normalized posterior probabilities could also be defined as generated quantities.

如果变量 gamma 在“transformed parameters”区块中声明和定义,则 Stan 会保存其采样值。归一化的后验概率也可定义为模型的生成量。

Full Bayesian inference for naive Bayes

朴素贝叶斯的全贝叶斯推断

Full Bayesian posterior predictive inference for the naive Bayes model can be implemented in Stan by combining the models for labeled and unlabeled data. The estimands include both the model parameters and the posterior distribution over categories for the unlabeled data. The model is essentially a missing data model assuming the unknown category labels are missing completely at random; see Gelman et al. (2013) and Gelman and Hill (2007) for more information on missing data imputation. The model is also an instance of semisupervised learning because the unlabeled data contributes to the parameter estimations.

在 Stan 中,可以通过将有标签和无标签数据的模型结合起来,实现朴素贝叶斯模型的全贝叶斯后验预测推断,估计对象包括模型参数和无标签数据的类别后验分布。该模型本质上是一个缺失数据模型,假设未知的类别标签是完全随机缺失的;有关缺失数据处理的更多信息,可参考 Gelman et al. (2013)Gelman and Hill (2007)。该模型也是半监督学习的一个例子,因为未标记的数据对参数估计有贡献。

To specify a Stan model for performing full Bayesian inference, the model for labeled data is combined with the model for unlabeled data. A second document collection is declared as data, but without the category labels, leading to new variables M2 N2, w2, and doc2. The number of categories and number of words, as well as the hyperparameters are shared and only declared once. Similarly, there is only one set of parameters. Then the model contains a single set of statements for the prior, a set of statements for the labeled data, and a set of statements for the unlabeled data.

要指定用于全贝叶斯推断的 Stan 模型,需将有标签数据的模型与无标签数据的模型结合。声明第二个文档集合作为数据(但不包含类别标签),生成新变量 M2N2w2doc2。类别数和词语数以及超参数是共享的,仅需声明一次。类似地,模型仅有一组参数。模型包含一组先验语句、一组有标签数据语句和一组无标签数据语句。

Prediction without model updates

不更新模型的预测

An alternative to full Bayesian inference involves estimating a model using labeled data, then applying it to unlabeled data without updating the parameter estimates based on the unlabeled data. This behavior can be implemented by moving the definition of gamma for the unlabeled documents to the generated quantities block. Because the variables no longer contribute to the log probability, they no longer jointly contribute to the estimation of the model parameters.

一种替代全贝叶斯推断的方法是使用有标记的数据来估计模型,然后将其应用于未标记的数据,而模型参数的估计并不基于无标签数据。此行为可通过将无标签文档的 gamma 定义移至生成量块来实现。因为这些变量不再对对数概率产生贡献,它们也不再共同参与模型参数的估计。

9.5 Latent Dirichlet allocation

潜在狄利克雷分配

Latent Dirichlet allocation (LDA) is a mixed-membership multinomial clustering model (Blei, Ng, and Jordan 2003) that generalizes naive Bayes. Using the topic and document terminology common in discussions of LDA, each document is modeled as having a mixture of topics, with each word drawn from a topic based on the mixing proportions.

潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)是一种混合隶属度的多项聚类模型 (Blei, Ng, and Jordan 2003),它是朴素贝叶斯的推广。使用 LDA 讨论中常见的“主题”和“文档”术语,每个文档被建模为主题混合体,每个词根据混合比例从某个主题中生成。

The LDA Model

LDA 模型

The basic model assumes each document is generated independently based on fixed hyperparameters. For document \(m\), the first step is to draw a topic distribution simplex \(\theta_m\) over the \(K\) topics,

基本模型假设每个文档基于固定超参数独立生成。对于文档 \(m\),第一步是生成包含 \(K\) 个主题分布的单纯形 \(\theta_m\)

\[ \theta_m \sim \textsf{Dirichlet}(\alpha). \]

The prior hyperparameter \(\alpha\) is fixed to a \(K\)-vector of positive values. Each word in the document is generated independently conditional on the distribution \(\theta_m\). First, a topic \(z_{m,n} \in \{1,\dotsc,K\}\) is drawn for the word based on the document-specific topic-distribution,

先验超参数 \(\alpha\) 被固定为一个 \(K\) 维正值向量。文档中的每个词在给定分布 \(\theta_m\) 的条件下独立生成。首先,根据文档特定的主题分布为词语生成主题 \(z_{m,n} \in \{1,\dotsc,K\}\)

\[ z_{m,n} \sim \textsf{categorical}(\theta_m). \]

Finally, the word \(w_{m,n}\) is drawn according to the word distribution for topic \(z_{m,n}\),

最后,根据主题 \(z_{m,n}\) 的词语分布得到词语 \(w_{m,n}\)

\[ w_{m,n} \sim \textsf{categorical}(\phi_{z[m,n]}). \] The distributions \(\phi_k\) over words for topic \(k\) are also given a Dirichlet prior,

主题 \(k\) 中词语的分布 \(\phi_k\) 也服从狄利克雷先验,

\[ \phi_k \sim \textsf{Dirichlet}(\beta) \]

where \(\beta\) is a fixed \(V\)-vector of positive values.

其中 \(\beta\) 是一个 \(V\) 维正值向量。

Summing out the discrete parameters

离散参数求和

Although Stan does not (yet) support discrete sampling, it is possible to calculate the marginal distribution over the continuous parameters by summing out the discrete parameters as in other mixture models. The marginal posterior of the topic and word variables is

尽管 Stan 尚不支持离散采样,但可以通过像其他混合模型中对离散参数求和来计算连续参数的边际分布。主题和词变量的边际后验为:

\[\begin{align*} p(\theta,\phi \mid w,\alpha,\beta) &\propto p(\theta \mid \alpha) \, p(\phi \mid \beta) \, p(w \mid \theta,\phi) \\ &= \prod_{m=1}^M p(\theta_m \mid \alpha) \times \prod_{k=1}^K p(\phi_k \mid \beta) \times \prod_{m=1}^M \prod_{n=1}^{M[n]} p(w_{m,n} \mid \theta_m,\phi). \end{align*}\]

The inner word-probability term is defined by summing out the topic assignments,

其中的词语概率项是通过主题分配的求和定义的,

\[\begin{align*} p(w_{m,n} \mid \theta_m,\phi) &= \sum_{z=1}^K p(z,w_{m,n} \mid \theta_m,\phi) \\ &= \sum_{z=1}^K p(z \mid \theta_m) \, p(w_{m,n} \mid \phi_z). \end{align*}\]

Plugging the distributions in and converting to the log scale provides a formula that can be implemented directly in Stan,

将分布代入并转换为对数尺度,可直接在 Stan 中实现,

\[\begin{align*} \log\, &p(\theta,\phi \mid w,\alpha,\beta) \\ &= \sum_{m=1}^M \log \textsf{Dirichlet}(\theta_m \mid \alpha) + \sum_{k=1}^K \log \textsf{Dirichlet}(\phi_k \mid \beta) \\ &\qquad + \sum_{m=1}^M \sum_{n=1}^{N[m]} \log \left( \sum_{z=1}^K \textsf{categorical}(z \mid \theta_m) \times \textsf{categorical}(w_{m,n} \mid \phi_z) \right) \end{align*}\]

Implementation of LDA

LDA 的实现

Applying the marginal derived in the last section to the data structure described in this section leads to the following Stan program for LDA.

将上一节推导出的边缘分布应用于本节所描述的数据结构,将得到以下用于 LDA 的 Stan 程序。

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  array[N] int<lower=1, upper=V> w;    // word n
  array[N] int<lower=1, upper=M> doc;  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  array[M] simplex[K] theta;    // topic dist for doc m
  array[K] simplex[V] phi;      // word dist for topic k
}
model {
  for (m in 1:M) {
    theta[m] ~ dirichlet(alpha);  // prior
  }
  for (k in 1:K) {
    phi[k] ~ dirichlet(beta);     // prior
  }
  for (n in 1:N) {
    array[K] real gamma;
    for (k in 1:K) {
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    }
    target += log_sum_exp(gamma);  // likelihood;
  }
}

As in the other mixture models, the log-sum-of-exponents function is used to stabilize the numerical arithmetic.

和其他混合模型一样,这里使用了对数-指数求和函数来稳定数值计算。

Correlated topic model

相关主题模型

To account for correlations in the distribution of topics for documents, Blei and Lafferty (2007) introduced a variant of LDA in which the Dirichlet prior on the per-document topic distribution is replaced with a multivariate logistic normal distribution.

为了考虑文档主题分布的相关性,Blei and Lafferty (2007) 引入了 LDA 的一种变体,其中文档主题分布的狄利克雷先验被替换为多元 Logistic-正态分布。

The authors treat the prior as a fixed hyperparameter. They use an \(L_1\)-regularized estimate of covariance, which is equivalent to the maximum a posteriori estimate given a double-exponential prior. Stan does not (yet) support maximum a posteriori estimation, so the mean and covariance of the multivariate logistic normal must be specified as data.

作者将先验视为固定的超参数。他们使用带 \(L_1\) 正则化的协方差估计,这等价于在双指数先验下的最大后验估计。由于 Stan 尚不支持最大后验估计,因此必须将多元 Logistic-正态分布的均值和协方差指定为数据。

Fixed hyperparameter correlated topic model

固定超参数的相关主题模型

The Stan model in the previous section can be modified to implement the correlated topic model by replacing the Dirichlet topic prior alpha in the data declaration with the mean and covariance of the multivariate logistic normal prior.

通过将数据声明中的狄利克雷主题先验 alpha 替换为多元 Logistic-正态先验的均值和协方差,可以修改上一节的 Stan 模型以实现相关主题模型:

data {
  // ... data as before without alpha ...
  vector[K] mu;          // topic mean
  cov_matrix[K] Sigma;   // topic covariance
}

Rather than drawing the simplex parameter theta from a Dirichlet, a parameter eta is drawn from a multivariate normal distribution and then transformed using softmax into a simplex.

下述代码不从狄利克雷分布中抽取单纯形参数 theta,而是从多元正态分布中抽取参数 eta,然后通过 softmax 转换为单纯形:

parameters {
  array[K] simplex[V] phi;     // word dist for topic k
  array[M] vector[K] eta;      // topic dist for doc m
}
transformed parameters {
  array[M] simplex[K] theta;
  for (m in 1:M) {
    theta[m] = softmax(eta[m]);
  }
}
model {
  for (m in 1:M) {
    eta[m] ~ multi_normal(mu, Sigma);
  }
  // ... model as before w/o prior for theta ...
}

Full Bayes correlated topic model

全贝叶斯相关主题模型

By adding a prior for the mean and covariance, Stan supports full Bayesian inference for the correlated topic model. This requires moving the declarations of topic mean mu and covariance Sigma from the data block to the parameters block and providing them with priors in the model. A relatively efficient and interpretable prior for the covariance matrix Sigma may be encoded as follows.

通过为均值和协方差添加先验,Stan 支持相关主题模型的全贝叶斯推理。这需要将主题均值 mu 和协方差 Sigma 的声明从数据块移动到参数块,并在模型中为它们提供先验。以下代码为协方差矩阵 Sigma 提供了相对高效且可解释的先验:

// ... data block as before, but without alpha ...
parameters {
  vector[K] mu;              // topic mean
  corr_matrix[K] Omega;      // correlation matrix
  vector<lower=0>[K] sigma;  // scales
  array[M] vector[K] eta;    // logit topic dist for doc m
  array[K] simplex[V] phi;   // word dist for topic k
}
transformed parameters {
  // ... eta as above ...
  cov_matrix[K] Sigma;       // covariance matrix
  for (m in 1:K) {
    Sigma[m, m] = sigma[m] * sigma[m] * Omega[m, m];
  }
  for (m in 1:(K-1)) {
    for (n in (m+1):K) {
      Sigma[m, n] = sigma[m] * sigma[n] * Omega[m, n];
      Sigma[n, m] = Sigma[m, n];
    }
  }
}
model {
  mu ~ normal(0, 5);      // vectorized, diffuse
  Omega ~ lkj_corr(2.0);  // regularize to unit correlation
  sigma ~ cauchy(0, 5);   // half-Cauchy due to constraint
  // ... words sampled as above ...
}

The \(\textsf{LKJCorr}\) distribution with shape \(\alpha > 0\) has support on correlation matrices (i.e., symmetric positive definite with unit diagonal). Its density is defined by

具有形状参数 \(\alpha>0\)\(\textsf{LKJCorr}\) 分布的支撑集为相关系数矩阵(即对称正定且对角线为1)。其密度由以下公式定义:

\[ \mathsf{LkjCorr}(\Omega\mid\alpha) \propto \mathrm{det}(\Omega)^{\alpha - 1} \] With a scale of \(\alpha = 2\), the weakly informative prior favors a unit correlation matrix. Thus the compound effect of this prior on the covariance matrix \(\Sigma\) for the multivariate logistic normal is a slight concentration around diagonal covariance matrices with scales determined by the prior on sigma.

\(\alpha=2\) 时,弱信息先验倾向于单位相关矩阵。因此,此先验对多元 Logistic-正态协方差矩阵 \(\Sigma\) 的复合效应是略微集中在由 sigma 先验决定尺度的对角协方差矩阵周围。

Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” The Annals of Applied Statistics 1 (1): 17–37.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.
Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

  1. For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎

  2. 在聚类任务中,所有混合模型都会面临不可识别性问题,而分类任务则不存在此问题。尽管在聚类中采用完整的贝叶斯推断存在困难,但研究者通常还是会在探索性数据分析环境中继续使用它,而不是用于预测建模。↩︎