5 有限混合模型

Finite Mixtures

本节译者：胡向宇、张梓源

初次校审：张梓源

二次校审：邱怡轩（DeepSeek 辅助）

Finite mixture models of an outcome assume that the outcome is drawn from one of several distributions, the identity of which is controlled by a categorical mixing distribution. Mixture models typically have multimodal densities with modes near the modes of the mixture components. Mixture models may be parameterized in several ways, as described in the following sections. Mixture models may be used directly for modeling data with multimodal distributions, or they may be used as priors for other parameters.

在有限混合模型中，我们假设观测结果是从多个不同分布中抽取的，而具体从哪个分布抽取则由一个分类混合分布决定。混合模型通常具有多峰密度分布，其峰值位置接近各混合成分的峰值。混合模型可以通过多种方式进行参数化，以下各节所述。混合模型可直接用于建模具有多峰分布的数据，也可作为其他参数的先验分布。

5.1 Relation to clustering

与聚类的关系

Clustering models, as discussed in the clustering chapter, are just a particular class of mixture models that have been widely applied to clustering in the engineering and machine-learning literature. The normal mixture model discussed in this chapter reappears in multivariate form as the statistical basis for the $K$-means algorithm; the latent Dirichlet allocation model, usually applied to clustering problems, can be viewed as a mixed-membership multinomial mixture model.

如聚类章节所述，聚类模型是混合模型的一个特殊类别，这类模型在工程和机器学习领域的聚类研究中被广泛应用。本章讨论的正态混合模型以多元形式重现，成为 $K$-均值算法的统计基础；通常应用于聚类问题的潜在狄利克雷分布模型，可以视为混合成分版本的多项混合模型。

5.2 Latent discrete parameterization

通过潜在离散变量参数化

One way to parameterize a mixture model is with a latent categorical variable indicating which mixture component was responsible for the outcome. For example, consider $K$ normal distributions with locations $\mu_k \in \mathbb{R}$ and scales $\sigma_k \in (0,\infty)$. Now consider mixing them in proportion $\lambda$, where $\lambda_k \geq 0$ and $\sum_{k=1}^K \lambda_k = 1$ (i.e., $\lambda$ lies in the unit $K$-simplex). For each outcome $y_n$ there is a latent variable $z_n$ in $\{ 1,\dotsc,K \}$ with a categorical distribution parameterized by $\lambda$,

参数化混合模型的一种方法是使用潜在分类变量来指示哪个混合成分生成观测结果。例如，考虑 $K$ 个正态分布，它们的位置参数是 $\mu_k \in \mathbb{R}$ ，尺度参数是 $\sigma_k \in (0,\infty)$。现在将它们按比例 $\lambda$ 混合，其中 $\lambda_k \geq 0$ and $\sum_{k=1}^K \lambda_k = 1$（即 $$位于单位 $K$ -单纯形上）。对于每个结果 $y_n$，都有一个从 $\{ 1,\dotsc,K \}$ 中取值的潜在变量 $z_n$，并有一个参数为 $\lambda$ 的分类分布。

\[ z_n \sim \textsf{categorical}(\lambda). \]

The variable $y_n$ is distributed according to the parameters of the mixture component $z_n$,

变量 $y_n$ 的分布由混合成分 $z_n$ 所对应的参数决定，

\[ y_n \sim \textsf{normal}(\mu_{z[n]},\sigma_{z[n]}). \]

This model is not directly supported by Stan because it involves discrete parameters $z_n$, but Stan can sample $\mu$ and $\sigma$ by summing out the $z$ parameter as described in the next section.

由于涉及离散参数 $z_n$，Stan 并不直接支持这种模型，但是 Stan 可以通过对 $z$ 求和，来采样 $\mu$ 和 $\sigma$，就像下一节描述的那样。

5.3 Summing out the responsibility parameter

对责任参数进行求和

To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If $Y$ is a mixture of $K$ normal distributions with locations $\mu_k$ and scales $\sigma_k$ with mixing proportions $\lambda$ in the unit $K$-simplex, then

要在 Stan 中实现前述的正态混合模型，可以通过对离散参数求和来消除它们。如果 $Y$ 是由 $K$ 个正态分布混合而成的，这些正态分布的位置参数是 $\mu_k$ ，尺度参数是 $\sigma_k$，并且混合比例 $\lambda$ 位于单位 $K$ -单纯形上，则

\[ p_Y\left(y \mid \lambda, \mu, \sigma \right) = \sum_{k=1}^K \lambda_k \, \textsf{normal}\left(y \mid \mu_k, \sigma_k\right). \]

Log sum of exponentials: linear Sums on the log scale

指数的对数和：对数尺度上的线性求和

The log sum of exponentials function is used to define mixtures on the log scale. It is defined for two inputs by

“指数的对数和”函数用于在对数尺度上定义混合。它由两个输入定义

\[ \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(a, b) = \log \left(\exp(a) + \exp(b)\right). \]

If $a$ and $b$ are probabilities on the log scale, then $\exp(a) + \exp(b)$ is their sum on the linear scale, and the outer log converts the result back to the log scale; to summarize, log_sum_exp does linear addition on the log scale. The reason to use Stan’s built-in log_sum_exp function is that it can prevent underflow and overflow in the exponentiation, by calculating the result as

若 $a$ 和 $b$ 表示对数概率，则 $\exp(a) + \exp(b)$ 即为它们在原始概率尺度上的和，外部对数将结果转换回对数尺度；总结来说，log_sum_exp 在对数尺度上进行线性加法。使用 Stan 内置的 log_sum_exp 函数的原因是，它可以通过下面的计算方式防止在指数运算中发生下溢和上溢，

\[ \log \left( \exp(a) + \exp(b)\right) = c + \log \left( \exp(a - c) + \exp(b - c) \right), \] where $c = \max(a, b)$. In this evaluation, one of the terms, $a - c$ or $b - c$, is zero and the other is negative, thus eliminating the possibility of overflow or underflow in the leading term while extracting the most arithmetic precision possible by pulling the $\max(a, b)$ out of the log-exp round trip.

其中$c = \max(a, b)$。在这个计算中，$a - c$ 和 $b - c$ 中一个是零，另一个是负数，这样就在主要项中消除了溢出或下溢的可能性，同时通过提取 $\max(a, b)$ 来避免对数-指数转换，从而获得尽可能高的算术精度。

For example, the mixture of $\textsf{normal}(-1, 2)$ with $\textsf{normal}(3, 1)$, with mixing proportion $\lambda = [0.3,0.7]^{\top}$, can be implemented in Stan as follows.

例如，具有混合比例 $\lambda = [0.3,0.7]^{\top}$ 的 $\textsf{normal}(-1, 2)$ 和 $\textsf{normal}(3, 1)$ 的混合分布可以参考如下代码在 Stan 中实现。

parameters {
  real y;
}
model {
  target += log_sum_exp(log(0.3) + normal_lpdf(y | -1, 2),
                        log(0.7) + normal_lpdf(y | 3, 1));
}

The log probability term is derived by taking

对数概率项推导如下

\[\begin{align*} \log\, &p\left(y \mid \lambda,\mu,\sigma \right) \\ &= \log\big( 0.3 \times \textsf{normal}\left(y \mid -1,2 \right) + 0.7 \times \textsf{normal}\left(y \mid 3,1 \right) \big) \\ &= \log\bigg( \exp\Big(\log\big(0.3 \times \textsf{normal}\left(y \mid -1,2 \right)\big)\Big) + \exp\Big(\log\big(0.7 \times \textsf{normal}\left(y \mid 3,1 \right)\big)\Big) \bigg) \\ &= \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}\big( \log(0.3) + \log \textsf{normal}\left(y \mid -1,2 \right), \log(0.7) + \log \textsf{normal}\left(y \mid 3,1 \right) \big). \end{align*}\]

Dropping uniform mixture ratios

放弃均匀的混合比例

If a two-component mixture has a mixing ratio of 0.5, then the mixing ratios can be dropped, because

如果一个两成分混合分布的混合比例为0.5，则可以删除混合比例，因为

log_half = log(0.5);
for (n in 1:N) {
    target +=
        log_sum_exp(log_half + normal_lpdf(y[n] | mu[1], sigma[1]),
                    log_half + normal_lpdf(y[n] | mu[2], sigma[2]));
}

then the $\log 0.5$ term isn’t contributing to the proportional density, and the above can be replaced with the more efficient version

$\log 0.5$ 项不会影响比例密度，上述代码可以替换为更高效的版本

for (n in 1:N) {
  target += log_sum_exp(normal_lpdf(y[n] | mu[1], sigma[1]),
                        normal_lpdf(y[n] | mu[2], sigma[2]));
}

The same result holds if there are $K$ components and the mixing simplex $\lambda$ is symmetric, i.e.,

则同样的结果在这种条件下也成立：如果有 $K$ 个成分并且混合单纯形 $\lambda$ 是对称的，即，

\[ \lambda = \left( \frac{1}{K}, \dotsc, \frac{1}{K} \right). \]

The result follows from the identity

这个结论来自于下面的等式

\[ \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(c + a, c + b) = c + \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(a, b) \]

and the fact that adding a constant $c$ to the log density accumulator has no effect because the log density is only specified up to an additive constant in the first place. There is nothing specific to the normal distribution here; constants may always be dropped from the target.

并且将常数 $c$ 添加到对数密度累加器中没有任何影响，因为对数密度本身就是在一个可加常数的基础上指定的。结论不仅限于正态分布；从目标中省略常数是一种常见的做法。

Recovering posterior mixture proportions

恢复后验混合比例

The posterior $p(z_n \mid y_n, \mu, \sigma)$ over the mixture indicator $z_n \in 1:K$ is often of interest as $p(z_n = k \mid y, \mu, \sigma)$ is the posterior probability that that observation $y_n$ was generated by mixture component $k$. The posterior can be computed via Bayes’s rule,

对于混合指示变量 $z_n \in 1:K$,它的后验分布 $p(z_n \mid y_n, \mu, \sigma)$ 通常很有意义,因为 $p(z_n = k \mid y, \mu, \sigma)$ 是观测值 $y_n$ 由混合成分 $k$ 生成的后验概率。这个后验分布可以通过贝叶斯法则计算得到,

\[\begin{align*} \Pr\!\left[z_n = k \mid y_n, \mu, \sigma, \lambda \right] &\propto p\left(y_n \mid z_n = k, \mu, \sigma\right)\, p\left(z_n = k \mid \lambda\right) \\ &= \textsf{normal}\left(y_n \mid \mu_k, \sigma_k\right) \cdot \lambda_k. \end{align*}\]

The normalization can be done via summation, because $z_n \in 1{:}K$ only takes on finitely many values. In detail,

归一化参数可以通过求和来完成，因为 $z_n \in 1{:}K$ 只取有限多个值。详细来说，

\[ p\left(z_n = k \mid y_n, \mu, \sigma, \lambda \right) = \frac{p\left(y_n \mid z_n = k, \mu, \sigma \right) \cdot p\left(z_n = k \mid \lambda \right)} {\sum_{k' = 1}^K p\left(y_n \mid z_n = k', \mu, \sigma \right) \cdot p\left(z_n = k' \mid \lambda \right)}. \]

On the log scale, the normalized probability is computed as

在对数尺度上，归一化后的概率为

\[\begin{align*} \log\,&\Pr\!\left[z_n = k \mid y_n, \mu, \sigma, \lambda\right] \\ &= \log p\left(y_n \mid z_n = k, \mu, \sigma\right) + \log \Pr\!\left[z_n = k \mid \lambda\right] \\ &\quad - \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}_{k' = 1}^K \big(\log p\left(y_n \mid z_n = k', \mu, \sigma\right) + \log p\left(z_n = k' \mid \lambda\right)\big). \end{align*}\]

This can be coded up directly in Stan; the change-point model in the change point section provides an example.

这可以直接在 Stan 中编程；变点模型章节中的变点模型提供了一个例子。

Estimating parameters of a mixture

混合模型的参数估计

Given the scheme for representing mixtures, it may be moved to an estimation setting, where the locations, scales, and mixture components are unknown. Further generalizing to a number of mixture components specified as data yields the following model.

给定表示混合分布的方案，可以对其进行估计，其中位置参数、尺度参数和混合成分是未知的。进一步将混合成分的数量作为 data 进行指定，得到以下模型。

data {
  int<lower=1> K;          // number of mixture components
  int<lower=1> N;          // number of data points
  array[N] real y;         // observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  ordered[K] mu;             // locations of mixture components
  vector<lower=0>[K] sigma;  // scales of mixture components
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  sigma ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K) {
      lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
}

The model involves K mixture components and N data points. The mixing proportion parameter theta is declared to be a unit $K$-simplex, whereas the component location parameter mu and scale parameter sigma are both defined to be K-vectors.

该模型涉及 K 个混合成分和 N 个数据点。混合比例参数 theta 被声明为单位 $K$–单纯形，而每个成分的位置参数 mu 和尺度参数 sigma 都被定义为 K 维向量。

The location parameter mu is declared to be an ordered vector in order to identify the model. This will not affect inferences that do not depend on the ordering of the components as long as the prior for the components mu[k] is symmetric, as it is here (each component has an independent $\textsf{normal}(0, 10)$ prior). It would even be possible to include a hierarchical prior for the components.

为了识别模型，位置参数 mu 被声明为有序向量。只要混合成分的先验分布 mu[k] 是对称的（这里每个成分都有独立的 $\textsf{normal}(0, 10)$ 先验分布），那么成分的排序就不会影响最后的推断结果。甚至可以为成分设定分层先验。

The values in the scale array sigma are constrained to be non-negative, and have a weakly informative prior given in the model chosen to avoid zero values and thus collapsing components.

尺度参数数组 sigma 中的值被限制为非负，并在模型中给出了一个弱信息先验，选择这样的先验是为了避免零值产生的退化情况。

The model declares a local array variable lps to be size K and uses it to accumulate the log contributions from the mixture components. The main action is in the loop over data points n. For each such point, the log of $\theta_k \times \textsf{normal}\left(y_n \mid \mu_k,\sigma_k\right)$ is calculated and added to the array lps. Then the log probability is incremented with the log sum of exponentials of those values.

该模型声明一个大小为 K 的局部数组变量 lps，并使用它来累积混合成分的对数贡献值。主要操作发生在对数据点 n 的循环中。对于每个这样的点，计算 $\theta_k \times \textsf{normal}\left(y_n \mid \mu_k,\sigma_k\right)$ 的对数，并添加到数组 lpps 中。然后，用这些值的指数和的对数来增加对数概率。

5.4 Vectorizing mixtures

混合模型的向量化

There is (currently) no way to vectorize mixture models at the observation level in Stan. This section is to warn users away from attempting to vectorize naively, as it results in a different model. A proper mixture at the observation level is defined as follows, where we assume that lambda, y[n], mu[1], mu[2], and sigma[1], sigma[2] are all scalars and lambda is between 0 and 1.

在 Stan 中，目前无法在观测级别向量化混合模型。本节旨在警告用户不要试图轻率地进行向量化，因为这会导致模型不同。我们在观察级别上正确定义的混合模型如下，假设 lambda, y[n], mu[1], mu[2],和 sigma[1], sigma[2] 都是标量，并且 lambda 的取值在0到1之间。

for (n in 1:N) {
  target += log_sum_exp(log(lambda)
                          + normal_lpdf(y[n] | mu[1], sigma[1]),
                        log1m(lambda)
                          + normal_lpdf(y[n] | mu[2], sigma[2]));

or equivalently

或者等价地

for (n in 1:N) {
  target += log_mix(lambda,
                    normal_lpdf(y[n] | mu[1], sigma[1]),
                    normal_lpdf(y[n] | mu[2], sigma[2]))
  };

This definition assumes that each observation $y_n$ may have arisen from either of the mixture components. The density is

该模型假设每个观测值 $y_n$ 都可能来自任意一个混合成分。密度函数为

\[ p\left(y \mid \lambda, \mu, \sigma\right) = \prod_{n=1}^N \big(\lambda \times \textsf{normal}\left(y_n \mid \mu_1, \sigma_1 \right) + (1 - \lambda) \times \textsf{normal}\left(y_n \mid \mu_2, \sigma_2 \right)\big). \]

Contrast the previous model with the following (erroneous) attempt to vectorize the model.

将之前的模型与下面（错误的）向量化模型进行比较。

target += log_sum_exp(log(lambda)
                        + normal_lpdf(y | mu[1], sigma[1]),
                      log1m(lambda)
                        + normal_lpdf(y | mu[2], sigma[2]));

or equivalently,

或者等价地

target += log_mix(lambda,
                  normal_lpdf(y | mu[1], sigma[1]),
                  normal_lpdf(y | mu[2], sigma[2]));

This second definition implies that the entire sequence $y_1, \dotsc, y_n$ of observations comes form one component or the other, defining a different density,

这第二个定义意味着整个观测序列 $y_1, \dotsc, y_n$ 来自于两个组成部分中的一个，从而定义了一个不同的密度，

\[ p\left(y \mid \lambda, \mu, \sigma \right) = \lambda \times \prod_{n=1}^N \textsf{normal}\left(y_n \mid \mu_1, \sigma_1\right) + (1 - \lambda) \times \prod_{n=1}^N \textsf{normal}\left(y_n \mid \mu_2, \sigma_2\right). \]

5.5 Inferences supported by mixtures

混合模型支持的推断分析

In many mixture models, the mixture components are underlyingly exchangeable in the model and thus not identifiable. This arises if the parameters of the mixture components have exchangeable priors and the mixture ratio gets a uniform prior so that the parameters of the mixture components are also exchangeable in the likelihood.

许多混合模型中的混合成分具有可交换性，这意味着它们无法被明确区分。如果混合成分的参数具有相同（可交换）的先验分布，并且混合比率具有均匀先验分布，那么混合成分的参数在似然函数中也是可交换的。

We have finessed this basic problem by ordering the parameters. This will allow us in some cases to pick out mixture components either ahead of time or after fitting (e.g., male vs. female, or Democrat vs. Republican).

我们通过对参数施加排序约束来解决这个识别问题。这种方法在某些情况下可以帮助我们识别出具体的混合成分，无论是在模型建立之前还是在模型拟合完成后（例如男性与女性，或民主党与共和党）。

In other cases, we do not care about the actual identities of the mixture components and want to consider inferences that are independent of indexes. For example, we might only be interested in posterior predictions for new observations.

在其他情况下，我们不关心混合成分的实际类别，而是更关注于不依赖于成分顺序的推断。例如，我们可能只对新观测数据的后验预测感兴趣。

Mixtures with unidentifiable components

含有无法识别成分的混合模型

As an example, consider the normal mixture from the previous section, which provides an exchangeable prior on the pairs of parameters

以前一节中的正态混合模型为例，该模型对参数对$(\mu_1, \sigma_1)$ 和 $(\mu_2, \sigma_2)$提供了可交换先验分布。

$(\mu_1, \sigma_1)$ and $(\mu_2, \sigma_2)$, \[\begin{align*} \mu_1, \mu_2 &\sim \textsf{normal}(0, 10) \\ \sigma_1, \sigma_2 &\sim \textsf{halfnormal}(0, 10) \\ \end{align*}\]

The prior on the mixture ratio is uniform,

混合比例具有均匀先验分布

\[ \lambda \sim \textsf{uniform}(0, 1), \] so that with the likelihood

因此，在给定如下似然函数时

\[ p\left(y_n \mid \mu, \sigma\right) = \lambda \times \textsf{normal}\left(y_n \mid \mu_1, \sigma_1\right) + (1 - \lambda) \times \textsf{normal}\left(y_n \mid \mu_2, \sigma_2\right), \] the joint distribution $p(y, \mu, \sigma, \lambda)$ is exchangeable in the parameters $(\mu_1, \sigma_1)$ and $(\mu_2, \sigma_2)$ with $\lambda$ flipping to $1 - \lambda$.¹

联合分布 $p(y, \mu, \sigma, \lambda)$ 在参数 $(\mu_1, \sigma_1)$ 和 $(\mu_2, \sigma_2)$ 间具有可交换性，其中 $\lambda$ 转换为 $1 - \lambda$。²

Inference under label switching

标签切换下的推断

In cases where the mixture components are not identifiable, it can be difficult to diagnose convergence of sampling or optimization algorithms because the labels will switch, or be permuted, in different MCMC chains or different optimization runs. Luckily, posterior inferences which do not refer to specific component labels are invariant under label switching and may be used directly. This subsection considers a pair of examples.

当混合成分不可识别时，诊断采样或优化算法的收敛性可能会很困难，因为标签会在不同的 MCMC 链或不同的优化运行中切换或重排列。幸运的是，不涉及特定成分标签的后验推断在标签切换下是不变的，可以直接使用。本小节考虑了一对示例。

Posterior predictive distribution

后验预测分布

Posterior predictive distribution for a new observation $\tilde{y}$ given the complete parameter vector $\theta$ will be

给定完整的参数向量 $\theta$，对一个新观测 $\tilde{y}$ 的后验预测分布将是

\[ p(\tilde{y} \mid y) = \int_{\theta} p(\tilde{y} \mid \theta) \, p(\theta \mid y) \, \textsf{d}\theta. \]

The normal mixture example from the previous section, with $\theta = (\mu, \sigma, \lambda)$, shows that the model returns the same density under label switching and thus the predictive inference is sound. In Stan, that predictive inference can be done either by computing $p(\tilde{y} \mid y)$, which is more efficient statistically in terms of effective sample size, or simulating draws of $\tilde{y}$, which is easier to plug into other inferences. Both approaches can be coded directly in the generated quantities block of the program. Here’s an example of the direct (non-sampling) approach.

前一节中的正态混合模型，其中 $\theta = (\mu, \sigma, \lambda)$，表明似然函数在标签交换下返回相同的密度，因此预测推断是可靠的。在 Stan 中，可以通过计算 $p(\tilde{y} \mid y)$（在有效样本量方面更加高效）或模拟抽取$\tilde{y}$ 来进行预测推断，后者更容易结合进其他推断中。这两种方法都可以直接编写在程序的 generated quantities 块中。这里是直接（非采样）方法的一个例子。

data {
  int<lower=0> N_tilde;
  vector[N_tilde] y_tilde;
  // ...
}
generated quantities {
  vector[N_tilde] log_p_y_tilde;
  for (n in 1:N_tilde) {
    log_p_y_tilde[n]
      = log_mix(lambda,
                normal_lpdf(y_tilde[n] | mu[1], sigma[1])
                normal_lpdf(y_tilde[n] | mu[2], sigma[2]));
  }
}

It is a bit of a bother afterwards, because the logarithm function isn’t linear and hence doesn’t distribute through averages (Jensen’s inequality shows which way the inequality goes). The right thing to do is to apply log_sum_exp of the posterior draws of log_p_y_tilde. The average log predictive density is then given by subtracting log(N_new).

之后有点麻烦，因为对数函数不是线性的，因此不能通过平均值分配（Jensen 不等式显示了不等式的方向）。正确的做法是对 log_p_y_tilde 的后验抽样应用 log_sum_exp。然后通过减去 log(N_new) 给出平均对数预测密度。

Clustering and similarity

聚类与相似性

Often a mixture model will be applied to a clustering problem and there might be two data items $y_i$ and $y_j$ for which there is a question of whether they arose from the same mixture component. If we take $z_i$ and $z_j$ to be the component responsibility discrete variables, then the quantity of interest is $z_i = z_j$, which can be summarized as an event probability

混合模型经常被应用于聚类问题，可能存在两个数据项 $y_i$ 和 $y_j$，我们想知道它们是否来源于同一个混合成分。如果我们将 $z_i$ 和$z_j$ 视为表示成分的离散变量，那么我们感兴趣的量是 $z_i = z_j$这可以被总结为一个事件概率。

\[ \Pr[z_i = z_j \mid y] = \int_{\theta} \frac{\sum_{k=0}^1 p(z_i=k, z_j = k, y_i, y_j \mid \theta)} {\sum_{k=0}^1 \sum_{m=0}^1 p(z_i = k, z_j = m, y_i, y_j \mid \theta)} \, p(\theta \mid y) \, \textsf{d}\theta. \]

As with other event probabilities, this can be calculated in the generated quantities block either by sampling $z_i$ and $z_j$ and using the indicator function on their equality, or by computing the term inside the integral as a generated quantity. As with posterior predictive distribute, working in expectation is more statistically efficient than sampling.

与其他事件概率一样，这可以在 generated quantities 块中通过对 $z_i$ 和 $z_j$ 进行抽样并使用让它们相等的示性函数，或者通过计算积分内的项作为 generated quantities 来计算。与预测似然函数一样，计算期望比采样更具统计效率。

5.6 Zero-inflated and hurdle models

零膨胀模型与障碍模型

Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes.

零膨胀模型和障碍模型都提供了泊松分布和伯努利分布的概率质量函数的混合，以便更灵活地建模零结果的概率。如 Lambert (1992) 所定义的零膨胀模型在零结果上添加了额外的概率质量。另一方面，障碍模型被制定为纯零和非零结果的混合。

Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior. Hurdle models can be formulated as combination of point mass at zero and continuous distribution for positive values.

零膨胀和障碍模型可以用于除泊松分布以外的离散分布。由于导数问题，零膨胀不适用于 Stan 中的连续分布；特别是，无法将点质量添加到连续分布中，例如将正态分布作为回归系数先验进行零膨胀。障碍模型可以被构建为在零点处具有点质量和正值的连续分布的组合。

Zero inflation

零膨胀

Consider the following example for zero-inflated Poisson distributions. There is a probability $\theta$ of observing a zero, and a probability $1 - \theta$ of observing a count with a $\textsf{Poisson}(\lambda)$ distribution (now $\theta$ is being used for mixing proportions because $\lambda$ is the traditional notation for a Poisson mean parameter). Given the probability $\theta$ and the intensity $\lambda$, the distribution for $y_n$ can be written as

考虑以下关于零膨胀泊松分布的例子。存在一个概率 $\theta$ 观测到一个零，和一个概率 $1 - \theta$ 观测到一个$\textsf{Poisson}(\lambda)$ 分布的计数（现在使用 $\theta$ 作为混合比例，因为 $\lambda$ 是泊松分布的传统均值参数符号）。给定概率 $\theta$ 和强度 $\lambda$，$y_n$ 的分布可以写成

\[\begin{align*} y_n & = 0 & \quad\text{with probability } \theta, \text{ and}\\ y_n & \sim \textsf{Poisson}(y_n \mid \lambda) & \quad\text{with probability } 1-\theta. \end{align*}\]

Stan does not support conditional distribution statements (with ~) conditional on some parameter, and we need to consider the corresponding likelihood

Stan 不支持条件分布语句（使用 ~）在某些参数上进行条件抽样，我们需要考虑相应的似然函数。

\[ p(y_n \mid \theta,\lambda) = \begin{cases} \theta + (1 - \theta) \times \textsf{Poisson}(0 \mid \lambda) & \quad\text{if } y_n = 0, \text{ and}\\ (1-\theta) \times \textsf{Poisson}(y_n \mid \lambda) &\quad\text{if } y_n > 0. \end{cases} \] The log likelihood can be coded directly in Stan (with target +=) as follows.

对数似然函数可以直接在 Stan 中实现（使用target +=）如下。

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0) {
      target += log_sum_exp(log(theta),
                            log1m(theta)
                              + poisson_lpmf(y[n] | lambda));
    } else {
      target += log1m(theta)
                  + poisson_lpmf(y[n] | lambda);
    }
  }
}

The log1m(theta) computes log(1-theta), but is more computationally stable. The log_sum_exp(lp1,lp2) function adds the log probabilities on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2)), but is more computationally stable and faster.

log1m(theta) 计算的是 log(1-theta)，但更具计算稳定性。log_sum_exp(lp1,lp2)函数在线性尺度上求和对数概率；它被定义为等于 log(exp(lp1) + exp(lp2))，但在计算上更稳定且更快。

Optimizing the zero-inflated Poisson model

优化零膨胀泊松模型

The code given above to compute the zero-inflated Poisson redundantly calculates all of the Bernoulli terms and also poisson_lpmf(0 | lambda) every time the first condition body executes. The use of the redundant terms is conditioned on y, which is known when the data are read in. This allows the transformed data block to be used to compute some more convenient terms for expressing the log density each iteration.

上面给出的用于计算零膨胀泊松的代码每次执行第一个条件体时，都会重复计算所有的伯努利项和 poisson_lpmf(0 | lambda)。冗余项的使用基于已知数据读入时的 y。这允许使用 transformed data 块来计算一些更方便的项，用于每次迭代中表达对数密度。

The number of zero cases is computed and handled separately. Then the nonzero cases are collected into their own array for vectorization. The number of zeros is required to declare y_nonzero, so it must be computed in a function.

零值情况的数量是单独计算和处理的。然后将非零情况收集到它们自己的数组中进行矢量化。声明y_nonzero需要零的数量，因此它必须在函数中计算。

functions {
  int num_zeros(array[] int y) {
    int sum = 0;
    for (n in 1:size(y)) {
      sum += (y[n] == 0);
    }
    return sum;
  }
}
// ...
transformed data {
  int<lower=0> N_zero = num_zeros(y);
  array[N - N_zero] int<lower=1> y_nonzero;
  int N_nonzero = 0;
  for (n in 1:N) {
    if (y[n] == 0) continue;
    N_nonzero += 1;
    y_nonzero[N_nonzero] = y[n];
  }
}
// ...
model {
  // ...
   target
     += N_zero
          * log_sum_exp(log(theta),
                        log1m(theta)
                          + poisson_lpmf(0 | lambda));
   target += N_nonzero * log1m(theta);
   target += poisson_lpmf(y_nonzero | lambda);
  // ...
}

The boundary conditions of all zeros and no zero outcomes is handled appropriately; in the vectorized case, if y_nonzero is empty, N_nonzero will be zero, and the last two target increment terms will add zeros.

所有零和非零结果的边界条件都得到了适当处理；在矢量化情况下，如果 y_nonzero 为空，N_nonzero 将为零，代码最后两行 target 增量项将加零。

Hurdle models

障碍模型

The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. Given the probability $\theta$ and the intensity $\lambda$, the distribution for $y_n$ can be written as

障碍模型类似于零膨胀模型，但在处理零结果时更加灵活，既可以膨胀也可以缩小零值的概率。给定概率 $\theta$ 和强度 $\lambda$，$y_n$ 的分布可以写成

[ \[\begin{align*} y_n & = 0 \quad\text{with probability } \theta, \text{ and}\\ y_n & \sim \textsf{Poisson}_{x\neq 0}(y_n \mid \lambda) \quad\text{with probability } 1-\theta, \end{align*}\] ] Where $\textsf{Poisson}_{x\neq 0}$ is a truncated Poisson distribution, truncated at $0$.

其中 $\textsf{Poisson}_{x\neq 0}$ 是在0处截断的泊松分布。

The corresponding likelihood function for the hurdle model is defined by

障碍模型对应的似然函数定义为

\[ p(y\mid\theta,\lambda) = \begin{cases} \theta &\quad\text{if } y = 0, \text{ and}\\ (1 - \theta) \frac{\displaystyle \textsf{Poisson}(y \mid \lambda)} {\displaystyle 1 - \textsf{PoissonCDF}(0 \mid \lambda)} &\quad\text{if } y > 0, \end{cases} \] where $\textsf{PoissonCDF}$ is the cumulative distribution function for the Poisson distribution and and $1 - \textsf{PoissonCDF}(0 \mid \lambda)$ is the relative normalization term for the truncated Poisson (truncated at $0$).

其中 $\textsf{PoissonCDF}$ 是泊松分布的累积分布函数，$1 - \textsf{PoissonCDF}(0 \mid \lambda)$ 是截断泊松（在0处截断）的相对归一化项。

The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture.

障碍模型在 Stan 中编程更加简单，因为它不需要显式混合。

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
            - poisson_lccdf(0 | lambda));
}

Julian King pointed out that because

Julian King 指出，因为

\[\begin{align*} \log \left( 1 - \textsf{PoissonCDF}(0 \mid \lambda) \right) &= \log \left( 1 - \textsf{Poisson}(0 \mid \lambda) \right) \\ &= \log(1 - \exp(-\lambda)) \end{align*}\] the CCDF in the else clause can be replaced with a simpler expression.

else 子句中的 CCDF 可以用更简单的表达式替换。

target += log1m(theta) + poisson_lpmf(y[n] | lambda)
          - log1m_exp(-lambda));

The resulting code is about 15% faster than the code with the CCDF.

替换后的代码比含有 CCDF 的代码快约 15%。

This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size $N=200$ and parameters $\theta=0.3$ and $\lambda = 8$, the speedup is a factor of 10; it will be lower for smaller $N$ and greater for larger $N$; it will also be greater for larger $\theta$.

这是一个例子，其中提前收集计数也可以大大加快执行速度，而不改变密度。对于数据大小 $N=200$ 和参数 $\theta=0.3$、$\lambda = 8$，加速为10倍；对于较小的 $N$ ，加速比将更低，对于较大的 $N$ ，加速比将更高；对于较大的 $\theta$，加速比也将更高。

To achieve this speedup, it helps to have a function to count the number of non-zero entries in an array of integers,

为了实现这种加速，拥有一个函数来计算整数数组中非零条目的数量是有用的，

functions {
  int num_zero(array[] int y) {
    int nz = 0;
    for (n in 1:size(y)) {
      if (y[n] == 0) {
        nz += 1;
      }
    }
    return nz;
  }
}

Then a transformed data block can be used to store the sufficient statistics,

然后可以使用 transformed data 块能用来储存充分统计量

transformed data {
  int<lower=0, upper=N> N0 = num_zero(y);
  int<lower=0, upper=N> Ngt0 = N - N0;
  array[N - num_zero(y)] int<lower=1> y_nz;
  {
    int pos = 1;
    for (n in 1:N) {
      if (y[n] != 0) {
        y_nz[pos] = y[n];
        pos += 1;
      }
    }
  }
}

The model block is then reduced to three statements.

model 块简化为三行代码

model {
  N0 ~ binomial(N, theta);
  y_nz ~ poisson(lambda);
  target += -Ngt0 * log1m_exp(-lambda);
}

The first statement accounts for the Bernoulli contribution to both the zero and non-zero counts. The second line is the Poisson contribution from the non-zero counts, which is now vectorized. Finally, the normalization for the truncation is a single line, so that the expression for the log CCDF at 0 isn’t repeated. Also note that the negation is applied to the constant Ngt0; whenever possible, leave subexpressions constant because then gradients need not be propagated until a non-constant term is encountered.

第一条语句解释了伯努利对零和非零计数的贡献。第二行是来自非零计数的泊松贡献，现在已经矢量化。最后，截断的归一化是单独的一行，因此对数 CCDF 在0处的表达式不会重复。还要注意，常量 Ngt0 取了相反数；尽可能保留子表达式常量，因为这样只有在遇到非常量项时才需要传播梯度。

5.7 Priors and effective data size in mixture models

混合模型中的先验与有效数据大小

Suppose we have a two-component mixture model with mixing rate $\lambda \in (0, 1)$. Because the likelihood for the mixture components is proportionally weighted by the mixture weights, the effective data size used to estimate each of the mixture components will also be weighted as a fraction of the overall data size. Thus although there are $N$ observations, the mixture components will be estimated with effective data sizes of $\theta \, N$ and $(1 - \theta) \, N$ for the two components for some $\theta \in (0, 1)$. The effective weighting size is determined by posterior responsibility, not simply by the mixing rate $\lambda$.

假设我们有一个两成分的混合模型，混合率为 $\lambda \in (0, 1)$。由于混合成分的似然函数按混合权重成比例加权，因此用于估计每个混合组件的有效数据大小也将按整体数据大小的一部分加权。因此，尽管有 $N$ 个观测值，但混合成分将使用 $\theta \, N$ 和 $(1 - \theta)\, N$ 的有效数据大小来进行估计，其中 $\theta \in (0, 1)$。有效加权大小由后验确定，而不仅仅是混合率 $\lambda$。

Comparison to model averaging

与模型平均的比较

In contrast to mixture models, which create mixtures at the observation level, model averaging creates mixtures over the posteriors of models separately fit with the entire data set. In this situation, the priors work as expected when fitting the models independently, with the posteriors being based on the complete observed data $y$.

与在每个观测数据级别创建混合的混合模型不同，模型平均是在整个数据集上分别拟合多个模型后，对这些模型的后验进行混合。在这种情况下，当独立拟合模型时，先验函数按预期工作，后验函数基于完整观测数据 $y$。

If different models are expected to account for different observations, we recommend building mixture models directly. If the models being mixed are similar, often a single expanded model will capture the features of both and may be used on its own for inferential purposes (estimation, decision making, prediction, etc.). For example, rather than fitting an intercept-only regression and a slope-only regression and averaging their predictions, even as a mixture model, we would recommend building a single regression with both a slope and an intercept. Model complexity, such as having more predictors than data points, can be tamed using appropriately regularizing priors. If computation becomes a bottleneck, the only recourse can be model averaging, which can be calculated after fitting each model independently (see Hoeting et al. (1999) and Gelman et al. (2013) for theoretical and computational details).

如果预计不同的模型可以解释不同的观测结果，我们建议直接构建混合模型。如果要混合的模型相似，通常一个扩展的单一模型就能捕捉到两者的特点，并且可以单独用于推断目的（估计、决策制定、预测等）。例如，即使作为混合模型，我们仍建议构建一个同时包含斜率和截距的单一回归模型，而不是拟合一个只有截距的回归和一个只有斜率的回归然后平均它们的预测。可以使用适当正则化先验来控制模型复杂性（例如具有比数据点更多预测变量）。如果计算成为瓶颈，则唯一的补救措施可能是模型平均，这可以在独立拟合每个模型之后进行计算（有关理论和计算细节，请参见 Hoeting et al. (1999) 和 Gelman et al. (2013)）。

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

Hoeting, Jennifer A., David Madigan, Adrian E Raftery, and Chris T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.

Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics 34 (1).

Imposing a constraint such as $\theta < 0.5$ will resolve the symmetry, but fundamentally changes the model and its posterior inferences.↩︎
施加一个约束，如 $\theta < 0.5$，将解决对称性问题，但从根本上改变了模型及其后验推断。↩︎