2  时间序列模型

Time-Series Models

本节译者:林余昌、洪世哲

初次校审:李君竹

二次校审:邱怡轩(DeepSeek 辅助)

Times series data come arranged in temporal order. This chapter presents two kinds of time series models, regression-like models such as autoregressive and moving average models, and hidden Markov models.

时间序列数据以时间顺序排列。本章介绍了两种时间序列模型,第一种是基于回归的模型,例如自回归模型和移动平均模型;另外一种是隐马尔可夫模型。

The Gaussian processes chapter presents Gaussian processes, which may also be used for time-series (and spatial) data.

高斯过程章节介绍了高斯过程,高斯过程也可以用于时间序列(和空间)数据分析。

2.1 Autoregressive models

自回归模型

A first-order autoregressive model (AR(1)) with normal noise takes each point \(y_n\) in a sequence \(y\) to be generated according to

带有正态噪声的一阶自回归模型(AR(1))假设序列 \(y\) 中的每个点 \(y_n\) 都遵循以下生成规则: \[ y_n \sim \textsf{normal}(\alpha + \beta y_{n-1}, \sigma). \]

That is, the expected value of \(y_n\) is \(\alpha + \beta y_{n-1}\), with noise scaled as \(\sigma\).

也就是说,\(y_n\) 的期望值是 \(\alpha + \beta y_{n-1}\),噪声尺度为 \(\sigma\)

AR(1) models

AR(1) 模型

With improper flat priors on the regression coefficients \(\alpha\) and \(\beta\) and on the positively-constrained noise scale (\(\sigma\)), the Stan program for the AR(1) model is as follows.1

对于回归系数 \(\alpha\)\(\beta\) 以及噪声尺度 \(\sigma\)\(\sigma\)>0),非正规的均匀先验,AR(1) 模型的 Stan 程序如下。2

data {
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  for (n in 2:N) {
    y[n] ~ normal(alpha + beta * y[n-1], sigma);
  }
}

The first observed data point, y[1], is not modeled here because there is nothing to condition on; instead, it acts to condition y[2]. This model also uses an improper prior for sigma, but there is no obstacle to adding an informative prior if information is available on the scale of the changes in y over time, or a weakly informative prior to help guide inference if rough knowledge of the scale of y is available.

这里没有对第一个观测数据点 y[1] 进行建模,因为缺乏 y[1] 之前的数据;相反,它作为条件作用于 y[2]。该模型还对 sigma 使用了非正规的均匀先验,但如果有关于 y 随时间变化的尺度的信息可用,那么添加一个有信息的先验是可行的,或者如果有关于 y 尺度的粗略知识可用,可以添加一个弱信息先验来帮助指导推断。

Slicing for efficiency

提高效率之切片处理

Although perhaps a bit more difficult to read, a much more efficient way to write the above model is by slicing the vectors, with the model above being replaced with the one-liner

通过向量切片可以实现更高效的模型表达,尽管这可能增加理解难度。上述代码可以简化为一行代码:

model {
  y[2:N] ~ normal(alpha + beta * y[1:(N - 1)], sigma);
}

The left-hand side slicing operation pulls out the last \(N-1\) elements and the right-hand side version pulls out the first \(N-1\).

左侧的切片操作取出向量的后 \(N-1\) 个元素,右侧取出向量的前 \(N-1\) 个元素。

Extensions to the AR(1) model

AR(1) 模型拓展

Proper priors of a range of different families may be added for the regression coefficients and noise scale. The normal noise model can be changed to a Student-\(t\) distribution or any other distribution with unbounded support. The model could also be made hierarchical if multiple series of observations are available.

可以为回归系数和噪声的尺度参数添加各种不同分布族的合理先验。噪声的正态分布可以改为 Student-\(t\) 分布或任何其他支撑无界的分布。如果有多个观察序列可用,该模型也可以被构造为层次化的。

To enforce the estimation of a stationary AR(1) process, the slope coefficient beta may be constrained with bounds as follows.

为了确保估计得到的 AR(1) 过程是平稳的,可以对斜率系数 beta 施加以下约束。

real<lower=-1, upper=1> beta;

In practice, such a constraint is not recommended. If the data are not well fit by a stationary model it is best to know this. Stationary parameter estimates can be encouraged with a prior favoring values of beta near zero.

在实践中,这样的约束不建议使用。当数据无法被平稳的 AR 模型很好地拟合时,了解这一点非常重要。不过在实践中可以通过设置 beta 先验质量集中在零的分布来鼓励稳定参数的估计。

AR(2) models

AR(2) 模型

Extending the order of the model is also straightforward. For example, an AR(2) model could be coded with the second-order coefficient gamma and the following model statement.

将 AR 模型的拓展到多阶滞后的方式也很直接。例如,一个 AR(2) 模型可以用二阶 AR 系数 gamma 和以下的模型语句进行表示。

for (n in 3:N) {
  y[n] ~ normal(alpha + beta*y[n-1] + gamma*y[n-2], sigma);
}

AR(\(K\)) models

AR(\(K\)) 模型

A general model where the order is itself given as data can be coded by putting the coefficients in an array and computing the linear predictor in a loop.

一个通用 AR 模型表示方式,其中阶数作为数据给出,通过将系数放在数组中并在循环中计算线性预测器来表示。

data {
  int<lower=0> K;
  int<lower=0> N;
  array[N] real y;
}
parameters {
  real alpha;
  array[K] real beta;
  real sigma;
}
model {
  for (n in (K+1):N) {
    real mu = alpha;
    for (k in 1:K) {
      mu += beta[k] * y[n-k];
    }
    y[n] ~ normal(mu, sigma);
  }
}

ARCH(1) models

ARCH(1) 模型

Econometric and financial time-series models usually assume heteroscedasticity: they allow the scale of the noise terms defining the series to vary over time. The simplest such model is the autoregressive conditional heteroscedasticity (ARCH) model (Engle 1982). Unlike the autoregressive model AR(1), which modeled the mean of the series as varying over time but left the noise term fixed, the ARCH(1) model takes the scale of the noise terms to vary over time but leaves the mean term fixed. Models could be defined where both the mean and scale vary over time; the econometrics literature presents a wide range of time-series modeling choices.

计量经济学和金融时间序列模型通常考虑异方差性:它们允许定义序列的噪声项的尺度随时间变化。最简单的这种模型是自回归条件异方差(ARCH)模型 (Engle 1982)。AR(1) 模型在噪声项尺度参数固定不随时间变化的假设下,刻画了序列条件均值随时间变化。ARCH(1) 模型放松了对噪声项尺度参数恒定的假设,允许其随时间变化,但条件均值部分不随着时间变化。当然我们可以定义条件均值和尺度都随时间变化的模型;计量经济学文献提出了一系列广泛的时间序列建模选择。

The ARCH(1) model is typically presented as the following sequence of equations, where \(r_t\) is the observed return at time point \(t\) and \(\mu\), \(\alpha_0\), and \(\alpha_1\) are unknown regression coefficient parameters.

ARCH(1) 模型通常有用以下几个方程的形式呈现,其中 \(r_t\) 是在时间点 \(t\) 的观测值,而 \(\mu\)\(\alpha_0\)\(\alpha_1\) 是未知的回归系数参数。

\[\begin{align*} r_t &= \mu + a_t \\ a_t &= \sigma_t \epsilon_t \\ \epsilon_t &\sim \textsf{normal}(0,1) \\ \sigma^2_t &= \alpha_0 + \alpha_1 a_{t-1}^2 \end{align*}\]

In order to ensure the noise terms \(\sigma^2_t\) are positive, the scale coefficients are constrained to be positive, \(\alpha_0, \alpha_1 > 0\). To ensure stationarity of the time series, the slope is constrained to be less than one, i.e., \(\alpha_1 < 1\).3

为了确保噪声项 \(\sigma^2_t\) 大于零,尺度系数被约束为正,即 \(\alpha_0, \alpha_1 > 0\)。为了确保时间序列的平稳性,斜率被约束为小于1,即 \(\alpha_1 < 1\)4

The ARCH(1) model may be coded directly in Stan as follows.

ARCH(1) 模型可以直接在 Stan 中编码如下。

data {
  int<lower=0> T;                // number of time points
  array[T] real r;               // return at time t
}
parameters {
  real mu;                       // average return
  real<lower=0> alpha0;          // noise intercept
  real<lower=0, upper=1> alpha1; // noise slope
}
model {
  for (t in 2:T) {
    r[t] ~ normal(mu, sqrt(alpha0 + alpha1
                                    * pow(r[t - 1] - mu,2)));
  }
}

The loop in the model is defined so that the return at time \(t=1\) is not modeled; the model in the next section shows how to model the return at \(t=1\). The model can be vectorized to be more efficient; the model in the next section provides an example.

模型中的循环定义为在时间 \(t=1\) 时的观测值不被建模;下一节的模型展示了如何建模在 \(t=1\) 时的回报。模型可以向量化表示以提高效率;下一节的模型提供了一个例子。

2.2 Modeling temporal heteroscedasticity

对序列异方差性质建模

A set of variables is homoscedastic if their variances are all the same; the variables are heteroscedastic if they do not all have the same variance. Heteroscedastic time-series models allow the noise term to vary over time.

如果一组变量的方差都相同,则称这组变量为同方差;如果它们的方差不都相同,则称为异方差。异方差时间序列模型允许噪声项随时间变化。

GARCH(1,1) models

GARCH(1,1) 模型

The basic generalized autoregressive conditional heteroscedasticity (GARCH) model, GARCH(1,1), extends the ARCH(1) model by including the squared previous difference in return from the mean at time \(t-1\) as a predictor of volatility at time \(t\), defining

广义自回归条件异方差(GARCH)模型,GARCH(1,1),通过包括在时间 \(t-1\) 的平均回报的平方前差作为时间 \(t\) 的波动性的预测因子,扩展了 ARCH(1) 模型,定义如下: \[ \sigma^2_t = \alpha_0 + \alpha_1 a^2_{t-1} + \beta_1 \sigma^2_{t-1}. \]

To ensure the scale term is positive and the resulting time series stationary, the coefficients must all satisfy \(\alpha_0, \alpha_1, \beta_1 > 0\) and the slopes \(\alpha_1 + \beta_1 < 1\).

为了确保尺度项是正的,并且时间序列模型是平稳的,所有的系数必须满足 \(\alpha_0, \alpha_1,\beta_1 > 0\) 并且斜率 \(\alpha_1 + \beta_1 < 1\)

data {
  int<lower=0> T;
  array[T] real r;
  real<lower=0> sigma1;
}
parameters {
  real mu;
  real<lower=0> alpha0;
  real<lower=0, upper=1> alpha1;
  real<lower=0, upper=(1-alpha1)> beta1;
}
transformed parameters {
  array[T] real<lower=0> sigma;
  sigma[1] = sigma1;
  for (t in 2:T) {
    sigma[t] = sqrt(alpha0
                     + alpha1 * pow(r[t - 1] - mu, 2)
                     + beta1 * pow(sigma[t - 1], 2));
  }
}
model {
  r ~ normal(mu, sigma);
}

To get the recursive definition of the volatility regression off the ground, the data declaration includes a non-negative value sigma1 for the scale of the noise at \(t = 1\).

为了开启波动率回归的递归表示,数据声明包括一个非负值 sigma1 作为 \(t = 1\) 时噪声的规模。

The constraints are coded directly on the parameter declarations. This declaration is order-specific in that the constraint on beta1 depends on the value of alpha1.

直接在参数声明上添加约束。这个声明是按顺序指定的,因为 beta1 的约束依赖于 alpha1 的值。

A transformed parameter array of non-negative values sigma is used to store the scale values at each time point. The definition of these values in the transformed parameters block is where the regression is now defined. There is an intercept alpha0, a slope alpha1 for the squared difference in return from the mean at the previous time, and a slope beta1 for the previous noise scale squared. Finally, the whole regression is inside the sqrt function because Stan requires scale (deviation) parameters (not variance parameters) for the normal distribution.

一个非负值的转换参数数组 sigma 被用来存储每个时间点的尺度值。这些值在转换参数块中的定义是现在定义回归的地方。有一个截距 alpha0,一个对前一时间的观测值与均值的平方差的斜率alpha1,和一个对前一个噪声规模平方的斜率 beta1。最后,整个回归在 sqrt 函数内部,因为 Stan 需要正态分布的规模(偏差)参数(不是方差参数)。

With the regression in the transformed parameters block, the model reduces a single vectorized distribution statement. Because r and sigma are of length T, all of the data are modeled directly.

有了转换参数块中的回归,模型就简化为一个单一的向量化采样语句。因为 rsigma 的长度是 T ,所有的数据都被直接建模。

2.3 Moving average models

移动平均模型

A moving average model uses previous errors as predictors for future outcomes. For a moving average model of order \(Q\), \(\mbox{MA}(Q)\), there is an overall mean parameter \(\mu\) and regression coefficients \(\theta_q\) for previous error terms. With \(\epsilon_t\) being the noise at time \(t\), the model for outcome \(y_t\) is defined by

一个移动平均模型使用先前的误差作为未来结果的自变量。对于阶数为 \(Q\) 的移动平均模型,\(\mbox{MA}(Q)\),有一个均值参数 \(\mu\) 和对先前误差项的回归系数 \(\theta_q\)。用 \(\epsilon_t\) 表示时间 \(t\) 的噪声,结果 \(y_t\) 的模型由以下定义: \[ y_t = \mu + \theta_1 \epsilon_{t-1} + \dotsb + \theta_Q \epsilon_{t-Q} + \epsilon_t, \] with the noise term \(\epsilon_t\) for outcome \(y_t\) modeled as normal,

其中,结果 \(y_t\) 的噪声项 \(\epsilon_t\) 建模为正态分布, \[ \epsilon_t \sim \textsf{normal}(0,\sigma). \] In a proper Bayesian model, the parameters \(\mu\), \(\theta\), and \(\sigma\) must all be given priors.

在一个合理的贝叶斯模型中,参数 \(\mu\)\(\theta\)\(\sigma\) 都必须给定先验分布。

MA(2) example

MA(2) 示例

An \(\mbox{MA}(2)\) model can be coded in Stan as follows.

在 Stan 中,\(\mbox{MA}(2)\) 模型可以表示如下。

data {
  int<lower=3> T;          // number of observations
  vector[T] y;             // observation at time T
}
parameters {
  real mu;                 // mean
  real<lower=0> sigma;     // error scale
  vector[2] theta;         // lag coefficients
}
transformed parameters {
  vector[T] epsilon;       // error terms
  epsilon[1] = y[1] - mu;
  epsilon[2] = y[2] - mu - theta[1] * epsilon[1];
  for (t in 3:T) {
    epsilon[t] = ( y[t] - mu
                    - theta[1] * epsilon[t - 1]
                    - theta[2] * epsilon[t - 2] );
  }
}
model {
  mu ~ cauchy(0, 2.5);
  theta ~ cauchy(0, 2.5);
  sigma ~ cauchy(0, 2.5);
  for (t in 3:T) {
    y[t] ~ normal(mu
                  + theta[1] * epsilon[t - 1]
                  + theta[2] * epsilon[t - 2],
                  sigma);
  }
}

The error terms \(\epsilon_t\) are defined as transformed parameters in terms of the observations and parameters. The definition of the distribution statement (which also defines the likelihood) follows the definition, which can only be applied to \(y_n\) for \(n > Q\). In this example, the parameters are all given Cauchy (half-Cauchy for \(\sigma\)) priors, although other priors can be used just as easily.

误差项 \(\epsilon_t\) 被定义为观察值和参数的转换参数。采样语句的定义(定义似然)跟随定义,它只能应用于 \(y_n\),其中 \(n > Q\)。在这个例子中,所有的参数都给定了 Cauchy(对于 \(\sigma\) 是半 Cauchy)先验,虽然其他的先验可以同样容易地使用。

This model could be improved in terms of speed by vectorizing the distribution statement in the model block. Vectorizing the calculation of the \(\epsilon_t\) could also be sped up by using a dot product instead of a loop.

这个模型可以在运行速度上进行改进,通过向量化模型块中的采样语句。通过使用点积而不是循环,可以加速 \(\epsilon_t\) 的计算。

Vectorized MA(Q) model

向量 MA(Q) 模型

A general \(\mbox{MA}(Q)\) model with a vectorized distribution statement may be defined as follows.

可以如下定义一个带有向量化分布语句的一般 \(\mbox{MA}(Q)\) 模型。

data {
  int<lower=0> Q;       // num previous noise terms
  int<lower=3> T;       // num observations
  vector[T] y;          // observation at time t
}
parameters {
  real mu;              // mean
  real<lower=0> sigma;  // error scale
  vector[Q] theta;      // error coeff, lag -t
}
transformed parameters {
  vector[T] epsilon;    // error term at time t
  for (t in 1:T) {
    epsilon[t] = y[t] - mu;
    for (q in 1:min(t - 1, Q)) {
      epsilon[t] = epsilon[t] - theta[q] * epsilon[t - q];
    }
  }
}
model {
  vector[T] eta;
  mu ~ cauchy(0, 2.5);
  theta ~ cauchy(0, 2.5);
  sigma ~ cauchy(0, 2.5);
  for (t in 1:T) {
    eta[t] = mu;
    for (q in 1:min(t - 1, Q)) {
      eta[t] = eta[t] + theta[q] * epsilon[t - q];
    }
  }
  y ~ normal(eta, sigma);
}

Here all of the data are modeled, with missing terms just dropped from the regressions as in the calculation of the error terms. Both models converge quickly and mix well at convergence, with the vectorized model being faster (per iteration, not to converge—they compute the same model).

在这里,所有的数据都被建模,缺失的项在计算误差项时就被从回归中删除了。两种模型都快速收敛并在收敛时混合得很好,其中向量化模型更快(每次迭代并不收敛——它们计算的是同一模型)。

2.4 Autoregressive moving average models

自回归滑动平均模型

Autoregressive moving-average models (ARMA), combine the predictors of the autoregressive model and the moving average model. An ARMA(1,1) model, with a single state of history, can be encoded in Stan as follows.

自回归移动平均模型(ARMA)结合了自回归模型和移动平均模型的预测器。ARMA(1,1) 模型,具有单一的历史状态,可以如下在 Stan 中进行编码。

data {
  int<lower=1> T;            // num observations
  array[T] real y;                 // observed outputs
}
parameters {
  real mu;                   // mean coeff
  real phi;                  // autoregression coeff
  real theta;                // moving avg coeff
  real<lower=0> sigma;       // noise scale
}
model {
  vector[T] nu;              // prediction for time t
  vector[T] err;             // error for time t
  nu[1] = mu + phi * mu;     // assume err[0] == 0
  err[1] = y[1] - nu[1];
  for (t in 2:T) {
    nu[t] = mu + phi * y[t - 1] + theta * err[t - 1];
    err[t] = y[t] - nu[t];
  }
  mu ~ normal(0, 10);        // priors
  phi ~ normal(0, 2);
  theta ~ normal(0, 2);
  sigma ~ cauchy(0, 5);
  err ~ normal(0, sigma);    // error model
}

The data are declared in the same way as the other time-series regressions and the parameters are documented in the code.

数据的声明方式与其他时间序列回归相同,参数在代码中有说明。

In the model block, the local vector nu stores the predictions and err the errors. These are computed similarly to the errors in the moving average models described in the previous section.

在模型块中,局部向量 nu 存储预测,err 存储误差。这些计算方式与上一节描述的移动平均模型中的误差类似。

The priors are weakly informative for stationary processes. The data model only involves the error term, which is efficiently vectorized here.

先验对于平稳过程来说是弱信息的。似然只涉及误差项,这里高效地进行了向量化。

Often in models such as these, it is desirable to inspect the calculated error terms. This could easily be accomplished in Stan by declaring err as a transformed parameter, then defining it the same way as in the model above. The vector nu could still be a local variable, only now it will be in the transformed parameter block.

在这样的模型中,通常希望检查计算出的误差项。在 Stan 中,可以通过将 err 声明为一个转换参数,然后按照上述模型中的方式定义它来轻松实现。向量 nu 仍然可以作为一个局部变量,只是现在它将在转换参数块中。

Wayne Folta suggested encoding the model without local vector variables as follows.

Wayne Folta 建议将模型编码为没有局部向量变量,如下所示。

model {
  real err;
  mu ~ normal(0, 10);
  phi ~ normal(0, 2);
  theta ~ normal(0, 2);
  sigma ~ cauchy(0, 5);
  err = y[1] - (mu + phi * mu);
  err ~ normal(0, sigma);
  for (t in 2:T) {
    err = y[t] - (mu + phi * y[t - 1] + theta * err);
    err ~ normal(0, sigma);
  }
}

This approach to ARMA models illustrates how local variables, such as err in this case, can be reused in Stan. Folta’s approach could be extended to higher order moving-average models by storing more than one error term as a local variable and reassigning them in the loop.

这种对 ARMA 模型的方法说明了在 Stan 中如何重复使用局部变量,例如在这种情况下的 err。Folta 的方法可以通过存储多个误差项作为局部变量,并在循环中重新赋值来扩展到高阶移动平均模型。

Both encodings are fast. The original encoding has the advantage of vectorizing the normal distribution, but it uses a bit more memory. A halfway point would be to vectorize just err.

两种编码都很快。原始编码的优点是向量化了正态分布,但是它使用了更多的内存。一个折中的方法是仅对 err 进行向量化。

Identifiability and stationarity

可识别性和平稳性

MA and ARMA models are not identifiable if the roots of the characteristic polynomial for the MA part lie inside the unit circle, so it’s necessary to add the following constraint5

如果 MA 部分的特征多项式的根位于单位圆内,则 MA 和 ARMA 模型是不可识别的,因此需要添加以下约束6

real<lower=-1, upper=1> theta;

When the model is run without the constraint, using synthetic data generated from the model, the simulation can sometimes find modes for (theta, phi) outside the \([-1,1]\) interval, which creates a multiple mode problem in the posterior and also causes the NUTS tree depth to get large (often above 10). Adding the constraint both improves the accuracy of the posterior and dramatically reduces the tree depth, which speeds up the simulation considerably (typically by much more than an order of magnitude).

从该模型中产生多组的数据进行模拟实验,当在没有约束情况下进行估计时,模拟有时会出现(theta, phi)在 \([-1,1]\) 区间之外的情况,这在后验中创建了一个多模式问题,并且导致 NUTS 树深度变大(通常超过10)。添加约束既提高了后验的准确性,也显著减少了树的深度,这大大加速了模拟(通常超过一个数量级)。

Further, unless one thinks that the process is really non-stationary, it’s worth adding the following constraint to ensure stationarity.

此外,除非你认为这个过程真的是非平稳的,否则值得添加以下约束来确保平稳性。

real<lower=-1, upper=1> phi;

2.5 Stochastic volatility models

随机波动率模型

Stochastic volatility models treat the volatility (i.e., variance) of a return on an asset, such as an option to buy a security, as following a latent stochastic process in discrete time (Kim, Shephard, and Chib 1998). The data consist of mean corrected (i.e., centered) returns \(y_t\) on an underlying asset at \(T\) equally spaced time points. Kim et al. formulate a typical stochastic volatility model using the following regression-like equations, with a latent parameter \(h_t\) for the log volatility, along with parameters \(\mu\) for the mean log volatility, and \(\phi\) for the persistence of the volatility term. The variable \(\epsilon_t\) represents the white-noise shock (i.e., multiplicative error) on the asset return at time \(t\), whereas \(\delta_t\) represents the shock on volatility at time \(t\).

随机波动率模型将资产(如购买证券的期权)的回报的波动性(即方差)视为在离散时间下遵循潜在的随机过程 (Kim, Shephard, and Chib 1998)。数据由在 \(T\) 个等间隔时间点上的底层资产的均值校正(即,中心化)回报 \(y_t\) 组成。Kim 等人用以下的类回归方程来构建典型的随机波动性模型,其中有一个潜在参数 \(h_t\) 用于表示对数波动性,以及参数 \(\mu\) 表示平均对数波动性,\(\phi\) 表示波动性项的持续性。变量 \(\epsilon_t\) 表示在时间 \(t\) 处的资产回报的白噪声冲击(即乘法误差),而 \(\delta_t\) 表示在时间 \(t\) 的波动性冲击。

\[\begin{align*} y_t &= \epsilon_t \exp(h_t / 2) \\ h_{t+1} &= \mu + \phi (h_t - \mu) + \delta_t \sigma \\ h_1 &\sim \textsf{normal}\left( \mu, \frac{\sigma}{\sqrt{1 - \phi^2}} \right) \\ \epsilon_t &\sim \textsf{normal}(0,1) \\ \delta_t &\sim \textsf{normal}(0,1) \end{align*}\]

Rearranging the first line, \(\epsilon_t = y_t \exp(-h_t / 2)\), allowing the distribution for \(y_t\) to be written as

将第一行重新排列,\(\epsilon_t = y_t \exp(-h_t / 2)\),允许为 \(y_t\) 编写分布

\[ y_t \sim \textsf{normal}(0,\exp(h_t/2)). \] The recurrence equation for \(h_{t+1}\) may be combined with the scaling of \(\delta_t\) to yield the distribution

\(h_{t+1}\) 的递归方程可以与 \(\delta_t\) 的缩放和采样结合,得到分布

\[ h_t \sim \mathsf{normal}(\mu + \phi(h_{t-1} - \mu), \sigma). \] This formulation can be directly encoded, as shown in the following Stan model.

这种表述可以直接编码,如下面的 Stan 模型所示。

data {
  int<lower=0> T;   // # time points (equally spaced)
  vector[T] y;      // mean corrected return at time t
}
parameters {
  real mu;                     // mean log volatility
  real<lower=-1, upper=1> phi; // persistence of volatility
  real<lower=0> sigma;         // white noise shock scale
  vector[T] h;                 // log volatility at time t
}
model {
  phi ~ uniform(-1, 1);
  sigma ~ cauchy(0, 5);
  mu ~ cauchy(0, 10);
  h[1] ~ normal(mu, sigma / sqrt(1 - phi * phi));
  for (t in 2:T) {
    h[t] ~ normal(mu + phi * (h[t - 1] -  mu), sigma);
  }
  for (t in 1:T) {
    y[t] ~ normal(0, exp(h[t] / 2));
  }
}

Compared to the Kim et al. formulation, the Stan model adds priors for the parameters \(\phi\), \(\sigma\), and \(\mu\). The shock terms \(\epsilon_t\) and \(\delta_t\) do not appear explicitly in the model, although they could be calculated efficiently in a generated quantities block.

与 Kim 等人的表述相比,Stan 模型在参数 \(\phi\)\(\sigma\)\(\mu\) 上添加了先验。冲击项 \(\epsilon_t\)\(\delta_t\) 在模型中没有明确出现,尽管它们可以在生成的数量块中有效地计算。

The posterior of a stochastic volatility model such as this one typically has high posterior variance. For example, simulating 500 data points from the above model with \(\mu = -1.02\), \(\phi = 0.95\), and \(\sigma = 0.25\) leads to 95% posterior intervals for \(\mu\) of \((-1.23, -0.54)\), for \(\phi\) of \((0.82, 0.98)\), and for \(\sigma\) of \((0.16, 0.38)\).

像这样的随机波动率模型的后验通常具有较大的后验方差。例如,用 \(\mu = -1.02\)\(\phi = 0.95\)\(\sigma = 0.25\) 模拟500个数据点,得到 \(\mu\) 的95%后验区间为 \((-1.23, -0.54)\)\(\phi\) 的为 \((0.82, 0.98)\)\(\sigma\) 的 为\((0.16, 0.38)\)

The NUTS draws show a high degree of autocorrelation, both for this model and the stochastic volatility model evaluated in (Hoffman and Gelman 2014). Using a non-diagonal mass matrix provides faster convergence and higher effective sample size than a diagonal mass matrix, but will not scale to large values of \(T\).

NUTS 的采样点展示出高度的自相关性,这对于此处的模型和在 (Hoffman and Gelman 2014) 中评估的随机波动率模型都是如此。使用非对角的质量矩阵将能获得比对角质量矩阵更快的收敛速度和更大的有效样本量,但无法扩展到较大的 \(T\) 值。

It is relatively straightforward to speed up the effective sample size per second generated by this model by one or more orders of magnitude. First, the distribution statements for return \(y\) is easily vectorized to

想要将当前模型的每秒有效样本量增大一个或更多数量级,可以使用如下相对直接的方法。首先,可以轻松将回报 \(y\) 的采样语句向量化为

y ~ normal(0, exp(h / 2));

This speeds up the iterations, but does not change the effective sample size because the underlying parameterization and log probability function have not changed. Mixing is improved by reparameterizing in terms of a standardized volatility, then rescaling. This requires a standardized parameter h_std to be declared instead of h.

这加快了迭代速度,但不会改变有效样本量,因为背后的参数化和对数概率函数并未改变。通过将标准化波动性重新参数化,然后重新缩放,可以改善混合效果。这需要声明一个标准化参数 h_std 而不是 h

parameters {
  // ...
  vector[T] h_std;  // std log volatility time t
}

The original value of h is then defined in a transformed parameter block.

原始值 h 在转换参数块中定义。

transformed parameters {
  vector[T] h = h_std * sigma;  // now h ~ normal(0, sigma)
  h[1] /= sqrt(1 - phi * phi);  // rescale h[1]
  h += mu;
  for (t in 2:T) {
    h[t] += phi * (h[t - 1] - mu);
  }
}

The first assignment rescales h_std to have a \(\textsf{normal}(0,\sigma)\) distribution and temporarily assigns it to h. The second assignment rescales h[1] so that its prior differs from that of h[2] through h[T]. The next assignment supplies a mu offset, so that h[2] through h[T] are now distributed \(\textsf{normal}(\mu,\sigma)\); note that this shift must be done after the rescaling of h[1]. The final loop adds in the moving average so that h[2] through h[T] are appropriately modeled relative to phi and mu.

第一个赋值操作将 h_std 重新缩放为 \(\textsf{normal}(0,\sigma)\) 分布,并暂时赋值给 h。第二个赋值操作重新缩放 h[1],使得其先验与 h[2]h[T] 的先验不同。接下来的赋值操作提供了一个 mu 偏移量,因此 h[2]h[T] 现在分布为 \(\textsf{normal}(\mu,\sigma)\);请注意,这个偏移必须在重新缩放 h[1] 之后进行。最后的循环添加了移动平均数,因此 h[2]h[T] 都根据 phimu 进行了适当的建模。

As a final improvement, the distribution statements for h[1] to h[T] are replaced with a single vectorized standard normal distribution statement.

作为最后的改进,h[1]h[T] 的分布语句被替换为单个向量化的标准正态分布语句。

model {
  // ...
  h_std ~ std_normal();
}

Although the original model can take hundreds and sometimes thousands of iterations to converge, the reparameterized model reliably converges in tens of iterations. Mixing is also dramatically improved, which results in higher effective sample sizes per iteration. Finally, each iteration runs in roughly a quarter of the time of the original iterations.

虽然原始模型可能需要数百乃至数千次迭代才能收敛,但经过重新参数化的模型通常只需数十次迭代即可可靠收敛。混合性也得到了显著改善,这导致了每次迭代的有效样本大小更大。最后,每次迭代的运行时间大约是原始迭代时间的四分之一。

2.6 Hidden Markov models

隐马尔可夫模型

A Hidden Markov model is a probabilistic model over \(N\) observations \(y_{1:N}\) and \(N\) hidden states \(z_{1:N}\). This models is defined by the conditional distributions \(p(y_n \mid z_n, \phi)\) and \(p(z_n \mid z_{n-1}, \phi)\). Here we make the dependency on additional model parameters \(\phi\) explicit. (\(\phi\) may be a vector of parameters.) The complete data likelihood is then

隐马尔可夫模型(Hidden Markov model, HMM)是一个基于 \(N\) 个观测值 \(y_{1:N}\)\(N\) 个隐状态 \(z_{1:N}\) 的概率模型。该模型由条件分布 \(p(y_n \mid z_n, \phi)\)\(p(z_n \mid z_{n-1}, \phi)\) 定义。这里我们明确了对其他模型参数 \(\phi\) 的依赖关系。(\(\phi\) 可以是参数向量。)完整数据似然函数为:

\[ p(y, z \mid \phi) = \prod_n p(y_n \mid z_n, \phi) p(z_n \mid z_{n - 1}, \phi) \] When \(z_{1:N}\) is continuous, the user can explicitly encode these distributions in Stan and use Markov chain Monte Carlo to integrate \(z\) out.

\(z_{1:N}\) 是连续时,用户可以在 Stan 中显式编码这些分布,并使用马尔可夫链蒙特卡洛方法将 \(z\) 积分掉。

When each state \(z\) takes a value over a discrete and finite set, say \(\{1, 2, ..., K\}\), we can use Stan’s suite of HMM functions to marginalize out \(z_{1:N}\) and compute

当每个状态 \(z\) 在离散有限集合上取值时,比如 \(\{1, 2, ..., K\}\),我们可以使用 Stan 的 HMM 函数套件将 \(z_{1:N}\) 边缘化,并计算:

\[ p(y_{1:N} \mid \phi) = \int_{\mathcal Z} p(y, z \mid \phi) \text d z. \] We start by defining the conditional observation distribution, stored in a \(K \times N\) matrix \(\omega\) with

我们首先定义条件观测分布,存储在一个 \(K \times N\) 矩阵 \(\omega\) 中,其中:

\[ \omega_{kn} = p(y_n \mid z_n = k, \phi). \] Next, we introduce the \(K \times K\) transition matrix, \(\Gamma\), with

接下来,我们引入 \(K \times K\) 转移矩阵 \(\Gamma\),其中:

\[ \Gamma_{ij} = p(z_n = j \mid z_{n - 1} = i, \phi). \] (This is a right-stochastic matrix.) Finally, we define the initial state \(K\)-vector \(\rho\), with

(这是一个右随机矩阵。)最后,我们定义初始状态 \(K\) 维向量 \(\rho\),其中:

\[ \rho_k = p(z_0 = k \mid \phi). \] It is common practice to set \(\rho\) to be the stationary distribution of the HMM, that is \(\rho\) is the first eigenvector of \(\Gamma\) and solves \(\Gamma \rho = \rho\).

常见的做法是将 \(\rho\) 设置为 HMM 的平稳分布,即 \(\rho\)\(\Gamma\) 的第一个特征向量,满足 \(\Gamma \rho = \rho\)

As an example, consider a three-state model with \(K=3\). The observations are normally distributed conditional on the HMM states with

举个例子,考虑一个三状态模型,\(K=3\)。观测值在 HMM 状态的条件下服从正态分布:

\[ y_n \sim \text{normal}(\mu_k, \sigma), \] where \(\mu = (1, 5, 9)\) and the standard deviation \(\sigma\) is the same across all observations. The model is then

其中 \(\mu = (1, 5, 9)\),标准差 \(\sigma\) 在所有观测中相同。该模型如下:

data {
  int N;  // Number of observations
  array[N] real y;
}

parameters {
  // Rows of the transition matrix
  array[3] simplex[3] gamma_arr;

  // Initial state
  simplex[3] rho;

  // Parameters of measurement model
  vector[3] mu;
  real<lower = 0.0> sigma;
}

transformed parameters {
  // Build transition matrix
  matrix[3, 3] gamma;
  for (k in 1:3) gamma[k, ] = to_row_vector(gamma_arr[k]);

  // Compute the log likelihoods in each possible state
  matrix[3, N] log_omega;
  for (n in 1:N) {
    for (i in 1:3) {
      log_omega[i, n] = normal_lpdf(y[n] | mu[i], sigma);
    }
  }
}

model {
  // prior
  mu ~ normal(0, 1);
  sigma ~ normal(0, 1);
  
  // no explicit prior on gamma_arr, meaning we default to a
  // uniform prior over the simplexes.

  // Increment target by log p(y | mu, sigma, Gamma, rho)
  target += hmm_marginal(log_omega, gamma, rho);
}

The last function hmm_marginal takes in all the ingredients of the HMM and computes the relevant log marginal distribution, \(\log p(y \mid \phi)\).

最后一个函数 hmm_marginal 接收 HMM 的所有组成部分,并计算相关的对数边缘分布 \(\log p(y \mid \phi)\)

If we desire draws from the posterior distribution of \(z\), we use the generated quantities block and draw, for each sample \(\phi\), a sample from \(p(z \mid y, \phi)\). In effect, MCMC produces draws from \(p(\phi \mid y)\) and with the draws in generated quantities, we obtain draws from \(p(\phi \mid y) p(z \mid y, \phi) = p(z, \phi \mid y)\). It is also possible to compute the posterior probbability of each hidden state, that is \(\text{Pr}(z_n = k \mid \phi, y)\). Averagging these probabilities over all MCMC draws, we obtain \(\text{Pr}(z_n = k \mid y)\).

如果我们希望从 \(z\) 的后验分布中抽取样本,我们使用生成量块,并对每个样本 \(\phi\)\(p(z \mid y, \phi)\) 中抽取一个样本。实际上,MCMC 从 \(p(\phi \mid y)\) 中产生样本,而通过生成量中的样本,我们从 \(p(\phi \mid y) p(z \mid y, \phi) = p(z, \phi \mid y)\) 中获得样本。也可以计算每个隐状态的后验概率,即 \(\text{Pr}(z_n = k \mid \phi, y)\)。在所有 MCMC 样本上对这些概率求平均,我们得到 \(\text{Pr}(z_n = k \mid y)\)

generated quantities {
  array[N] int latent_states = hmm_latent_rng(log_omega, gamma, rho);
  matrix[3, N] hidden_probs = hmm_hidden_state_prob(log_omega, gamma, rho);
}

hmm_hidden_state_prob returns the marginal probabilities of each state, \(\text{Pr}(z_n = k \mid \phi, y)\). This function cannot be used to compute the joint probability \(\text{Pr}(z \mid \phi, y)\), because such calculation requires accounting for the posterior correlation between the different components of \(z\). Therefore, hidden_probs should not be used to obtain posterior draws. Instead, users should rely on hmm_latent_rng.

hmm_hidden_state_prob 返回每个状态的边缘概率 \(\text{Pr}(z_n = k \mid \phi, y)\)。这个函数不能用来计算联合概率 \(\text{Pr}(z \mid \phi, y)\),因为这样的计算需要考虑 \(z\) 的不同分量之间的后验相关性。因此,hidden_probs 不应该用来获取后验样本。相反,用户应该依赖 hmm_latent_rng

generated quantities {
   array[N] int<lower=1, upper=K> z = hmm_latent_rng(...fill-in params here to match example...);
}

The example in this section is derived from the more detailed case study by Ben Bales: https://mc-stan.org/users/documentation/case-studies/hmm-example.html.

本节中的示例来源于 Ben Bales 更详细的案例研究: https://mc-stan.org/users/documentation/case-studies/hmm-example.html

Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of Variance of United Kingdom Inflation.” Econometrica 50: 987–1008.
Hoffman, Matthew D., and Andrew Gelman. 2014. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15: 1593–623. http://jmlr.org/papers/v15/hoffman14a.html.
Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65: 361–93.

  1. The intercept in this model is \(\alpha / (1 - \beta)\). An alternative parameterization in terms of an intercept \(\gamma\) suggested Mark Scheuerell on GitHub is \(y_n \sim \textsf{normal}\left(\gamma + \beta \cdot (y_{n-1} - \gamma), \sigma\right)\).↩︎

  2. 这个模型中的截距是 \(\alpha / (1 - \beta)\)。Mark Scheuerell 在 GitHub 上建议用截距 \(\gamma\) 表示的另一种参数化方式是 \(y_n \sim \textsf{normal}\left(\gamma + \beta \cdot (y_{n-1} - \gamma), \sigma\right)\)↩︎

  3. In practice, it can be useful to remove the constraint to test whether a non-stationary set of coefficients provides a better fit to the data. It can also be useful to add a trend term to the model, because an unfitted trend will manifest as non-stationarity.↩︎

  4. 在实践中,移除约束以测试非稳定系数是否能更好地拟合数据可能是有用的。添加趋势项到模型也可能是有用的,因为一个未拟合的趋势会表现为非稳定性。↩︎

  5. This subsection is a lightly edited comment of Jonathan Gilligan’s on GitHub; see https://github.com/stan-dev/stan/issues/1617#issuecomment-160249142.↩︎

  6. 这一小节是 Jonathan Gilligan 在 GitHub 上的评论; 见 https://github.com/stan-dev/stan/issues/1617#issuecomment-160249142↩︎