10  高斯过程

Gaussian Processes

本节译者:李巧玲、王嘉宁

初次校审:李君竹

二次校审:邱怡轩(DeepSeek 辅助)

Gaussian processes are continuous stochastic processes and thus may be interpreted as providing a probability distribution over functions. A probability distribution over continuous functions may be viewed, roughly, as an uncountably infinite collection of random variables, one for each valid input. The generality of the supported functions makes Gaussian priors popular choices for priors in general multivariate (non-linear) regression problems.

高斯过程是一种连续随机过程,因此可以解释为对函数的概率分布。连续函数的概率分布可以粗略地看作是随机变量的不可数无限集合,其中每个随机变量对应一个有效输入。由于支持的函数具有广泛性,高斯先验常被用作一般多元(非线性)回归问题中的先验。

The defining feature of a Gaussian process is that the joint distribution of the function’s value at a finite number of input points is a multivariate normal distribution. This makes it tractable to both fit models from finite amounts of observed data and make predictions for finitely many new data points.

高斯过程的定义特征在于,函数在有限个输入点处的联合分布是多元正态分布。这使得既可以从有限数量的观测数据拟合模型,又可以预测有限数量的新数据点。

Unlike a simple multivariate normal distribution, which is parameterized by a mean vector and covariance matrix, a Gaussian process is parameterized by a mean function and covariance function. The mean and covariance functions apply to vectors of inputs and return a mean vector and covariance matrix which provide the mean and covariance of the outputs corresponding to those input points in the functions drawn from the process.

与由均值向量和协方差矩阵参数化的简单多元正态分布不同,高斯过程由均值函数和协方差函数参数化。均值函数和协方差函数将输入向量返回为均值向量和协方差矩阵,它们表示从输入位置提取出的多元正态分布对应的均值和协方差矩阵。

Gaussian processes can be encoded in Stan by implementing their mean and covariance functions or by using the specialized covariance functions outlined below, and plugging the result into the Gaussian model.
This form of model is straightforward and may be used for simulation, model fitting, or posterior predictive inference. A more efficient Stan implementation for the GP with a normally distributed outcome marginalizes over the latent Gaussian process, and applies a Cholesky-factor reparameterization of the Gaussian to compute the likelihood and the posterior predictive distribution analytically.

高斯过程可以通过实现其均值和协方差函数并将结果代入其采样分布的高斯形式,或使用下面概述的专用协方差函数,在 Stan 中进行编码。这种形式的模型很简单,可用于模拟、模型拟合或后验预测推断。对于具有正态分布结果的高斯过程,一种更高效的 Stan 实现是对潜在高斯过程进行边缘化,并应用高斯分布的 Cholesky 因子重参数化来解析地计算似然和后验预测分布。

After defining Gaussian processes, this chapter covers the basic implementations for simulation, hyperparameter estimation, and posterior predictive inference for univariate regressions, multivariate regressions, and multivariate logistic regressions. Gaussian processes are general, and by necessity this chapter only touches on some basic models. For more information, see Rasmussen and Williams (2006).

在定义高斯过程后,本章将介绍单变量回归、多元回归和多元 Logistic 回归的模拟、超参数估计和后验预测推断的基本实现。高斯过程具有普遍性,因此本章仅涉及一些基本模型。更多信息请参阅 Rasmussen and Williams (2006)

Note that fitting Gaussian processes as described below using exact inference by computing Cholesky of the covariance matrix scales cubicly with the size of data. Due to how Stan autodiff is implemented, Stan is also slower than Gaussian process specialized software. It is likely that Gaussian processes using exact inference by computing Cholesky of the covariance matrix with \(N>1000\) are too slow for practical purposes in Stan. There are many approximations to speed-up Gaussian process computation, from which the basis function approaches for 1-3 dimensional \(x\) are easiest to implement in Stan (see, e.g., Riutort-Mayol et al. (2023)).

需要注意的是,使用协方差矩阵的 Cholesky 分解来拟合高斯过程并进行精确推断时,计算复杂度随数据量呈立方增长。由于 Stan 自动微分的实现方式,其速度也慢于专门的高斯过程软件。对于数据量 \(N>1000\) 的情况,在 Stan 中使用精确推断可能因速度过慢而不实用。目前有许多加速高斯过程计算的近似方法,其中基函数方法在1-3维 \(x\) 的情况下最容易在 Stan 中实现(参见 Riutort-Mayol et al. (2023))。

10.1 Gaussian process regression

高斯过程回归

The data for a multivariate Gaussian process regression consists of a series of \(N\) inputs \(x_1,\dotsc,x_N \in \mathbb{R}^D\) paired with outputs \(y_1,\dotsc,y_N \in \mathbb{R}\). The defining feature of Gaussian processes is that the probability of a finite number of outputs \(y\) conditioned on their inputs \(x\) is Gaussian:

多元高斯过程回归的数据由 \(N\) 个输入值 \(x_1,\dotsc,x_N \in \mathbb{R}^D\) 和相应的输出值 \(y_1,\dotsc,y_N \in \mathbb{R}\) 组成。高斯过程的定义特征在于,给定一组输入值 \(x\),有限个输出值 \(y\) 的概率分布是高斯分布:

\[ y \sim \textsf{multivariate normal}(m(x), K(x \mid \theta)), \] where \(m(x)\) is an \(N\)-vector and \(K(x \mid \theta)\) is an \(N \times N\) covariance matrix. The mean function \(m : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N}\) can be anything, but the covariance function \(K : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N \times N}\) must produce a positive-definite matrix for any input \(x\).1

其中 \(m(x)\) 是一个 \(N\) 维向量,\(K(x \mid \theta)\) 是一个 \(N \times N\) 的协方差矩阵。均值函数 \(m : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N}\) 可以是任意的函数,但协方差函数 \(K : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N \times N}\) 必须对于任何输入 \(x\) 都产生正定矩阵。2

A popular covariance function, which will be used in the implementations later in this chapter, is an exponentiated quadratic function,

一种常用的协方差函数是指数二次函数,后文实现中将使用该函数:

\[ K(x \mid \alpha, \rho, \sigma)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right) + \delta_{i, j} \sigma^2, \] where \(\alpha\), \(\rho\), and \(\sigma\) are hyperparameters defining the covariance function and where \(\delta_{i, j}\) is the Kronecker delta function with value 1 if \(i = j\) and value 0 otherwise; this test is between the indexes \(i\) and \(j\), not between values \(x_i\) and \(x_j\). This kernel is obtained through a convolution of two independent Gaussian processes, \(f_1\) and \(f_2\), with kernels

其中 \(\alpha\)\(\rho\)\(\sigma\) 是定义协方差函数的超参数,\(\delta_{i,j}\) 是 Kronecker delta 函数,如果 \(i=j\) 则值为1,否则值为0(这一测试是针对索引 \(i\)\(j\),而不是针对值 \(x_i\)\(x_j\))。该核函数通过两个独立的高斯过程 \(f_1\)\(f_2\) 进行卷积得到,其中 \(f_1\) 的核函数为

\[ K_1(x \mid \alpha, \rho)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right) \] and

\(f_2\) 的核函数为

\[ K_2(x \mid \sigma)_{i, j} = \delta_{i, j} \sigma^2. \]

The addition of \(\sigma^2\) on the diagonal is important to ensure the positive definiteness of the resulting matrix in the case of two identical inputs \(x_i = x_j\). In statistical terms, \(\sigma\) is the scale of the noise term in the regression.

在对角线上加入 \(\sigma^2\) 是为了确保在输入 \(x_i=x_j\) 相同的情况下得到的矩阵是正定的。从统计学的角度来看,\(\sigma\) 是回归中噪声项的尺度参数。

The hyperparameter \(\rho\) is the length-scale, and corresponds to the frequency of the functions represented by the Gaussian process prior with respect to the domain. Values of \(\rho\) closer to zero lead the GP to represent high-frequency functions, whereas larger values of \(\rho\) lead to low-frequency functions. The hyperparameter \(\alpha\) is the marginal standard deviation. It controls the magnitude of the range of the function represented by the GP. If you were to take the standard deviation of many draws from the GP \(f_1\) prior at a single input \(x\) conditional on one value of \(\alpha\) one would recover \(\alpha\).

超参数 \(\rho\)长度尺度,对应于高斯过程先验所表示函数在定义域上的频率。\(\rho\) 值接近0,则高斯过程表示高频函数,而 \(\rho\) 较大时则表示低频函数。超参数 \(\alpha\)边际标准差,它控制高斯过程所表示函数的幅度范围。在给定 \(\alpha\) 值的情况下,如果从高斯过程先验 \(f_1\) 中抽取大量样本,并在单个输入 \(x\) 处计算标准差,则结果就是 \(\alpha\)

The only term in the squared exponential covariance function involving the inputs \(x_i\) and \(x_j\) is their vector difference, \(x_i - x_j\). This produces a process with stationary covariance in the sense that if an input vector \(x\) is translated by a vector \(\epsilon\) to \(x + \epsilon\), the covariance at any pair of outputs is unchanged, because \(K(x \mid \theta) = K(x + \epsilon \mid \theta)\).

指数二次协方差函数中唯一涉及输入 \(x_i\)\(x_j\) 的项是它们的向量差,即 \(x_i - x_j\)。这产生了一个具有平稳协方差的过程,也就是说,如果将输入向量 \(x\) 沿着向量 \(\epsilon\) 移动到 \(x+\epsilon\),任意两个成对输出的协方差将保持不变,因为 \(K(x \mid \theta) = K(x + \epsilon \mid \theta)\)

The summation involved is just the squared Euclidean distance between \(x_i\) and \(x_j\) (i.e., the squared \(L_2\) norm of their difference, \(x_i - x_j\)). This results in support for smooth functions in the process. The amount of variation in the function is controlled by the free hyperparameters \(\alpha\), \(\rho\), and \(\sigma\).

涉及的求和仅仅是 \(x_i\)\(x_j\) 之间的欧氏距离的平方(即它们差值 \(x_i - x_j\)\(L_2\) 范数的平方),这使得过程支持平滑函数。函数变化的大小由自由超参数 \(\alpha\)\(\rho\)\(\sigma\) 控制。

Changing the notion of distance from Euclidean to taxicab distance (i.e., an \(L_1\) norm) changes the support to functions which are continuous but not smooth.

将距离的概念从欧氏距离更改为曼哈顿距离(即 \(L_1\) 范数)会变成支持连续但不平滑的函数。

10.2 Simulating from a Gaussian process

从高斯过程中模拟

It is simplest to start with a Stan model that does nothing more than simulate draws of functions \(f\) from a Gaussian process. In practical terms, the model will draw values \(y_n = f(x_n)\) for finitely many input points \(x_n\).

最简单的方法是从一个 Stan 模型开始,该模型仅用于从高斯过程中模拟函数 \(f\) 的抽取。在实际应用中,该模型将针对有限数量的输入点 \(x_n\) 抽取值 \(y_n = f(x_n)\)

The Stan model defines the mean and covariance functions in a transformed data block and then samples outputs \(y\) in the model using a multivariate normal distribution. To make the model concrete, the squared exponential covariance function described in the previous section will be used with hyperparameters set to \(\alpha^2 = 1\), \(\rho^2 = 1\), and \(\sigma^2 = 0.1\), and the mean function \(m\) is defined to always return the zero vector, \(m(x) = \textbf{0}\). Consider the following implementation of a Gaussian process simulator.

Stan 模型在“transformed data”块中定义了均值和协方差函数,然后使用多元正态分布对输出 \(y\) 进行采样。为使模型具体化,将使用前面一节中描述的指数二次协方差函数,并将超参数设置为 \(\alpha^2 = 1\)\(\rho^2 = 1\)\(\sigma^2 = 0.1\)。均值函数 \(m\) 被定义为始终返回零向量,即 \(m(x) = \textbf{0}\)。考虑以下高斯过程模拟器的实现。

data {
  int<lower=1> N;
  array[N] real x;
}
transformed data {
  matrix[N, N] K;
  vector[N] mu = rep_vector(0, N);
  for (i in 1:(N - 1)) {
    K[i, i] = 1 + 0.1;
    for (j in (i + 1):N) {
      K[i, j] = exp(-0.5 * square(x[i] - x[j]));
      K[j, i] = K[i, j];
    }
  }
  K[N, N] = 1 + 0.1;
}
parameters {
  vector[N] y;
}
model {
  y ~ multi_normal(mu, K);
}

The above model can also be written more compactly using the specialized covariance function that implements the exponentiated quadratic kernel.

以上模型也可以使用实现指数二次核的专门协方差函数更简洁地编写。

data {
  int<lower=1> N;
  array[N] real x;
}
transformed data {
  matrix[N, N] K = gp_exp_quad_cov(x, 1.0, 1.0);
  vector[N] mu = rep_vector(0, N);
  for (n in 1:N) {
    K[n, n] = K[n, n] + 0.1;
  }
}
parameters {
  vector[N] y;
}
model {
  y ~ multi_normal(mu, K);
}

The input data are just the vector of inputs x and its size N. Such a model can be used with values of x evenly spaced over some interval in order to plot sample draws of functions from a Gaussian process.

输入数据仅是输入向量 x 及其长度 N。此类模型可用于在某个区间上均匀分布的 x 值,来绘制从高斯过程中抽取的函数样本。

Multivariate inputs

多元输入

Only the input data needs to change in moving from a univariate model to a multivariate model.

从单变量模型转到多元模型时,只需要更改输入数据。

The only lines that change from the univariate model above are as follows.

与之前的单变量模型相比,唯一变化的部分如下所示。

data {
  int<lower=1> N;
  int<lower=1> D;
  array[N] vector[D] x;
}
transformed data {
  // ...
}

The data are now declared as an array of vectors instead of an array of scalars; the dimensionality D is also declared.

数据现在声明为向量的数组,而不是标量的数组;同时还声明了维度 D

In the remainder of the chapter, univariate models will be used for simplicity, but any of the models could be changed to multivariate in the same way as the simple sampling model. The only extra computational overhead from a multivariate model is in the distance calculation.

为简便起见,本章的后续部分将仅使用单变量模型,但任何模型都可以像简单采样模型一样改为多元模型。多元模型的额外计算开销仅体现在距离计算中。

Cholesky factored and transformed implementation

Cholesky 分解与变换实现

A more efficient implementation of the simulation model can be coded in Stan by relocating, rescaling and rotating an isotropic standard normal variate. Suppose \(\eta\) is an an isotropic standard normal variate

通过平移、缩放和旋转各向同性标准正态变量,可在 Stan 中实现更高效的模拟模型。假设 \(\eta\) 是一个标准多元正态变量

\[ \eta \sim \textsf{normal}(\textbf{0}, \textrm{I}), \] where \(\textbf{0}\) is an \(N\)-vector of 0 values and \(\textrm{I}\) is the \(N \times N\) identity matrix. Let \(L\) be the Cholesky decomposition of \(K(x \mid \theta)\), i.e., the lower-triangular matrix \(L\) such that \(LL^{\top} = K(x \mid \theta)\). Then the transformed variable \(\mu + L\eta\) has the intended target distribution,

其中 \(\textbf{0}\)\(N\) 维零向量,\(\textbf{1}\)\(N\times N\) 的单位矩阵。令 \(L\) 表示 \(K(x \mid \theta)\) 的 Cholesky 分解,即下三角矩阵 \(L\) 满足 \(LL^{\top} = K(x \mid \theta)\)。于是变换后的变量 \(\mu + L\eta\) 服从目标分布,

\[ \mu + L\eta \sim \textsf{multivariate normal}(\mu(x), K(x \mid \theta)). \]

This transform can be applied directly to Gaussian process simulation.

这种变换可以直接应用于高斯过程的模拟中。

This model has the same data declarations for N and x, and the same transformed data definitions of mu and K as the previous model, with the addition of a transformed data variable for the Cholesky decomposition. The parameters change to the raw parameters sampled from an isotropic standard normal, and the actual samples are defined as generated quantities.

这个模型对于 Nx 的数据声明,以及对于 muK 的“transformed data”定义与之前的模型相同,并增加了一个用于 Cholesky 分解的“transformed data”变量。参数更改为从各向同性标准正态分布中采样的原始参数,并且实际的样本被定义为生成量(即“generated quantities”)。

// ...
transformed data {
  matrix[N, N] L;
  // ...
  L = cholesky_decompose(K);
}
parameters {
  vector[N] eta;
}
model {
  eta ~ std_normal();
}
generated quantities {
  vector[N] y;
  y = mu + L * eta;
}

The Cholesky decomposition is only computed once, after the data are loaded and the covariance matrix K computed. The isotropic normal distribution for eta is specified as a vectorized univariate distribution for efficiency; this specifies that each eta[n] has an independent standard normal distribution. The sampled vector y is then defined as a generated quantity using a direct encoding of the transform described above.

Cholesky 分解仅在加载数据并计算协方差矩阵 K 后计算一次。各向同性正态分布的 eta 被指定为向量化的单变量分布以提高效率;这表明每个 eta[n] 都具有独立的标准正态分布。采样向量 y 作为生成量通过上述变换直接定义。

10.3 Fitting a Gaussian process

拟合高斯过程

GP with a normal outcome

带有正态输出的高斯过程

The full generative model for a GP with a normal outcome, \(y \in \mathbb{R}^N\), with inputs \(x \in \mathbb{R}^N\), for a finite \(N\):

带有正态输出 \(y \in \mathbb{R}^N\),输入为 \(x \in \mathbb{R}^N\) 的完整生成模型如下,其中 \(N\) 表示一个有限的数据量:

\[\begin{align*} \rho &\sim \textsf{InvGamma}(5, 5) \\ \alpha &\sim \textsf{normal}(0, 1) \\ \sigma &\sim \textsf{normal}(0, 1) \\ f &\sim \textsf{multivariate normal}\left(0, K(x \mid \alpha, \rho)\right) \\ y_i &\sim \textsf{normal}(f_i, \sigma) \, \forall i \in \{1, \dots, N\} \end{align*}\] With a normal outcome, it is possible to integrate out the Gaussian process \(f\), yielding the more parsimonious model:

对于正态输出,可以将高斯过程 \(f\) 积分掉,得到更简洁的模型:

\[\begin{align*} \rho &\sim \textsf{InvGamma}(5, 5) \\ \alpha &\sim \textsf{normal}(0, 1) \\ \sigma &\sim \textsf{normal}(0, 1) \\ y &\sim \textsf{multivariate normal} \left(0, K(x \mid \alpha, \rho) + \textbf{I}_N \sigma^2\right) \\ \end{align*}\]

It can be more computationally efficient when dealing with a normal outcome to integrate out the Gaussian process, because this yields a lower-dimensional parameter space over which to do inference. We’ll fit both models in Stan. The former model will be referred to as the latent variable GP, while the latter will be called the marginal likelihood GP.

在处理正态输出时,通过积分掉高斯过程可以提高计算效率,因为这会产生一个较低维度的参数空间来进行推断。我们将在 Stan 中拟合这两个模型。前面的模型称为潜变量高斯过程,而后者称为边际似然高斯过程。

The hyperparameters controlling the covariance function of a Gaussian process can be fit by assigning them priors, like we have in the generative models above, and then computing the posterior distribution of the hyperparameters given observed data. The priors on the parameters should be defined based on prior knowledge of the scale of the output values (\(\alpha\)), the scale of the output noise (\(\sigma\)), and the scale at which distances are measured among inputs (\(\rho\)). See the Gaussian process priors section for more information about how to specify appropriate priors for the hyperparameters.

控制高斯过程协方差函数的超参数可以通过设置先验来进行拟合,就像我们在上面的生成模型中所做的那样,然后计算给定观测数据的超参数的后验分布。参数的先验应该根据输出值的尺度(\(\alpha\))、输出噪声的尺度(\(\sigma\))以及输入之间距离的测量尺度(\(\rho\))的先验知识来定义。更多关于超参数先验设置的信息,请参阅高斯过程的先验参数一节。

The Stan program implementing the marginal likelihood GP is shown below. The program is similar to the Stan programs that implement the simulation GPs above, but because we are doing inference on the hyperparameters, we need to calculate the covariance matrix K in the model block, rather than the transformed data block.

下面是实现边际似然高斯过程的 Stan 程序。该程序类似于上面实现模拟高斯过程的 Stan 程序,但由于我们将对超参数进行推断,因此需要在“model”块中计算协方差矩阵 K,而不是在“transformed data”块中进行计算。

data {
  int<lower=1> N;
  array[N] real x;
  vector[N] y;
}
transformed data {
  vector[N] mu = rep_vector(0, N);
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
model {
  matrix[N, N] L_K;
  matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
  real sq_sigma = square(sigma);

  // diagonal elements
  for (n in 1:N) {
    K[n, n] = K[n, n] + sq_sigma;
  }

  L_K = cholesky_decompose(K);

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();

  y ~ multi_normal_cholesky(mu, L_K);
}

The data block declares a vector y of observed values y[n] for inputs x[n]. The transformed data block now only defines the mean vector to be zero. The three hyperparameters are defined as parameters constrained to be non-negative. The computation of the covariance matrix K is now in the model block because it involves unknown parameters and thus can’t simply be precomputed as transformed data. The rest of the model consists of the priors for the hyperparameters and the multivariate Cholesky-parameterized normal distribution, only now the value y is known and the covariance matrix K is an unknown dependent on the hyperparameters, allowing us to learn the hyperparameters.

“data”块声明了一个观察值向量 y,其输入为 x[n]。“transformed data”块现在仅将均值向量定义为零。三个超参数被定义为非负参数。协方差矩阵 K 的计算现在在“model”块中进行,因为它涉及未知参数,因此不能像转换数据那样预先计算。模型的其余部分包括超参数的先验分布和基于 Cholesky 分解的多元正态分布,其中 y 是已知观测值,而协方差矩阵 K 依赖于超参数,从而允许学习超参数。

We have used the Cholesky parameterized multivariate normal rather than the standard parameterization because it allows us to the cholesky_decompose function which has been optimized for both small and large matrices. When working with small matrices the differences in computational speed between the two approaches will not be noticeable, but for larger matrices (\(N \gtrsim 100\)) the Cholesky decomposition version will be faster.

我们使用了基于 Cholesky 分解的多元正态分布而不是标准参数化形式,是因为 cholesky_decompose 函数针对大小矩阵都进行了优化。当处理小矩阵时,两种方法之间的计算速度差异不会很明显,但对于较大的矩阵(\(N \gtrsim 100\)),Cholesky 分解版本将更快。

Hamiltonian Monte Carlo sampling is fast and effective for hyperparameter inference in this model (Neal 1997). If the posterior is well-concentrated for the hyperparameters the Stan implementation will fit hyperparameters in models with a few hundred data points in seconds.

在这个模型中,Hamiltonian Monte Carlo 采样对于超参数推断是快速和有效的 (Neal 1997)。如果后验密度函数在超参数方面高度集中,则 Stan 实现可以在几秒钟内拟合包含数百个数据点的模型。

Latent variable GP

潜变量高斯过程

We can also explicitly code the latent variable formulation of a GP in Stan. This will be useful for when the outcome is not normal. We’ll need to add a small positive term, \(\delta\) to the diagonal of the covariance matrix in order to ensure that our covariance matrix remains positive definite.

我们也可以在 Stan 中显式编写高斯过程的潜变量形式,这在处理非正态数据时非常有用。我们需要在协方差矩阵的对角线上添加一个小的正数项 \(\delta\),以确保我们的协方差矩阵始终是正定的。

data {
  int<lower=1> N;
  array[N] real x;
  vector[N] y;
}
transformed data {
  real delta = 1e-9;
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
model {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);

    // diagonal elements
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y ~ normal(f, sigma);
}

Two differences between the latent variable GP and the marginal likelihood GP are worth noting. The first is that we have augmented our parameter block with a new parameter vector of length \(N\) called eta. This is used in the model block to generate a multivariate normal vector called \(f\), corresponding to the latent GP. We put a \(\textsf{normal}(0,1)\) prior on eta like we did in the Cholesky-parameterized GP in the simulation section. The second difference is that although we could code the distribution statement for \(y\) with one \(N\)-dimensional multivariate normal with an identity covariance matrix multiplied by \(\sigma^2\), we instead use vectorized univariate normal distribution, which is equivalent but more efficient to use.

潜变量高斯过程和边际似然高斯过程之间有两个差异值得注意。首先,“parameter”块中增加了一个长度为 \(N\) 的新参数向量 eta,它用于在“model”块中生成对应于潜在高斯过程的多元正态向量 \(f\)。我们对 eta 使用 \(\textsf{normal}(0,1)\) 先验,就像在模拟章节中对基于 Cholesky 分解的高斯过程所做的那样。第二个区别是,虽然我们可以利用一个 \(N\) 维多元正态分布来编码 \(y\) 的分布,其中协方差是单位阵乘以 \(\sigma^2\),但现在我们使用向量化的一元正态分布,这两种形式等价,但后者更高效。

Discrete outcomes with Gaussian processes

使用高斯过程处理离散输出

Gaussian processes can be generalized the same way as standard linear models by introducing a link function. This allows them to be used as discrete data models.

高斯过程可以通过引入链接函数的方式与标准线性模型一样进行推广。这使得它们可以用作离散数据的模型。

Poisson GP

泊松高斯过程

If we want to model count data, we can remove the \(\sigma\) parameter, and use poisson_log, which implements a Poisson distribution with log link function, rather than normal. We can also add an overall mean parameter, \(a\), which will account for the marginal expected value for \(y\). We do this because we cannot center count data like we would for normally distributed data.

如果我们想要建模计数数据,可以移除 \(\sigma\) 参数,并使用 poisson_log 函数而非 normal 函数。该函数实现了一个对数链接的泊松分布。我们还可以添加一个总体均值参数 \(a\),它将考虑到 \(y\) 的边际期望值。这是因为我们不能像对于正态分布数据一样对计数数据进行中心化。

data {
  // ...
  array[N] int<lower=0> y;
  // ...
}
// ...
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real a;
  vector[N] eta;
}
model {
  // ...
  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  a ~ std_normal();
  eta ~ std_normal();

  y ~ poisson_log(a + f);
}

Logistic Gaussian process regression

Logistic 高斯过程回归模型

For binary classification problems, the observed outputs \(z_n \in \{ 0,1 \}\) are binary. These outputs are modeled using a Gaussian process with (unobserved) outputs \(y_n\) through the logistic link,

对于二元分类问题,观测输出 \(z_n \in \{ 0,1 \}\) 是二元取值的。通过 Logistic 链接函数,这些输出可以使用高斯过程进行建模,其中(未观测到的)输出为 \(y_n\),具体如下所示:

\[ z_n \sim \textsf{Bernoulli}(\operatorname{logit}^{-1}(y_n)), \] or in other words,

换言之,

\[ \Pr[z_n = 1] = \operatorname{logit}^{-1}(y_n). \]

We can extend our latent variable GP Stan program to deal with classification problems. Below a is the bias term, which can help account for imbalanced classes in the training data:

我们可以扩展潜变量高斯过程的 Stan 程序以处理分类问题。下面的 a 是偏置项,它可以帮助解决训练数据中类别不平衡的问题:

data {
  // ...
  array[N] int<lower=0, upper=1> z;
  // ...
}
// ...
model {
  // ...
  y ~ bernoulli_logit(a + f);
}

Automatic relevance determination

自动相关性确定

If we have multivariate inputs \(x \in \mathbb{R}^D\), the squared exponential covariance function can be further generalized by fitting a scale parameter \(\rho_d\) for each dimension \(d\),

如果我们有多元输入 \(x \in \mathbb{R}^D\),则可以通过为每个维度 \(d\) 拟合一个尺度参数 \(\rho_d\) 来进一步推广平方指数协方差函数,

\[ k(x \mid \alpha, \vec{\rho}, \sigma)_{i, j} = \alpha^2 \exp \left(-\dfrac{1}{2} \sum_{d=1}^D \dfrac{1}{\rho_d^2} (x_{i,d} - x_{j,d})^2 \right) + \delta_{i, j}\sigma^2. \] The estimation of \(\rho\) was termed “automatic relevance determination” by Neal (1996), but this is misleading, because the magnitude of the scale of the posterior for each \(\rho_d\) is dependent on the scaling of the input data along dimension \(d\). Moreover, the scale of the parameters \(\rho_d\) measures non-linearity along the \(d\)-th dimension, rather than “relevance” (Piironen and Vehtari 2016).

Neal (1996)\(\rho\) 的估计称为“自动相关性确定”,但这一说法具有误导性,因为每个 \(\rho_d\) 的后验分布尺度大小取决于输入数据在第 \(d\) 维上的缩放比例。此外,参数 \(\rho_d\) 的尺度衡量的是第 \(d\) 维上的非线性程度,而不是“相关性” (Piironen and Vehtari 2016)

A priori, the closer \(\rho_d\) is to zero, the more nonlinear the conditional mean in dimension \(d\) is. A posteriori, the actual dependencies between \(x\) and \(y\) play a role. With one covariate \(x_1\) having a linear effect and another covariate \(x_2\) having a nonlinear effect, it is possible that \(\rho_1 > \rho_2\) even if the predictive relevance of \(x_1\) is higher (Rasmussen and Williams 2006, 80). The collection of \(\rho_d\) (or \(1/\rho_d\)) parameters can also be modeled hierarchically.

从先验上讲,\(\rho_d\) 越接近于零,第 \(d\) 维的条件均值的非线性程度就越高。从后验上看,\(x\)\(y\) 之间的实际依赖关系也起到了作用。如果一个协变量 \(x_1\) 具有线性效应,而另一个协变量 \(x_2\) 具有非线性效应,则即使 \(x_1\) 的预测相关性更高,也可能出现 \(\rho_1 > \rho_2\) 的情况 (Rasmussen and Williams 2006,第80页)\(\rho_d\)(或 \(1/\rho_d\))参数的集合也可以按分层方式进行建模。

The implementation of automatic relevance determination is a straightforward extension of the one-dimensional case by modifying rho to be an array.

自动相关性确定的实现是对一维情况的直接扩展,需将 rho 修改为数组。

data {
  int<lower=1> N;
  int<lower=1> D;
  array[N] vector[D] x;
  vector[N] y;
}
transformed data {
  real delta = 1e-9;
}
parameters {
  array[D] real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
model {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }
    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y ~ normal(f, sigma);
}

Priors for Gaussian process parameters

高斯过程参数的先验分布

Formulating priors for GP hyperparameters requires the analyst to consider the inherent statistical properties of a GP, the GP’s purpose in the model, and the numerical issues that may arise in Stan when estimating a GP.

为高斯过程超参数设置先验时,需要分析者考虑高斯过程固有的统计特性、高斯过程在模型中的作用,以及 Stan 估计高斯过程时可能出现的数值问题。

Perhaps most importantly, the parameters \(\rho\) and \(\alpha\) are weakly identified (Zhang 2004). The ratio of the two parameters is well-identified, but in practice we put independent priors on the two hyperparameters because these two quantities are more interpretable than their ratio.

最重要的是,参数 \(\rho\)\(\alpha\) 是弱可识别的 (Zhang 2004)。两者的比值可以被较好地识别,但在实践中,我们对这两个超参数分别设置独立先验,因为它们的实际意义比比值更明确。

Priors for length-scale

长度尺度的先验

GPs are a flexible class of priors and, as such, can represent a wide spectrum of functions. For length scales below the minimum spacing of the covariates the GP likelihood plateaus. Unless regularized by a prior, this flat likelihood induces considerable posterior mass at small length scales where the observation variance drops to zero and the functions supported by the GP begin to exactly interpolate between the input data. The resulting posterior not only significantly overfits to the input data, it also becomes hard to accurately sample using Euclidean HMC.

高斯过程是一类灵活的先验分布,因此能够表示广泛的函数形式。当长度尺度小于协变量的最小间距时,高斯过程的似然函数趋于平坦。如果没有先验的正则化,这种平坦的似然会将后验质量集中在长度尺度很小的区域内,此时观测方差趋近于零,高斯过程能表达的函数开始几乎在输入数据之间进行精确的插值。由此产生的后验分布不仅会严重过拟合输入数据,而且难以通过欧氏 HMC 进行准确的采样。

We may wish to put further soft constraints on the length-scale, but these are dependent on how the GP is used in our statistical model.

我们希望对长度尺度施加更进一步的软约束,但这取决于高斯过程在统计模型中的用途。

If our model consists of only the GP, i.e.:

如果模型只包含了高斯过程,换言之:

\[\begin{align*} f &\sim \textsf{multivariate normal}\left(0, K(x \mid \alpha, \rho)\right) \\ y_i &\sim \textsf{normal}(f_i, \sigma) \, \forall i \in \{1, \dots, N\} \\ & x \in \mathbb{R}^{N \times D}, \quad f \in \mathbb{R}^N \end{align*}\]

we likely don’t need constraints beyond penalizing small length-scales. We’d like to allow the GP prior to represent both high-frequency and low-frequency functions, so our prior should put non-negligible mass on both sets of functions. In this case, an inverse gamma, inv_gamma_lpdf in Stan’s language, will work well as it has a sharp left tail that puts negligible mass on infinitesimal length-scales, but a generous right tail, allowing for large length-scales. Inverse gamma priors will avoid infinitesimal length-scales because the density is zero at zero, so the posterior for length-scale will be pushed away from zero. An inverse gamma distribution is one of many zero-avoiding or boundary-avoiding distributions.3.

则除了对小长度尺度进行惩罚外,可能不需要额外的约束。我们希望允许高斯过程先验既能表示高频函数,也能表示低频函数,因此先验应该在两者之上都分配显著的质量。在这种情况下,逆伽马分布(Stan 中的 inv_gamma_lpdf)会展示其优良的特性,因为它具有一个尖锐的左尾,从而微小长度尺度上的质量可以忽略不计,而它同时还有较厚的右尾,以容纳较大的长度尺度。逆伽马先验可以避免微小的长度尺度,因为它在零点时的密度为零,从而长度尺度的后验将远离零点。逆伽马分布是众多能够避免零点或边界点的分布之一。4

If we’re using the GP as a component in a larger model that includes an overall mean and fixed effects for the same variables we’re using as the domain for the GP, i.e.:

如果高斯过程作为一个更大模型的组成部分,且模型包含整体均值项和固定效应(与高斯过程的输入变量相同):

\[\begin{align*} f &\sim \textsf{multivariate normal}\big(0, K(x \mid \alpha, \rho)\big) \\ y_i &\sim \textsf{normal}\left(\beta_0 + x_i \beta_{[1:D]} + f_i, \sigma\right) \, \forall i \in \{1, \dots, N\} \\ & x_i^T, \beta_{[1:D]} \in \mathbb{R}^D,\quad x \in \mathbb{R}^{N \times D},\quad f \in \mathbb{R}^N \end{align*}\]

we’ll likely want to constrain large length-scales as well. A length scale that is larger than the scale of the data yields a GP posterior that is practically linear (with respect to the particular covariate) and increasing the length scale has little impact on the likelihood. This will introduce nonidentifiability in our model, as both the fixed effects and the GP will explain similar variation. In order to limit the amount of overlap between the GP and the linear regression, we should use a prior with a sharper right tail to limit the GP to higher-frequency functions. We can use a generalized inverse Gaussian distribution:

则可能还需要限制较大的长度尺度。当长度尺度大于数据尺度时,高斯过程的后验近似为线性(相对于特定的协变量),并且增加长度尺度对似然的影响很小。这将导致固定效应和高斯过程解释相似的变异,从而造成不可识别性。为了限制高斯过程和线性回归的重叠程度,我们应该使用具有更尖锐右尾的先验,从而将高斯过程限制在更高频率的函数上。我们可以使用广义逆高斯分布:

\[\begin{align*} f(x \mid a, b, p) &= \dfrac{\left(a/b\right)^{p/2}}{2K_p\left(\sqrt{ab}\right)} x^{p - 1}\exp\big(-(ax + b / x)/2\big) \\ & x, a, b \in \mathbb{R}^{+},\quad p \in \mathbb{Z} \end{align*}\]

which has an inverse gamma left tail if \(p \leq 0\) and an inverse Gaussian right tail. This has not yet been implemented in Stan’s math library, but it is possible to implement as a user defined function:

\(p \leq 0\) 的情况下,其左尾为逆伽马分布,右尾为逆高斯分布。目前 Stan 的数学库还未实现该分布,但是我们可以通过用户自定义函数去实现它:

functions {
  real generalized_inverse_gaussian_lpdf(real x, int p,
                                        real a, real b) {
    return p * 0.5 * log(a / b)
      - log(2 * modified_bessel_second_kind(p, sqrt(a * b)))
      + (p - 1) * log(x)
      - (a * x + b / x) * 0.5;
 }
}
data {
  // ...
}

If we have high-frequency covariates in our fixed effects, we may wish to further regularize the GP away from high-frequency functions, which means we’ll need to penalize smaller length-scales. Luckily, we have a useful way of thinking about how length-scale affects the frequency of the functions supported by the GP. If we were to repeatedly draw from a zero-mean GP with a length-scale of \(\rho\) in a fixed-domain \([0,T]\), we would get a distribution for the number of times each draw of the GP crossed the zero axis. The expectation of this random variable, the number of zero crossings, is \(T / \pi \rho\). You can see that as \(\rho\) decreases, the expectation of the number of upcrossings increases as the GP is representing higher-frequency functions. Thus, this is a good statistic to keep in mind when setting a lower-bound for our prior on length-scale in the presence of high-frequency covariates. However, this statistic is only valid for one-dimensional inputs.

如果固定效应中具有高频的协变量,我们可能希望进一步将高斯过程正则化,使其避免成为高频函数,这意味着我们需要对小的长度尺度进行惩罚。幸运的是,我们有一种有效的方法来考察长度尺度如何影响高斯过程所表达的函数的频率。假设我们从零均值的高斯过程中重复采样,其长度尺度参数 \(\rho\) 具有固定的取值范围 \([0,T]\)。我们可以得到采样所得函数穿过零值的次数的分布。该随机变量的期望是 \(T / \pi\rho\)。我们可以看到,随着 \(\rho\) 减小,穿越零轴的次数不断增加,而这是因为高斯过程表征了一个高频的函数。因此,当存在高频协变量时,此统计量可为长度尺度的先验下限设置提供参考。但是,此统计量仅对一维输入有效。

Priors for marginal standard deviation

边际标准差的先验

The parameter \(\alpha\) corresponds to how much of the variation is explained by the regression function and has a similar role to the prior variance for linear model weights. This means the prior can be the same as used in linear models, such as a half-\(t\) prior on \(\alpha\).

参数 \(\alpha\) 对应于回归函数解释的变异程度,其作用类似于线性模型权重的先验方差。这意味着可以采用与线性模型类似的先验,例如对 \(\alpha\) 使用半 \(t\) 先验。

A half-\(t\) or half-Gaussian prior on alpha also has the benefit of putting nontrivial prior mass around zero. This allows the GP support the zero functions and allows the possibility that the GP won’t contribute to the conditional mean of the total output.

\(t\) 先验或半高斯先验的优点是在零点附近分配了非零先验质量,这使得高斯过程可以支持零函数,并允许高斯过程不对总输出的条件均值做出贡献。

Predictive inference with a Gaussian process

使用高斯过程进行预测推断

Suppose for a given sequence of inputs \(x\) that the corresponding outputs \(y\) are observed. Given a new sequence of inputs \(\tilde{x}\), the posterior predictive distribution of their labels is computed by sampling outputs \(\tilde{y}\) according to

假设对于给定的输入序列 \(x\),观测到相应的输出 \(y\)。对于一个新的输入序列 \(\tilde{x}\),其标签的后验预测分布通过采样输出 \(\tilde{y}\) 来计算,公式如下:

\[ p\left(\tilde{y} \mid \tilde{x},x,y\right) \ = \ \frac{p\left(\tilde{y}, y \mid \tilde{x},x\right)} {p(y \mid x)} \ \propto \ p\left(\tilde{y}, y \mid \tilde{x},x\right). \]

A direct implementation in Stan defines a model in terms of the joint distribution of the observed \(y\) and unobserved \(\tilde{y}\).

在 Stan 中的直接实现是通过定义观察到的 \(y\) 和未观察到的 \(\tilde{y}\) 的联合分布模型来完成的。

data {
  int<lower=1> N1;
  array[N1] real x1;
  vector[N1] y1;
  int<lower=1> N2;
  array[N2] real x2;
}
transformed data {
  real delta = 1e-9;
  int<lower=1> N = N1 + N2;
  array[N] real x;
  for (n1 in 1:N1) {
    x[n1] = x1[n1];
  }
  for (n2 in 1:N2) {
    x[N1 + n2] = x2[n2];
  }
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);

    // diagonal elements
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}
model {
  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y1 ~ normal(f[1:N1], sigma);
}
generated quantities {
  vector[N2] y2;
  for (n2 in 1:N2) {
    y2[n2] = normal_rng(f[N1 + n2], sigma);
  }
}

The input vectors x1 and x2 are declared as data, as is the observed output vector y1. The unknown output vector y2, which corresponds to input vector x2, is declared in the generated quantities block and will be sampled when the model is executed.

输入向量 x1x2 被声明为数据,观测到的输出向量 y1 同理。未知的输出向量 y2 对应于输入向量 x2,被声明在“generated quantities”块中,并将在模型执行时被采样。

A transformed data block is used to combine the input vectors x1 and x2 into a single vector x.

“transformed data”块用于将输入向量 x1x2 合并成一个单一的向量 x

The model block declares and defines a local variable for the combined output vector f, which consists of the concatenation of the conditional mean for known outputs y1 and unknown outputs y2. Thus the combined output vector f is aligned with the combined input vector x. All that is left is to define the univariate normal distribution statement for y.

“model”块声明并定义了一个本地变量,用于合并输出向量 f,其中包括已知输出 y1 和未知输出 y2 的条件均值。因此,合并后的输出向量 f 与合并后的输入向量 x 对齐。现在只需要为 y 定义单变量正态分布采样语句。

The generated quantities block defines the quantity y2. We generate y2 by randomly generating N2 values from univariate normals with each mean corresponding to the appropriate element in f.

“generated quantities”块定义了量 y2。我们通过从单变量正态分布中采样 N2 个值来生成 y2,其中每个均值对应于 f 中的相应元素。

Predictive inference in non-Gaussian GPs

非高斯过程的预测推断

We can do predictive inference in non-Gaussian GPs in much the same way as we do with Gaussian GPs.

在非高斯过程中进行预测推断的方法与高斯过程类似。

Consider the following full model for prediction using logistic Gaussian process regression.

以下是一个使用 Logistic 高斯过程回归进行预测的完整模型:

data {
  int<lower=1> N1;
  array[N1] real x1;
  array[N1] int<lower=0, upper=1> z1;
  int<lower=1> N2;
  array[N2] real x2;
}
transformed data {
  real delta = 1e-9;
  int<lower=1> N = N1 + N2;
  array[N] real x;
  for (n1 in 1:N1) {
    x[n1] = x1[n1];
  }
  for (n2 in 1:N2) {
    x[N1 + n2] = x2[n2];
  }
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real a;
  vector[N] eta;
}
transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);

    // diagonal elements
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}
model {
  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  a ~ std_normal();
  eta ~ std_normal();

  z1 ~ bernoulli_logit(a + f[1:N1]);
}
generated quantities {
  array[N2] int z2;
  for (n2 in 1:N2) {
    z2[n2] = bernoulli_logit_rng(a + f[N1 + n2]);
  }
}

Analytical form of joint predictive inference

联合预测推断的解析形式

Bayesian predictive inference for Gaussian processes with Gaussian observations can be sped up by deriving the posterior analytically, then directly sampling from it.

对于具有高斯观测的高斯过程,可以通过解析推导后验分布来加速贝叶斯预测推断,并直接从中采样。

Jumping straight to the result,

直接给出结果,

\[ p\left(\tilde{y} \mid \tilde{x},y,x\right) = \textsf{normal}\left(K^{\top}\Sigma^{-1}y,\ \Omega - K^{\top}\Sigma^{-1}K\right), \]

where \(\Sigma = K(x \mid \alpha, \rho, \sigma)\) is the result of applying the covariance function to the inputs \(x\) with observed outputs \(y\), \(\Omega = K(\tilde{x} \mid \alpha, \rho)\) is the result of applying the covariance function to the inputs \(\tilde{x}\) for which predictions are to be inferred, and \(K\) is the matrix of covariances between inputs \(x\) and \(\tilde{x}\), which in the case of the exponentiated quadratic covariance function would be

其中 \(\Sigma = K(x \mid \alpha, \rho, \sigma)\) 是将协方差函数作用在输入 \(x\) 上取得的结果,该输入对应了观测输出值 \(y\)\(\Omega = K(\tilde{x} \mid \alpha, \rho)\) 是将协方差函数作用在待预测输入 \(\tilde{x}\) 上的结果,\(K\) 是输入变量 \(x\)\(\tilde{x}\) 之间的协方差矩阵。如果考虑指数二次协方差函数,那么其具体形式为:

\[ K(x \mid \alpha, \rho)_{i, j} = \eta^2 \exp\left(-\dfrac{1}{2 \rho^2} \sum_{d=1}^D \left(x_{i,d} - \tilde{x}_{j,d}\right)^2\right). \]

There is no noise term including \(\sigma^2\) because the indexes of elements in \(x\) and \(\tilde{x}\) are never the same.

此处不包含噪声项 \(\sigma^2\),这是因为 \(x\)\(\tilde{x}\) 中的元素索引永不重复。

This Stan code below uses the analytic form of the posterior and provides sampling of the resulting multivariate normal through the Cholesky decomposition. The data declaration is the same as for the latent variable example, but we’ve defined a function called gp_pred_rng which will generate a draw from the posterior predictive mean conditioned on observed data y1. The code uses a Cholesky decomposition in triangular solves in order to cut down on the number of matrix-matrix multiplications when computing the conditional mean and the conditional covariance of \(p(\tilde{y})\).

以下 Stan 代码给出了后验分布的解析形式,同时使用 Cholesky 分解对结果进行多元正态采样。“data”块的声明与之前隐变量的例子相同,但此处定义了一个 gp_pred_rng 函数,它可以基于观测数据 y1 生成后验预测均值的样本。此代码利用 Cholesky 分解进行三角矩阵求解,以减少在计算条件均值和条件协方差 \(p(\tilde{y})\) 时所需的矩阵乘法的数量。

functions {
  vector gp_pred_rng(array[] real x2,
                     vector y1,
                     array[] real x1,
                     real alpha,
                     real rho,
                     real sigma,
                     real delta) {
    int N1 = rows(y1);
    int N2 = size(x2);
    vector[N2] f2;
    {
      matrix[N1, N1] L_K;
      vector[N1] K_div_y1;
      matrix[N1, N2] k_x1_x2;
      matrix[N1, N2] v_pred;
      vector[N2] f2_mu;
      matrix[N2, N2] cov_f2;
      matrix[N2, N2] diag_delta;
      matrix[N1, N1] K;
      K = gp_exp_quad_cov(x1, alpha, rho);
      for (n in 1:N1) {
        K[n, n] = K[n, n] + square(sigma);
      }
      L_K = cholesky_decompose(K);
      K_div_y1 = mdivide_left_tri_low(L_K, y1);
      K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
      k_x1_x2 = gp_exp_quad_cov(x1, x2, alpha, rho);
      f2_mu = (k_x1_x2' * K_div_y1);
      v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
      cov_f2 = gp_exp_quad_cov(x2, alpha, rho) - v_pred' * v_pred;
      diag_delta = diag_matrix(rep_vector(delta, N2));

      f2 = multi_normal_rng(f2_mu, cov_f2 + diag_delta);
    }
    return f2;
  }
}
data {
  int<lower=1> N1;
  array[N1] real x1;
  vector[N1] y1;
  int<lower=1> N2;
  array[N2] real x2;
}
transformed data {
  vector[N1] mu = rep_vector(0, N1);
  real delta = 1e-9;
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
model {
  matrix[N1, N1] L_K;
  {
    matrix[N1, N1] K = gp_exp_quad_cov(x1, alpha, rho);
    real sq_sigma = square(sigma);

    // diagonal elements
    for (n1 in 1:N1) {
      K[n1, n1] = K[n1, n1] + sq_sigma;
    }

    L_K = cholesky_decompose(K);
  }

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();

  y1 ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
  vector[N2] f2;
  vector[N2] y2;

  f2 = gp_pred_rng(x2, y1, x1, alpha, rho, sigma, delta);
  for (n2 in 1:N2) {
    y2[n2] = normal_rng(f2[n2], sigma);
  }
}

Multiple-output Gaussian processes

多输出高斯过程

Suppose we have observations \(y_i \in \mathbb{R}^M\) observed at \(x_i \in \mathbb{R}^K\). One can model the data like so:

假设我们在 \(x_i \in \mathbb{R}^K\) 点上有观测 \(y_i \in \mathbb{R}^M\)。可以按如下方式建模:

\[\begin{align*} y_i &\sim \textsf{multivariate normal}\left(f(x_i), \textbf{I}_M \sigma^2\right) \\ f(x) &\sim \textsf{GP}\big(m(x), K(x \mid \theta, \phi)\big) \\ & K(x \mid \theta) \in \mathbb{R}^{M \times M}, \quad f(x), m(x) \in \mathbb{R}^M \end{align*}\]

where the \(K(x, x^\prime \mid \theta, \phi)_{[m, m^\prime]}\) entry defines the covariance between \(f_m(x)\) and \(f_{m^\prime}(x^\prime)(x)\). This construction of Gaussian processes allows us to learn the covariance between the output dimensions of \(f(x)\). If we parameterize our kernel \(K\):

其中 \(K(x, x^\prime \mid \theta, \phi)_{[m, m^\prime]}\) 元素定义了 \(f_m(x)\)\(f_{m^\prime}(x^\prime)\) 之间的协方差。这种高斯过程的结构为我们研究输出值不同维度之间的协方差提供了可能。我们对核函数 \(K\) 进行如下参数化:

\[ K(x, x^\prime \mid \theta, \phi)_{[m, m^\prime]} = k\left(x, x^\prime \mid \theta\right) k\left(m, m^\prime \mid \phi\right) \]

then our finite dimensional generative model for the above is:

此时我们对应的有限维生成模型如下所示:

\[\begin{align*} f &\sim \textsf{matrixnormal}\big(m(x), K(x \mid \alpha, \rho), C(\phi)\big) \\ y_{i, m} &\sim \textsf{normal}(f_{i,m}, \sigma) \\ f &\in \mathbb{R}^{N \times M} \end{align*}\]

where \(K(x \mid \alpha, \rho)\) is the exponentiated quadratic kernel we’ve used throughout this chapter, and \(C(\phi)\) is a positive-definite matrix, parameterized by some vector \(\phi\).

其中 \(K(x \mid \alpha, \rho)\) 是本章使用的指数二次核函数,\(C(\phi)\) 是由向量 \(\phi\) 参数化的一个正定矩阵。

The matrix normal distribution has two covariance matrices: \(K(x \mid \alpha, \rho)\) to encode column covariance, and \(C(\phi)\) to define row covariance. The salient features of the matrix normal are that the rows of the matrix \(f\) are distributed:

矩阵正态分布具有两个协方差矩阵: \(K(x \mid \alpha, \rho)\) 编码了列间协方差,\(C(\phi)\) 编码了行间协方差。矩阵正态的重要特性是,\(f\) 的行向量有如下分布:

\[ f_{[n,]} \sim \textsf{multivariate normal}\big(m(x)_{[n,]}, K(x \mid \alpha, \rho)_{[n,n]} C(\phi)\big) \]

and that the columns of the matrix \(f\) are distributed:

以及 \(f\) 的列向量有如下分布:

\[ f_{[,m]} \sim \textsf{multivariate normal}\big(m(x)_{[,m]}, K(x \mid \alpha, \rho) C(\phi)_{[m,m]}\big) \]

This also means means that \(\mathbb{E}\left[f^T f\right]\) is equal to \(\operatorname{trace}\!\big(K(x \mid \alpha, \rho)\big) \times C\), whereas \(\mathbb{E}\left[ff^T\right]\) is \(\operatorname{trace}(C) \times K(x \mid \alpha, \rho)\). We can derive this using properties of expectation and the matrix normal density.

这意味着 \(\mathbb{E}\left[f^T f\right]=\operatorname{trace}\!\big(K(x \mid \alpha, \rho)\big) \times C\), 同时 \(\mathbb{E}\left[ff^T\right]\) 等于 \(\operatorname{trace}(C) \times K(x \mid \alpha, \rho)\)。我们可以通过期望的性质和矩阵正态密度的性质得到上述结果。

We should set \(\alpha\) to \(1.0\) because the parameter is not identified unless we constrain \(\operatorname{trace}(C) = 1\). Otherwise, we can multiply \(\alpha\) by a scalar \(d\) and \(C\) by \(1/d\) and our likelihood will not change.

我们应该将 \(\alpha\) 设为 \(1.0\) ,因为只有在约束 \(\operatorname{trace}(C) = 1\) 之下该参数才可识别。否则,我们对 \(\alpha\) 乘以常数 \(d\),同时对 \(C\) 乘以 \(1/d\),似然函数将不会改变。

We can generate a random variable \(f\) from a matrix normal density in \(\mathbb{R}^{N \times M}\) using the following algorithm:

我们可以用如下的算法从矩阵正态中生成随机变量 \(f \in \mathbb{R}^{N \times M}\)

\[\begin{align*} \eta_{i,j} &\sim \textsf{normal}(0, 1) \, \forall i,j \\ f &= L_{K(x \mid 1.0, \rho)} \, \eta \, L_C(\phi)^T \\ f &\sim \textsf{matrixnormal}\big(0, K(x \mid 1.0, \rho), C(\phi)\big) \\ \eta &\in \mathbb{R}^{N \times M} \\ L_C(\phi) &= \texttt{cholesky}\mathtt{\_}\texttt{decompose}\big(C(\phi)\big) \\ L_{K(x \mid 1.0, \rho)} &= \texttt{cholesky}\mathtt{\_}\texttt{decompose}\big(K(x \mid 1.0, \rho)\big) \end{align*}\]

This can be implemented in Stan using a latent-variable GP formulation. We’ve used \(\textsf{LKJCorr}\) for \(C(\phi)\), but any positive-definite matrix will do.

我们可以使用 Stan 中的隐变量高斯过程来实现上述过程。此处对 \(C(\phi)\) 使用了 \(\textsf{LKJCorr}\),但事实上任何正定矩阵均可。

data {
  int<lower=1> N;
  int<lower=1> D;
  array[N] real x;
  matrix[N, D] y;
}
transformed data {
  real delta = 1e-9;
}
parameters {
  real<lower=0> rho;
  vector<lower=0>[D] alpha;
  real<lower=0> sigma;
  cholesky_factor_corr[D] L_Omega;
  matrix[N, D] eta;
}
model {
  matrix[N, D] f;
  {
    matrix[N, N] K = gp_exp_quad_cov(x, 1.0, rho);
    matrix[N, N] L_K;

    // diagonal elements
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta
        * diag_pre_multiply(alpha, L_Omega)';
  }

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(3);
  to_vector(eta) ~ std_normal();

  to_vector(y) ~ normal(to_vector(f), sigma);
}
generated quantities {
  matrix[D, D] Omega;
  Omega = L_Omega * L_Omega';
}
Neal, Radford M. 1996. Bayesian Learning for Neural Networks. Lecture Notes in Statistics 118. New York: Springer.
———. 1997. “Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification.” 9702. University of Toronto, Department of Statistics.
Piironen, Juho, and Aki Vehtari. 2016. “Projection Predictive Model Selection for Gaussian Processes.” In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on. IEEE.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.
Riutort-Mayol, Gabriel, Paul-Christian Bürkner, Michael R Andersen, Arno Solin, and Aki Vehtari. 2023. “Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.” Statistics and Computing 33 (1): 17.
Zhang, Hao. 2004. “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatistics.” Journal of the American Statistical Association 99 (465): 250–61.

  1. Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.↩︎

  2. 高斯过程可以扩展到产生半正定矩阵的协方差函数,但是 Stan 不支持此类模型的推断,因为它们的分布不具有无约束的支撑集。↩︎

  3. A boundary-avoiding prior is just one where the limit of the density is zero at the boundary, the result of which is estimates that are pushed away from the boundary.↩︎

  4. 避免边界点的先验指的是在边界处密度的极限为零,其结果是估计值将被推动远离边界。↩︎