Crossvalidation

If you’re using this package it’s probably safe to say you’re interested in building a statitical model and plan on training with some data and making predictions for another. During the process of you’ll invariable end up wondering how well the model work. It could be that you have multiple models and want to know which model will work best. Or could simply be that you’re curious how well your model will work in real life.

This is where cross validation steps in. The motivation is relatively simple: Fitting your model and testing it on the same data is cheating. Cross validation is an approach for getting a more representative estimate of a model’s generalization error, or the ability of your model to predict things it hasn’t seen. The process consists of spliting your data into groups, holding one group out, fitting a model to the remaining data and predicting the held out group, then repeating for all groups. The result will be out of sample predictions for all the data in your dataset which can be compared to the truth and used with your favorite metrics

There is a cross_validate() helper which is available for any model you build in albatross. You provide a function get_group which takes a single feature (dataset.features[i]) and returns the group it belongs too.

get_group = [](const FeatureType &f) -> GroupKey {
  return f.something;
};

Then you can ask for leave one group out predictions from your model, returned in the original order (which makes it easy to compare directly to the truth dataset),

// Get the cross validated marginal predictions in the original order;
MarginalDistribution cv_pred = model.cross_validate().predict(dataset, get_group).marginal();

Or you can ask for predictions organized by group,

std::map<GroupKey, MarginalDistribution> cv_preds =
         model.cross_validate().predict(dataset, get_group).marginals();
std::map<GroupKey, JointDistribution> cv_preds =
         model.cross_validate().predict(dataset, get_group).joints();

Efficient Cross Validation with Gaussian Processes

In Gaussian Processes for Machine Learning (Section 5.4.2) they describe an efficient way for making leave one out predictions, here we expand that same trick to enable making leave one group out predictions.

Consider the case where we have a set of observations, \(y\), and we would like to make leave one group out cross validated predictions and by groups we mean independent sets of one or more variables.

We start with our GP,

\[\mathbf{f} \sim \mathcal{N}\left(0, \Sigma \right)\]

Which we can then break into groups,

\[\begin{split}\begin{bmatrix} \mathbf{\hat{y}} \\ \mathbf{y_i} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix}\hat{\Sigma} & S \\ S^T & C \end{bmatrix}\right)\end{split}\]

Where we will be using a subset of observations, \(\hat{y}\) to make predictions for a held out set of locations, \(x_i\). We can do this directly using the Gaussian process predict formula,

\[[\mathbf{y_i}|\mathbf{\hat{y}}=\hat{y}] \sim \mathcal{N}\left(S^T \hat{\Sigma}^{-1} \hat{y}, C - S^T \hat{\Sigma}^{-1} S\right)\]

But doing so would require computing \(\hat{\Sigma}^{-1}\) for every group, \(i\), that we hold out. So if we’re doing leave one out with \(n\) observations we have to do the \(\mathcal{O}(n^3)\) inversion \(n\) times leading to \(\mathcal{O}(n^4)\) complexity which will quickly get infeasible.

However, in the process of fitting our GP we’ll need to end up computing the inverse of the full covariance, \(\Sigma^{-1}\) as well as what we’ve been calling the information vector, \(v = \Sigma^{-1} y\). By using block inversion we get,

\[\begin{split}\Sigma^{-1} = \begin{bmatrix} \left(\hat{\Sigma} - S C^{-1} S^T\right)^{-1} & -\left(\hat{\Sigma} - S C^{-1} S^T\right)^{-1}SC^{-1} \\ -\left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1} S^T \hat{\Sigma}^{-1} & \left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1}\end{bmatrix}\end{split}\]

And if we break up \(v\) into \([\hat{v} \hspace{8pt} v_i]\) using the same partitioning as \(y\) we see,

\[\begin{split}v_i & = \left[\Sigma^{-1} y\right]_i \\ & = \begin{bmatrix} -\left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1} S^T \hat{\Sigma}^{-1} & \left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1} \end{bmatrix} \begin{bmatrix} \hat{y} \\ y_i \end{bmatrix} \\ & = -\left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1} S^T \hat{\Sigma}^{-1} \hat{y} + \left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1} y_i \\ & = -A S^T \hat{\Sigma}^{-1} \hat{y} + A y_i\end{split}\]

Where \(A = \left(C - S^T \hat{\Sigma}^{-1} S\right)^{-1}\) is the lower right corner of \(\Sigma^{-1}\) and \(A^{-1}\) is the leave one out prediction covariance. Notice that if we multiply \(v_i\) through by \(A^{-1}\) we end up with,

\[\begin{split}A^{-1} v_i &= - S^T \hat{\Sigma}^{-1} \hat{y} + y_i \\ &= -\mbox{E}[\mathbf{y_i}|\hat{y}] + y_i \\ \mbox{E}[\mathbf{y_i}|\mathbf{\hat{y}}=\hat{y}] &= y_i - A^{-1} v_i\end{split}\]

We can then recover the leave one out predictions,

\[[\mathbf{y_i}|\mathbf{\hat{y}}=\hat{y}] \sim \mathcal{N}\left(y_i - A^{-1} v_i, A^{-1}\right)\]

Computing \(A\)

Above we see that if we can compute \(A\) then we can recover the leave one out predictions without ever directly computing \(\hat{\Sigma}^{-1}\). Take the case of leave one observation out, in this case \(A\) will be the last diagonal value of \(\Sigma^{-1}\). When training a Gaussian process we’ll often have a decomposition of \(\Sigma\) laying around, typically \(\Sigma = LDL^T\). To get the \(i^{th}\) diagonal value of \(\Sigma^{-1}\) we can first compute, \(q = D^{-1/2} L^{-1} e_i\), where \(e_i\) is a vector of zeros with a one in element \(i\), then we find that \(\Sigma^{-1}_{ii} = q^T q\). Since \(L\) is lower triangular and \(D\) is diagonal \(p\) can be computed efficiently.

Similarly if we’re making leave one group out predictions we can build an indexing matrix \(E_i\) which consists of columns \(e_j\) for each \(j\) in group \(i\). Then we find that,

\[A = Q^T Q\]

with

\[Q = D^{-1/2} L^{-1} E_i.\]

Where \(L^{-1} E_i\) amounts to extracting columns of \(L^{-1}\).