Custom Models
Defining a Model
The ModelBase
class in albatross
uses the Curiously Recurring Template Pattern (CRTP)
which makes defining them slightly different from the standard inheritence pattern in C++.
In general an albatross model requires defining _fit_impl
, _predict_impl
and a
struct Fit<MyModel>
which is in charge of storing any coefficients.
Fit Method
To get model.fit(dataset) to work you need to add a _fit_impl
method to your class,
this fit implementation needs to take a vector of features and corresponding targets
(often measurements) and needs to return a Fit<ModelType>
object holding any
information required to make predictions.
class ModelType : public ModelBase<ModelType> {
Fit<ModelType> _fit_impl(const std::vector<FeatureType> &features,
const MarginalDistribution &targets) const;
}
The FeatureType
here can be which ever type your problem requires and
you can have multiple _fit_impl
methods for different types in the same model.
Templated _fit_impl
methods also work.
Predict Method
To get model.predict(features) to work you need to add a _predict_impl
method to your class,
this predict implementation needs to take a vector of features and a Fit<ModelType>
and
needs to return either an Eigen::VectorXd
(mean only),
MarginalDistribution
(mean and variance) or JointDistribution
(mean and covariance).
class ModelType : public ModelBase<ModelType> {
JointDistribution _predict_impl(const std::vector<FeatureType> &features,
const Fit<ModelType> &fit,
PredictTypeIdentity<JointDistribution>) const;
}
In this case above we’ve implemented predict to return a JointDistribution
which holds
the mean prediction as well as a full covariance. A JointDistribution
can be converted
into a MarginalDistribution
by taking the diagonal of the covariance matrix and a MarginalDistribution
can be converted into a mean only prediction (Eigen::VectorXd
) by simply taking the mean of the distribution.
As a result, by implementing predict for a JointDistribution
you will be able to call
all of the following.
const auto prediction model.fit(dataset).predict(features);
JointDistribution joint_pred =prediction.joint();
MarginalDistribution marginal_pred = prediction.marginal();
Eigen::VectorXd mean_pred = prediction.mean();
If you define _predict_impl
with a MarginalDistribution
instead, then you’ll find
that you can call,
MarginalDistribution marginal_pred = prediction.marginal();
Eigen::VectorXd mean_pred = prediction.mean();
but calling prediction.joint();
would result in a compile time error. Similarly if you just define the mean only version
then asking for anything other than prediction.mean()
will result in a compile time error.
We saw above that you could implement the JointDistribution
version and have access to all the predict types,
but that is often inefficient. Instead you may want to impelement specialized version for each of
the predict types. This is what is done in for the Gaussian processes (see gp.hpp
). The
desire to have specialized predict types is what led to the mysterious PredictTypeIdentity<>
argument,
which is required to allow overridable _predict_impl
methods with different return types.
Fit Type
The fit type needs to be a specialization of the Fit<>
struct. The idea is that by forcing the
output of _fit_impl
to be a custom type we can subsequently make model types constant, which
gives us peace of mind that there isn’t accidentally some state that get’s stored in a model which
would cause two calls to fit
to produce different results.
Once you’ve defined the Fit<>
you shouldn’t ever need to actually inspect that type, that
should be left to the internals of albatross
. Instead you are encouraged to use auto
,
const auto fit_model = model.fit(dataset);
or write everything as one liners.
const Eigen::VectorXd mean = model.fit(dataset).predict(features).mean();
Here’s an illustration of the actual types that would result from a typical model workflow:
const ModelType model = make_my_model();
const FitModel<ModelType, Fit<ModelType>> fit_model = model.fit(dataset);
const Prediction<ModelType, FeatureType, Fit<ModelType>> prediction = fit_model.predict(features);
const JointDistribution joint_prediction = prediction.joint();
Again, thanks to auto
type declarations you shouldn’t need to actually know these types
but it may be helpful to get a glimpse of what’s happening under the hood. This chain of
types is what allows albatross
to keep track of how exactly you’re using a model and
decide (at compile time) the most efficient methods to use.
Example
Here’s an example of a model which always returns the mean of the training data.
struct Fit<MeanModel> {
double mean;
}
class MeanModel : public albatross::ModelBase<MeanModel> {
public:
using FitType = Fit<MeanModel>;
std::string get_name() const { return "mean"; }
template <typename FeatureType>
FitType _fit_impl(const std::vector<FeatureType> &features,
const MarginalDistribution &targets) const {
FitType model_fit = {targets.mean.mean()};
return model_fit;
}
template <typename FeatureType>
Eigen::VectorXd _predict_impl(const std::vector<FeatureType> &features,
const FitType &fit,
PredictTypeIdentity<Eigen::VectorXd>) const {
Eigen::VectorXd output(features.size());
output.fill(fit.mean);
return output;
}
}
While defining your own model isn’t as simple as standard inheritence
, the benefits are large. Once you’ve defined a model using the ModelBase
class you can immediately start using all the
tools built around it, things such as cross validation, outlier detection using RANSAC,
and tuning tuning.