We provide basic machine learning tooling by wrapping
scikit-learn objects, contained in
dato.ml module). Where possible, we following our typical naming convention - UpperCamelCase w/underscores removed, but because many scikit-learn objects are classes, this is not always possible. For
Classifier objects, we therefore simply abridge these terms to
Clf in our
Pipeable-wrapped versions. For example, the
LinearRegression implementation is called
XGBRegressor is named
Machine learning modeling does not always follow a single-i/o workstream. Data is often split off and reserved for validation, requiring an accumulator to be passed through our pipelines, storing this data for downstream consumption. We do this by initiating a
_ModelSpec object once
InitModel is called. Users may want to access the following components of this class while debugging their models:
test_ objects contain the training and test data, respectively. The
estimator is the instantiated class object for the underlying
Because there is typically a standard set of tasks for creating simple machine learning models, we lay out the necessary functions here. The following steps are generally required:
Encode categorical variables.
Split data into a training and test set.
Train the model over the training set.
Predict and evaluate the model.
This can be accomplished as follows:
df \>> InitModel(label='y')>> LabelEncode('x1', 'x2') \>> FillNA(-1) \>> TrainTestSplit \>> LinearReg