Tuning and Cross-validation with tidymodels and scikit learn

2020/06/30

Introduction

tidymodels is the new framework from Max Kuhn, David Vaughan, and Julia Silge at RStudio. It’s the successor to the caret package, which was heavily featured in Max Kuhn’s book Applied Predictive Modeling.

tidymodels promises a modular, extensible design for machine learning in R. It also has a wonderful website design to help you get started as soon as possible, with a focus on examples and case studies walking you through the components of the platform.

In this article, I want to translate concepts between tidymodels and scikit-learn, Python’s standard machine learning ecosystem. I hope this will be useful for Python Data Scientists interested in R, and R users who want to learn a little more about Python.

I’ll cover the key machine learning workflows with scikit-learn and tidymodels side-by-side, starting with preprocessing and feature engineering, model tranining and cross-validation, as well as evaluation and comparison.

I think you’ll find that while both frameworks have their own unique advantages, there are many conceptual similarities that have helped me understand both frameworks better.

tidymodels and sklearn

We’ll go through this tidymodels example: 4 Tune Model Parameters side-by-sideline with sklearn.

Notebook saved here: https://github.com/tmastny/sageprocess/blob/master/vignettes/sklearn-tidymodels-example1.ipynb

tidymodels

library(tidymodels)
library(modeldata)
library(vip)
library(readr)

cells <- read_csv("cells.csv")

sklearn

import pandas as pd
import numpy as np

np.random.seed(753)
cells = pd.read_csv("cells.csv")
cell_split <- initial_split(
  cells,
  strata = class
)

cell_train <- training(cell_split)
cell_test <- testing(cell_split)
from sklearn.model_selection import train_test_split

features = cells.drop('class', axis=1)
outcome = cells['class']

X_train, X_test, y_train, y_test = train_test_split(
    features,
    outcome,
    test_size=0.25,
    stratify=outcome
)

We have to do a little preprocessing. All the features are numeric, but the outcome is binary. R expects categorical outcomes as factors. Python automatically changes outcome strings with LabelBinarizer. We also use the identity transformation FunctionTransformer() to remove the case column by dropping it from selected columns.

tree_rec <- recipe(class ~ ., data = cells) %>%
  step_rm(case) %>%
  step_string2factor(all_outcomes())
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer

tree_preprocess = make_column_transformer(
    (FunctionTransformer(), features.drop('case', axis=1).columns)
)

With R, we chose the implementation using set_engine and set_mode. For sklearn, if we wanted a different implementation we would import a different module that implemented the sklearn interface.

tree_model <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune()
) %>%
  set_engine("rpart") %>%
  set_mode("classification")
from sklearn.tree import DecisionTreeClassifier

Both frameworks have a way to combine all the components of the model.

set.seed(345)

tree_wf <- workflow() %>%
  add_model(tree_model) %>%
  add_recipe(tree_rec)
from sklearn.pipeline import make_pipeline

tree_pipeline = make_pipeline(
    tree_preprocess,
    DecisionTreeClassifier()
)

Next, we can come up with tuning grids. tidymodels has sensible defaults for all parameters. With sklearn, you typically have to come up with them yourself.

tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  levels = 5
)

tree_grid
## # A tibble: 25 x 2
##    cost_complexity tree_depth
##              <dbl>      <int>
##  1    0.0000000001          1
##  2    0.0000000178          1
##  3    0.00000316            1
##  4    0.000562              1
##  5    0.1                   1
##  6    0.0000000001          4
##  7    0.0000000178          4
##  8    0.00000316            4
##  9    0.000562              4
## 10    0.1                   4
## # … with 15 more rows
param_grid = {
    'decisiontreeclassifier__max_depth': [1, 4, 8, 11, 15],
    'decisiontreeclassifier__ccp_alpha': [
        0.0000000001, 0.0000000178, 0.00000316, 0.000562, 0.1
    ]
}

Next, we construct and run cross-validation. By default, tidymodels calculates both accuracy and area under the ROC curve. We’ll need to add those manually to sklearn’s metrics.

set.seed(234)
cell_folds <- vfold_cv(cell_train, v = 5)

tree_res <- tree_wf %>%
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
  )
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

tree_scorer = {
    'roc_auc': make_scorer(roc_auc_score, needs_proba=True),
    'accuray': make_scorer(accuracy_score)
}

tree_tuner = GridSearchCV(
    tree_pipeline, param_grid, cv=5,
    scoring=tree_scorer,
    refit='roc_auc'
)

tree_res = tree_tuner.fit(X_train, y_train)

List training metrics and hyperparamterics.

best_param <- tree_res %>%
  select_best("roc_auc")

best_param
## # A tibble: 1 x 3
##   cost_complexity tree_depth .config
##             <dbl>      <int> <fct>  
## 1    0.0000000001          4 Model06
tree_res.best_params_
## {'decisiontreeclassifier__ccp_alpha': 1.78e-08, 'decisiontreeclassifier__max_depth': 4}
tree_res %>%
  show_best("roc_auc")
## # A tibble: 5 x 8
##   cost_complexity tree_depth .metric .estimator  mean     n std_err .config
##             <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
## 1    0.0000000001          4 roc_auc binary     0.862     5 0.00578 Model06
## 2    0.0000000178          4 roc_auc binary     0.862     5 0.00578 Model07
## 3    0.00000316            4 roc_auc binary     0.862     5 0.00578 Model08
## 4    0.000562              4 roc_auc binary     0.862     5 0.00578 Model09
## 5    0.0000000001          8 roc_auc binary     0.846     5 0.0105  Model11
pd.DataFrame(tree_res.cv_results_) \
    .sort_values('mean_test_roc_auc', ascending=False) \
    .rename(columns={
        'param_decisiontreeclassifier__ccp_alpha': 'cost',
        'param_decisiontreeclassifier__max_depth': 'max_depth'
    }) \
    [[
        'cost', 'max_depth',
        'mean_test_accuray', 'mean_test_roc_auc'
    ]] \
    .head(5)
##         cost max_depth  mean_test_accuray  mean_test_roc_auc
## 6   1.78e-08         4           0.797240           0.839899
## 1      1e-10         4           0.796584           0.839053
## 11  3.16e-06         4           0.795257           0.837103
## 16  0.000562         4           0.795917           0.837055
## 0      1e-10         1           0.738461           0.779942

Variable importance. This is where sklearn becomes significantly more difficult. Column name metadata is not saved in the final output, so we have to manually add column names back to an ordered list of feature importances.

best_tree <- tree_wf %>%
  finalize_workflow(best_param) %>%
  fit(data = cell_train)

best_tree %>%
  pull_workflow_fit() %>%
  vip()

best_tree = tree_res.best_estimator_.named_steps['decisiontreeclassifier']
ct = tree_res.best_estimator_.named_steps['columntransformer']

feature_importances = pd.DataFrame({'name': ct.transformers_[0][2]}) \
    .assign(importance = best_tree.feature_importances_) \
    .sort_values('importance', ascending=False)

import seaborn as sns
sns.barplot(x='importance', y='name', data=feature_importances.head(10))

Final validation

best_tree %>%
  last_fit(cell_split) %>%
  collect_metrics()
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.813
## 2 roc_auc  binary         0.853
pd.DataFrame.from_records([
    (name, scorer(tree_res.best_estimator_, X_test, y_test))
    for name, scorer in tree_scorer.items()
], columns=['metric', 'score'])
##     metric     score
## 0  roc_auc  0.824385
## 1  accuray  0.776238