Introduction
I’m excited to announce my sagemaker R package!
AWS Sagemaker is a powerful tool, and I hope my package makes it easier for people to try it out!
Since the Github page and website already introduce the sagemaker R package, I want to use this blog post to introduce AWS Sagemaker, productionizing machine learning, and how the my sagemaker R package tries to make it all easier.
AWS Sagemaker
AWS Sagemaker is a platform for training machine learning models.
There are three components of Sagemaker:
- hosted development environment with juypter notebooks
- scalable training of machine learning models
- endpoint and batch predictions from trained models
Hosted juypter notebooks are a great feature, but this post will focus on Sagemaker’s scalable training and predictions.
AWS cost
Before you use Sagemaker you need an AWS account. AWS requires your credit card information so they can charge you for use of services.
However, by taking advantage of free tier services and closely monitoring usage, it’s possible to keep costs low.
Here is my Spend Summary for the entire time I developed the R package:
- Total Spend: $4.05
- By service:
- Sagemaker: $4.05
- Other services: $0.00
- By service:
- Free Tier Usage
- S3 Put: 28%
- S3 Get: 4.2%
- Others: <1%
So if you are interested in learning AWS, but aren’t in an enterprise environment, it is definitely possible to get started.
Productionizing machine learning predictions
Before we dive into AWS Sagemaker, let me introduce my mental model for generating machine learning predictions:
The workflow on the bottom is similar to the data exploration workflow shown in R for Data Science. We process data, evaluate, and select a model to deploy.
On the top, the key to repeatable, scheduled predictions is the predictions pipeline environment. A mature environment might be a docker image ran by AWS Batch, scheduled by a CRON trigger on Jenkins. A simple one would be an Rscript ran once a day on your desktop.
Think of the pipeline like a function, deployed on a certain trigger:
## 🕐
## <clock for 13:00 (~one o’clock) >
pipeline <- function(...) {
data <- fetch_data(...) %>%
transform(...)
model <- sagemaker::sagemaker_attach_tuner(...)
predict(model, data)
}
One key feature of the environment is that processes are shared with the training workflow.
For example, any data transformation or feature engineering process must be shared between the training workflow and the predictions pipeline. Otherwise, your predictions will not be reliable, and your model might not even run against the new data if its a different shape.
Likewise, the pipeline needs to use the model we selected during training.
Sagemaker Features
Training and Evaluation
For the training half of the productionization framework, Sagemaker provides tools for scalable training of machine learning models.
You can choose the compute power and parallel model building suitable for the scale of your training task.
In the R package, this is done by
sagemaker::sagemaker_estimator
sagemaker::sagemaker_hyperparameter_tuner
There are also features for hyperparameter tuning, as well as basic ways to evaluate tuning and model fit:
sagemaker::sagemaker_tuning_job_logs
sagemaker::sagemaker_training_job_logs
Deployment
In the predictions pipeline environment, Sagemaker can also provide a scalable model to service predictions.
Sagemaker offers two different services, depending on the requirements of the predictions. Real-time endpoint and batch predictions:
Real-time endpoint
First, the Sagemaker real-time predictions endpoint. This operates on a request-response model, where transformed data is sent and predictions are received.
The endpoint is scalable both in terms of compute speed and parallel servicing, depending on the expected demand.
One thing to note is that the environment (or the application) is still responsible for transforming the data as required by the model. Sagemaker does not take care of this.
In the sagemaker R package,
real-time predictions managed by the
predict.sagemaker
S3 method after the endpoint is deployed:
sagemaker::sagemaker_deploy_endpoint
(model)
predict(model, data)
Batch predictions
Sagemaker also offers batch predictions1, making predictions on data in S3 and writing the predictions to S3.
As in the previous example, the data in S3 should already be transformed as required by the model.
Additionally, the predictions pipeline needs to have permission and access to Sagemaker to spin up resources. And the predictions can’t be time sensitive, since you need to wait for the compute resources to deploy.
In the sagemaker R package, this process is executed by
What Sagemaker isn’t
Sagemaker is not a full end-to-end solution for productionizing machine learning predictions. Ultimately, Sagemaker will only act as an API service to generate predictions.
That means you’ll need to find other services to build and maintain the entire predictions pipeline environment. This is a complex system, which includes:
- the environment itself, usually a docker image
- scheduling and triggering the environment
- reusing components from the model training, like data transformation and feature engineering.
Sagemaker is a great tool for machine learning, but it won’t take you all the way.
sagemaker R package
So how does the new sagemaker R package fit into this?2 First and foremost, the sagemaker R package is an interface to the AWS Sagemaker API. This means that it’s easier to build a predictions pipeline environment using R.
You can do all your data maniuplation and cleaning from R, as well as manage your Sagemaker APIs.
Second, I think this R package vastly simplifies the Sagemaker interface, especially during model training and evaluation. It’s easier to compare hyperparameters across tuning jobs, and make predictions on new data.
Most importantly, I’ve tried to hide a lot of the details that get in the way when you are trying to quickly spin up a Sagemaker model. Less boilerplate, more machine learning.
What to watch out for
This package has been built for and tested on the xgboost Sagemaker models.
I spent most of my time working with xgboost, so I’ve even included
shortcuts for xgboost like
sagemaker_xgb_container
and sagemaker_xgb_estimator
.
However, this means there are a lot of models and features of Sagemaker
I might not have come across.
If for whatever reason, something in sagemaker
doesn’t work
leave an issue here: https://github.com/tmastny/sagemaker/issues.
What’s next
I have plans to extend the sagemaker R package in various ways. I really like Sagemaker and R, and I think there are many small improvements that could be made to the interface to help with some everyday stuff.
Things like a type
parameter on predict
for probability or class,
default objective metrics based on the dataset’s outcome,
and better intergration with named data frames.
I also have bigger projects I want to tackle. Cross-validation seems like a huge missing feature from Sagemaker, but something that I think is possible to implement on my own.
And as always, I’ll keep developing and maintaining functionality as I build more models.
The Sagemaker API calls it batch transform, because technically you can use it for generic transforms. However, in my experience this process is not well documented and there are better solutions in my opinion. See https://stackoverflow.com/questions/58985124/why-does-aws-sagemaker-run-a-web-server-for-batch-transform↩
https://aws.amazon.com/blogs/machine-learning/using-r-with-amazon-sagemaker/ http://www.rpubs.com/TimFlocke/SageMaker_R_demo ↩