Trigger Pre-built Framework Training Job via Amazon SageMaker API

2 minute read

TL;DR

The SageMaker training job with customized training script in frameworks such as TensorFlow/PyTorch/scikit-learn can also be triggered by pure SageMaker API, by configuring the request body fields:

  • HyperParameters.sagemaker_submit_directory: the S3 location of the uploaded source.tar.gz file, which tars the training script.
  • HyperParameters.sagemaker_program: the name of the entry point file
  • AlgorithmSpecification.TrainingImage: the Amazon ECR registry path of the pre-built framework container images. You can find the images URL here or here.

Reason for This Blog

Amazon SageMaker Python SDK is a great package for SageMaker practices. Still, in some scenarios it is required to trigger pre-built framework (TensorFlow, Pytorch, scikit-learn, etc) training job via SageMaker API directly. However, there does not seem to have any explicit document/tutorial to describe the solution. So, I write this short article, and hope it can help you.

Running Pre-built Framework Training Job with Amazon SageMaker Python SDK

Amazon SageMaker Python SDK is an open source library for training and deploying machine-learned models on Amazon SageMaker. There are bunch of examples for TensorFlow, PyTorch, scikit-learn and more frameworks in this open source repository amazon-sagemaker-examples. In short, we can create an Estimator with the customized script and fit the estimator as the code piece below. For the parameter of PyTorch estimator, entry_point indicates the training script, framework_version and py_verison decide the pre-built container image.

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='mnist.py',
                    role=role,
                    framework_version='1.4.0',
                    py_version='py3',
                    train_instance_count=2,
                    train_instance_type='ml.c4.xlarge',
                    hyperparameters={
                        'epochs': 6,
                        'backend': 'gloo'
                    })

estimator.fit({'training': inputs})

Example code of starting a SageMaker PyTorch Training job by SageMaker Python SDK. The code is copied from this example in amaozn-sagemaker-example repository

Scenarios of Using SageMaker API Directly

Although SageMaker Python SDK is very convienient to run managed training job for a variety of machine learning frameworks, there are still some scenarios that we need to trigger SageMaker Training job directly via Amazon SageMaker CreateTrainingJob API, such as:

  • Machine Learning engineers or software developers use other languages instead of Python.
  • The machine learning operational pipeline is constructed by AWS Step Functions. The Step Functions SageMaker connection uses SageMaker API interface.

Use SageMaker API to Trigger Pre-build Framework Training Job with Training Scripts

People may think it is not supported by SageMaker API to trigger TensorFlow/PyTorch/… training job with customized training script, because the SageMaker API seems have no place to setup training script location at the first glance. The good news is we can!

First, we need to tar the training script as source.tar.gzand upload to a S3 location, e.g., s3://bucket/prefix/source.tar.gz. This step can be done as a step of the step functions, or in the CI/CD build stage, depends on how we operate the ML pipeline.

Then, we need to set up these fields in SageMaker CreateTrainingJob API request body or the state definition of Step Functions SageMaker connector.

  • HyperParameters.sagemaker_submit_directory: the S3 location of the uploaded source.tar.gz file, e.g., s3://bucket/prefix/source.tar.gz
  • HyperParameters.sagemaker_program: the name of the entry point file
  • AlgorithmSpecification.TrainingImage: the Amazon ECR registry path of the pre-built framework container images. You can find the images URL here.

Below is an example to trigger the same training job by using Python boto3 SDK. For more information about HyperParameters, you can refer to this code in sagemaker-training-toolkit package.

...
import boto3
src_path = 's3://bucket/prefix/source.tar.gz'
def trigger_train():
    training_job_name = '<YOUR-TRAINING-JOB-NAME>'
    sm = boto3.client('sagemaker')
    resp = sm.create_training_job(
            TrainingJobName = training_job_name, 
            AlgorithmSpecification={
                'TrainingImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py36-cu101-ubuntu16.04',
                ...
            }, 
            RoleArn=role_arn,
            HyperParameters={
                'sagemaker_program' : "mnist.py",
                'sagemaker_submit_directory': src_path,
                'sagemaker_region': "<your-aws-region>",                
            },
            InputDataConfig=[<Input-data-setup>], 
            OutputDataConfig={<output-data-setup>},
            ResourceConfig={<resource-setup>},         
            ...
    )
...