Technology

Alt Full Text
Amazon SageMaker Training

Amazon SageMaker Training

 

Amazon SageMaker is a service that allows one to build complete end-to-end machine learning pipelines, from model training to production deployment. It has internaly built out of the box model that you can use straight away but it also allows you to train your own custom models via Docker images. This gives one complete control over the interesting parts of the process (model architecure and training logic) while at the same time abstracting away the boring part (model serving, instance creation, load balancing, scaling e.t.c)

Model Training - involves

  1. Training a model - Here you need an algorithm. However, the algorithm you choose depends on certain factors:
    1. Use an algorithm provided by Amazon SageMaker
    2. Use Apache Spark with Amazon SageMaker
    3. Submit custom code to train with deep learning frameworks
    4. Use own custom algorithms - this is where you put the code together in a docker container and specify the registry path of the image in an Amazon SageMaker CreateTrainingJob API call
  2. Evaluating a model - after training the model, you evaluate it to determine whether the accuracy of the inferences is acceptable.
    1. Offline testing - Use historical, not live, data to send requests to the model for inferences. Deploy your trained model to an alpha endpoint, and use historical data to send inference requests to it. To send the requests, use a Jupyter notebook in your Amazon SageMaker notebook instance and either the AWS SDK for Python (Boto) or the high-level Python library provided by Amazon SageMaker.
    2. Online testing with live data - Amazon SageMaker supports multiple models (called production variants) to a single Amazon SageMaker endpoint. You configure the production variants so that a small portion of the live traffic goes to the model that you want to validate. For example, you might choose to send 10% of the traffic to a model variant for evaluation. After you are satisfied with the model’s performance, you can route 100% traffic to the updated model.

AWS Services - The services that are required for the training within AWS ecosystem are:

  1. Amazon S3 bucket - this is to store your data and model weights
  2. Elastic Container Register (ECR) - this is to store Docker images for your training containers
  3. IAM - to create roles
  4. Amazon Sagemaker - used to connect all the above services and to launch the training jobs

Training Steps - Below are the common steps involved

  1. Upload your datasets to AWS S3 - Sore the training data in an S3 bucket
  2. Create an S3 bucket that will store training output - This 
  3. Write the training script - this is what will be executed to run as a training job
  4. Dockerize the training script and create a training container - the complexity of this step depends on whether you would like to train your model on a GPU, or does it require any special libs to be installed. There aren’t really any constraints as to what you can do in this Dockerfile. However, there are a number of directories reserved by Sagemaker for its purposes, such as for storing input data or saving output files after the training job has finished. Specifically, your training algorithm needs to look for data in the /opt/ml/input folder, and store model artifacts (and whatever other output you’d like to keep for later) in /opt/ml/model. SageMaker will copy the training data we’ve uploaded to the S3 to the input folder, and copy everything from the model folder to the output S3 bucket.
  5. Host the training docker container to ECR - push the training container to ECR
  6. Create IAM roles for SageMaker training jobs
  7. Configure the training jobs
  8. Launch he training Job

Model Deployment - After you train your machine learning model, you can deploy it using Amazon SageMaker to get predictions in any of the following ways, depending on your use case:

  1. For persistent, real-time endpoints that make one prediction at a time, use SageMaker real-time hosting services. - SageMaker Real-time inferences
  2. Workloads that have idle periods between traffic spurts and can tolerate cold starts, use Serverless Inference. - Serverless Inferences
  3. Requests with large payload sizes up to 1GB, long processing times, and near real-time latency requirements, use Amazon SageMaker Asynchronous Inference. - Asynchronous Inference
  4. To get predictions for an entire dataset, use SageMaker batch transform - 

Practical Implementations

  • Students enrolling for any AI related course from Carnegie Training Institute have access to practical and working implementation guidelines

Sources

  1. Amazon Sagemaker Training Concepts
  2.  

Related Articles