Amazon SageMaker Pipelines

What comes next after building a data lake

Robert Ayub Technology 11 July 2023 Hits: 525

It is quite tasking to build a data lake. However, after achieving the hurdle, how best do you utilize it to the advantage of the enterprise. One of the main efficient use of a data lake would be to drive analytics and machine learning tasks on top of it. The efficient way of implementing this is by setting up a machine learning pipeline and this article talks about building Machine Learning pipelines using Amazon SageMaker.

What

With Amazon SageMaker Pipelines, you can create, automate, and manage end-to-end machine learning (ML) workflows at scale. You can use Amazon SageMaker Model Building Pipelines to create end-to-end workflows that manage and deploy SageMaker jobs. SageMaker Pipelines comes with SageMaker Python SDK integration, so you can build each step of your pipeline using a Python-based interface.

After your pipeline is deployed, you can view the directed acyclic graph (DAG) for your pipeline and manage your executions using Amazon SageMaker Studio. Using SageMaker Studio, you can get information about your current and historical pipelines, compare executions, see the DAG for your executions, get metadata information, and more.

The pipeline that you create follows a typical machine learning (ML) application pattern of preprocessing, training, evaluation, model creation, batch transformation, and model registration

SageMaker Projects

SageMaker projects build on SageMaker Pipelines by providing several MLOps templates that automate model building and deployment pipelines using continuous integration and continuous delivery (CI/CD). SageMaker Projects help organizations set up and standardize developer environments for data scientists and CI/CD systems for MLOps engineers. Projects also help organizations set up dependency management, code repository management, build reproducibility, and artifact sharing.

With SageMaker Projects, MLOps engineers and organization admins can define their own templates or use SageMaker-provided templates. The SageMaker-provided templates bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases.

Step Types

SageMaker Pipelines provides many predefined step types, such as:

Data processing steps
Model training steps
Model tuning steps
Batch scoring steps
Callback step
Lambda step

Activities in SageMaker Pipelines

SageMaker pipelines supports the following activities

Pipeline - A DAG of steps and conditions to orchestrate SageMaker jobs and resource creation.
Processing job steps- A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation.
Training job steps - An iterative process that teaches a model to make predictions by presenting examples from a training dataset.
Conditional execution steps - A step that provides conditional execution of branches in a pipeline.
Register model steps - A step that creates a model package resource in the Model Registry that can be used to create deployable models in Amazon SageMaker.
Create model steps - A step that creates a model for use in transform steps or later publication as an endpoint.
Transform job steps - A batch transform to preprocess datasets to remove noise or bias that interferes with training or inference from a dataset, get inferences from large datasets, and run inference when a persistent endpoint is not needed.
Fail steps - A step that stops a pipeline execution and marks the pipeline execution as failed.
Parametrized Pipeline executions - Enables variation in pipeline executions according to specified parameters.

Benefits of a Machine Learning Pipeline

MLOps is a machine learning engineering culture and methodology that helps to bring together the creation and application of machine learning systems (Ops). Using MLOps pipelines means advocating for automation and tracking in the ML system development process, including integration, checking, launching, rollout, and infrastructure management. Some advantages are:

Constantly Forecast - A constant stream of raw data obtained over time can be processed by an integrated Machine Learning Pipeline, unlike a one-time model. This enables you to move machine learning from the lab to the real world, allowing you to build a continuously teaching process that learns from new data and generates up-to-date decisions for real-time automation at scale.
Get Started Faster - enables teams to get started quicker and cheaper than their rivals by automating every phase of the Machine Learning Pipeline. MLOps also lays the groundwork for iterating and building on the machine learning objectives. You can create a new Machine Learning Pipeline in a short amount of time until data is streaming into your database.
Any team can access it - You will place ML in the hands of the company owners who can really use the forecasts by automating the most difficult parts and wrapping the rest in a simple interface, freeing up the data analysis team to work on bespoke modelling..

Practical Implementations

Students enrolling for any AI related course from Carnegie Training Institute have access to practical and working implementation guidelines

Sources

AWS Documentation

Robert Ayub

Kenya

+254 718 758 221

robert@ayub.co.ke

+254 718 758 221

Technology

What comes next after building a data lake

Robert Ayub

Kenya

+254 718 758 221

robert@ayub.co.ke

+254 718 758 221

Technology

What comes next after building a data lake

Related Articles

What is Artificial Intelligence

Mobile App or Website?

Multipl Linear Regression