Technology

Alt Full Text
Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler

 

Amazon SageMaker Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no code. You can also add custom python scripts to customize workflows.

Core functionalities

  1. Import - connect to and import data from Amazon S3, Amazon Athena, Amazon Redshift, Snowflake and Databricks
  2. Data Flow - create a data flow to define a series of ML data prep steps. Use flow to:
    1. combine datasets from different data sources
    2. identify the number and type of transformations you want to apply to datasets
    3. define a data prep workflow that can be integrated into an ML pipeline
  3. Transform -
    1. clean and transform your dataset using standard transforms like string, vector and numeric data formatting tools
    2. Featurize your data using transforms like text and date/time embedding and categorical encoding
  4. Generate data insights - automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Insights and Quality Report
  5. Analyze -
    1. Analyze features in the dataset at any point in the flow
    2. Data wrangler includes built-in data visualization tools like scatter plots, histograms
    3. Data wrangler includes data analysis tools like target leakage analysis and quick modeling to understand feature correlation
  6. Export - export data preparation workflow to:
    1. Amazon S3 bucket
    2. Amazon SageMaker model building pipeline (using SageMaker pipelines to automate model deployments)
    3. Amazon SageMaker Feature Store - store the features and their data in a centralized location
    4. Python Script - Store the data and their transformations in a Python script for your custom workflows

Core activities

  1. Upload dataset to Amazon S3 and import
  2. Analyze the data using Data Wrangler Analysis
  3. Define data flow using Data Wrangler data transforms
    1. Prepare and visualize
    2. Data exploration
    3. Drop unused columns
    4. Cleanup missing values
    5. Custom Pandas :: Encoding
    6. Custom SQL
    7. Save the flow
  4. Export flow to notebook
  5. Training using a classifier
  6. Shutdown the Data Wrangler

Practical Implementations

  • Students enrolling for any AI related course from Carnegie Training Institute have access to practical and working implementation guidelines

Sources

  1. Amazon DOCs
  2.  

Related Articles