Technology

Alt Full Text
Introduction to ETL with Amazon Glue

Extract Transform and Load (ETL) with Amazon Glue

 

How exactly is big data extracted from the various sources, transformed into formats and shapes that can be easily consumed by data analysts, data scientists and machine learning engineers? This is the main work of Amazon Glue which simplifies the seamless transformation.

What is Amazon Glue

Amazon Glue is a server less ETL that extracts data from source, transforms it in the right way to be consumed and finally loads it back to storage for access. It is a server less data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML) and application development. AWS Glue is a serverless data integration service that makes data preparation simpler, faster, and cheaper. You can discover and connect to over 70 diverse data sources, manage your data in a centralized data catalog, and visually create, run, and monitor ETL pipelines to load data into your data lakes.

Components of AWS Glue

  1. Data catalog: The data catalog holds the metadata and the structure of the data.
  2. Database: It is used to create or access the database for the sources and targets.
  3. Table: Create one or more tables in the database that can be used by the source and target.
  4. Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
  5. Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic.
  6. Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
  7. Development endpoint: It creates a development environment where the ETL job script can be tested, developed, and debugged.

Event Driven

AWS Glue can run your extract, transform and load (ETL) jobs as new data arrives. For example, you can configure AWS Glue to initiate your ETL jobs to run as soon as new data becomes available in Amazon Simple Storage Service (S3).

Amazon Glue Data Catalog

You can use the Data Catalog to quickly discover and search multiple AWS datasets without moving the data. Once the data is cataloged, it is immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

No-Code ETL Jobs

AWS Glue Studio makes it easier to visually create, run, and monitor AWS Glue ETL jobs. You can build ETL jobs that move and transform data using a drag-and-drop editor, and AWS Glue automatically generates the code.

Manage and Monitor Data Quality

AWS Glue Data Quality automates data quality rule creation, management, and monitoring to help ensure high quality data across your data lakes and pipelines.

Data Preparation

With AWS Glue DataBrew, you can explore and experiment with data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, AWS Lake Formation, Amazon Aurora, and Amazon Relational Database Service (RDS). You can choose from over 250 prebuilt transformations in DataBrew to automate data preparation tasks such as filtering anomalies, standardizing formats, and correcting invalid values.

Advantages of using Amazon Glue

  1. AWS Glue scan through all the available data with a crawler
  2. Final processed data can be stored in many different places (Amazon RDS, Amazon Redshift, Amazon S3, etc)
  3. It’s a cloud service. No money needed on on-premises infrastructures.
  4. It’s a cost-effective option as it’s a server less ETL service
  5. It’s fast. It gives you the Python/Scala ETL code right off the bat.

Practical Implementations

  • Students enrolling for any AI related course from Carnegie Training Institute have access to practical and working implementation guidelines

Sources

  1. AWS Glue FAQs
  2.  

Related Articles