0% found this document useful (0 votes)
13 views

Notes

Uploaded by

Amudha A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Notes

Uploaded by

Amudha A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

A data warehouse is a centralized repository of data that’s been

Data Management gathered from various sources within an organization.

data marts are built on top of data warehouses and users access these
data marts for their queries.
Types of data processing – OLTP and OLAP
Data warehouses and data marts While data marts can be based on a star or snowflake schema, the star
Data lakes schema is generally preferred because it results in faster queries due to
Data lakehouse fewer joins.
Data mesh
A data lake can be defined as a centralized repository that allows you to
Apache Spark on the AWS cloud store all structured and unstructured data at any scale. Organizations
AWS Glue usually pick a replication tool such as AWS Data Migration Service (AWS
Querying data using AWS DMS) to bring the data into the data lake. Organizations may also use a
push mechanism to FTP to transfer the files to an AWS Simple Storage
Service (S3)-based data lake using AWS Transfer Family.

The data lakehouse blurs the lines between data lakes and data
warehouses by enabling the atomicity, consistency, isolation, and
durability (ACID) properties on the data in the data lake and enabling
multiple processes to concurrently read and write data. new
arrangement that does not try to force unstructured data into the strict
models of a data warehouse.

Data mesh is serving data as a product and setting the ownership of the
product. This thought process led to the creation of the data mesh.
Figure 1.1 – Overview of Apache Spark’s workload execution
Cluster Manager can be Spark’s standalone
cluster manager, Mesos, Apache Hadoop Yet
Another Resource Negotiator (YARN), or
Kubernetes.

In client mode, the driver program runs on the


machine that submitted the Spark Job. In cluster
mode, the driver program runs on one of the
executors.

To execute a Job, an execution plan must be


created based on a Directed Acyclic Graph (DAG).

A DAG scheduler converts the logical execution


plan into a physical execution plan.

Once Spark acquires the executors, SparkContext


sends the tasks to the executors to perform.

Spark also has a component called SparkSQL


which allows users to write SQL queries for data
transformation. SparkSQL is enabled by the
Catalyst and Tungsten engines.
AWS Glue
On August 14, 2017, AWS released a new service called AWS Glue. AWS Glue is a
serverless data integration service. AWS Glue also provides some easy-to-use AWS Glue Interactive Sessions: Interactive sessions are
features that almost eliminate the administrative overhead of infrastructure managed interactive environments that can be used to develop
management and simplify how common data integration tasks can be integrated. and test AWS Glue ETL scripts.

Let’s look at some of the notable components of the AWS Glue feature set: AWS Glue Schema Registry: AWS Glue Schema Registry allows
users to centrally control data stream schemas and has
AWS Glue DataBrew: Glue DataBrew is used for data cleansing and enrichment
integrations with Apache Kafka, Amazon Kinesis, and AWS
through another GUI. Creating AWS Glue DataBrew Jobs does not require the user
to write any source code and the Jobs are created with the help of a GUI. Lambda.

AWS Glue Data Catalog: AWS Glue Data Catalog is a central catalog of metadata AWS Glue Triggers: AWS Glue Triggers are data catalog objects
that can be used with other AWS services such as Amazon Athena, Amazon that allow us to either manually or automatically start executing
Redshift, and Amazon EMR. one or more AWS Glue Crawlers or AWS Glue ETL Jobs.
AWS Glue Connections: Glue Connections are catalog objects that help organize
AWS Glue Workflows: Glue Workflows can be used to
and store connection information to various data stores. AWS Glue Connections
can also be created for Marketplace AWS Glue Connectors, which allows you to
orchestrate the execution of a set of AWS Glue Jobs and AWS
integrate with third-party data stores, such as Apache Hudi, Google Big Query, Glue Crawlers using AWS Glue Triggers.
and Elastic Search.
AWS Glue Blueprints: Blueprints are useful for creating
AWS Glue Crawlers: Crawlers can be used to crawl existing data and populate an parameterized workflows that can be created and shared for
AWS Glue Data Catalog with metadata. similar use cases.
AWS Glue ETL Jobs: Glue ETL Jobs enables users to extract source data from
various data stores, process it, and write output to a data target based on the
AWS Glue Elastic Views: Glue Elastic Views helps users replicate
logic defined in the ETL script. Users can take advantage of Apache Spark-based the data from one store to another using familiar SQL syntax.
ETL Jobs to handle their workload in a distributed fashion. Glue also offers Python
shell Jobs for ETL workloads; these don’t need distributed processing.
Querying data using AWS

Data analysts may want to access the data and combine it even before the data starts hydrating
the data lake.
Amazon Athena and Amazon Redshift allow you to query data across multiple data stores.
While using Amazon Athena to query S3 data cataloged in AWS Glue Catalog is quite common,
Amazon Athena can also be used to query data from Amazon CloudWatch Logs, Amazon
DynamoDB, Amazon DocumentDB, Amazon RDS, and JDBC-compliant relational data sources
such MySQL and PostgreSQL under the Apache 2.0 license using AWS Lambda-based data source
connectors. Athena Query Federation SDK can be used to write a customer connector too. These
connectors return data in Apache Arrow format. Amazon Athena uses these connectors and
manages parallelism, along with predicate pushdown.
Similarly, Amazon Redshift also supports querying Amazon S3 data through Amazon Redshift
Spectrum. Redshift also supports querying data in Amazon RDS for PostgreSQL, Amazon Aurora
PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible
Edition through its Query Federation feature.
To handle the undifferentiated heavy lifting, AWS Glue introduced a new feature called AWS Glue
Elastic Views. It allows users to use familiar SQL. It combines and materializes the data from
various sources into the target. Since AWS Glue Elastic Views is serverless, users do not have to
worry about managing the underlying infrastructure or keeping the target hydrated.
Data discovery
AWS Glue Data Catalog can be used to discover and search data
Data integration across all our datasets.

Data ingestion
AWS Glue makes it easy to ingest data from several standard
Data integration is a complex operation that data stores, such as HDFS, Amazon S3, JDBC, and AWS Glue. It
involves several tasks – data discovery, ingestion, allows data to get ingested from SaaS and custom data stores
preparation, transformation, and replication. via custom and marketplace connectors.

AWS Glue DataBrew offers built-in capabilities to Data preparation


define data quality rules and allows us to profile AWS Glue enables us to de-duplicate and cleanse data with
data based on our requirements. built-in ML capabilities using its FindMatches feature. AWS Glue
DataBrew provides an interactive visual interface for cleaning
AWS Glue was initially introduced as a serverless and normalizing data without writing code.
ETL service that allows users to crawl, catalog,
transform, and ingest data into AWS for analytics. Data replication
However, over the years, it has evolved into a fully- The Elastic Views feature of AWS Glue enables us to create
managed serverless data integration service. views of data stored in different AWS data stores and
materialize them in a target data store of our choice. We can
create materialized views by using PartiQL to write queries.
AWS Glue Elastic Views continuously monitors changes in our
dataset and updates the target data stores automatically.
The following are the key features of AWS Glue:

AWS Glue Data Catalog


AWS Glue Connections
AWS Glue Crawlers and Classifiers
AWS Glue Schema Registry
AWS Glue Jobs
AWS Glue Notebooks and interactive sessions
AWS Glue Triggers
AWS Glue Workflows
AWS Glue Blueprints
AWS Glue ML
AWS Glue Studio
AWS Glue DataBrew
AWS Glue Elastic Views
AWS Glue Data Catalog Figure 2.2 – Structure of AWS
Glue Data Catalog

A Data Catalog can be defined as an inventory of data assets


in an organization that helps data professionals find and
understand relevant datasets to extract business value. Each
AWS account has one Glue Data Catalog per AWS region and
is identified by a combination of catalog_id and aws_region.
AWS Glue Data Catalog is comprised of the following
components:

Databases
Tables
Partitions

It is important to note that AWS Glue supports versioning


catalog tables. Once a table has been created, the names,
data types, and the order of the keys registered as part of
the partition index cannot be modified.
Glue connections
AWS Glue connections are resources stored in AWS Glue Data Catalog that contain
connection information for a given data store. Typically, an AWS Glue connection contains
information such as login credentials, connection strings (URIs), and VPC configuration (VPC
subnet and security group information), which are required by different AWS Glue resources
to connect to the data store.
At the time of writing, there are eight types of Glue connections, each of which is designed
to establish a connection with a specific type of data store: JDBC, Amazon RDS, Amazon
Redshift, Amazon DocumentDB, MongoDB, Kafka, Network, and Custom/Marketplace
connections.
When a Glue connection is attached to any Glue compute resource (Jobs, Crawlers,
development endpoints, and interactive sessions), behind the scenes, Glue creates EC2
Elastic Network Interfaces (ENIs) with the VPC configuration (subnet and security groups)
specified by the user. These ENIs are then attached to compute resources on the server side.
This mechanism is used by AWS Glue to communicate with VPC/on-premise data stores.
Here, when a user makes an API call to execute the AWS Glue workload, the request is submitted to the
AWS Glue workload orchestration system, which will calculate the amount of compute resources required
and allocates workers from the worker node fleet.
If the workload being executed requires VPC connectivity, ENIs are created in the end user AWS account
and are attached to worker nodes.

At the time of writing, a Glue resource can only use one subnet. If multiple connections with different
subnets are attached, the subnet settings from the first connection will be used by default. However, if the
first connection is unhealthy for any reason – for instance, if the availability zone is down – then the next
connection is used.

Crawlers can crawl a wide variety of data stores – Amazon S3, Amazon Redshift, Amazon RDS, JDBC,
Amazon DynamoDB, and DocumentDB/MongoDB to name a few.
Figure 2.3 – VPC-based data store access from AWS Glue using ENIs

NOTE

It is important to make sure that


the subnet being used by the
Glue connection has enough IP
addresses available as each Glue
resource creates multiple ENIs
(each of which consumes one IP
address) based on the compute
capacity required for workload
execution.
AWS Glue crawlers Figure 2.4 – Workflow of a Glue crawler
A Crawler is a component of AWS Glue that helps crawl the data in different
types of data stores, infers the schema, and populates AWS Glue Data Catalog
with the metadata for the dataset that was crawled.
Crawlers can crawl a wide variety of data stores – Amazon S3, Amazon
Redshift, Amazon RDS, JDBC, Amazon DynamoDB, and
DocumentDB/MongoDB to name a few.

For a crawler to crawl a VPC resource or on-premise data stores such as


Amazon Redshift, JDBC data stores (including Amazon RDS data stores), and
Amazon DocumentDB (MongoDB compatible), a Glue connection is required.
Crawlers are capable of crawling S3 buckets without using Glue connections.
However, a Network connection type is required if you must keep S3 request
traffic off the public internet.
For a crawler with a Glue connection, it is recommended to have at least 16 IP
addresses available in the subnet. When a connection is attached to a Glue
resource, multiple ENIs are created to run the workload.

NOTE
At the time of writing, the maximum runtime for any crawler is 24 hours. After
24 hours, the crawler’s run is automatically stopped with the CANCELLED
status. The workflow of a crawl can be divided into three
stages: Cassification, Grouping, Output
For Amazon S3 data store crawls, the crawler will read all the files in the path specified by default. The crawler
will classify each of the files available in the S3 path and persist the metadata to the crawler’s service side
storage (not to be confused with AWS Glue Data Catalog). Metadata gets reused and the new files are crawled
during the subsequent crawler runs and the metadata stored on the service side is updated as necessary.

For the JDBC, Amazon DynamoDB, and Amazon DocumentDB (with MongoDB compatibility) data stores, the
stages of the crawler workflow are the same, but the logic that’s used for classification and clustering is
different for each data store type. The classification of the table(s) is decided based on the data store
type/database engine.
For JDBC data stores, Glue connects to the database server, and the schema is obtained for the tables that
match the include path value in the crawler settings.
AWS Glue crawlers have several features and configuration options

Key features of Glue crawlers Configuration options


Data sampling – DynamoDB and
DocumentDB/MongoDB
Data sampling – Amazon S3
Amazon S3 data store – incremental crawl
Amazon S3 data store – table-level
specification
Custom classifiers (Classifiers are
responsible for determining the file’s
classification string (for example, parquet)
and the schema of the file. )
Grok classifiers
XML classifiers
CSV Classifiers
AWS Glue Schema Registry
Schema registries can be used to address the issues caused by schema evolution and allow streaming
data producers and consumers to discover and manage schema changes, as well as adapt to these
changes based on user settings.
GSR is a feature available in AWS Glue that allows users to discover, control, and evolve schema for
streaming data stores centrally. Glue Schema registries support integrations with a wide variety of
streaming data stores such as Apache Kafka, Amazon Kinesis Data Streams, Amazon Managed
Streaming for Apache Kafka (MSK), Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda
by allowing users to enforce and manage schemas.
At the time of writing, GSR supports the AVRO, JSON, and protocol buffer (protobuf) data formats for
schemas.
Note
The AWS Glue Schema Registry currently supports the Java programming language. Java version 8 (or
above) is required for both producers and consumers.
SerDe libraries can be added to both producer and consumer applications by adding the
software.amazon.glue:schema-registry-serde Maven dependency
AWS Glue ETL jobs
At the time of writing, Glue allows users to create three different types of ETL jobs – Spark ETL, Spark
Streaming, and Python shell jobs. The key differences between these job types are in the libraries/packages
that are injected into the environment during job orchestration on the service side and billing practices.

During job creation, users can use the AWS Glue wizard to generate an ETL script for Spark and Spark
Streaming ETL jobs by choosing the source, destination, column mapping, and connection information.
However, for Python shell jobs, the user will have to provide a script. At the time of writing, Glue ETL supports
Scala 2 and PySpark (Java and R jobs are currently not supported) for Spark and Spark Streaming jobs and
Python 3 for Python shell jobs.

NOTE
AWS Glue allows multiple connections to be attached to ETL jobs. However, it is important to note that a Glue
job can use only one subnet for VPC jobs. If multiple connections are attached to a job, only the first
connection is attached to the ETL job.

There are some advanced features that users can select during job creation, such as job bookmarks,
continuous logging, Spark UI, and capacity settings (the number of workers and worker type). Glue allows
users to inject several job parameters (including Spark configuration parameters) so that they can alter the
default Spark behavior.
Glue Spark extensions / APIs
Glue ETL introduces quite a lot of
advanced Spark extensions/APIs
and transformations to make it
easy to achieve complex ETL
operations.

GlueContext
DynamicFrame
Job bookmarks
GlueParquet
Glue development endpoints
This feature allows users to create
an environment for Glue ETL
development wherein the
developer/data engineer can use
Notebook environments
(Jupyter/Zeppelin), read-eval-print
loop (REPL) shells, or IDEs to
develop ETL scripts and test them
instantly using the endpoint.
AWS Glue interactive sessions
Interactive sessions make it easier for users to access the session
from Jupyter notebook environments hosted anywhere (the
notebook server can be running locally on a user workstation as
well) with minimal configuration.
Triggers
Triggers are Glue Data Catalog objects that can be used to
start (manually or automatically) one or more crawlers or
ETL jobs. Triggers allow users to chain crawlers and ETL jobs
that depend on each other.

There are three types of triggers:

On-demand triggers: These triggers allow users to start one


or more crawlers or ETL jobs by activating the trigger. This
can be done manually or via an event-driven API call.
Scheduled triggers: These time-based triggers are fired
based on a specified cron expression.
Conditional triggers: Conditional triggers fire when the
previous job(s)/crawler(s) satisfy the conditions specified.
Conditional triggers watch the status of the jobs/crawlers
specified – success, failed, timeout, and so on. If the list of
conditions specified is satisfied, the trigger is fired.
Chapter 3: Data Ingestion
In the previous chapter, we discussed the fundamental concepts and inner workings of the various
features/microservices that are available in AWS Glue, such as Glue Data Catalog, connections,
crawlers, and classifiers, the schema registry, Glue ETL jobs, development endpoints, interactive
sessions, and triggers. We also explored how AWS Glue crawlers aid in data discovery by crawling
different types of data stores – Amazon S3, JDBC (Amazon RDS or on-premises databases), and
DynamoDB/MongoDB/DocumentDB infer the schema and populate AWS Glue Data Catalog. While
discussing Glue ETL in the previous chapter, we introduced a few of the important
extensions/features of Spark ETL, including GlueContext, DynamicFrame, JobBookmark, and
GlueParquet. In this chapter, we will see them in action by looking at some examples.

In this chapter, we will be discussing some of the components of AWS Glue mentioned in the
previous paragraph – specifically Glue ETL jobs, the schema registry, and Glue custom/Marketplace
connectors – in further detail and exploring data ingestion use cases, such as ingesting data from
file/object stores, JDBC-compatible data stores, streaming data sources, and SaaS data stores, to
demonstrate the capabilities of Glue.
AWS Glue supports three different types of ETL jobs – Spark, Spark Streaming, and a
Python Shell.

Each of these job types is designed to handle a specific type of workload and the
environment in which the workload is executed varies, depending on the type of ETL job.
For instance, Python Shell jobs allow users to execute Python scripts as a shell in AWS
Glue. These jobs run on a single host on the server side. Spark/Spark Streaming ETL, on
the other hand, allows you to execute PySpark/Scala-based ETL jobs in a distributed
environment and allows users to take advantage of Spark libraries to execute ETL
workloads.

In the upcoming sections, we will explore how Glue ETL can be used to ingest data from
different data stores, including file/object stores, HDFS, JDBC data stores, Spark
Streaming data sources, and SaaS data stores.
• Data ingestion from file/object stores
• Data ingestion from JDBC data stores
• Data ingestion from streaming data sources
• Data ingestion from SaaS data stores
Data ingestion from file/object stores
This is one of the most common use cases for Glue ETL, where the source data is already available in file
storage or cloud-based object stores. Here, depending on the type of job being executed, the methods or
libraries used to access the data store differ.

There are several file/object storage services available today – Amazon S3, HDFS, Azure Storage, Google Cloud
Storage, IBM Cloud Object Storage, FTP, SFTP, and HTTP(s) to name a few.

Most organizations already have some mechanism to move data to Amazon S3, typically by using the AWS
CLI/SDKs directly, AWS Transfer Family (https://aws.amazon.com/aws-transfer-family/), or some other third-
party tools.

If we are using Python Shell jobs, the user can take advantage of several Python packages that allow them to
connect to the desired file storage. If the user wishes to read an object from Amazon S3, they can use the
Amazon S3 Boto3 client to get and read objects using Python packages/functions (for example, native Python
functions and pandas), depending on the file format.
The following code snippet can be used with an AWS Glue Python Shell ETL job to read a CSV from an Amazon S3 bucket,
transform the file from CSV into JSON, and write the output to another Amazon S3 location
The same source code can be executed in several ways in AWS – any Amazon
import boto3, io, pandas as pd EC2 instance with Python installed, an AWS Lambda function, or within a
client = boto3.client('s3') #AWS Python SDK
Docker container using Amazon ECR/AWS Batch, to name a few.
# nyc-tlc - https://registry.opendata.aws/
src_bucket = 'nyc-tlc' # SOURCE_S3_BUCKET_NAME
target_bucket = 'TARGET_S3_BUCKET_NAME' “Why should we use AWS Glue Python Shell jobs over AWS Lambda?”
src_object = client.get_object(
Bucket=src_bucket, While AWS Lambda and AWS Glue Python Shell jobs are both capable of
Key='trip data/yellow_tripdata_2021-07.csv' running Python scripts, Python Shell jobs are designed for ETL workloads and
) can be orchestrated easier with other Glue components such as crawlers,
# Read CSV and Transform to JSON
Spark jobs, and Glue triggers using AWS Glue workflows.
df = pd.read_csv(src_object['Body'])
jsonBuffer = io.StringIO()
df.to_json(jsonBuffer, orient='records') AWS Lambda functions can use a maximum of 512 MB of storage space
# Write JSON to target location (the /tmp directory), up to 10,240 MB of memory, and up to 6 vCPUs – the
client.put_object( functions can run for up to a maximum of 15 minutes.
Bucket=target_bucket,
Key='target_prefix/data.json', Glue Python Shell jobs, on the other hand, use the concept of Data
Body=jsonBuffer.getvalue()
Processing Units (DPUs) for capacity allocation, where one DPU provides four
)
vCPUs and 16 GB of memory. Users can use either 0.0625 DPU or a 1 DPU
capacity for Python Shell jobs. Essentially, a Python Shell job can use up to
four vCPUs and 16 GB of memory and the user can configure the timeout
value for Python Shell jobs (the default is 48 hours). At the time of writing,
Glue Python Shell jobs are allocated 20 GB of disk space by default, though
this may change in the future.
Now, let’s consider the same ETL operation we performed in the previous script but using AWS Glue Spark
ETL instead.

Schema flexibility
Since DynamicRecords are self-describing, a
schema is computed on the fly and there is no
need to perform an additional pass over the
source data.
Advanced options for managing schema conflicts
DynamicFrames make it easier to handle schema conflicts by introducing ChoiceType whenever a schema conflict is encountered
instead of defaulting to the most compatible data type (usually, this is StringType). For instance, if one of the columns has
integer/long values and string values, Spark infers it as StringType by default. However, Glue creates ChoiceType and allows the user
to resolve the conflict.

Let’s consider an example where the Provider Id column has StringType and numeric (LongType) values. When the data is read using
a Spark DataFrame, the column will be inferred as StringType by Spark:

root
|-- ColumnA: string (nullable = true)
|-- ColumnB: string (nullable = true)
Now, when the same dataset is read using AWS Glue DynamicFrames, the column is represented with ChoiceType and lets the user
decide how to resolve the type conflict:

“What is the advantage of using Glue DynamicFrame over a Spark DataFrame?”

As discussed in the previous chapter, DynamicFrames are structurally different from Spark DataFrames since they
have several optimizations enabled under the hood.

Under the hood, AWS Glue ETL (Spark) uses an EMRFS (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-
fs.html) driver by default to read from Amazon S3 data stores when the path begins with the s3:// URI scheme (class:
com.amazon.ws.emr.hadoop.fs.EmrFileSystem), regardless of whether Apache Spark DataFrames or AWS Glue
DynamicFrames are used. The EMRFS driver was originally developed for Amazon EMR and has been since adopted
by AWS Glue for Amazon S3 reads and writes from Glue Spark ETL.
AWS Glue-specific ETL transformations and extensions

Unbox, SplitFields, ResolveChoice, and Relationalize


https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html for an exhaustive list of
transformations and extensions supported by DynamicFrames.

Job bookmarks
To take advantage of job bookmarks – a key feature of Glue ETL – it is necessary to use Glue DynamicFrames.
Grouping
We have all come across or heard of a classic problem in big data processing – reading a large number of small files.
Spark launches a separate task for each data partition for each stage; if the file size is less than the block size, Spark will
launch one task per file. Consider a scenario where there are billions of such files/objects in the data store – this will
lead to a huge number of tasks being created, which will cause unnecessary delays due to scheduling logic (any given
executor can run a finite number of tasks in parallel, depending on the number of CPU cores available). Using the
Grouping feature in Glue ETL (https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html), users can group
input files to combine multiple files into a single task. This can be done by specifying the target size of groups in bytes
with groupSize. Glue ETL automatically enables this feature if the number of input files is higher than 50,000.

For example, in the following code snippet, we are reading JSON data from Amazon S3 while performing grouping. This
allows us to control the task size rather than letting Spark control the task size based on the number of input files:

dy_frame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options = {
'paths': ["s3://s3path/"], NOTE
'recurse':True,
'groupFiles': 'inPartition', groupFiles is supported for DynamicFrames that have been
'groupSize': '1048576' created using the csv, ion, grokLog, json, and xml formats. This
}, format="json") option is not supported for Avro, Parquet, or ORC data formats.

You might also like