Notes
Notes
data marts are built on top of data warehouses and users access these
data marts for their queries.
Types of data processing – OLTP and OLAP
Data warehouses and data marts While data marts can be based on a star or snowflake schema, the star
Data lakes schema is generally preferred because it results in faster queries due to
Data lakehouse fewer joins.
Data mesh
A data lake can be defined as a centralized repository that allows you to
Apache Spark on the AWS cloud store all structured and unstructured data at any scale. Organizations
AWS Glue usually pick a replication tool such as AWS Data Migration Service (AWS
Querying data using AWS DMS) to bring the data into the data lake. Organizations may also use a
push mechanism to FTP to transfer the files to an AWS Simple Storage
Service (S3)-based data lake using AWS Transfer Family.
The data lakehouse blurs the lines between data lakes and data
warehouses by enabling the atomicity, consistency, isolation, and
durability (ACID) properties on the data in the data lake and enabling
multiple processes to concurrently read and write data. new
arrangement that does not try to force unstructured data into the strict
models of a data warehouse.
Data mesh is serving data as a product and setting the ownership of the
product. This thought process led to the creation of the data mesh.
Figure 1.1 – Overview of Apache Spark’s workload execution
Cluster Manager can be Spark’s standalone
cluster manager, Mesos, Apache Hadoop Yet
Another Resource Negotiator (YARN), or
Kubernetes.
Let’s look at some of the notable components of the AWS Glue feature set: AWS Glue Schema Registry: AWS Glue Schema Registry allows
users to centrally control data stream schemas and has
AWS Glue DataBrew: Glue DataBrew is used for data cleansing and enrichment
integrations with Apache Kafka, Amazon Kinesis, and AWS
through another GUI. Creating AWS Glue DataBrew Jobs does not require the user
to write any source code and the Jobs are created with the help of a GUI. Lambda.
AWS Glue Data Catalog: AWS Glue Data Catalog is a central catalog of metadata AWS Glue Triggers: AWS Glue Triggers are data catalog objects
that can be used with other AWS services such as Amazon Athena, Amazon that allow us to either manually or automatically start executing
Redshift, and Amazon EMR. one or more AWS Glue Crawlers or AWS Glue ETL Jobs.
AWS Glue Connections: Glue Connections are catalog objects that help organize
AWS Glue Workflows: Glue Workflows can be used to
and store connection information to various data stores. AWS Glue Connections
can also be created for Marketplace AWS Glue Connectors, which allows you to
orchestrate the execution of a set of AWS Glue Jobs and AWS
integrate with third-party data stores, such as Apache Hudi, Google Big Query, Glue Crawlers using AWS Glue Triggers.
and Elastic Search.
AWS Glue Blueprints: Blueprints are useful for creating
AWS Glue Crawlers: Crawlers can be used to crawl existing data and populate an parameterized workflows that can be created and shared for
AWS Glue Data Catalog with metadata. similar use cases.
AWS Glue ETL Jobs: Glue ETL Jobs enables users to extract source data from
various data stores, process it, and write output to a data target based on the
AWS Glue Elastic Views: Glue Elastic Views helps users replicate
logic defined in the ETL script. Users can take advantage of Apache Spark-based the data from one store to another using familiar SQL syntax.
ETL Jobs to handle their workload in a distributed fashion. Glue also offers Python
shell Jobs for ETL workloads; these don’t need distributed processing.
Querying data using AWS
Data analysts may want to access the data and combine it even before the data starts hydrating
the data lake.
Amazon Athena and Amazon Redshift allow you to query data across multiple data stores.
While using Amazon Athena to query S3 data cataloged in AWS Glue Catalog is quite common,
Amazon Athena can also be used to query data from Amazon CloudWatch Logs, Amazon
DynamoDB, Amazon DocumentDB, Amazon RDS, and JDBC-compliant relational data sources
such MySQL and PostgreSQL under the Apache 2.0 license using AWS Lambda-based data source
connectors. Athena Query Federation SDK can be used to write a customer connector too. These
connectors return data in Apache Arrow format. Amazon Athena uses these connectors and
manages parallelism, along with predicate pushdown.
Similarly, Amazon Redshift also supports querying Amazon S3 data through Amazon Redshift
Spectrum. Redshift also supports querying data in Amazon RDS for PostgreSQL, Amazon Aurora
PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible
Edition through its Query Federation feature.
To handle the undifferentiated heavy lifting, AWS Glue introduced a new feature called AWS Glue
Elastic Views. It allows users to use familiar SQL. It combines and materializes the data from
various sources into the target. Since AWS Glue Elastic Views is serverless, users do not have to
worry about managing the underlying infrastructure or keeping the target hydrated.
Data discovery
AWS Glue Data Catalog can be used to discover and search data
Data integration across all our datasets.
Data ingestion
AWS Glue makes it easy to ingest data from several standard
Data integration is a complex operation that data stores, such as HDFS, Amazon S3, JDBC, and AWS Glue. It
involves several tasks – data discovery, ingestion, allows data to get ingested from SaaS and custom data stores
preparation, transformation, and replication. via custom and marketplace connectors.
Databases
Tables
Partitions
At the time of writing, a Glue resource can only use one subnet. If multiple connections with different
subnets are attached, the subnet settings from the first connection will be used by default. However, if the
first connection is unhealthy for any reason – for instance, if the availability zone is down – then the next
connection is used.
Crawlers can crawl a wide variety of data stores – Amazon S3, Amazon Redshift, Amazon RDS, JDBC,
Amazon DynamoDB, and DocumentDB/MongoDB to name a few.
Figure 2.3 – VPC-based data store access from AWS Glue using ENIs
NOTE
NOTE
At the time of writing, the maximum runtime for any crawler is 24 hours. After
24 hours, the crawler’s run is automatically stopped with the CANCELLED
status. The workflow of a crawl can be divided into three
stages: Cassification, Grouping, Output
For Amazon S3 data store crawls, the crawler will read all the files in the path specified by default. The crawler
will classify each of the files available in the S3 path and persist the metadata to the crawler’s service side
storage (not to be confused with AWS Glue Data Catalog). Metadata gets reused and the new files are crawled
during the subsequent crawler runs and the metadata stored on the service side is updated as necessary.
For the JDBC, Amazon DynamoDB, and Amazon DocumentDB (with MongoDB compatibility) data stores, the
stages of the crawler workflow are the same, but the logic that’s used for classification and clustering is
different for each data store type. The classification of the table(s) is decided based on the data store
type/database engine.
For JDBC data stores, Glue connects to the database server, and the schema is obtained for the tables that
match the include path value in the crawler settings.
AWS Glue crawlers have several features and configuration options
During job creation, users can use the AWS Glue wizard to generate an ETL script for Spark and Spark
Streaming ETL jobs by choosing the source, destination, column mapping, and connection information.
However, for Python shell jobs, the user will have to provide a script. At the time of writing, Glue ETL supports
Scala 2 and PySpark (Java and R jobs are currently not supported) for Spark and Spark Streaming jobs and
Python 3 for Python shell jobs.
NOTE
AWS Glue allows multiple connections to be attached to ETL jobs. However, it is important to note that a Glue
job can use only one subnet for VPC jobs. If multiple connections are attached to a job, only the first
connection is attached to the ETL job.
There are some advanced features that users can select during job creation, such as job bookmarks,
continuous logging, Spark UI, and capacity settings (the number of workers and worker type). Glue allows
users to inject several job parameters (including Spark configuration parameters) so that they can alter the
default Spark behavior.
Glue Spark extensions / APIs
Glue ETL introduces quite a lot of
advanced Spark extensions/APIs
and transformations to make it
easy to achieve complex ETL
operations.
GlueContext
DynamicFrame
Job bookmarks
GlueParquet
Glue development endpoints
This feature allows users to create
an environment for Glue ETL
development wherein the
developer/data engineer can use
Notebook environments
(Jupyter/Zeppelin), read-eval-print
loop (REPL) shells, or IDEs to
develop ETL scripts and test them
instantly using the endpoint.
AWS Glue interactive sessions
Interactive sessions make it easier for users to access the session
from Jupyter notebook environments hosted anywhere (the
notebook server can be running locally on a user workstation as
well) with minimal configuration.
Triggers
Triggers are Glue Data Catalog objects that can be used to
start (manually or automatically) one or more crawlers or
ETL jobs. Triggers allow users to chain crawlers and ETL jobs
that depend on each other.
In this chapter, we will be discussing some of the components of AWS Glue mentioned in the
previous paragraph – specifically Glue ETL jobs, the schema registry, and Glue custom/Marketplace
connectors – in further detail and exploring data ingestion use cases, such as ingesting data from
file/object stores, JDBC-compatible data stores, streaming data sources, and SaaS data stores, to
demonstrate the capabilities of Glue.
AWS Glue supports three different types of ETL jobs – Spark, Spark Streaming, and a
Python Shell.
Each of these job types is designed to handle a specific type of workload and the
environment in which the workload is executed varies, depending on the type of ETL job.
For instance, Python Shell jobs allow users to execute Python scripts as a shell in AWS
Glue. These jobs run on a single host on the server side. Spark/Spark Streaming ETL, on
the other hand, allows you to execute PySpark/Scala-based ETL jobs in a distributed
environment and allows users to take advantage of Spark libraries to execute ETL
workloads.
In the upcoming sections, we will explore how Glue ETL can be used to ingest data from
different data stores, including file/object stores, HDFS, JDBC data stores, Spark
Streaming data sources, and SaaS data stores.
• Data ingestion from file/object stores
• Data ingestion from JDBC data stores
• Data ingestion from streaming data sources
• Data ingestion from SaaS data stores
Data ingestion from file/object stores
This is one of the most common use cases for Glue ETL, where the source data is already available in file
storage or cloud-based object stores. Here, depending on the type of job being executed, the methods or
libraries used to access the data store differ.
There are several file/object storage services available today – Amazon S3, HDFS, Azure Storage, Google Cloud
Storage, IBM Cloud Object Storage, FTP, SFTP, and HTTP(s) to name a few.
Most organizations already have some mechanism to move data to Amazon S3, typically by using the AWS
CLI/SDKs directly, AWS Transfer Family (https://aws.amazon.com/aws-transfer-family/), or some other third-
party tools.
If we are using Python Shell jobs, the user can take advantage of several Python packages that allow them to
connect to the desired file storage. If the user wishes to read an object from Amazon S3, they can use the
Amazon S3 Boto3 client to get and read objects using Python packages/functions (for example, native Python
functions and pandas), depending on the file format.
The following code snippet can be used with an AWS Glue Python Shell ETL job to read a CSV from an Amazon S3 bucket,
transform the file from CSV into JSON, and write the output to another Amazon S3 location
The same source code can be executed in several ways in AWS – any Amazon
import boto3, io, pandas as pd EC2 instance with Python installed, an AWS Lambda function, or within a
client = boto3.client('s3') #AWS Python SDK
Docker container using Amazon ECR/AWS Batch, to name a few.
# nyc-tlc - https://registry.opendata.aws/
src_bucket = 'nyc-tlc' # SOURCE_S3_BUCKET_NAME
target_bucket = 'TARGET_S3_BUCKET_NAME' “Why should we use AWS Glue Python Shell jobs over AWS Lambda?”
src_object = client.get_object(
Bucket=src_bucket, While AWS Lambda and AWS Glue Python Shell jobs are both capable of
Key='trip data/yellow_tripdata_2021-07.csv' running Python scripts, Python Shell jobs are designed for ETL workloads and
) can be orchestrated easier with other Glue components such as crawlers,
# Read CSV and Transform to JSON
Spark jobs, and Glue triggers using AWS Glue workflows.
df = pd.read_csv(src_object['Body'])
jsonBuffer = io.StringIO()
df.to_json(jsonBuffer, orient='records') AWS Lambda functions can use a maximum of 512 MB of storage space
# Write JSON to target location (the /tmp directory), up to 10,240 MB of memory, and up to 6 vCPUs – the
client.put_object( functions can run for up to a maximum of 15 minutes.
Bucket=target_bucket,
Key='target_prefix/data.json', Glue Python Shell jobs, on the other hand, use the concept of Data
Body=jsonBuffer.getvalue()
Processing Units (DPUs) for capacity allocation, where one DPU provides four
)
vCPUs and 16 GB of memory. Users can use either 0.0625 DPU or a 1 DPU
capacity for Python Shell jobs. Essentially, a Python Shell job can use up to
four vCPUs and 16 GB of memory and the user can configure the timeout
value for Python Shell jobs (the default is 48 hours). At the time of writing,
Glue Python Shell jobs are allocated 20 GB of disk space by default, though
this may change in the future.
Now, let’s consider the same ETL operation we performed in the previous script but using AWS Glue Spark
ETL instead.
Schema flexibility
Since DynamicRecords are self-describing, a
schema is computed on the fly and there is no
need to perform an additional pass over the
source data.
Advanced options for managing schema conflicts
DynamicFrames make it easier to handle schema conflicts by introducing ChoiceType whenever a schema conflict is encountered
instead of defaulting to the most compatible data type (usually, this is StringType). For instance, if one of the columns has
integer/long values and string values, Spark infers it as StringType by default. However, Glue creates ChoiceType and allows the user
to resolve the conflict.
Let’s consider an example where the Provider Id column has StringType and numeric (LongType) values. When the data is read using
a Spark DataFrame, the column will be inferred as StringType by Spark:
root
|-- ColumnA: string (nullable = true)
|-- ColumnB: string (nullable = true)
Now, when the same dataset is read using AWS Glue DynamicFrames, the column is represented with ChoiceType and lets the user
decide how to resolve the type conflict:
As discussed in the previous chapter, DynamicFrames are structurally different from Spark DataFrames since they
have several optimizations enabled under the hood.
Under the hood, AWS Glue ETL (Spark) uses an EMRFS (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-
fs.html) driver by default to read from Amazon S3 data stores when the path begins with the s3:// URI scheme (class:
com.amazon.ws.emr.hadoop.fs.EmrFileSystem), regardless of whether Apache Spark DataFrames or AWS Glue
DynamicFrames are used. The EMRFS driver was originally developed for Amazon EMR and has been since adopted
by AWS Glue for Amazon S3 reads and writes from Glue Spark ETL.
AWS Glue-specific ETL transformations and extensions
Job bookmarks
To take advantage of job bookmarks – a key feature of Glue ETL – it is necessary to use Glue DynamicFrames.
Grouping
We have all come across or heard of a classic problem in big data processing – reading a large number of small files.
Spark launches a separate task for each data partition for each stage; if the file size is less than the block size, Spark will
launch one task per file. Consider a scenario where there are billions of such files/objects in the data store – this will
lead to a huge number of tasks being created, which will cause unnecessary delays due to scheduling logic (any given
executor can run a finite number of tasks in parallel, depending on the number of CPU cores available). Using the
Grouping feature in Glue ETL (https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html), users can group
input files to combine multiple files into a single task. This can be done by specifying the target size of groups in bytes
with groupSize. Glue ETL automatically enables this feature if the number of input files is higher than 50,000.
For example, in the following code snippet, we are reading JSON data from Amazon S3 while performing grouping. This
allows us to control the task size rather than letting Spark control the task size based on the number of input files:
dy_frame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options = {
'paths': ["s3://s3path/"], NOTE
'recurse':True,
'groupFiles': 'inPartition', groupFiles is supported for DynamicFrames that have been
'groupSize': '1048576' created using the csv, ion, grokLog, json, and xml formats. This
}, format="json") option is not supported for Avro, Parquet, or ORC data formats.