Amazon EMR Serverless Architecture and Use Cases
Amazon EMR Serverless Architecture and Use Cases
Lesson objectives
In this lesson, you will learn the following:
The architecture of Amazon EMR Serverless
Typical use cases for Amazon EMR Serverless
Key points about Amazon EMR Serverless
Alternatively, you can submit the same Apache Spark ETL job without any code
changes to other deployment options, such as Amazon EMR on Amazon EC2 or
Amazon EMR on EKS. You can submit the job using the AWS Management Console,
AWS CLI, or Amazon EMR APIs. With Amazon EMR on Amazon EC2, you can submit
jobs using the Amazon EMR steps API during or after cluster launch.
Large-scale SQL queries using Hive
–
Apache Hive on Amazon EMR provides data warehouse-like query capabilities. You
can read, write, and manage petabytes of data using a SQL-like language with
Amazon EMR Serverless or Amazon EMR on Amazon EC2 clusters. Starting with EMR
6.0.0, Amazon EMR Hive supports the Live Long and Process (LLAP) functionality.
LLAP uses persistent daemons with intelligent in-memory caching to improve Hive
query performance.
Amazon EMR 6.1.0 and later support Hive ACID (atomicity, consistency, isolation,
durability) transactions, so it complies with the ACID properties of a database. With
this feature, you can run INSERT, UPDATE, DELETE, and MERGE SQL operations in
Hive-managed tables with data stored in Amazon S3.
Interactive analysis using Jupyter notebooks with EMR studio
–
EMR Studio provides a managed interactive analysis environment for Jupyter
notebooks. It can help data scientists and data engineers to develop, visualize, and
debug data engineering and data science applications written in R, Python, Scala, or
PySpark.
With EMR Studio, you can start notebooks in seconds, get onboarded with sample
notebooks, and perform your data exploration. You can collaborate with peers using
built-in real-time collaboration and track changes across notebook versions using
Git repositories. You can also customize your environment by loading custom
kernels and Python libraries from notebooks, or start parameterized notebooks as
part of scheduled workflows using orchestration services like Apache Airflow or
Amazon MWAA.
Ad-hoc analysis using Presto
–
When using Presto on Amazon EMR on Amazon EC2, you can run interactive queries
on large datasets with minimal setup time. Amazon EMR handles the provisioning,
configuration, and tuning of Hadoop clusters. Presto is included in Amazon EMR
versions 5.0.0 and later.
Presto running on Amazon EMR gives you more flexibility in how you configure and
run queries, including the ability to federate to other data sources if needed. For
example:
A use case that requires Lightweight Directory Access Protocol (LDAP)
authentication for clients such as the Presto CLI or Java Database
Connectivity/Open Database Connectivity drivers
A workflow in which you need to join data between different systems like
MySQL, Amazon Redshift, Apache Cassandra, and Hive.
Building real-time streaming data pipelines
–
With Amazon EMR, you can perform fault-tolerant stream processing of live data
streams using Apache Spark or Apache Flink data frameworks. With Apache Spark,
you can run Spark streaming or Apache Spark Structured Streaming applications.
Structured streaming is a scalable and fault-tolerant stream processing engine built
on the Spark SQL, while Spark streaming uses DStream API, powered by Spark RDDs
(Resilient Data Sets), to process streams of data.
Amazon EMR makes it possible for the streaming data pipelines to distribute and
process data across dynamically scalable Amazon EC2 instances and Amazon S3.
You can use Amazon EMR to analyze event-based data for use cases such as
personalization, product discovery, and fraud detection. Amazon EMR also supports
Apache Flink, which lets you run real-time stream processing on high-throughput
data sources.
Running AI/ML workloads on Amazon EMR
–
You can pre-process data and train models and perform prediction and validation to
build accurate ML models using Amazon EMR. You can analyze data using open-
source ML frameworks such as Apache Spark MLlib, TensorFlow, and Apache MXNet.
Amazon EMR is used for ML use cases in which Spark is already used with a
persistent cluster, or where an end-to-end pipeline already exists and the team has
the skill set and inclination to run a persistent cluster. With a wide range of instance
types, including AWS Graviton processors and Amazon EC2 Spot Instances, Amazon
EMR offers flexibility and cost optimization for running ML workloads.
Amazon EMR also features integrations with Amazon SageMaker, in which a
SageMaker model training job can start from a Spark pipeline in Amazon EMR.
Amazon EMR Studio offers fully managed Jupyter notebooks for visualization with an
ability to log in through AWS IAM Identity Center (successor to AWS Single Sign-On.