0% found this document useful (0 votes)
30 views

Amazon EMR Serverless Architecture and Use Cases

Amazon EMR Serverless Architecture and Use Cases

Uploaded by

nikki91476
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Amazon EMR Serverless Architecture and Use Cases

Amazon EMR Serverless Architecture and Use Cases

Uploaded by

nikki91476
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Amazon EMR Serverless Architecture and Use Cases

Lesson objectives
In this lesson, you will learn the following:
 The architecture of Amazon EMR Serverless
 Typical use cases for Amazon EMR Serverless
 Key points about Amazon EMR Serverless

How is Amazon EMR Serverless used to architect a cloud solution?


Amazon EMR Serverless automatically provisions, configures, and scales compute
and memory resources required at each stage of your data-processing application.
With Amazon EMR Serverless, your jobs run faster because it includes the
performance-optimized Amazon EMR runtime for Apache Spark, Hive, Presto, and
other technologies. Additionally, Amazon EMR Serverless integrates with EMR Studio
to provide an interactive development experience using notebooks and familiar
open-source tools. Such tools include Spark UI and Tez UI to help you develop,
visualize, and debug your applications.
How does Amazon EMR Serverless work?
As soon as the Amazon EMR application is created, users can start submitting their
Apache Spark jobs. There are multiple ways to submit Apache Spark jobs. For
example, you can use Apache Airflow, Step Functions, EMR Studio notebook, the
AWS CLI, AWS SDK, or custom-built pipelines. Amazon EMR Serverless automatically
provisions workers required for data processing jobs in the Amazon EMR service
account. These interact with resources in your AWS account to run the jobs.
To learn more about how Amazon EMR Serverless works, choose the following three
interactive markers.
1.
2.
3.
What are the core concepts of Amazon EMR Serverless?
With Amazon EMR Serverless, the behind-the-scenes architecture remains similar to
Amazon EMR running on Amazon EC2 and Amazon EMR on EKS. However, the core
concepts you work with shift from nodes to application, jobs, workers, and pre-
initialized workers. To learn more about the core concepts of Amazon EMR
Serverless, choose the following four tabs.
ApplicationJobWorkersPre-initialized workers
You can create one or more applications that use open-source analytics frameworks
by specifying the framework that you want to use (for example, Apache Spark or
Apache Hive), the Amazon EMR release version, and the name of your application.
What are typical use cases for Amazon EMR Serverless?
To learn more about Amazon EMR Serverless use cases, expand each of the
following six sections.
Apache Spark ETL jobs

With Amazon EMR Serverless, you can run Apache Spark ETL jobs on an application
with the type parameter set to “SPARK.”
For example:
 Extract: Read CSV data from Amazon S3.
 Transform: Add or remove columns in the dataset.
 Load: Write updated data back to Amazon S3 (load).
Jobs must be compatible with the Apache Spark version referenced in the Amazon
EMR release version. For example, when you run jobs on an application with
Amazon EMR release 6.6.0, your job must be compatible with Apache Spark 3.2.0.

Alternatively, you can submit the same Apache Spark ETL job without any code
changes to other deployment options, such as Amazon EMR on Amazon EC2 or
Amazon EMR on EKS. You can submit the job using the AWS Management Console,
AWS CLI, or Amazon EMR APIs. With Amazon EMR on Amazon EC2, you can submit
jobs using the Amazon EMR steps API during or after cluster launch.
Large-scale SQL queries using Hive

Apache Hive on Amazon EMR provides data warehouse-like query capabilities. You
can read, write, and manage petabytes of data using a SQL-like language with
Amazon EMR Serverless or Amazon EMR on Amazon EC2 clusters. Starting with EMR
6.0.0, Amazon EMR Hive supports the Live Long and Process (LLAP) functionality.
LLAP uses persistent daemons with intelligent in-memory caching to improve Hive
query performance.
Amazon EMR 6.1.0 and later support Hive ACID (atomicity, consistency, isolation,
durability) transactions, so it complies with the ACID properties of a database. With
this feature, you can run INSERT, UPDATE, DELETE, and MERGE SQL operations in
Hive-managed tables with data stored in Amazon S3.
Interactive analysis using Jupyter notebooks with EMR studio

EMR Studio provides a managed interactive analysis environment for Jupyter
notebooks. It can help data scientists and data engineers to develop, visualize, and
debug data engineering and data science applications written in R, Python, Scala, or
PySpark.
With EMR Studio, you can start notebooks in seconds, get onboarded with sample
notebooks, and perform your data exploration. You can collaborate with peers using
built-in real-time collaboration and track changes across notebook versions using
Git repositories. You can also customize your environment by loading custom
kernels and Python libraries from notebooks, or start parameterized notebooks as
part of scheduled workflows using orchestration services like Apache Airflow or
Amazon MWAA.
Ad-hoc analysis using Presto

When using Presto on Amazon EMR on Amazon EC2, you can run interactive queries
on large datasets with minimal setup time. Amazon EMR handles the provisioning,
configuration, and tuning of Hadoop clusters. Presto is included in Amazon EMR
versions 5.0.0 and later.

Presto running on Amazon EMR gives you more flexibility in how you configure and
run queries, including the ability to federate to other data sources if needed. For
example:
 A use case that requires Lightweight Directory Access Protocol (LDAP)
authentication for clients such as the Presto CLI or Java Database
Connectivity/Open Database Connectivity drivers
 A workflow in which you need to join data between different systems like
MySQL, Amazon Redshift, Apache Cassandra, and Hive.
Building real-time streaming data pipelines

With Amazon EMR, you can perform fault-tolerant stream processing of live data
streams using Apache Spark or Apache Flink data frameworks. With Apache Spark,
you can run Spark streaming or Apache Spark Structured Streaming applications.
Structured streaming is a scalable and fault-tolerant stream processing engine built
on the Spark SQL, while Spark streaming uses DStream API, powered by Spark RDDs
(Resilient Data Sets), to process streams of data.

Amazon EMR makes it possible for the streaming data pipelines to distribute and
process data across dynamically scalable Amazon EC2 instances and Amazon S3.
You can use Amazon EMR to analyze event-based data for use cases such as
personalization, product discovery, and fraud detection. Amazon EMR also supports
Apache Flink, which lets you run real-time stream processing on high-throughput
data sources.
Running AI/ML workloads on Amazon EMR

You can pre-process data and train models and perform prediction and validation to
build accurate ML models using Amazon EMR. You can analyze data using open-
source ML frameworks such as Apache Spark MLlib, TensorFlow, and Apache MXNet.

Amazon EMR is used for ML use cases in which Spark is already used with a
persistent cluster, or where an end-to-end pipeline already exists and the team has
the skill set and inclination to run a persistent cluster. With a wide range of instance
types, including AWS Graviton processors and Amazon EC2 Spot Instances, Amazon
EMR offers flexibility and cost optimization for running ML workloads.
Amazon EMR also features integrations with Amazon SageMaker, in which a
SageMaker model training job can start from a Spark pipeline in Amazon EMR.

Amazon EMR Studio offers fully managed Jupyter notebooks for visualization with an
ability to log in through AWS IAM Identity Center (successor to AWS Single Sign-On.

With Amazon EMR Serverless, the behind-the-scenes architecture remains


similar to Amazon EMR running on Amazon EC2 and Amazon EMR on EKS.
However, the core concepts you work with shift from nodes to application,
jobs, workers, and pre-initialized workers.
What else should I keep in mind when using Amazon EMR Serverless?
There are multiple aspects to consider while designing workloads to run
on Amazon EMR. To learn more about Amazon EMR Serverless
considerations, expand each of the following three sections.
Fine-grained autoscaling with no need to guess cluster sizes

Amazon EMR Serverless eliminates the need to right-size clusters for
varying jobs and data sizes. It automatically adds and removes workers at
different stages of your job. With Amazon EMR Serverless, you provide the
minimum and maximum number of concurrent workers for your
application, as well as compute resources and storage for each worker.
Amazon EMR automatically adds and removes workers based on what the
job requires, within your specified limits. Also, it provisions, configures,
and dynamically scales the compute and memory resources needed a
teach stage of your data processing application. You’re charged for
aggregated vCPU, memory, and storage resources used from the time a
worker starts running until it stops, rounded up to the nearest second
with a 1-minute minimum. It’s a cost-effective offering, where you need to
pay only for the compute time and resources that were used.
Resilience to Availability Zone failures
+
Different Amazon EMR deployment options
+
Getting Started with Amazon EMR Serverless
To learn more about getting started with Amazon EMR Serverless when
you deploy a sample Spark or Hive workload, choose the following button.
Go To user guide
Run Big Data Applications without Managing Servers
To learn more about using Amazon EMR Serverless, choose the following
button.
Go to AWS Blog
What's next?
In this lesson, you learned the basics of Amazon EMR Serverless
architecture and the use cases the service can be applied to. In the next
lesson, you will learn the basics of Amazon EMR cluster

You might also like