SlideShare a Scribd company logo
Alluxio Open Source New York Meetup
Alluxio on AWS EMR
Fast storage access and sharing for Spark
Table of Content
● Spark and Alluxio
● AWS EMR and deployment of Alluxio
● Demo
About me
Chengzhi Zhao
● Data Platform Engineer at Meetup 2.5 years, 4 years in
retail and healthcare
● Area of Interest: Build scalable, reliable and
maintainable data infrastructure and tools
How do we run spark jobs?
EMR Dataproc HDInsight on
premise
S3
GCS
DB
HDFS
Compute Storage
To Summarize...
● Multiple cloud providers
● Storage layer is separated from compute
layer
● Data skew/Data Format
● Failure/Restart
● Data sharing is difficult
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio
Architecture
DATA LOCALITY
Spark tries to execute tasks as close to the data as possible to minimize
data transfer (over the wire).
Preferred Locality Order: PROCESS_LOCAL, NODE_LOCAL,
RACK_LOCAL, or ANY
Check Data Locality
Check Data Locality
Data Sharing
Spark Read & Write
spark.read.parquet("alluxio://")
val rdd = sc.textFile("alluxio://")
df.spark.write.parquet("alluxio://")
rdd.saveAsTextFile("alluxio://")
Caching
● Avoid duplicated cache
● Crash/Restart
Tiered Storage
By default, Alluxio only enables a single, memory tier.
● MEM (Memory)
● SSD (Solid State Drives)
● HDD (Hard Disk Drives)
Tiered Storage Configurations
alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.quota=24GB
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.9
alluxio.worker.tieredstore.level0.watermark.low.ratio=0.7
alluxio.worker.tieredstore.level1.alias=HDD
alluxio.worker.tieredstore.level1.dirs.quota=300GB
Preload data into Alluxio Uniformly
$ bin/alluxio fs load /path/to/load
Data
Warehouse
EMR (Elastic MapReduce)
Amazon EMR simplifies building and operating big data environments
and applications. EMR features include easy provisioning, scaling, and
reconfiguring of clusters, and notebooks for collaborative development.
Applications: Flink, Ganglia, Hadoop, HBase, HCatalog, Hive, Hue,
JupyterHub, Livy, Mahout, MXNet, Oozie, Phoenix, Pig, Presto, Spark,
Sqoop, TensorFlow, Tez, Zeppelin, and ZooKeeper.
Setup Alluxio on EMR
aws emr create-cluster
--applications Name=Spark
--release-label emr-5.23.0
--configurations ...
--bootstrap-actions '[{"Path":"s3://test/bootstrap_cz.sh"]'
--region us-east-1
core-site
"fs.alluxio.impl": "alluxio.hadoop.FileSystem"
"fs.AbstractFileSystem.alluxio.impl": "alluxio.hadoop.AlluxioFileSystem"
spark-defaults
"spark.driver.extraClassPath":
":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrf
s/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/sh
are/aws/emr/security/lib/*:/opt/alluxio-1.8.1-hadoop-2.8/client/alluxio-1.8.1-client.jar",
"spark.executor.extraClassPath": ":/opt/alluxio-1.8.1-hadoop-2.8/client/alluxio-1.8.1-client.jar"
Bootstrap
ismaster=`cat /mnt/var/lib/info/instance.json | jq -r '.isMaster'`
masterdns=`cat /mnt/var/lib/info/job-flow.json | jq -r '.masterPrivateDnsName'`
….
if [[ ${ismaster} == "true" ]]; then
sudo ./bin/alluxio bootstrapConf ${masterdns}
sudo ./bin/alluxio format
sudo ./bin/alluxio-start.sh master
else
sudo ./bin/alluxio bootstrapConf ${masterdns}
initialize_alluxio
sudo ./bin/alluxio format
sudo ./bin/alluxio-start.sh worker Mount
fi
….
DEMO
Hardware
● r5.2xlarge
● 8 vCore
● 64 GiB memory
● 3 instances
Size: 51G Parquet
Non Alluxio
Alluxio
Size: 51G Parquet
Non Alluxio
Alluxio
Q & A
https://github.com/ChengzhiZhao/Alluxio-EMR-bootstrap

More Related Content

What's hot (20)

PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
PDF
Accelerating Hive with Alluxio on S3
Alluxio, Inc.
 
PDF
Accelerating Data Computation on Ceph Objects
Alluxio, Inc.
 
PDF
Data Orchestration for the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
PDF
Alluxio Use Cases and Future Directions
Alluxio, Inc.
 
PDF
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Alluxio, Inc.
 
PPTX
Tachyon workshop 2015-07-19
Tachyon Nexus, Inc.
 
PDF
Building Cloud Native Analytical Pipelines on AWS
Alluxio, Inc.
 
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
PDF
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Accelerating Hive with Alluxio on S3
Alluxio, Inc.
 
Accelerating Data Computation on Ceph Objects
Alluxio, Inc.
 
Data Orchestration for the Hybrid Cloud Era
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
Alluxio Use Cases and Future Directions
Alluxio, Inc.
 
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Alluxio, Inc.
 
Tachyon workshop 2015-07-19
Tachyon Nexus, Inc.
 
Building Cloud Native Analytical Pipelines on AWS
Alluxio, Inc.
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 

Similar to Alluxio on AWS EMR Fast Storage Access & Sharing for Spark (20)

PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Databricks
 
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
PDF
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
PDF
Spark Summit EU talk by Jiri Simsa
Spark Summit
 
PDF
Spark Summit EU talk by Jiri Simsa
Alluxio, Inc.
 
PPTX
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Data Con LA
 
PPTX
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
PDF
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
PDF
Accelerating Analytics with EMR on your S3 Data Lake
Alluxio, Inc.
 
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio, Inc.
 
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
PDF
The Architecture of Decoupling Compute and Storage with Alluxio
Alluxio, Inc.
 
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Alluxio, Inc.
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Databricks
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit
 
Spark Summit EU talk by Jiri Simsa
Alluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Data Con LA
 
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Accelerating Analytics with EMR on your S3 Data Lake
Alluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio, Inc.
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
The Architecture of Decoupling Compute and Storage with Alluxio
Alluxio, Inc.
 
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
prodad heroglyph crack 2.0.214.2 Full Free Download
cracked shares
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PPTX
Library_Management_System_PPT111111.pptx
nmtnissancrm
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
Simplify React app login with asgardeo-sdk
vaibhav289687
 
PDF
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
PDF
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
PPTX
BB FlashBack Pro 5.61.0.4843 With Crack Free Download
cracked shares
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
NPD Software -Omnex systems
omnex systems
 
PPTX
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
prodad heroglyph crack 2.0.214.2 Full Free Download
cracked shares
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
Library_Management_System_PPT111111.pptx
nmtnissancrm
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Simplify React app login with asgardeo-sdk
vaibhav289687
 
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
BB FlashBack Pro 5.61.0.4843 With Crack Free Download
cracked shares
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
NPD Software -Omnex systems
omnex systems
 
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 

Alluxio on AWS EMR Fast Storage Access & Sharing for Spark