SlideShare a Scribd company logo
EFFECTIVE SPARK WITH ALLUXIO
Spark Summit EU
Jiri Simsa, Alluxio, Inc.
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
2
3
BIG DATA ECOSYSTEM YESTERDAY
BIG DATA ECOSYSTEM TODAY
…
…
3
…
…
BIG DATA ECOSYSTEM ISSUES
3
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
WHY ALLUXIO
4
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
5
Unification
New workflows across any
data in any storage system
Orders of magnitude
improvement in run time
Choice in compute and
storage – grow each
independently, buy only
what is needed
Performance Flexibility
BENEFITS
6
Popular Open Source Projects’ Growth
Spark Kafka Cassandra HDFS
FASTEST GROWING BIG DATA PROJECTS
6
• Fastest growing
open-source project
in the big data
ecosystem
• 300+ contributors
from 100+
organizations
• Running in large
production clusters
Popular Open Source Projects’ Growth
Alluxio
Spark Kafka Cassandra HDFS
FASTEST GROWING BIG DATA PROJECTS
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
7
ACCELERATE I/O TO/FROM REMOTE STORAGE
Baidu’s PMs and analysts run 

interactive queries to gain insights 

into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
8
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark SQL
alone, it took 100-150 seconds to finish a query;
using Alluxio, where data may hit local or
remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB
of RAM space
• By using Alluxio, batch queries usually lasting
over 15 minutes were transformed into an
interactive query taking less than 30 seconds
Baidu’s PMs and analysts run 

interactive queries to gain insights 

into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
8
SHARE DATA ACROSS JOBS AT MEMORY SPEED
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
9
SHARE DATA ACROSS JOBS AT MEMORY SPEED
Thanks to Alluxio, we now have the raw data
immediately available at every iteration and
we can skip the costs of loading in terms of time
waiting, network traffic, and RDBMS activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased
from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O cost
of loading and storing in Alluxio is now on the
order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
9
MANAGE DATA ACROSS STORAGE SYSTEMS
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
10
MANAGE DATA ACROSS STORAGE SYSTEMS
We’ve been running Alluxio in production for
over 9 months, Alluxio’s unified namespace
enable different applications and frameworks
to easily interact with data from different
storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark
batch and Flink jobs provide efficient data
sharing
• Improved the performance of their system with
15x – 300x speedups
• Tiered storage feature manages storage
resources including memory and HDD
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
10
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
11
CONSOLIDATING MEMORY
12
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
CONSOLIDATING MEMORY
13
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
DATA RESILIENCE DURING CRASH
14
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
DATA RESILIENCE DURING CRASH
15
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
Storage Engine &
Execution Engine
Same Process
DATA RESILIENCE DURING CRASH
16
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
DATA RESILIENCE DURING CRASH
17
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
DATA RESILIENCE DURING CRASH
18
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
ACCESSING ALLUXIO DATA FROM SPARK
19
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
CODE EXAMPLE FOR SPARK RDDS
20
Writing RDD
to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD
from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)
CODE EXAMPLE FOR SPARK DATAFRAMES
21
Writing DataFrame
to Alluxio
df.write.parquet(alluxioPath)
Reading DataFrame
from Alluxio
df = sc.read.parquet(alluxioPath)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
22
ENVIRONMENT
Spark 2.0.0 + Alluxio 1.2.0
Single r3.2xlarge instance (61GB RAM)
Comparisons:
• Alluxio
• Spark Storage Level: MEMORY_ONLY
• Spark Storage Level: MEMORY_ONLY_SER
• Spark Storage Level: DISK_ONLY
23
Reading Cached RDDTime[seconds]
0
75
150
225
300
RDD Size [GB]
0 13 25 38 50
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
24
25
New Context: Read 50 GB RDD (SSD)
No Alluxio
Alluxio
(objectFile)
Alluxio
(textFile)
Time [seconds]
0 55 110 165 220
2x speedup
4x speedup
26
New Context: Read 50 GB RDD (S3)
No Alluxio
Alluxio
(objectFile)
Alluxio
(textFile)
Time [seconds]
0 200 400 600 800
7x speedup
16x speedup
27
Reading CACHED DATAFRAME (parquet)Time[seconds]
0
62.5
125
187.5
250
DataFrame Size [GB]
0 13 25 38 50
Alluxio (textFile) DISK_ONLY MEMORY_ONLY_SER
MEMORY_ONLY
28
New Context: Read 50 GB DATAFRAME
(SSD)
No Alluxio
Alluxio
Time [seconds]
0 60 120 180 240
2.5x speedup
29
New CONTEXT: Read 50 GB DataFrame
(S3)
No Alluxio
Alluxio
Time [seconds]
0 250 500 750 1000 1250 1500 1750
10x speedup (avg), 17x speedup (peak)
CONCLUSION
30
• Easy to use with Spark
• Better Performance
• Predictable Performance
Contact: jiri@alluxio.com
Twitter: @jsimsa
Websites: www.alluxio.com and www.alluxio.org
Thank you!
31
Ad

More Related Content

What's hot (20)

Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
Alluxio, Inc.
 
Atom: A cloud native deep learning platform at Supremind
Atom: A cloud native deep learning platform at SupremindAtom: A cloud native deep learning platform at Supremind
Atom: A cloud native deep learning platform at Supremind
Alluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with Alluxio
Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
Spark Summit
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
Thai Bui
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
Alluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Fluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets KubernetesFluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets Kubernetes
Alluxio, Inc.
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
Alluxio, Inc.
 
Powering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and PrestoPowering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and Presto
Alluxio, Inc.
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
Alluxio, Inc.
 
Atom: A cloud native deep learning platform at Supremind
Atom: A cloud native deep learning platform at SupremindAtom: A cloud native deep learning platform at Supremind
Atom: A cloud native deep learning platform at Supremind
Alluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with Alluxio
Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
Spark Summit
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
Thai Bui
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
Alluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Fluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets KubernetesFluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets Kubernetes
Alluxio, Inc.
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
Alluxio, Inc.
 
Powering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and PrestoPowering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and Presto
Alluxio, Inc.
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 

Viewers also liked (20)

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Cloudera, Inc.
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
Joel Koshy
 
Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016
Julia Mateo
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
Joel Koshy
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
Simon Su
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified Namespace
Alluxio, Inc.
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
Joel Koshy
 
Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19
Tachyon Nexus, Inc.
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio, Inc.
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Alluxio, Inc.
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
Jiangjie Qin
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Cloudera, Inc.
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
Joel Koshy
 
Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016
Julia Mateo
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
Joel Koshy
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
Simon Su
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified Namespace
Alluxio, Inc.
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
Joel Koshy
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio, Inc.
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Alluxio, Inc.
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
Jiangjie Qin
 
Ad

Similar to Spark Summit EU talk by Jiri Simsa (20)

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio, Inc.
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Improving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using AlluxioImproving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using Alluxio
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Databricks
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Data Con LA
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio, Inc.
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio, Inc.
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
Alluxio, Inc.
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Alluxio, Inc.
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
Alluxio, Inc.
 
Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4
Tony Pearson
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio, Inc.
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Improving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using AlluxioImproving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using Alluxio
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Databricks
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Data Con LA
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio, Inc.
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio, Inc.
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
Alluxio, Inc.
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Alluxio, Inc.
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
Alluxio, Inc.
 
Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4
Tony Pearson
 
Ad

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Shotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formateShotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formate
freefreefire0998
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Shotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formateShotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formate
freefreefire0998
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 

Spark Summit EU talk by Jiri Simsa

  • 1. EFFECTIVE SPARK WITH ALLUXIO Spark Summit EU Jiri Simsa, Alluxio, Inc.
  • 2. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 2
  • 4. BIG DATA ECOSYSTEM TODAY … … 3
  • 6. BIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3
  • 7. BIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Unifying Data at Memory Speed GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3
  • 8. WHY ALLUXIO 4 Co-located compute and data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only
  • 9. 5 Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility BENEFITS
  • 10. 6 Popular Open Source Projects’ Growth Spark Kafka Cassandra HDFS FASTEST GROWING BIG DATA PROJECTS
  • 11. 6 • Fastest growing open-source project in the big data ecosystem • 300+ contributors from 100+ organizations • Running in large production clusters Popular Open Source Projects’ Growth Alluxio Spark Kafka Cassandra HDFS FASTEST GROWING BIG DATA PROJECTS
  • 12. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 7
  • 13. ACCELERATE I/O TO/FROM REMOTE STORAGE Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 8
  • 14. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 8
  • 15. SHARE DATA ACROSS JOBS AT MEMORY SPEED Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 9
  • 16. SHARE DATA ACROSS JOBS AT MEMORY SPEED Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 9
  • 17. MANAGE DATA ACROSS STORAGE SYSTEMS • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 10
  • 18. MANAGE DATA ACROSS STORAGE SYSTEMS We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems. - Qunar RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory and HDD • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 10
  • 19. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 11
  • 20. CONSOLIDATING MEMORY 12 Storage Engine & Execution Engine Same Process • Two copies of data in memory – double the memory used • Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  • 21. CONSOLIDATING MEMORY 13 Storage Engine & Execution Engine Different process • Half the memory used • Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  • 22. DATA RESILIENCE DURING CRASH 14 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  • 23. DATA RESILIENCE DURING CRASH 15 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  • 24. DATA RESILIENCE DURING CRASH 16 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  • 25. DATA RESILIENCE DURING CRASH 17 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  • 26. DATA RESILIENCE DURING CRASH 18 • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  • 27. ACCESSING ALLUXIO DATA FROM SPARK 19 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  • 28. CODE EXAMPLE FOR SPARK RDDS 20 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  • 29. CODE EXAMPLE FOR SPARK DATAFRAMES 21 Writing DataFrame to Alluxio df.write.parquet(alluxioPath) Reading DataFrame from Alluxio df = sc.read.parquet(alluxioPath)
  • 30. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 22
  • 31. ENVIRONMENT Spark 2.0.0 + Alluxio 1.2.0 Single r3.2xlarge instance (61GB RAM) Comparisons: • Alluxio • Spark Storage Level: MEMORY_ONLY • Spark Storage Level: MEMORY_ONLY_SER • Spark Storage Level: DISK_ONLY 23
  • 32. Reading Cached RDDTime[seconds] 0 75 150 225 300 RDD Size [GB] 0 13 25 38 50 Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY 24
  • 33. 25 New Context: Read 50 GB RDD (SSD) No Alluxio Alluxio (objectFile) Alluxio (textFile) Time [seconds] 0 55 110 165 220 2x speedup 4x speedup
  • 34. 26 New Context: Read 50 GB RDD (S3) No Alluxio Alluxio (objectFile) Alluxio (textFile) Time [seconds] 0 200 400 600 800 7x speedup 16x speedup
  • 35. 27 Reading CACHED DATAFRAME (parquet)Time[seconds] 0 62.5 125 187.5 250 DataFrame Size [GB] 0 13 25 38 50 Alluxio (textFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY
  • 36. 28 New Context: Read 50 GB DATAFRAME (SSD) No Alluxio Alluxio Time [seconds] 0 60 120 180 240 2.5x speedup
  • 37. 29 New CONTEXT: Read 50 GB DataFrame (S3) No Alluxio Alluxio Time [seconds] 0 250 500 750 1000 1250 1500 1750 10x speedup (avg), 17x speedup (peak)
  • 38. CONCLUSION 30 • Easy to use with Spark • Better Performance • Predictable Performance
  • 39. Contact: [email protected] Twitter: @jsimsa Websites: www.alluxio.com and www.alluxio.org Thank you! 31