SlideShare a Scribd company logo
Radu Chilom
radu.chilom@gmail.com
In-memory data pipeline
and warehouse at scale
using Spark, Spark SQL,
Tachyon and Parquet
Buzzwords Berlin - 2015
Ema Iancuta
iorhian@gmail.com
‹#›
• Big data analytics / machine learning
• 6+ years with Hadoop ecosystem
• 2 years with Spark
• http://atigeo.com/
• A research group that focuses on the technical
problems that exist in the big data industry and
provides open source solutions
• http://bigdataresearch.io/
‹#›
• Intro
• Use Case
• Data pipeline with Spark
• Spark Job Rest Service
• Spark SQL Rest Service (Jaws)
• Parquet
• Tachyon
• Demo
Agenda
‹#›
• Build an in memory data pipeline for millions
financial transactions used downstream by
data scientists for detecting fraud
• Ingestion from S3 to our Tachyon/HDFS
cluster
• Data transformation
• Data warehouse
Use Case
‹#›
• “fast and general engine for large-scale
data processing”
• Built around the concept of RDD
• API for Java/Scala/Python (80 operators)
• powers a stack of high level tools including
Spark SQL, MLlib, Spark Streaming.
Apache Spark
‹#›
Public S3 Bucket: public-financial-transactions
public-financial-
transactions
(s3-bucket)
scheme scheme.csv
data input-0.csv
data2
input-1.csv
. . .
. . .
‹#›
• Download from S3
1. Ingestion
• Resolving the wildcards means listing files
metadata
• Listing the metadata for a large number
of files from external sources can take a
long time
‹#›
Listing the metadata (distributed)
Driver
Worker Worker Worker
folder1
folder2
folder3
folder4
folder5
folder6
folder1
folder2
folder3
folder4
folder5
folder6
file-11
file-12
file-21
file-22
file-23
file-31
file-32
file-41
file-42
file-43
file-44
file-51
file-52
file-61
‹#›
Listing the metadata (distributed)
• For fine tuning, specify the number of partitions
‹#›
• Unbalanced partitions
Download Files
‹#›
Unbalanced partitions
Partition 0
transactions.csv
Partition 1
input.csv
data.csv
values.csv
buzzwords.csv
buzzwords.txt
‹#›
Balancing partitions
Partition 0
(0, transactions.csv)
(2, data.csv)
(4, buzzwords.csv)
Partition 1
(1, input.csv)
(3, values.csv)
(5, buzzwords.txt)
‹#›
• Balancing partitions
Keep in mind that repartitioning your data is a
fairly expensive operation.
Balancing partitions
‹#›
• Data cleaning is the first step in any data
science project
• For this use-case:
- Remove lines that don't match the structure
- Remove “useless” columns
- Transform data to be in a consistent format
2. Data Transformation
‹#›
• Join
Find Country char code
Numeric Format Alpha 2 Format
276 DE
Name
Germany
• Problem with skew in the key distribution
‹#›
Metrics for Join
‹#›
• Broadcast Country Codes Map
Find Country char code
‹#›
Metrics
‹#›
Transformation with
Join vs Broadcasted Map
(skewed key)
Seconds
0
60
120
180
240
300
Rows
1 Million 2 Million 3 Million
Join Broadcasted Map
‹#›
• Supports multiple contexts
• Launches a new process for each Spark context
• Inter-process communication with Akka actors
• Easy context creation & job runs
• Supports Java and Scala code
• Friendly UI
Spark-Job-Rest
https://github.com/Atigeo/spark-job-rest
‹#›
• Hive
• Apache Pig
• Impala
• Presto
• Stinger (Hive on Tez)
• Spark SQL
Build a data warehouse
‹#›
Spark SQL
• Support for multiple
input formats
• Rich language interfaces
• RDD-aware optimizer
RDD
DataFrame / SchemaRDD
JDBC
HIVE QL SQL
‹#›
Creating a data frame
‹#›
Perform a simple query:
Explore data
> Directly on the data frame
> Registering a temporary table
- select
- filter
- join
- groupBy
- agg
- join
- count
- sort
- where ..etc.
‹#›
Creating a data warehouse
https://github.com/Atigeo/xpatterns-spark-parquet
‹#›
• TextFile
• SequenceFile
• RCFile (RowColumnar)
• ORCFile (OptimizedRowColumnar)
• Avro
• Parquet
File Formats
> columnar format
> good for aggregation queries
> only the required columns are read from disk
> nested data structures
> schema with the data
> spark sql supports schema evolution
> efficient compression
‹#›
Tachyon
• memory-centric distributed file system
enabling reliable file sharing at memory-speed
across cluster frameworks
• Pluggable underlayer file system: hdfs, S3,…
‹#›
Caching in Spark SQL
• Cache data in columnar format
• Automatically compression tune
‹#›
• spark context might crash
Spark cache vs Tachyon
• GC kicks in
• share data between different applications
‹#›
- Highly scalable and resilient data warehouse
- Submit queries concurrently and asynchronously
- Restful alternative to Spark SQL JDBC having a
interactive UI
- Since Spark 091 with Shark
- Support for Spark SQL and Hive - MR (and more to
come)
https://github.com/Atigeo/jaws-spark-sql-rest
Jaws spark sql rest
‹#›
- Akka actors to communicate through instances
- Support cancel queries
- Supports large results retrieval
- Parquet in memory warehouse
- returns persisted logs, results, query history
- provides a metadata browser
- configuration file to fine tune spark
Jaws main features
‹#›
https://github.com/big-data-research/in-memory-data-pipeline
Code available at
‹#›
Q & A
© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of
the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the
accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Ad

More Related Content

What's hot (20)

Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
vito jeng
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Caserta
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
vito jeng
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 

Similar to In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015 (20)

Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
Marco Parenzan
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
Marco Parenzan
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Ad

Recently uploaded (20)

03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ad

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

  • 1. Radu Chilom [email protected] In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Buzzwords Berlin - 2015 Ema Iancuta [email protected]
  • 2. ‹#› • Big data analytics / machine learning • 6+ years with Hadoop ecosystem • 2 years with Spark • http://atigeo.com/ • A research group that focuses on the technical problems that exist in the big data industry and provides open source solutions • http://bigdataresearch.io/
  • 3. ‹#› • Intro • Use Case • Data pipeline with Spark • Spark Job Rest Service • Spark SQL Rest Service (Jaws) • Parquet • Tachyon • Demo Agenda
  • 4. ‹#› • Build an in memory data pipeline for millions financial transactions used downstream by data scientists for detecting fraud • Ingestion from S3 to our Tachyon/HDFS cluster • Data transformation • Data warehouse Use Case
  • 5. ‹#› • “fast and general engine for large-scale data processing” • Built around the concept of RDD • API for Java/Scala/Python (80 operators) • powers a stack of high level tools including Spark SQL, MLlib, Spark Streaming. Apache Spark
  • 6. ‹#› Public S3 Bucket: public-financial-transactions public-financial- transactions (s3-bucket) scheme scheme.csv data input-0.csv data2 input-1.csv . . . . . .
  • 7. ‹#› • Download from S3 1. Ingestion • Resolving the wildcards means listing files metadata • Listing the metadata for a large number of files from external sources can take a long time
  • 8. ‹#› Listing the metadata (distributed) Driver Worker Worker Worker folder1 folder2 folder3 folder4 folder5 folder6 folder1 folder2 folder3 folder4 folder5 folder6 file-11 file-12 file-21 file-22 file-23 file-31 file-32 file-41 file-42 file-43 file-44 file-51 file-52 file-61
  • 9. ‹#› Listing the metadata (distributed) • For fine tuning, specify the number of partitions
  • 11. ‹#› Unbalanced partitions Partition 0 transactions.csv Partition 1 input.csv data.csv values.csv buzzwords.csv buzzwords.txt
  • 12. ‹#› Balancing partitions Partition 0 (0, transactions.csv) (2, data.csv) (4, buzzwords.csv) Partition 1 (1, input.csv) (3, values.csv) (5, buzzwords.txt)
  • 13. ‹#› • Balancing partitions Keep in mind that repartitioning your data is a fairly expensive operation. Balancing partitions
  • 14. ‹#› • Data cleaning is the first step in any data science project • For this use-case: - Remove lines that don't match the structure - Remove “useless” columns - Transform data to be in a consistent format 2. Data Transformation
  • 15. ‹#› • Join Find Country char code Numeric Format Alpha 2 Format 276 DE Name Germany • Problem with skew in the key distribution
  • 17. ‹#› • Broadcast Country Codes Map Find Country char code
  • 19. ‹#› Transformation with Join vs Broadcasted Map (skewed key) Seconds 0 60 120 180 240 300 Rows 1 Million 2 Million 3 Million Join Broadcasted Map
  • 20. ‹#› • Supports multiple contexts • Launches a new process for each Spark context • Inter-process communication with Akka actors • Easy context creation & job runs • Supports Java and Scala code • Friendly UI Spark-Job-Rest https://github.com/Atigeo/spark-job-rest
  • 21. ‹#› • Hive • Apache Pig • Impala • Presto • Stinger (Hive on Tez) • Spark SQL Build a data warehouse
  • 22. ‹#› Spark SQL • Support for multiple input formats • Rich language interfaces • RDD-aware optimizer RDD DataFrame / SchemaRDD JDBC HIVE QL SQL
  • 24. ‹#› Perform a simple query: Explore data > Directly on the data frame > Registering a temporary table - select - filter - join - groupBy - agg - join - count - sort - where ..etc.
  • 25. ‹#› Creating a data warehouse https://github.com/Atigeo/xpatterns-spark-parquet
  • 26. ‹#› • TextFile • SequenceFile • RCFile (RowColumnar) • ORCFile (OptimizedRowColumnar) • Avro • Parquet File Formats > columnar format > good for aggregation queries > only the required columns are read from disk > nested data structures > schema with the data > spark sql supports schema evolution > efficient compression
  • 27. ‹#› Tachyon • memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks • Pluggable underlayer file system: hdfs, S3,…
  • 28. ‹#› Caching in Spark SQL • Cache data in columnar format • Automatically compression tune
  • 29. ‹#› • spark context might crash Spark cache vs Tachyon • GC kicks in • share data between different applications
  • 30. ‹#› - Highly scalable and resilient data warehouse - Submit queries concurrently and asynchronously - Restful alternative to Spark SQL JDBC having a interactive UI - Since Spark 091 with Shark - Support for Spark SQL and Hive - MR (and more to come) https://github.com/Atigeo/jaws-spark-sql-rest Jaws spark sql rest
  • 31. ‹#› - Akka actors to communicate through instances - Support cancel queries - Supports large results retrieval - Parquet in memory warehouse - returns persisted logs, results, query history - provides a metadata browser - configuration file to fine tune spark Jaws main features
  • 34. © 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.