SlideShare a Scribd company logo
Apache Spark
1
Agenda
• Introduction
• Features -
• Supports all major Big Data Environments
• General platform for major Big Data tasks
• Access diverse data sources
• Speed
• Ease of use
• Architecture
• Resilient Distributed Datasets
• In-memory Computing
• Performance
Apache Spark 2
Spark Introduction
• Apache Spark is an
– open source,
– parallel data processing framework
– master slave model
– complements Hadoop
to make it easy to develop fast, unified Big Data applications
• Cloudera offers commercial support for Spark with Cloudera
Enterprise.
• It has over 465 contributors making it most active Big Data Project in
Apache Foundation, started at UC Berkley in 2009
Apache Spark 3
Supports All major BigData Environments
Runs Everywhere -
• Standalone Cluster Mode
• Hadoop Yarn
• Apache Mesos
• Amazon EC2
• Spark runs on both Windows and UNIX-like systems (e.g. Linux,
Mac OS).
Apache Spark 4
General platform for all major Big Data tasks
• Common ETL (Sqoop)
• SQL and Analytics (Pig and Hive)
• Real Time Streaming (Storm)
• Machine Learning (Mahout)
• Graphs (Data Visualization)
• Both Interactive and Batch mode processing
• Reuse the same code for batch and stream processing, even joining
streaming data to historical data
Apache Spark 5
Access Diverse Data Sources
Read and write anywhere -
• HDFS
• Cassandra
• Hbase
• TextFile
• RDBMS
• Kafka
Apache Spark 6
Speed
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Spark has an advanced DAG execution engine that supports cyclic
data flow and in-memory computing.
Logistic regression in Hadoop and Spark
Apache Spark 7
Ease of Use
• Write application in
– Java
– Scala
– Python
• 80 high-level operators that make it easy to build parallel apps
Apache Spark 8
Unified Analytics with Cloudera’s Enterprise Data Hub
Faster Decisions (Interactive) Better Decisions (Batch)
Real-Time Action (Streaming
and Applications)
Web Security
Why is my website slow? What are the common causes
of performance issues?
How can I detect and block
malicious attacks in real-time?
Retail
What are our top selling items
across channels?
What products and services to
customers buy together?
How can I deliver relevant
promotions to buyers at the
point of sale?
Financial
Services
Who opened multiple accounts
in the past 6 months?
What are the leading
indicators of fraudulent
activity?
How can I protect my
customers from identity theft in
real-time?
Apache Spark 9
ARCHITECTURE
Apache Spark 10
Apache Spark 11
Resilient Distributed Datasets
• A read-only collection of objects partitioned across a set of machines
• Immutable. You can modify an RDD with a transformation but the
transformation returns you a new RDD whereas the original RDD
remains the same.
• Can be rebuilt if a partition is lost.
• Fault tolerant because they can be recreated & recomputed
RDD supports two types of operations:
• Transformation
• Action
Apache Spark 12
• Transformation: don't return a single value, they return a new RDD.
Nothing gets evaluated when you call a Transformation function, it
just takes an RDD and return a new RDD.
– Some of the Transformation functions are map, filter, flatMap,
groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
• Action: operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing
queries are computed at that time and the result value is returned.
– Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach.
Apache Spark 13
How in-memory computing Improve Performance
• Intermediate data is not spilled to disk
Apache Spark 14
Performance improvement in DAG executions
• New shuffle implementation, sort-based shuffle uses single buffer
which reduces substantial memory overhead and can support huge
workloads in a single stage. Earlier concurrent buffers were used
• The revamped network module in Spark maintains its own pool of
memory, thus bypassing JVM’s memory allocator, reducing the
impact of garbage collection.
• New external shuffle service ensures that Spark can still serve
shuffle files even when the executors are in GC pauses.
Apache Spark 15
• Timsort, which is a derivation of quicksort and mergesort. It works
better than earlier quicksort in partially ordered datasets.
• Exploiting cache locality reduced sorting by factor of 5.
sort_key record sort_key location
The Spark cluster was able to sustain 3GB/s/node I/O activity during the
map phase, and 1.1 GB/s/node network activity during the reduce phase,
saturating the 10Gbps link available on these machines.
Apache Spark 16
10b 10b100b 4b
• The memory consumed by RDDs can be optimized per size of data
• Performance can be tuned by caching
• MR supports only Map and Reduce operations and everything (join,
groupby etc) has to be fit into the Map and Reduce model, which
might not be the efficient way. Spark supports 80 other
transformations and actions.
Apache Spark 17
References
• http://spark.apache.org/
• http://en.wikipedia.org/wiki/Apache_Spark
• http://opensource.com/business/15/1/apache-spark-new-world-
record
• http://www.cloudera.com/content/cloudera/en/products-and-
services/cdh/spark.html
• http://vision.cloudera.com/mapreduce-spark/
• http://thesohiljain.blogspot.in/2015/03/apache-spark-
introduction.html
Apache Spark 18
Thank You
Apache Spark 19
Ad

More Related Content

What's hot (20)

SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark core
Spark coreSpark core
Spark core
Freeman Zhang
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 

Viewers also liked (6)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Internals
InternalsInternals
Internals
Sandeep Purohit
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Ad

Similar to Spark introduction and architecture (20)

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Apache spark
Apache sparkApache spark
Apache spark
Hitesh Dua
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Ad

Recently uploaded (20)

Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 

Spark introduction and architecture

  • 2. Agenda • Introduction • Features - • Supports all major Big Data Environments • General platform for major Big Data tasks • Access diverse data sources • Speed • Ease of use • Architecture • Resilient Distributed Datasets • In-memory Computing • Performance Apache Spark 2
  • 3. Spark Introduction • Apache Spark is an – open source, – parallel data processing framework – master slave model – complements Hadoop to make it easy to develop fast, unified Big Data applications • Cloudera offers commercial support for Spark with Cloudera Enterprise. • It has over 465 contributors making it most active Big Data Project in Apache Foundation, started at UC Berkley in 2009 Apache Spark 3
  • 4. Supports All major BigData Environments Runs Everywhere - • Standalone Cluster Mode • Hadoop Yarn • Apache Mesos • Amazon EC2 • Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). Apache Spark 4
  • 5. General platform for all major Big Data tasks • Common ETL (Sqoop) • SQL and Analytics (Pig and Hive) • Real Time Streaming (Storm) • Machine Learning (Mahout) • Graphs (Data Visualization) • Both Interactive and Batch mode processing • Reuse the same code for batch and stream processing, even joining streaming data to historical data Apache Spark 5
  • 6. Access Diverse Data Sources Read and write anywhere - • HDFS • Cassandra • Hbase • TextFile • RDBMS • Kafka Apache Spark 6
  • 7. Speed • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Logistic regression in Hadoop and Spark Apache Spark 7
  • 8. Ease of Use • Write application in – Java – Scala – Python • 80 high-level operators that make it easy to build parallel apps Apache Spark 8
  • 9. Unified Analytics with Cloudera’s Enterprise Data Hub Faster Decisions (Interactive) Better Decisions (Batch) Real-Time Action (Streaming and Applications) Web Security Why is my website slow? What are the common causes of performance issues? How can I detect and block malicious attacks in real-time? Retail What are our top selling items across channels? What products and services to customers buy together? How can I deliver relevant promotions to buyers at the point of sale? Financial Services Who opened multiple accounts in the past 6 months? What are the leading indicators of fraudulent activity? How can I protect my customers from identity theft in real-time? Apache Spark 9
  • 12. Resilient Distributed Datasets • A read-only collection of objects partitioned across a set of machines • Immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same. • Can be rebuilt if a partition is lost. • Fault tolerant because they can be recreated & recomputed RDD supports two types of operations: • Transformation • Action Apache Spark 12
  • 13. • Transformation: don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD. – Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. • Action: operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned. – Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach. Apache Spark 13
  • 14. How in-memory computing Improve Performance • Intermediate data is not spilled to disk Apache Spark 14
  • 15. Performance improvement in DAG executions • New shuffle implementation, sort-based shuffle uses single buffer which reduces substantial memory overhead and can support huge workloads in a single stage. Earlier concurrent buffers were used • The revamped network module in Spark maintains its own pool of memory, thus bypassing JVM’s memory allocator, reducing the impact of garbage collection. • New external shuffle service ensures that Spark can still serve shuffle files even when the executors are in GC pauses. Apache Spark 15
  • 16. • Timsort, which is a derivation of quicksort and mergesort. It works better than earlier quicksort in partially ordered datasets. • Exploiting cache locality reduced sorting by factor of 5. sort_key record sort_key location The Spark cluster was able to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines. Apache Spark 16 10b 10b100b 4b
  • 17. • The memory consumed by RDDs can be optimized per size of data • Performance can be tuned by caching • MR supports only Map and Reduce operations and everything (join, groupby etc) has to be fit into the Map and Reduce model, which might not be the efficient way. Spark supports 80 other transformations and actions. Apache Spark 17
  • 18. References • http://spark.apache.org/ • http://en.wikipedia.org/wiki/Apache_Spark • http://opensource.com/business/15/1/apache-spark-new-world- record • http://www.cloudera.com/content/cloudera/en/products-and- services/cdh/spark.html • http://vision.cloudera.com/mapreduce-spark/ • http://thesohiljain.blogspot.in/2015/03/apache-spark- introduction.html Apache Spark 18

Editor's Notes

  • #4: I quickly added these four benefits, please keep extending this list. Put quantifiable numbers (example ??% improvement in development cycle) wherever possible.
  • #5: List all the features added so far in Release #1. One line per feature, similar to the “Features in upcoming releases” format.
  • #6: The architecture diagram goes here. See the example I sent from one of my oldframework development project.