SlideShare a Scribd company logo
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science Company
Boosting Big Data with Apache Spark
Mathias Lavaert
April 2015
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
About Infofarm
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data
Science
Big
Data
Identifying, extracting and using data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning
and business value from it.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Java
PHP
E-Commerce
Mobile
Web
Development
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
About me
Mathias Lavaert
Big Data Developer at InfoFarm since May, 2014
Proud citizen of West-Flanders
Outdoor enthusiast
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda
• What is Apache Spark?
• An in-depth overview
– Spark Core and Resilient Distributed Data
– Unified access to structured data with Spark SQL
– Machine Learning with Spark MLLib
– Scalable streaming applications Spark Streaming
• Q&A
• Wrap-up & lunch
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
What is Apache Spark?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
“Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing”
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
History
• Created by Matei Zaharia at UC Berkeley in 2009
• Based on 2007 Microsoft Dryad paper
• Donated in 2013 to Apache Software Foundation
• 465 contributors in 2014 making it the most active
Apache Project
• Currently supported by Databricks, a company founded
by the creators of Apache Spark
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Target users
● Data Scientists
○ Data exploration and data modelling using interactive
shells
○ Machine Learning
○ Ad Hoc analysis to answer business questions or
discovering new insights
● Engineers
○ Fault-tolerant production data applications
○ ‘Productizing’ the work of the data scientist
○ Integration with business application
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Where to situate Apache Spark?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Differences with MapReduce
• Faster by minimizing IO and trying to use
the memory as much as possible
• Unified libraries
• Huge community effort, very fast
development pace.
• Ships with higher level tools included
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Daytona GraySort Contest
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Differences with Hive, Pig, others...
• One integrated framework that suits a
wide range of problems
• No need for a workflow application like
Oozie
• Only 1 language/framework to learn
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Explosion of Specialized Systems
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Architecture
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Advantages of unified libraries
Advancements in higher-level libraries are pushed down into core and
vice-versa
● Spark Core
○ Highly-optimized, low overhead, network-saturating shuffle
● Spark Streaming
○ Garbage collection, memory management, cleanup
improvements
● Spark GraphX
○ IndexedRDD for random access within a partition vs scanning
entire partition
● Spark MLLib
○ Statistics (Correlations, sampling, heuristics)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Supported languages
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Difference between Java and Scala
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Cluster Resource Managers
● Spark Standalone
○ Suitable for a lot of production workloads
○ Only suitable for Spark workloads
● YARN
○ Allows hierarchies of resources
○ Kerberos integration
○ Multiple workloads from different execution frameworks
■ Hive, Pig, Spark, MapReduce, Cascading, etc…
● Mesos
○ Similar to YARN, but allows elastic allocation
○ Coarse-grained
■ Single, long-running Mesos tasks runs Spark mini tasks
○ Fine-grained
■ New Mesos task for each Spark task
■ Higher overhead, not good for long-running Spark jobs
(Streaming)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Storage Layers for Spark
Spark can create distributed datasets from:
● Any file stored in the Hadoop distributed filesystem (HDFS)
● Any storage system supported by the Hadoop APIs
○ Local filesystem
○ S3
○ Cassandra
○ Hive
○ HBase
Note that Apache Spark doesn’t require Hadoop, but it has support for
storage systems implementing the Hadoop APIs.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Short introduction to functional
programming
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
What is functional programming?
A programming paradigm where the
basic unit of abstraction is the function
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Basic concepts
● Higher-order functions
○ Are functions that can either take other functions as
arguments
○ or return functions as a result of a function
● Pure functions
○ Purely functional expressions have no side effects
● Recursion
○ Iteration in functional languages is usually
accomplished via recursion.
● Immutable data structures
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Small example with a functional
language: Scala
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Introduction to Spark concepts
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Resilient Distributed Datasets (RDDs)
● Core Spark abstraction
● Immutable distributed collection of objects
● Split into multiple partitions
● May be computed on different nodes of the cluster
● Can contain any type of Scala, Java or Python object
including user-defined classes
“Distributed Scala collections”
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Driver and context
● Driver
○ Shell
○ Standalone program
● Spark Context represents a connection to a computing cluster
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
RDD Operations
● Transformations
○ map
○ filter
○ flatMap
○ sample
○ groupByKey
○ reduceByKey
○ union
○ join
○ sort
● Actions
○ count
○ collect
○ reduce
○ lookup
○ save
● Transformations are lazy
● Actions force the computation of transformations
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Narrow vs wide dependencies
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo using only core operations
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Specialized operations for specific
types of RDDs
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Specialized operations for Key/Value pairs
● reduceByKey
● groupByKey
● combineByKey
● mapValues
● flatMapValues
● keys
● sortByKey
● subtractByKey
● join
● rightOuterJoin
● leftOuterJoin
● cogroup
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Specialized operations for numeric RDDs
● count
● mean
● sum
● max
● min
● variance
● sampleVariance
● stdev
● sampleStDev
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
And many more...
● HadoopRDD
● FilteredRDD
● MappedRDD
● PairRDD
● ShuffledRDD
● UnionRDD
● DoubleRDD
● JdbcRDD
● JsonRDD
● SchemaRDD
● VertexRDD
● EdgeRDD
● CassandraRDD
● GeoRDD
● EsSpark (Elastic Search
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark SQL
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark SQL Overview
● Newest component of Spark
● Tightly integrated to work with structured data
○ Tables with rows and columns
● Transform RDDs using SQL
● Data source integration: Hive, Parquet, JSON and more…
● Optimizes execution plan
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Differences with Spark Core
● Spark + RDDs
○ Functional transformations on
collections of objects
● SQL + SchemaRDDs
○ Declarative transformations on
collections of tuples
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Getting started with Spark SQL
● Create an instance of SQLContext or HiveContext
○ Entry point for all SQL functionality
○ Wraps/extends existing Spark Context (Decorator Pattern)
● If you’re using the shell a SQLContext has been created for you
val sparkContext = new SparkContext("local[4]", "SQL")
val sqlContext = new SQLContext(sparkContext)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Language Integrated UDFs
● Ability to write custom SQL-functions in one of the languages that is
supported by Spark
● Another example on how Spark simplifies the big data stack
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Parquet compatibility
Native support for reading data stored in Parquet:
● Columnar storage avoids reading unneeded data
● SchemaRDDs can be written to Parquet while preserving the schema
● Convert other slower formats like JSON to Parquet for repeated querying.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo: Spark SQL
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark MLLib
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Machine Learning Algorithms
● Supervised
○ Prediction: Train a model with existing data + label, predict
label for new data
■ Classification (categorical)
■ Regression (continuous numeric)
○ Recommendation: recommend to similar users
■ User -> user, item -> item, user -> item similarity
● Unsupervised
○ Clustering: Find natural clusters in data based on similarities
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Algorithms provided by Spark
● Classification and regression
○ Linear models (SVMs, logistic regression, linear regression)
○ Naive Bayes
○ Decision trees
○ Ensembles of trees (Random Forests and Gradient-Boosted trees)
○ Isotonic regression
● Recommendations
○ Alternating Least Squares (ALS)
○ FP-growth
● Clustering
○ K-Means
○ Gaussian mixture
○ Power Iteration clustering
○ Latent Dirichlet allocation
○ Streaming k-means
● Dimensionality reduction
○ Singular value decomposition (SVD)
○ Principal component analysis (PCA)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Tools provided by Spark
● Tools for basic statistics including
○ Summary statistics
○ Correlations
○ Sampling
○ Hypothesis testing
○ Random data generation
● Tools for feature extraction and transformation
○ Extracting features out of text
○ Uniform Vector format to store features
● Tools to build Machine Learning Pipelines
using Spark SQL
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Why choose for MLLib?
● One of the best documented machine learning
libraries available for the JVM
● Simple API, constructs are the same for different
algorithms
● Well integrated with other Spark-components
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo: Spark MLLib
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark Streaming
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Streaming Overview
● Build around the concept of DStreams or discretized
streams
● Long-running Spark application
● Micro-batch architecture
● Supports Flume, Kafka, Twitter, Amazon Kinesis,
Socket, File…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
DStreams
● A sequence of RDDs
● Stateless transformations
● Stateful transformations
● Checkpointing
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Streaming Use Cases
● ETL and enrichment of streaming data on ingestion
● Lambda Architecture
● Operational dashboards
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo: Spark Streaming
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark on Amazon EC2
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Spark runs easily on Amazon EC2
Apache Spark comes with a script to launch Spark clusters
on Amazon EC2.
So there is no need to invest in a cluster of servers...
Furthermore it has support for multiple Amazon
components.
● Spark can read files from Amazon S3
● Spark Streaming can easily be integrated with Amazon
Kinesis
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Conclusion
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Why choose for Apache Spark?
● Modern integrated full-stack Big Data framework
● Suitable for both batch and (near) real time applications
● Well supported by a very large community
● The Big Data landscape seems to shift to Apache Spark
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Questions?
Ad

More Related Content

What's hot (20)

Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case study
Oscar Corcho
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
How To Leverage OBIEE Within A Big Data Architecture
How To Leverage OBIEE Within A Big Data ArchitectureHow To Leverage OBIEE Within A Big Data Architecture
How To Leverage OBIEE Within A Big Data Architecture
Kevin McGinley
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Knowledge graph
Knowledge graphKnowledge graph
Knowledge graph
Brecht Van de Vyvere
 
Building a knowledge graph of the Belgian War Press
Building a knowledge graph of the Belgian War PressBuilding a knowledge graph of the Belgian War Press
Building a knowledge graph of the Belgian War Press
Open Knowledge Belgium
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
Kyle Bader
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
 
Instrumentation with Splunk
Instrumentation with SplunkInstrumentation with Splunk
Instrumentation with Splunk
Datavail
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Sergio Fernández
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
Enrico Daga
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB Applications
MongoDB
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
ArangoDB Database
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Cloudera, Inc.
 
Drupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP WebinarDrupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP Webinar
scorlosquet
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case study
Oscar Corcho
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
How To Leverage OBIEE Within A Big Data Architecture
How To Leverage OBIEE Within A Big Data ArchitectureHow To Leverage OBIEE Within A Big Data Architecture
How To Leverage OBIEE Within A Big Data Architecture
Kevin McGinley
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Building a knowledge graph of the Belgian War Press
Building a knowledge graph of the Belgian War PressBuilding a knowledge graph of the Belgian War Press
Building a knowledge graph of the Belgian War Press
Open Knowledge Belgium
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
Kyle Bader
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
 
Instrumentation with Splunk
Instrumentation with SplunkInstrumentation with Splunk
Instrumentation with Splunk
Datavail
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Sergio Fernández
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
Enrico Daga
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB Applications
MongoDB
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
ArangoDB Database
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Cloudera, Inc.
 
Drupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP WebinarDrupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP Webinar
scorlosquet
 

Viewers also liked (8)

Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceRetail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
InfoFarm
 
Data Driven Decisions seminar
Data Driven Decisions seminarData Driven Decisions seminar
Data Driven Decisions seminar
InfoFarm
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
Machine learning
Machine learningMachine learning
Machine learning
InfoFarm
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
InfoFarm
 
Data Science for e-commerce
Data Science for e-commerceData Science for e-commerce
Data Science for e-commerce
InfoFarm
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
InfoFarm
 
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceRetail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
InfoFarm
 
Data Driven Decisions seminar
Data Driven Decisions seminarData Driven Decisions seminar
Data Driven Decisions seminar
InfoFarm
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
Machine learning
Machine learningMachine learning
Machine learning
InfoFarm
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
InfoFarm
 
Data Science for e-commerce
Data Science for e-commerceData Science for e-commerce
Data Science for e-commerce
InfoFarm
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
InfoFarm
 
Ad

Similar to Boosting big data with apache spark (20)

Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
Bogdan Kyryliuk
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
Chetan Khatri
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
Nikhil Shekhar
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
Bogdan Kyryliuk
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
Chetan Khatri
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
Nikhil Shekhar
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Ad

Recently uploaded (20)

Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 

Boosting big data with apache spark

  • 1. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Data Science Company Boosting Big Data with Apache Spark Mathias Lavaert April 2015
  • 2. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company About Infofarm
  • 3. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Data Science Big Data Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business value from it.
  • 4. Veldkant 33A, Kontich [email protected] ● www.infofarm.be
  • 5. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Java PHP E-Commerce Mobile Web Development
  • 6. Veldkant 33A, Kontich [email protected] ● www.infofarm.be
  • 7. Veldkant 33A, Kontich [email protected] ● www.infofarm.be About me Mathias Lavaert Big Data Developer at InfoFarm since May, 2014 Proud citizen of West-Flanders Outdoor enthusiast
  • 8. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Agenda • What is Apache Spark? • An in-depth overview – Spark Core and Resilient Distributed Data – Unified access to structured data with Spark SQL – Machine Learning with Spark MLLib – Scalable streaming applications Spark Streaming • Q&A • Wrap-up & lunch
  • 9. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company What is Apache Spark?
  • 10. Veldkant 33A, Kontich [email protected] ● www.infofarm.be “Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”
  • 11. Veldkant 33A, Kontich [email protected] ● www.infofarm.be History • Created by Matei Zaharia at UC Berkeley in 2009 • Based on 2007 Microsoft Dryad paper • Donated in 2013 to Apache Software Foundation • 465 contributors in 2014 making it the most active Apache Project • Currently supported by Databricks, a company founded by the creators of Apache Spark
  • 12. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Target users ● Data Scientists ○ Data exploration and data modelling using interactive shells ○ Machine Learning ○ Ad Hoc analysis to answer business questions or discovering new insights ● Engineers ○ Fault-tolerant production data applications ○ ‘Productizing’ the work of the data scientist ○ Integration with business application
  • 13. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Where to situate Apache Spark?
  • 14. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Differences with MapReduce • Faster by minimizing IO and trying to use the memory as much as possible • Unified libraries • Huge community effort, very fast development pace. • Ships with higher level tools included
  • 15. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Daytona GraySort Contest
  • 16. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Differences with Hive, Pig, others... • One integrated framework that suits a wide range of problems • No need for a workflow application like Oozie • Only 1 language/framework to learn
  • 17. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Explosion of Specialized Systems
  • 18. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Architecture
  • 19. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Advantages of unified libraries Advancements in higher-level libraries are pushed down into core and vice-versa ● Spark Core ○ Highly-optimized, low overhead, network-saturating shuffle ● Spark Streaming ○ Garbage collection, memory management, cleanup improvements ● Spark GraphX ○ IndexedRDD for random access within a partition vs scanning entire partition ● Spark MLLib ○ Statistics (Correlations, sampling, heuristics)
  • 20. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Supported languages
  • 21. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Difference between Java and Scala
  • 22. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Cluster Resource Managers ● Spark Standalone ○ Suitable for a lot of production workloads ○ Only suitable for Spark workloads ● YARN ○ Allows hierarchies of resources ○ Kerberos integration ○ Multiple workloads from different execution frameworks ■ Hive, Pig, Spark, MapReduce, Cascading, etc… ● Mesos ○ Similar to YARN, but allows elastic allocation ○ Coarse-grained ■ Single, long-running Mesos tasks runs Spark mini tasks ○ Fine-grained ■ New Mesos task for each Spark task ■ Higher overhead, not good for long-running Spark jobs (Streaming)
  • 23. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Storage Layers for Spark Spark can create distributed datasets from: ● Any file stored in the Hadoop distributed filesystem (HDFS) ● Any storage system supported by the Hadoop APIs ○ Local filesystem ○ S3 ○ Cassandra ○ Hive ○ HBase Note that Apache Spark doesn’t require Hadoop, but it has support for storage systems implementing the Hadoop APIs.
  • 24. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Short introduction to functional programming
  • 25. Veldkant 33A, Kontich [email protected] ● www.infofarm.be What is functional programming? A programming paradigm where the basic unit of abstraction is the function
  • 26. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Basic concepts ● Higher-order functions ○ Are functions that can either take other functions as arguments ○ or return functions as a result of a function ● Pure functions ○ Purely functional expressions have no side effects ● Recursion ○ Iteration in functional languages is usually accomplished via recursion. ● Immutable data structures
  • 27. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Small example with a functional language: Scala
  • 28. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Introduction to Spark concepts
  • 29. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Resilient Distributed Datasets (RDDs) ● Core Spark abstraction ● Immutable distributed collection of objects ● Split into multiple partitions ● May be computed on different nodes of the cluster ● Can contain any type of Scala, Java or Python object including user-defined classes “Distributed Scala collections”
  • 30. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Driver and context ● Driver ○ Shell ○ Standalone program ● Spark Context represents a connection to a computing cluster
  • 31. Veldkant 33A, Kontich [email protected] ● www.infofarm.be RDD Operations ● Transformations ○ map ○ filter ○ flatMap ○ sample ○ groupByKey ○ reduceByKey ○ union ○ join ○ sort ● Actions ○ count ○ collect ○ reduce ○ lookup ○ save ● Transformations are lazy ● Actions force the computation of transformations
  • 32. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Narrow vs wide dependencies
  • 33. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Demo using only core operations
  • 34. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Specialized operations for specific types of RDDs
  • 35. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Specialized operations for Key/Value pairs ● reduceByKey ● groupByKey ● combineByKey ● mapValues ● flatMapValues ● keys ● sortByKey ● subtractByKey ● join ● rightOuterJoin ● leftOuterJoin ● cogroup
  • 36. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Specialized operations for numeric RDDs ● count ● mean ● sum ● max ● min ● variance ● sampleVariance ● stdev ● sampleStDev
  • 37. Veldkant 33A, Kontich [email protected] ● www.infofarm.be And many more... ● HadoopRDD ● FilteredRDD ● MappedRDD ● PairRDD ● ShuffledRDD ● UnionRDD ● DoubleRDD ● JdbcRDD ● JsonRDD ● SchemaRDD ● VertexRDD ● EdgeRDD ● CassandraRDD ● GeoRDD ● EsSpark (Elastic Search
  • 38. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Spark SQL
  • 39. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Spark SQL Overview ● Newest component of Spark ● Tightly integrated to work with structured data ○ Tables with rows and columns ● Transform RDDs using SQL ● Data source integration: Hive, Parquet, JSON and more… ● Optimizes execution plan
  • 40. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Differences with Spark Core ● Spark + RDDs ○ Functional transformations on collections of objects ● SQL + SchemaRDDs ○ Declarative transformations on collections of tuples
  • 41. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Getting started with Spark SQL ● Create an instance of SQLContext or HiveContext ○ Entry point for all SQL functionality ○ Wraps/extends existing Spark Context (Decorator Pattern) ● If you’re using the shell a SQLContext has been created for you val sparkContext = new SparkContext("local[4]", "SQL") val sqlContext = new SQLContext(sparkContext)
  • 42. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Language Integrated UDFs ● Ability to write custom SQL-functions in one of the languages that is supported by Spark ● Another example on how Spark simplifies the big data stack
  • 43. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Parquet compatibility Native support for reading data stored in Parquet: ● Columnar storage avoids reading unneeded data ● SchemaRDDs can be written to Parquet while preserving the schema ● Convert other slower formats like JSON to Parquet for repeated querying.
  • 44. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Demo: Spark SQL
  • 45. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Spark MLLib
  • 46. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Machine Learning Algorithms ● Supervised ○ Prediction: Train a model with existing data + label, predict label for new data ■ Classification (categorical) ■ Regression (continuous numeric) ○ Recommendation: recommend to similar users ■ User -> user, item -> item, user -> item similarity ● Unsupervised ○ Clustering: Find natural clusters in data based on similarities
  • 47. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Algorithms provided by Spark ● Classification and regression ○ Linear models (SVMs, logistic regression, linear regression) ○ Naive Bayes ○ Decision trees ○ Ensembles of trees (Random Forests and Gradient-Boosted trees) ○ Isotonic regression ● Recommendations ○ Alternating Least Squares (ALS) ○ FP-growth ● Clustering ○ K-Means ○ Gaussian mixture ○ Power Iteration clustering ○ Latent Dirichlet allocation ○ Streaming k-means ● Dimensionality reduction ○ Singular value decomposition (SVD) ○ Principal component analysis (PCA)
  • 48. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Tools provided by Spark ● Tools for basic statistics including ○ Summary statistics ○ Correlations ○ Sampling ○ Hypothesis testing ○ Random data generation ● Tools for feature extraction and transformation ○ Extracting features out of text ○ Uniform Vector format to store features ● Tools to build Machine Learning Pipelines using Spark SQL
  • 49. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Why choose for MLLib? ● One of the best documented machine learning libraries available for the JVM ● Simple API, constructs are the same for different algorithms ● Well integrated with other Spark-components
  • 50. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Demo: Spark MLLib
  • 51. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Spark Streaming
  • 52. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Spark Streaming Overview ● Build around the concept of DStreams or discretized streams ● Long-running Spark application ● Micro-batch architecture ● Supports Flume, Kafka, Twitter, Amazon Kinesis, Socket, File…
  • 53. Veldkant 33A, Kontich [email protected] ● www.infofarm.be DStreams ● A sequence of RDDs ● Stateless transformations ● Stateful transformations ● Checkpointing
  • 54. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Spark Streaming Use Cases ● ETL and enrichment of streaming data on ingestion ● Lambda Architecture ● Operational dashboards
  • 55. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Demo: Spark Streaming
  • 56. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Spark on Amazon EC2
  • 57. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Apache Spark runs easily on Amazon EC2 Apache Spark comes with a script to launch Spark clusters on Amazon EC2. So there is no need to invest in a cluster of servers... Furthermore it has support for multiple Amazon components. ● Spark can read files from Amazon S3 ● Spark Streaming can easily be integrated with Amazon Kinesis
  • 58. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Conclusion
  • 59. Veldkant 33A, Kontich [email protected] ● www.infofarm.be Why choose for Apache Spark? ● Modern integrated full-stack Big Data framework ● Suitable for both batch and (near) real time applications ● Well supported by a very large community ● The Big Data landscape seems to shift to Apache Spark
  • 60. Veldkant 33A, Kontich [email protected] ● www.infofarm.beData Science Company Questions?