SlideShare a Scribd company logo
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1
Mohammed Guller
May 17, 2016
Introduction to
Big Data and Apache Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2
Big Data
Big Data Technologies
Hadoop
Spark Core
Spark Libraries
Agenda
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4
About Me
• Principal Architect at Glassbeam
• Founded two startups
• Passionate about building products, big data analytics, and
machine learning
• www.linkedin.com/in/mohammedguller
• @MohammedGuller
4
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6
• Hands-on guide with lots of examples
• Covers both fundamental and advanced
topics such as machine learning
• Includes a primer on functional
programming and Scala
• Introduces other important Big Data
technologies such as HDFS, Parquet,
Kafka, HBase, Cassandra, Mesos, and
YARN
Big Data Analytics with Spark
Available on Amazon
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7
About Glassbeam
Glassbeam brings structure and meaning to data from any connected machine or device while providing
actionable intelligence
Cloud based analytics platform that helps
organizations turn raw machine data to insights
Making sense of multi
structured machine data
 Data center devices
 Medical devices
 Sensors
 ATMs
 Automobiles
 Data from any machine
Providing comprehensive set of apps
& tools for machine data analysis
 50,000+ systems being tracked today
 1,500+ different software rev codes
 1.2 Billion sensor readings per day
 1+ Trillion sensor readings tracked
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9
What is Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
•
•
•
•
•
•
•
•
•
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10
Data Growing At a Faster Pace Than Ever
10
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11
Internet of Things (IoT)
• Network of objects embedded with
software for collecting and sending data
over the Internet
• 5x more connected things than people by
2020
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12
Industrial IoT
• Manufacturing
• Automotive
• Medical
• Data Center
• EVC
• Smart Meter
12
Glassbeam target market is focused on driving opera onal & business
analy cs value for connected product companies in Industrial IoT market
IT & Networks Medical & Health Care
Transporta on
EV Chargers & Smart Grid
Industrial & Mfg
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13
Big Data Comes With Big Challenges
• Storage
• Processing
• Value
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14
Storage Challenges
• Legacy SAN / NAS storage devices are expensive
• Traditional RDBMS were not designed for Big Data
• Cannot handle volume, velocity, variety of Big Data
14
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15
Processing Challenges
• Diverse processing
• Organizations want do more than just BI / traditional analytics
• Go beyond SQL queries
• Timeliness
• Process data in reasonable amount of time
• Value of data decreases over time
15
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16
How Much Data Can a Standard Server Process
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17
•
•
17
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18
• Large number of CPUs / cores
• Faster cores
• Large amount of memory
• Faster memory bus
• High-performance architecture
Scale-up with Powerful High-end Server
18
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19
Disadvantages of Scale-up Architecture
• Proprietary
• Expensive
• Limited scalability
19
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20
• Cluster of servers
• Commodity machines
• Pool together resources
• CPU
• Memory
• Disk
Scale-out Architecture
20
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21
Benefits of Scale-out Architecture
• Relatively inexpensive
• Economical to scale
• No huge upfront investment
• Start small and expand cluster as workload increases
21
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22
Challenges With Scale-out Architecture
• Writing distributed applications is very hard
• Split job into chunks that can be distributed across a cluster
• Schedule compute resources among different jobs
• Manage inter-node communication
• Handle network and node failures
• Hardware failures are more common at a cluster level
• Probability of a single node failing is low
• Probability of any one node in a large cluster failing is high
22
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23
Getting Value Out of Big Data
• Traditional analytics / BI
• Custom processing
• Machine Learning
• Predictive analytics
• Automate complex tasks
• Stream processing
• Analyze in real-time/near real-time
• React in real-time
23
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24
Traditional Analytics / BI
• What
• Customer growth for the last month/quarter/year
• Segmentation of customers by demographics
• Average time spent by mobile app users
• Why
• Sales growth slowed
• regional issue
• supply issue
• Profit dropped
• revenue dropped
• expenses increased
24
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25
Custom Processing
• Index web pages
• Google
• Bing
• Process genome data
• Identify mutations linked to cancer, Alzheimer's and other disease
• Click analysis
• Log analysis
• 360-degree real time view of a customer
25
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26
Predictive Analytics
• Advertisements that a visitor will most likely click
• Movies / songs / news that a customer will like
• Products that a customer will buy
• Patient will have an heart attack
26
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27
• Virtual assistant
• Siri
• Google Now
• Autonomous machine
• Self-driving car
• Robots
• Tag Images
• Facebook
• Flickr
• Expert System
• Medical diagnosis
• Personalized medicine
• Security
• Fraud detection
• Network Security
• Music recognition
• Shazam
• SoundHound
Automate Complex Tasks
27
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29
•
•
•
•
•
•
29
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30
•
•
•
•
•
•
30
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31
• Text
• CSV
• JSON
• XML
• Binary
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar
(ORC)
File Formats
31
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32
• Hive
• Impala
• Presto
• Drill
• Phoenix
• HAWQ
• Tajo
• Spark SQL
Distributed SQL Query Engine
32
Data Warehouse
Distributed
Storage
Distributed
Query Engine
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33
•
•
•
•
•
•
33
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34
•
•
34
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35
•
•
35
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36
Messaging Systems
• Kafka
• RabbitMQ
• ActiveMQ
• ZeroMQ
36
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37
• Batch
• Hadoop MapReduce
• HPCC
• Stream
• Storm
• Samza
• Kafka Streams
• Batch and Stream
• Spark
• Flink
• Beam
• Apex
• Ignite
General-purpose Data Processing Frameworks
37
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39
Hadoop
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40
•
•
•
•
40
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41
•
•
•
•
•
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45
Hadoop is Not a Single Product
45
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46
Hadoop Core Components
46
=
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48
•
•
•
48
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49
•
•
•
•
•
•
•
49
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51
Spark
Fast, easy-to-use, general-purpose cluster computing framework
for processing large datasets using a simple programming model
51
• • •
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52
Benefits
• Scale
• Fault-tolerance
• Abstracts distributed computing
• Hides the messy details of writing distributed applications
• Allows developers to just focus on the data processing logic
• Same code works on a laptop or a cluster of servers
• Ease-of-use
• Speed
• Flexibility
52
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53
Easy To Use
• Library with an expressive API
• Scala, Java, Python, R
• 80+ operators
• Interactive development
• spark-shell
• notebooks
53
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54
• Batch processing
• Interactive analytics
• Stream analysis
• Machine learning
• Graph analytics
Integrated Libraries For a Variety of DP Tasks
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55
Unified Platform for a Variety of Data Processing
• Solve a variety of problems with a single toolkit
• No need to learn different tools for each use case
• Avoid code and data duplication
• Achieve operational simplicity
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56
Why is Spark Fast
• Advanced job execution engine
• Allows applications to cache data in memory
56
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57
Advanced Job Execution Engine
• Directed Acyclic Graph (DAG) of stages
• simple job can contain just one stage
• complex job can contain many stages
• eliminates expensive operations between multiple jobs
• synchronization
• serialization/deserialization
• disk I/O
• Lazy operator evaluation
• Pipelined operations
57
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60
Why Caching Makes Applications Run Faster
60
100 MB/s
500 MB/s
10 GB/s
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61
Read Latency Comparison
61
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62
Primary Data Abstractions
• Resilient Distributed Dataset (RDD)
• Dataset
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 63
Resilient Distributed Dataset (RDD)
• Represents data as an immutable partitioned collection of
elements, which can be processed in parallel
• Provides the methods / operations for processing data
• Conceptually similar to a Scala collection, except for two things:
• represents a partitioned dataset, which may be distributed over a
cluster
• operations are lazily evaluated
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 64
Dataset
• Introduced in Spark 1.6.0
• Represents data as an immutable, partitioned, strongly-typed
collection of elements, which can be processed in parallel
• Benefits from Spark SQL optimizations
• performance optimizations
• memory optimizations (cached dataset uses 5x less space)
• Provides compile-time type-safety similar to the RDD API
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65
Dataset Uses a Specialized Encoder
• Highly optimized
• Runtime code generation for serialization and deserialization
• Significantly faster than Java or Kryo serialization
• Encoded data can be significantly smaller (up to 2x)
• Serialized data is in the Tungsten binary format
• many operations can be done in-place without deserializing
• Built-in support for automatically generating encoders for
• primitive types (e.g. String, Integer, Long)
• Scala case classes
• Java Beans
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66
Relationship Between Dataset and DataFrame
• DataFrame = Dataset[Row]
• Dataset is a partitioned collection of strongly-typed elements
• DataFrame is a partitioned collection of generic Row objects
• DataFrame can be transformed into a Dataset by calling
df.as[ElementType]
• Dataset can be transformed into a DataFrame by calling
df.toDF()
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67
High-level Architecture
67
Master
Driver
Program
Executor
Task
Worker
Executor
Task
Worker
Executor
Task
Worker
Cache
Cache
Cache
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 68
Ideal Applications
• Complex data processing
• multi-step pipeline
• Iterative algorithm
• Machine Learning
• Graph analytics
• Ad hoc analysis
• Interactive
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 69
Spark Does Not Provide Storage
• Works with a variety of data sources
• No need to import data into Spark
• Scale compute and storage cluster independently
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 70
Process Data From a Variety Of Data Sources
And Many More
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 71
Spark Does Not Replace Hadoop
71
= =
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 72
Hadoop is Optional
72
= =
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 73
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74
Spark SQL
• Spark library for processing structured data
• Provides more than just a SQL interface to Spark
• Increases developer productivity
• Makes applications run faster
• Uniform API for processing data from a variety of sources
Spark core
Spark SQL
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75
Increases Productivity with Higher-level API
• SQL
• Declarative language
• HiveQL
• SQL-like
• Hive not required
• Language integrated queries
• DataFrame
• Scala, Java, Python and R
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76
Makes Applications Run Faster
• Query optimization
• Python, Scala, Java, and SQL code all use the same optimizer
• Reduced disk I/O
• skip partitions
• skip columns (columnar storage)
• skip rows (using statistical information)
• in-memory columnar caching
• Predicate pushdown
• push operations to a database / data source
• Code generation
• generate Java bytecode
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77
Uniform Data Access
Spark SQL
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78
Primary Data Abstraction is DataFrame
• Represents a partitioned collection organized into named
columns
• Equivalent to a table in a relation database
• Inspired by DataFrames in Python / R
• Provides methods / operations for language integrated queries
in Scala, Java, Python, and R
• Constructed from structured data files, Hive tables, RDBMS,
NoSQL datastores, or RDDs
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 79
• Distributed SQL query engine
• JDBC / ODBC support
• BI and visualization app
• Tableau, Zoomdata, Qlik
• Run SQL / HiveQL queries
• No Scala, Java, R, or Python
BI / Visualization App
JDBC / ODBC
Spark SQL
HDFS /
S3
NoSQL RDBMS
Thrift JDBC/ODBC server
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 81
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 82
Spark Streaming
• Extends Spark for data stream processing
• Higher-level operators for streams analytics
• Integrates with other Spark libraries
• Spark SQL
• MLlib
• GraphX
• High throughput
Spark Core
Spark
Streaming
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 83
• Data stream split into micro-batches
• Batch interval specified by application
• Stream processed as a sequence of
micro-batch jobs
• Each batch is processed as an RDD
by Spark core
• Results of RDD operations are
returned in batches
High-level Architecture
Spark
Streaming
Spark Core
Input stream
Result
stream
Batches of
x seconds
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 84
Primary Abstraction is DStream
• Represents a data stream
• Implemented as a sequence of RDDs
• Provides operators for processing streams
• Basic operators similar to RDD operators
• Window operations
• Stateful stream processing
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 85
Spark Streaming Sources And Destinations
85
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 86
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 87
Machine Learning Applications
• Regression
• Home price valuation
• Classification
• Image classification
• Clustering
• Customer segmentation
• Anomaly detection
• Fraud
• Recommendation
• Movie
• Dimensionality reduction
87
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 88
What is Machine Learning
• Program software to learn from data
• Logic is not explicitly programmed
• Infer patterns in a dataset
• Generalize beyond the training set
• Mathematics provides the foundation
88
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 89
Broad Categorization of ML Algorithms
• Supervised
• Labeled dataset
• Examples: Linear Regression, SVM, Neural Networks
• Unsupervised
• Unlabeled dataset
• Example: K-Means
89
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 90
MLlib
Machine Learning with Spark
Spark Core
spark.mllib spark.ml
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 91
spark.mllib
• Extends Spark for ML and statistical analysis
• Provides higher-level API than Spark Core for ML
• Dataset represented by an RDD
91
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 93
spark.ml
• Another machine learning library that runs on top of Spark
• Less mature than spark.mllib
• Dataset represented by a DataFrame
• Higher-level abstraction
• Feature engineering
• Model training, tuning and evaluation
• Makes it easy to create ML pipelines / workflows
93
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 95
MLlib Supports Machine Learning Workflow
• Statistical analysis
• Feature extraction
• Feature transformation
• Dimensionality reduction
• Machine learning algorithms
• Model tuning
• Model evaluation
95
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 96
Statistical Analysis
• Summary statistics
• Correlations
• Stratified sampling
• Hypothesis testing
• Random data generation
96
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 97
• Feature Extraction and
Transformation
• TF-IDF
• Word2Vec
• StandardScaler
• Normalizer
• ChiSqSelector
• ElementwiseProduct
• PCA
• Dimensionality Reduction
• Singular value decomposition
(SVD)
• Principal component analysis
(PCA)
Feature Engineering
97
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 98
• Regression and Classification
• Linear regression, Isotonic
regression
• Logistic regression, SVM
• Naive Bayes
• Random Forest, Gradient-
Boosted Trees
• Multilayer perceptron
(Feedforward NN)
• Recommendation
• Collaborative Filtering (ALS)
• Clustering
• K-means
• Gaussian mixture
• Power iteration clustering (PIC)
• Latent Dirichlet allocation
(LDA)
• Bisecting k-means
• Streaming k-means
• Frequent pattern mining
• FP-growth
• Association rules
• PrefixSpan
Machine Learning Algorithms
98
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 99
Big Data
Open-source Big Data Technologies
Hadoop
Spark Core
Spark SQL
Spark Streaming
Machine Learning
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 100
Graph Intro
• Composed of vertices and edges
• Graphs are ubiquitous
• Web pages
• Social networks
• Transportation hubs
• Graph theory provides efficient algorithms for graph-oriented
data
100
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 101
• Extends Spark for analyzing large-scale
graph-oriented data
• Provides higher-level abstractions than
Spark core for graph analytics
• Unifies collections and graphs as first-
class composable objects
GraphX
101
Spark core
GraphX
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 102
• Fundamental graph operators
• filter
• reverse
• subgraph
• joinVertices
• mapVertices
• mapEdges
• aggregateMessages
• Advanced graph operators
• pregel
• pageRank
• connectedComponents
• stronglyConnectedComponents
• triangleCount
Primary Data Abstraction is Graph
102
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 103103
Ad

More Related Content

What's hot (20)

Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
How Netflix uses Python? Edureka
How Netflix uses Python? EdurekaHow Netflix uses Python? Edureka
How Netflix uses Python? Edureka
Edureka!
 
Neo4j 4 Overview
Neo4j 4 OverviewNeo4j 4 Overview
Neo4j 4 Overview
Neo4j
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
MongoDB
 
System and network administration network services
System and network administration network servicesSystem and network administration network services
System and network administration network services
Uc Man
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Neo4j
 
Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
Steffen Staab
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
An Ambitious Wikidata Tutorial
An Ambitious Wikidata TutorialAn Ambitious Wikidata Tutorial
An Ambitious Wikidata Tutorial
_Emw
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Debraj GuhaThakurta
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
sreehari orienit
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Complete open source IAM solution
Complete open source IAM solutionComplete open source IAM solution
Complete open source IAM solution
Radovan Semancik
 
SQLcl overview - A new Command Line Interface for Oracle Database
SQLcl overview - A new Command Line Interface for Oracle DatabaseSQLcl overview - A new Command Line Interface for Oracle Database
SQLcl overview - A new Command Line Interface for Oracle Database
Jeff Smith
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Data Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph DatabaseData Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph Database
DATAVERSITY
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
How Netflix uses Python? Edureka
How Netflix uses Python? EdurekaHow Netflix uses Python? Edureka
How Netflix uses Python? Edureka
Edureka!
 
Neo4j 4 Overview
Neo4j 4 OverviewNeo4j 4 Overview
Neo4j 4 Overview
Neo4j
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
MongoDB
 
System and network administration network services
System and network administration network servicesSystem and network administration network services
System and network administration network services
Uc Man
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Neo4j
 
Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
Steffen Staab
 
An Ambitious Wikidata Tutorial
An Ambitious Wikidata TutorialAn Ambitious Wikidata Tutorial
An Ambitious Wikidata Tutorial
_Emw
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Debraj GuhaThakurta
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
sreehari orienit
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Complete open source IAM solution
Complete open source IAM solutionComplete open source IAM solution
Complete open source IAM solution
Radovan Semancik
 
SQLcl overview - A new Command Line Interface for Oracle Database
SQLcl overview - A new Command Line Interface for Oracle DatabaseSQLcl overview - A new Command Line Interface for Oracle Database
SQLcl overview - A new Command Line Interface for Oracle Database
Jeff Smith
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Data Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph DatabaseData Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph Database
DATAVERSITY
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 

Viewers also liked (17)

Ad hoc analytics with Cassandra and Spark
Ad hoc analytics with Cassandra and SparkAd hoc analytics with Cassandra and Spark
Ad hoc analytics with Cassandra and Spark
Mohammed Guller
 
Pre-Con Education: Advanced and Reporting and Dashboards With Xtraction
Pre-Con Education: Advanced and Reporting and Dashboards With XtractionPre-Con Education: Advanced and Reporting and Dashboards With Xtraction
Pre-Con Education: Advanced and Reporting and Dashboards With Xtraction
CA Technologies
 
Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...
Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...
Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...
CA Technologies
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
DataStax Academy
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
Jason Hubbard
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Imam Raza
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
Arpit Tak
 
Hands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardHands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM Dashboard
CA Technologies
 
Ad hoc analytics with Cassandra and Spark
Ad hoc analytics with Cassandra and SparkAd hoc analytics with Cassandra and Spark
Ad hoc analytics with Cassandra and Spark
Mohammed Guller
 
Pre-Con Education: Advanced and Reporting and Dashboards With Xtraction
Pre-Con Education: Advanced and Reporting and Dashboards With XtractionPre-Con Education: Advanced and Reporting and Dashboards With Xtraction
Pre-Con Education: Advanced and Reporting and Dashboards With Xtraction
CA Technologies
 
Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...
Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...
Hands-on Lab: Building Advanced Dashboards with Xtraction for CA Service Mana...
CA Technologies
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
DataStax Academy
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
Jason Hubbard
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Imam Raza
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Hands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardHands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM Dashboard
CA Technologies
 
Ad

Similar to Introduction to big data and apache spark (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Mohammed Guller
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
Datameer
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy
Dr. Wilfred Lin (Ph.D.)
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
Riccardo Romani
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Cynthia Saracco
 
IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...
IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...
IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...
IDERA Software
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
Saama
 
Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service
David Linthicum
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
DATAVERSITY
 
IDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data EnvironmentsIDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data Environments
DATAVERSITY
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
Wei-Chiu Chuang
 
IDERA Live | Why You Need Data Warehouse Automation Now More Than Ever
IDERA Live | Why You Need Data Warehouse Automation Now More Than EverIDERA Live | Why You Need Data Warehouse Automation Now More Than Ever
IDERA Live | Why You Need Data Warehouse Automation Now More Than Ever
IDERA Software
 
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Ocean9, Inc.
 
Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs
DataCore Software
 
IDERA Live | Working with Complex Data Environments
IDERA Live | Working with Complex Data EnvironmentsIDERA Live | Working with Complex Data Environments
IDERA Live | Working with Complex Data Environments
IDERA Software
 
Think Big, think fast: how to select a future-proof database - Bruno Simic (C...
Think Big, think fast: how to select a future-proof database - Bruno Simic (C...Think Big, think fast: how to select a future-proof database - Bruno Simic (C...
Think Big, think fast: how to select a future-proof database - Bruno Simic (C...
Shift Conference
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
David Spurway
 
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Riccardo Romani
 
How Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data CenterHow Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data Center
Hostway|HOSTING
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Mohammed Guller
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
Datameer
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy
Dr. Wilfred Lin (Ph.D.)
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
Riccardo Romani
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Cynthia Saracco
 
IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...
IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...
IDERA Live | Doing More with Less: Managing Multiple Database Roles and Platf...
IDERA Software
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
Saama
 
Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service
David Linthicum
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
DATAVERSITY
 
IDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data EnvironmentsIDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data Environments
DATAVERSITY
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
Wei-Chiu Chuang
 
IDERA Live | Why You Need Data Warehouse Automation Now More Than Ever
IDERA Live | Why You Need Data Warehouse Automation Now More Than EverIDERA Live | Why You Need Data Warehouse Automation Now More Than Ever
IDERA Live | Why You Need Data Warehouse Automation Now More Than Ever
IDERA Software
 
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Ocean9, Inc.
 
Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs
DataCore Software
 
IDERA Live | Working with Complex Data Environments
IDERA Live | Working with Complex Data EnvironmentsIDERA Live | Working with Complex Data Environments
IDERA Live | Working with Complex Data Environments
IDERA Software
 
Think Big, think fast: how to select a future-proof database - Bruno Simic (C...
Think Big, think fast: how to select a future-proof database - Bruno Simic (C...Think Big, think fast: how to select a future-proof database - Bruno Simic (C...
Think Big, think fast: how to select a future-proof database - Bruno Simic (C...
Shift Conference
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
David Spurway
 
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Riccardo Romani
 
How Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data CenterHow Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data Center
Hostway|HOSTING
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA
 
Ad

Recently uploaded (20)

Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 

Introduction to big data and apache spark

  • 1. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1 Mohammed Guller May 17, 2016 Introduction to Big Data and Apache Spark
  • 2. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2 Big Data Big Data Technologies Hadoop Spark Core Spark Libraries Agenda
  • 3. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4 About Me • Principal Architect at Glassbeam • Founded two startups • Passionate about building products, big data analytics, and machine learning • www.linkedin.com/in/mohammedguller • @MohammedGuller 4
  • 4. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6 • Hands-on guide with lots of examples • Covers both fundamental and advanced topics such as machine learning • Includes a primer on functional programming and Scala • Introduces other important Big Data technologies such as HDFS, Parquet, Kafka, HBase, Cassandra, Mesos, and YARN Big Data Analytics with Spark Available on Amazon
  • 5. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7 About Glassbeam Glassbeam brings structure and meaning to data from any connected machine or device while providing actionable intelligence Cloud based analytics platform that helps organizations turn raw machine data to insights Making sense of multi structured machine data  Data center devices  Medical devices  Sensors  ATMs  Automobiles  Data from any machine Providing comprehensive set of apps & tools for machine data analysis  50,000+ systems being tracked today  1,500+ different software rev codes  1.2 Billion sensor readings per day  1+ Trillion sensor readings tracked
  • 6. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 7. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9 What is Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data • • • • • • • • •
  • 8. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10 Data Growing At a Faster Pace Than Ever 10
  • 9. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11 Internet of Things (IoT) • Network of objects embedded with software for collecting and sending data over the Internet • 5x more connected things than people by 2020
  • 10. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12 Industrial IoT • Manufacturing • Automotive • Medical • Data Center • EVC • Smart Meter 12 Glassbeam target market is focused on driving opera onal & business analy cs value for connected product companies in Industrial IoT market IT & Networks Medical & Health Care Transporta on EV Chargers & Smart Grid Industrial & Mfg
  • 11. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13 Big Data Comes With Big Challenges • Storage • Processing • Value
  • 12. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14 Storage Challenges • Legacy SAN / NAS storage devices are expensive • Traditional RDBMS were not designed for Big Data • Cannot handle volume, velocity, variety of Big Data 14
  • 13. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15 Processing Challenges • Diverse processing • Organizations want do more than just BI / traditional analytics • Go beyond SQL queries • Timeliness • Process data in reasonable amount of time • Value of data decreases over time 15
  • 14. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16 How Much Data Can a Standard Server Process
  • 15. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17 • • 17
  • 16. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18 • Large number of CPUs / cores • Faster cores • Large amount of memory • Faster memory bus • High-performance architecture Scale-up with Powerful High-end Server 18
  • 17. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19 Disadvantages of Scale-up Architecture • Proprietary • Expensive • Limited scalability 19
  • 18. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20 • Cluster of servers • Commodity machines • Pool together resources • CPU • Memory • Disk Scale-out Architecture 20
  • 19. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21 Benefits of Scale-out Architecture • Relatively inexpensive • Economical to scale • No huge upfront investment • Start small and expand cluster as workload increases 21
  • 20. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22 Challenges With Scale-out Architecture • Writing distributed applications is very hard • Split job into chunks that can be distributed across a cluster • Schedule compute resources among different jobs • Manage inter-node communication • Handle network and node failures • Hardware failures are more common at a cluster level • Probability of a single node failing is low • Probability of any one node in a large cluster failing is high 22
  • 21. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23 Getting Value Out of Big Data • Traditional analytics / BI • Custom processing • Machine Learning • Predictive analytics • Automate complex tasks • Stream processing • Analyze in real-time/near real-time • React in real-time 23
  • 22. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24 Traditional Analytics / BI • What • Customer growth for the last month/quarter/year • Segmentation of customers by demographics • Average time spent by mobile app users • Why • Sales growth slowed • regional issue • supply issue • Profit dropped • revenue dropped • expenses increased 24
  • 23. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25 Custom Processing • Index web pages • Google • Bing • Process genome data • Identify mutations linked to cancer, Alzheimer's and other disease • Click analysis • Log analysis • 360-degree real time view of a customer 25
  • 24. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26 Predictive Analytics • Advertisements that a visitor will most likely click • Movies / songs / news that a customer will like • Products that a customer will buy • Patient will have an heart attack 26
  • 25. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27 • Virtual assistant • Siri • Google Now • Autonomous machine • Self-driving car • Robots • Tag Images • Facebook • Flickr • Expert System • Medical diagnosis • Personalized medicine • Security • Fraud detection • Network Security • Music recognition • Shazam • SoundHound Automate Complex Tasks 27
  • 26. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 27. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29 • • • • • • 29
  • 28. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30 • • • • • • 30
  • 29. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31 • Text • CSV • JSON • XML • Binary • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC) File Formats 31
  • 30. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32 • Hive • Impala • Presto • Drill • Phoenix • HAWQ • Tajo • Spark SQL Distributed SQL Query Engine 32 Data Warehouse Distributed Storage Distributed Query Engine
  • 31. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33 • • • • • • 33
  • 32. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34 • • 34
  • 33. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35 • • 35
  • 34. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36 Messaging Systems • Kafka • RabbitMQ • ActiveMQ • ZeroMQ 36
  • 35. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37 • Batch • Hadoop MapReduce • HPCC • Stream • Storm • Samza • Kafka Streams • Batch and Stream • Spark • Flink • Beam • Apex • Ignite General-purpose Data Processing Frameworks 37
  • 36. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 37. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39 Hadoop
  • 38. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40 • • • • 40
  • 39. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41 • • • • •
  • 40. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42
  • 41. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43
  • 42. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44
  • 43. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45 Hadoop is Not a Single Product 45
  • 44. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46 Hadoop Core Components 46 =
  • 45. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 46. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48 • • • 48
  • 47. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49 • • • • • • • 49
  • 48. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51 Spark Fast, easy-to-use, general-purpose cluster computing framework for processing large datasets using a simple programming model 51 • • •
  • 49. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52 Benefits • Scale • Fault-tolerance • Abstracts distributed computing • Hides the messy details of writing distributed applications • Allows developers to just focus on the data processing logic • Same code works on a laptop or a cluster of servers • Ease-of-use • Speed • Flexibility 52
  • 50. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53 Easy To Use • Library with an expressive API • Scala, Java, Python, R • 80+ operators • Interactive development • spark-shell • notebooks 53
  • 51. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54 • Batch processing • Interactive analytics • Stream analysis • Machine learning • Graph analytics Integrated Libraries For a Variety of DP Tasks Spark Core Spark SQL GraphX Spark Streaming MLlib
  • 52. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55 Unified Platform for a Variety of Data Processing • Solve a variety of problems with a single toolkit • No need to learn different tools for each use case • Avoid code and data duplication • Achieve operational simplicity
  • 53. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56 Why is Spark Fast • Advanced job execution engine • Allows applications to cache data in memory 56
  • 54. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57 Advanced Job Execution Engine • Directed Acyclic Graph (DAG) of stages • simple job can contain just one stage • complex job can contain many stages • eliminates expensive operations between multiple jobs • synchronization • serialization/deserialization • disk I/O • Lazy operator evaluation • Pipelined operations 57
  • 55. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60 Why Caching Makes Applications Run Faster 60 100 MB/s 500 MB/s 10 GB/s
  • 56. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61 Read Latency Comparison 61 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM
  • 57. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62 Primary Data Abstractions • Resilient Distributed Dataset (RDD) • Dataset
  • 58. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 63 Resilient Distributed Dataset (RDD) • Represents data as an immutable partitioned collection of elements, which can be processed in parallel • Provides the methods / operations for processing data • Conceptually similar to a Scala collection, except for two things: • represents a partitioned dataset, which may be distributed over a cluster • operations are lazily evaluated
  • 59. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 64 Dataset • Introduced in Spark 1.6.0 • Represents data as an immutable, partitioned, strongly-typed collection of elements, which can be processed in parallel • Benefits from Spark SQL optimizations • performance optimizations • memory optimizations (cached dataset uses 5x less space) • Provides compile-time type-safety similar to the RDD API
  • 60. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65 Dataset Uses a Specialized Encoder • Highly optimized • Runtime code generation for serialization and deserialization • Significantly faster than Java or Kryo serialization • Encoded data can be significantly smaller (up to 2x) • Serialized data is in the Tungsten binary format • many operations can be done in-place without deserializing • Built-in support for automatically generating encoders for • primitive types (e.g. String, Integer, Long) • Scala case classes • Java Beans
  • 61. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66 Relationship Between Dataset and DataFrame • DataFrame = Dataset[Row] • Dataset is a partitioned collection of strongly-typed elements • DataFrame is a partitioned collection of generic Row objects • DataFrame can be transformed into a Dataset by calling df.as[ElementType] • Dataset can be transformed into a DataFrame by calling df.toDF()
  • 62. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67 High-level Architecture 67 Master Driver Program Executor Task Worker Executor Task Worker Executor Task Worker Cache Cache Cache
  • 63. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 68 Ideal Applications • Complex data processing • multi-step pipeline • Iterative algorithm • Machine Learning • Graph analytics • Ad hoc analysis • Interactive
  • 64. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 69 Spark Does Not Provide Storage • Works with a variety of data sources • No need to import data into Spark • Scale compute and storage cluster independently
  • 65. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 70 Process Data From a Variety Of Data Sources And Many More
  • 66. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 71 Spark Does Not Replace Hadoop 71 = =
  • 67. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 72 Hadoop is Optional 72 = =
  • 68. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 73 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 69. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74 Spark SQL • Spark library for processing structured data • Provides more than just a SQL interface to Spark • Increases developer productivity • Makes applications run faster • Uniform API for processing data from a variety of sources Spark core Spark SQL
  • 70. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75 Increases Productivity with Higher-level API • SQL • Declarative language • HiveQL • SQL-like • Hive not required • Language integrated queries • DataFrame • Scala, Java, Python and R
  • 71. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76 Makes Applications Run Faster • Query optimization • Python, Scala, Java, and SQL code all use the same optimizer • Reduced disk I/O • skip partitions • skip columns (columnar storage) • skip rows (using statistical information) • in-memory columnar caching • Predicate pushdown • push operations to a database / data source • Code generation • generate Java bytecode
  • 72. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77 Uniform Data Access Spark SQL
  • 73. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78 Primary Data Abstraction is DataFrame • Represents a partitioned collection organized into named columns • Equivalent to a table in a relation database • Inspired by DataFrames in Python / R • Provides methods / operations for language integrated queries in Scala, Java, Python, and R • Constructed from structured data files, Hive tables, RDBMS, NoSQL datastores, or RDDs
  • 74. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 79 • Distributed SQL query engine • JDBC / ODBC support • BI and visualization app • Tableau, Zoomdata, Qlik • Run SQL / HiveQL queries • No Scala, Java, R, or Python BI / Visualization App JDBC / ODBC Spark SQL HDFS / S3 NoSQL RDBMS Thrift JDBC/ODBC server
  • 75. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 81 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 76. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 82 Spark Streaming • Extends Spark for data stream processing • Higher-level operators for streams analytics • Integrates with other Spark libraries • Spark SQL • MLlib • GraphX • High throughput Spark Core Spark Streaming
  • 77. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 83 • Data stream split into micro-batches • Batch interval specified by application • Stream processed as a sequence of micro-batch jobs • Each batch is processed as an RDD by Spark core • Results of RDD operations are returned in batches High-level Architecture Spark Streaming Spark Core Input stream Result stream Batches of x seconds
  • 78. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 84 Primary Abstraction is DStream • Represents a data stream • Implemented as a sequence of RDDs • Provides operators for processing streams • Basic operators similar to RDD operators • Window operations • Stateful stream processing
  • 79. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 85 Spark Streaming Sources And Destinations 85
  • 80. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 86 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 81. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 87 Machine Learning Applications • Regression • Home price valuation • Classification • Image classification • Clustering • Customer segmentation • Anomaly detection • Fraud • Recommendation • Movie • Dimensionality reduction 87
  • 82. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 88 What is Machine Learning • Program software to learn from data • Logic is not explicitly programmed • Infer patterns in a dataset • Generalize beyond the training set • Mathematics provides the foundation 88
  • 83. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 89 Broad Categorization of ML Algorithms • Supervised • Labeled dataset • Examples: Linear Regression, SVM, Neural Networks • Unsupervised • Unlabeled dataset • Example: K-Means 89
  • 84. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 90 MLlib Machine Learning with Spark Spark Core spark.mllib spark.ml
  • 85. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 91 spark.mllib • Extends Spark for ML and statistical analysis • Provides higher-level API than Spark Core for ML • Dataset represented by an RDD 91
  • 86. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 93 spark.ml • Another machine learning library that runs on top of Spark • Less mature than spark.mllib • Dataset represented by a DataFrame • Higher-level abstraction • Feature engineering • Model training, tuning and evaluation • Makes it easy to create ML pipelines / workflows 93
  • 87. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 95 MLlib Supports Machine Learning Workflow • Statistical analysis • Feature extraction • Feature transformation • Dimensionality reduction • Machine learning algorithms • Model tuning • Model evaluation 95
  • 88. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 96 Statistical Analysis • Summary statistics • Correlations • Stratified sampling • Hypothesis testing • Random data generation 96
  • 89. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 97 • Feature Extraction and Transformation • TF-IDF • Word2Vec • StandardScaler • Normalizer • ChiSqSelector • ElementwiseProduct • PCA • Dimensionality Reduction • Singular value decomposition (SVD) • Principal component analysis (PCA) Feature Engineering 97
  • 90. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 98 • Regression and Classification • Linear regression, Isotonic regression • Logistic regression, SVM • Naive Bayes • Random Forest, Gradient- Boosted Trees • Multilayer perceptron (Feedforward NN) • Recommendation • Collaborative Filtering (ALS) • Clustering • K-means • Gaussian mixture • Power iteration clustering (PIC) • Latent Dirichlet allocation (LDA) • Bisecting k-means • Streaming k-means • Frequent pattern mining • FP-growth • Association rules • PrefixSpan Machine Learning Algorithms 98
  • 91. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 99 Big Data Open-source Big Data Technologies Hadoop Spark Core Spark SQL Spark Streaming Machine Learning GraphX
  • 92. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 100 Graph Intro • Composed of vertices and edges • Graphs are ubiquitous • Web pages • Social networks • Transportation hubs • Graph theory provides efficient algorithms for graph-oriented data 100
  • 93. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 101 • Extends Spark for analyzing large-scale graph-oriented data • Provides higher-level abstractions than Spark core for graph analytics • Unifies collections and graphs as first- class composable objects GraphX 101 Spark core GraphX
  • 94. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 102 • Fundamental graph operators • filter • reverse • subgraph • joinVertices • mapVertices • mapEdges • aggregateMessages • Advanced graph operators • pregel • pageRank • connectedComponents • stronglyConnectedComponents • triangleCount Primary Data Abstraction is Graph 102
  • 95. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 103103