SlideShare a Scribd company logo
Building End to End Streaming
Application on Spark
Streaming application development journey
https://github.com/Shasidhar/sensoranalytics
● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● www.shashidhare.com
Agenda
● Problem Statement
● Spark streaming
● Stage 1 : File Streams
● Stage 2 : Kafka as input source (Introduction to Kafka)
● Stage 3 : Casandra as Output Store (Introduction to Cassandra)
● Stage 4 : Flume as data collection engine (Introduction to Flume)
● How to test streaming code?
● Next steps
Earlier System
Business model
● Providers of Wi-Fi hot spot devices in public spaces
● Ability to collect data from these devices and analyse
Existing System
● Collect data and process in daily batches to generate the
required results
Existing System
Server
Server
Server
Server
Central
directory
Splunk
Downstream
Systems
Need for real time engine
● Lot of failures in User login
● Need to analyse why there is a drop in user logins
● Ability to analyse the data in real time rather than daily
batches
● As the company is growing Splunk was not scaling as it is
not meant for horizontal scaling
New system requirement
● Able to collect and process large amount of data
● Ability to store results in persistent storage
● A reporting mechanism to view the insights obtained from
the analysis
● Need to see the results in real time
● In a simple term, we can call it as a real time monitoring
system
Why Spark Streaming ?
● Easy to port batch system to streaming engine in Spark
● Spark streaming can handle large amounts of data and it
is very fast
● Best choice for near real time systems
● Futuristic views
○ Ability to ingest data from many sources
○ Good support for downstream stores like NoSQL
○ And lot more
Spark Streaming Architecture
Server
Source
directory
Spark
Streaming
engine
Output
directory
View in
Zeppelin
Data format
Log Data with the following format
● Timestamp
● Country
● State
● City
● SensorStatus
Required Results
● Country Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured
countrywise.
● State Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured
statewise.
● City Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured city
wise with respect to sensor status
Data Analytics - Phase 1
● Receive data from servers
● Store the input data into files
● Use file as input and output
● Process the data , generate
required statistics
● Store results into output files
Spark Streaming engine
Input files (Directory)
Output files (Directory)
Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Micro batch
● Spark streaming is a fast batch processing system
● Spark streaming collects stream data into small batch
and runs batch processing on it
● Batch can be as small as 1s to as big as multiple hours
● Spark job creation and execution overhead is so low it
can do all that under a sec
● These batches are called as DStreams
Apache Zeppelin
● Web based notebook that allows interactive data analysis
● It allows
○ Data ingestion
○ Data Discovery
○ Data Analytics
○ Data Visualization and collaboration
● Built-in Spark integration
Data Model
● 4 models
○ SensorRecord - To read input records
○ CountryWiseStats - Store country wise aggregations
○ StateWiseStats - Store state wise aggregations
○ CityWiseStats - Store city wise aggregations
Phase 1 - Hands On
Git branch : Master
Problems with Phase 1
● Input and output is a file
● Cannot detect new records / new data as and when it is
received
● File causes Low latency in system
Solution : Replace Input file source with Apache kafka
Data Analytics - Phase 2
● Receive data from servers
● Store the input data in Kafka
● Use kafka as input
● Process the data , generate
required statistics
● Store results into output files
Spark Streaming engine
Kafka
Output files (Directory)
Apache Kafka
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn
High Level Architecture
Terminology
● Topics : Is where messages are maintained and
partitioned
● Producers : Processes which produces messages to
Topic
● Consumers: Processes which subscribes to topic and
read messages
● Brokers: Every server which is part of kafka cluster
Anatomy of Kafka Topic
Spark Streaming - Kafka
● Two ways to fetch data from kafka to spark
○ Receiver approach
■ Data is stored in receivers
■ Kafka topic partitions does not correlate with RDDs
■ Enable WAL for zero data loss
■ To increase input speed create multiple receivers
Spark Streaming - Kafka cont
○ Receiver less approach
■ No data is stored in receivers
■ Exact same partitioning in maintained in Spark RDDs as in
Kafka topics
■ No WAL is needed as data is already in kafka we can fetch
older data on receiver crash
■ More kafka partitions increases the data fetching speed
Phase 2 - Hands On
Git branch : Kafka
Problems with Phase 2
● Output is still a file
● Always full file scan is needed to retrieve, no lookups
● Querying results is cumbersome
● Nosql Database is the better option
Solution : Replace Output file with Cassandra
Data Analytics - Phase 3
Spark Streaming engine
Kafka
Cassandra
● Receive data from servers
● Store the input data in Kafka
● Use kafka as input
● Process the data , generate
required statistics
● Store results into cassandra
What is Cassandra
“Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available, fault-
tolerant, tunable consistency, column-oriented database”
“Daughter of Dynamo and Big Table”
Key Components and Features
● Distributed
● System keyspace
● Peer to peer - No SPOF
● Read and write to any node
● Operational simplicity
● Gossip and Failure Detection
Cassandra daemon
cassandra
(CLI)
Language
drivers
JDBC Drivers
Memtable SS tablesCommit Log
Overall Architecture
Spark Cassandra Connector
● Loads data from cassandra to spark and vice versa
● Handles type conversions
● Maps tables to spark RDDs
● Support all cassandra data types, collections and UDTs
● Spark-Sql support
● Supports for Spark SQLs predicate push
Phase 3 - Hands On
Git branch : Cassandra
Problems with Phase 3
● Servers cannot push directly to Kafka
● There is an intervention to push data
● Need for automated way to push data
Solution : Add Flume as a data collection agent
Data Analytics - Phase 4
● Receive data from Server
● Stream data into kafka through
flume
● Store the input data in Kafka
● Use kafka as input
● Process the data , generate
required statistics
● Store results into cassandra
Spark Streaming engine
Kafka
Cassandra
Flume
Apache Flume
● Distributed data collection service
● Solution for data collection of all formats
● Initially designed to transfer log data into HDFS frequently
and reliably
● It is horizontally scalable
● Configurable routing
Flume Architecture
Components
○ Event
○ Source
○ Sink
○ Channel
○ Agent
Flume Configuration
● Define Source, Sink and Channel names
● Configure Source
● Configure Sink
● Configure Channel
● Bind Source and Sink to Channel
Phase 4 - Hands On
Git branch : Flume
Data Analytics - Re Design
● Why we want to re design/ re structure ?
● What we want to test ?
● How to test Streaming applications
● Hack a bit on Spark Manual Clock
● Use scala-test for unit testing
● Bring up abstractions to decouple the code
● Write some tests
Manual Clock
● A clock whose time can be set and modified
● Its notified time will not change as time elapses
● Only callers have control over it
● Specially used for testing
Phase 5 - Hands On
Git branch : unittest
Next steps
● Use better serialization frameworks like Avro
● Enable Checkpointing
● Integrate kafka monitoring tools
● Adding support for multiple kafka topics
● Write more tests for all functionality
Ad

More Related Content

What's hot (20)

Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 

Viewers also liked (19)

Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
datamantra
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
Hari Shreedharan
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
datamantra
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Ad

Similar to Building end to end streaming application on Spark (20)

Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
Hari Shreedharan
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
Hari Shreedharan
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
Hotstar
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Ricardo Bravo
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Operational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in KafkaOperational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in Kafka
confluent
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Databricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
Hari Shreedharan
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
Hotstar
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Ricardo Bravo
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Operational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in KafkaOperational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in Kafka
confluent
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Databricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
Ad

More from datamantra (13)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 

Recently uploaded (20)

Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahahE-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
RyanRahardjo2
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Process Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challengesProcess Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challenges
Process mining Evangelist
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
AWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATA
AWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATAAWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATA
AWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATA
SnehaBoja
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahahE-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
RyanRahardjo2
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Process Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challengesProcess Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challenges
Process mining Evangelist
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
AWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATA
AWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATAAWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATA
AWS-AIML-PRESENTATION RELATED TO DATA SCIENCE TO DATA
SnehaBoja
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 

Building end to end streaming application on Spark

  • 1. Building End to End Streaming Application on Spark Streaming application development journey https://github.com/Shasidhar/sensoranalytics
  • 2. ● Shashidhar E S ● Big data consultant and trainer at datamantra.io ● www.shashidhare.com
  • 3. Agenda ● Problem Statement ● Spark streaming ● Stage 1 : File Streams ● Stage 2 : Kafka as input source (Introduction to Kafka) ● Stage 3 : Casandra as Output Store (Introduction to Cassandra) ● Stage 4 : Flume as data collection engine (Introduction to Flume) ● How to test streaming code? ● Next steps
  • 4. Earlier System Business model ● Providers of Wi-Fi hot spot devices in public spaces ● Ability to collect data from these devices and analyse Existing System ● Collect data and process in daily batches to generate the required results
  • 6. Need for real time engine ● Lot of failures in User login ● Need to analyse why there is a drop in user logins ● Ability to analyse the data in real time rather than daily batches ● As the company is growing Splunk was not scaling as it is not meant for horizontal scaling
  • 7. New system requirement ● Able to collect and process large amount of data ● Ability to store results in persistent storage ● A reporting mechanism to view the insights obtained from the analysis ● Need to see the results in real time ● In a simple term, we can call it as a real time monitoring system
  • 8. Why Spark Streaming ? ● Easy to port batch system to streaming engine in Spark ● Spark streaming can handle large amounts of data and it is very fast ● Best choice for near real time systems ● Futuristic views ○ Ability to ingest data from many sources ○ Good support for downstream stores like NoSQL ○ And lot more
  • 10. Data format Log Data with the following format ● Timestamp ● Country ● State ● City ● SensorStatus
  • 11. Required Results ● Country Wise Stats ○ Hourly,Weekly and Monthly view of total count of records captured countrywise. ● State Wise Stats ○ Hourly,Weekly and Monthly view of total count of records captured statewise. ● City Wise Stats ○ Hourly,Weekly and Monthly view of total count of records captured city wise with respect to sensor status
  • 12. Data Analytics - Phase 1 ● Receive data from servers ● Store the input data into files ● Use file as input and output ● Process the data , generate required statistics ● Store results into output files Spark Streaming engine Input files (Directory) Output files (Directory)
  • 13. Spark streaming introduction Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
  • 14. Micro batch ● Spark streaming is a fast batch processing system ● Spark streaming collects stream data into small batch and runs batch processing on it ● Batch can be as small as 1s to as big as multiple hours ● Spark job creation and execution overhead is so low it can do all that under a sec ● These batches are called as DStreams
  • 15. Apache Zeppelin ● Web based notebook that allows interactive data analysis ● It allows ○ Data ingestion ○ Data Discovery ○ Data Analytics ○ Data Visualization and collaboration ● Built-in Spark integration
  • 16. Data Model ● 4 models ○ SensorRecord - To read input records ○ CountryWiseStats - Store country wise aggregations ○ StateWiseStats - Store state wise aggregations ○ CityWiseStats - Store city wise aggregations
  • 17. Phase 1 - Hands On Git branch : Master
  • 18. Problems with Phase 1 ● Input and output is a file ● Cannot detect new records / new data as and when it is received ● File causes Low latency in system Solution : Replace Input file source with Apache kafka
  • 19. Data Analytics - Phase 2 ● Receive data from servers ● Store the input data in Kafka ● Use kafka as input ● Process the data , generate required statistics ● Store results into output files Spark Streaming engine Kafka Output files (Directory)
  • 20. Apache Kafka ● High throughput publish subscribe based messaging system ● Distributed, partitioned and replicated commit log ● Messages are persistent in system as Topics ● Uses Zookeeper for cluster management ● Written in scala, but supports many client API’s - Java, Ruby, Python etc ● Developed by LinkedIn
  • 22. Terminology ● Topics : Is where messages are maintained and partitioned ● Producers : Processes which produces messages to Topic ● Consumers: Processes which subscribes to topic and read messages ● Brokers: Every server which is part of kafka cluster
  • 24. Spark Streaming - Kafka ● Two ways to fetch data from kafka to spark ○ Receiver approach ■ Data is stored in receivers ■ Kafka topic partitions does not correlate with RDDs ■ Enable WAL for zero data loss ■ To increase input speed create multiple receivers
  • 25. Spark Streaming - Kafka cont ○ Receiver less approach ■ No data is stored in receivers ■ Exact same partitioning in maintained in Spark RDDs as in Kafka topics ■ No WAL is needed as data is already in kafka we can fetch older data on receiver crash ■ More kafka partitions increases the data fetching speed
  • 26. Phase 2 - Hands On Git branch : Kafka
  • 27. Problems with Phase 2 ● Output is still a file ● Always full file scan is needed to retrieve, no lookups ● Querying results is cumbersome ● Nosql Database is the better option Solution : Replace Output file with Cassandra
  • 28. Data Analytics - Phase 3 Spark Streaming engine Kafka Cassandra ● Receive data from servers ● Store the input data in Kafka ● Use kafka as input ● Process the data , generate required statistics ● Store results into cassandra
  • 29. What is Cassandra “Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault- tolerant, tunable consistency, column-oriented database” “Daughter of Dynamo and Big Table”
  • 30. Key Components and Features ● Distributed ● System keyspace ● Peer to peer - No SPOF ● Read and write to any node ● Operational simplicity ● Gossip and Failure Detection
  • 32. Spark Cassandra Connector ● Loads data from cassandra to spark and vice versa ● Handles type conversions ● Maps tables to spark RDDs ● Support all cassandra data types, collections and UDTs ● Spark-Sql support ● Supports for Spark SQLs predicate push
  • 33. Phase 3 - Hands On Git branch : Cassandra
  • 34. Problems with Phase 3 ● Servers cannot push directly to Kafka ● There is an intervention to push data ● Need for automated way to push data Solution : Add Flume as a data collection agent
  • 35. Data Analytics - Phase 4 ● Receive data from Server ● Stream data into kafka through flume ● Store the input data in Kafka ● Use kafka as input ● Process the data , generate required statistics ● Store results into cassandra Spark Streaming engine Kafka Cassandra Flume
  • 36. Apache Flume ● Distributed data collection service ● Solution for data collection of all formats ● Initially designed to transfer log data into HDFS frequently and reliably ● It is horizontally scalable ● Configurable routing
  • 37. Flume Architecture Components ○ Event ○ Source ○ Sink ○ Channel ○ Agent
  • 38. Flume Configuration ● Define Source, Sink and Channel names ● Configure Source ● Configure Sink ● Configure Channel ● Bind Source and Sink to Channel
  • 39. Phase 4 - Hands On Git branch : Flume
  • 40. Data Analytics - Re Design ● Why we want to re design/ re structure ? ● What we want to test ? ● How to test Streaming applications ● Hack a bit on Spark Manual Clock ● Use scala-test for unit testing ● Bring up abstractions to decouple the code ● Write some tests
  • 41. Manual Clock ● A clock whose time can be set and modified ● Its notified time will not change as time elapses ● Only callers have control over it ● Specially used for testing
  • 42. Phase 5 - Hands On Git branch : unittest
  • 43. Next steps ● Use better serialization frameworks like Avro ● Enable Checkpointing ● Integrate kafka monitoring tools ● Adding support for multiple kafka topics ● Write more tests for all functionality