SlideShare a Scribd company logo
Real time Analytics 
with Apache Kafka and Spark 
October 2014 Meetup 
Organized by Big Data Hyderabad. 
http://www.meetup.com/Big-Data-Hyderabad/ 
Rahul Jain 
@rahuldausa
About Me 
• Big data/Search Consultant 
• 8+ years of learning experience. 
• Worked (got a chance) on High volume 
distributed applications. 
• Still a learner (and beginner)
Quick Questionnaire 
How many people know/work on Scala ? 
How many people know/work on Apache Kafka? 
How many people know/heard/are using Apache Spark ?
What we are going to 
learn/see today ? 
• Apache Zookeeper (Overview) 
• Apache Kafka – Hands-on 
• Apache Spark – Hands-on 
• Spark Streaming (Explore)
Apache Zookeeper TM
What is Zookeeper 
• An Open source, High Performance coordination 
service for distributed applications 
• Centralized service for 
– Configuration Management 
– Locks and Synchronization for providing coordination 
between distributed systems 
– Naming service (Registry) 
– Group Membership 
• Features 
– hierarchical namespace 
– provides watcher on a znode 
– allows to form a cluster of nodes 
• Supports a large volume of request for data retrieval 
and update 
• http://zookeeper.apache.org/ 
Source : http://zookeeper.apache.org
Zookeeper Use cases 
• Configuration Management 
• Cluster member nodes Bootstrapping configuration from a 
central source 
• Distributed Cluster Management 
• Node Join/Leave 
• Node Status in real time 
• Naming Service – e.g. DNS 
• Distributed Synchronization – locks, barriers 
• Leader election 
• Centralized and Highly reliable Registry
Zookeeper Data Model 
 Hierarchical Namespace 
 Each node is called “znode” 
 Each znode has data(stores data in 
byte[] array) and can have children 
 znode 
– Maintains “Stat” structure with 
version of data changes , ACL 
changes and timestamp 
– Version number increases with each 
changes
Let’s recall basic concepts of 
Messaging System
Point to Point Messaging 
(Queue) 
Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
Publish-Subscribe Messaging 
(Topic) 
Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
Apache Kafka
Overview 
• An apache project initially developed at LinkedIn 
• Distributed publish-subscribe messaging system 
• Designed for processing of real time activity stream data e.g. logs, 
metrics collections 
• Written in Scala 
• Does not follow JMS Standards, neither uses JMS APIs 
• Features 
– Persistent messaging 
– High-throughput 
– Supports both queue and topic semantics 
– Uses Zookeeper for forming a cluster of nodes 
(producer/consumer/broker) 
and many more… 
• http://kafka.apache.org/
How it works 
Credit : http://kafka.apache.org/design.html
Real time transfer 
Broker does not Push messages to Consumer, Consumer Polls messages from Broker. 
Consumer1 
(Group1) 
Consumer3 
(Group2) 
Kafka 
Broker 
Consumer4 
(Group2) 
Producer 
Zookeeper 
Consumer2 
(Group1) 
Update Consumed 
Message offset 
Queue 
Topology 
Topic 
Topology 
Kafka 
Broker
Performance Numbers 
Producer Performance Consumer Performance 
Credit : http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
About Apache Spark 
• Initially started at UC Berkeley in 2009 
• Fast and general purpose cluster computing system 
• 10x (on disk) - 100x (In-Memory) faster 
• Most popular for running Iterative Machine Learning Algorithms. 
• Provides high level APIs in 
• Java 
• Scala 
• Python 
• Integration with Hadoop and its eco-system and can read existing data. 
• http://spark.apache.org/
So Why Spark ? 
• Most of Machine Learning Algorithms are iterative because each iteration 
can improve the results 
• With Disk based approach each iteration’s output is written to disk making 
it slow 
Hadoop execution flow 
Spark execution flow 
http://www.wiziq.com/blog/hype-around-apache-spark/
Spark Stack 
• Spark SQL 
– For SQL and unstructured data 
processing 
• MLib 
– Machine Learning Algorithms 
• GraphX 
– Graph Processing 
• Spark Streaming 
– stream processing of live data 
streams 
http://spark.apache.org
Execution Flow 
http://spark.apache.org/docs/latest/cluster-overview.html
Terminology 
• Application Jar 
– User Program and its dependencies except Hadoop & Spark Jars bundled into a 
Jar file 
• Driver Program 
– The process to start the execution (main() function) 
• Cluster Manager 
– An external service to manage resources on the cluster (standalone manager, 
YARN, Apache Mesos) 
• Deploy Mode 
– cluster : Driver inside the cluster 
– client : Driver outside of Cluster
Terminology (contd.) 
• Worker Node : Node that run the application program in cluster 
• Executor 
– Process launched on a worker node, that runs the Tasks 
– Keep data in memory or disk storage 
• Task : A unit of work that will be sent to executor 
• Job 
– Consists multiple tasks 
– Created based on a Action 
• Stage : Each Job is divided into smaller set of tasks called Stages that is sequential 
and depend on each other 
• SparkContext : 
– represents the connection to a Spark cluster, and can be used to create RDDs, 
accumulators and broadcast variables on that cluster.
Resilient Distributed Dataset (RDD) 
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark 
• Immutable, Partitioned collection of elements that can be operated in parallel 
• Basic Operations 
– map 
– filter 
– persist 
• Multiple Implementation 
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join 
– DoubleRDDFunctions : Operation related to double values 
– SequenceFileRDDFunctions : Operation related to SequenceFiles 
• RDD main characteristics: 
– A list of partitions 
– A function for computing each split 
– A list of dependencies on other RDDs 
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 
• Custom RDD can be also implemented (by overriding functions)
Cluster Deployment 
• Standalone Deploy Mode 
– simplest way to deploy Spark on a private cluster 
• Amazon EC2 
– EC2 scripts are available 
– Very quick launching a new cluster 
• Apache Mesos 
• Hadoop YARN
Monitoring
Monitoring – Stages
Let’s start getting hands dirty
Kafka Installation 
• Download 
– http://kafka.apache.org/downloads.html 
• Untar it 
> tar -xzf kafka_<version>.tgz 
> cd kafka_<version>
Start Servers 
• Start the Zookeeper server 
> bin/zookeeper-server-start.sh config/zookeeper.properties 
Pre-requisite: Zookeeper should be up and running. 
• Now Start the Kafka Server 
> bin/kafka-server-start.sh config/server.properties
Create/List Topics 
• Create a topic 
> bin/kafka-topics.sh --create --zookeeper localhost:2181 -- 
replication-factor 1 --partitions 1 --topic test 
• List down all topics 
> bin/kafka-topics.sh --list --zookeeper localhost:2181 
Output: test
Producer 
• Send some Messages 
> bin/kafka-console-producer.sh --broker-list 
localhost:9092 --topic test 
Now type on console: 
This is a message 
This is another message
Consumer 
• Receive some Messages 
> bin/kafka-console-consumer.sh --zookeeper 
localhost:2181 --topic test --from-beginning 
This is a message 
This is another message
Multi-Broker Cluster 
• Copy configs 
> cp config/server.properties config/server-1.properties 
> cp config/server.properties config/server-2.properties 
• Changes in the config files. 
config/server-1.properties: 
broker.id=1 
port=9093 
log.dir=/tmp/kafka-logs-1 
config/server-2.properties: 
broker.id=2 
port=9094 
log.dir=/tmp/kafka-logs-2
Start with New Nodes 
• Start other Nodes with new configs 
> bin/kafka-server-start.sh config/server-1.properties & 
> bin/kafka-server-start.sh config/server-2.properties & 
• Create a new topic with replication factor as 3 
> bin/kafka-topics.sh --create --zookeeper 
localhost:2181 --replication-factor 3 --partitions 
1 --topic my-replicated-topic 
• List down the all topics 
> bin/kafka-topics.sh --describe --zookeeper 
localhost:2181 --topic my-replicated-topic 
Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: 
Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Let’s move to 
Apache Spark
Spark Shell 
./bin/spark-shell --master local[2] 
The --master option specifies the master URL for a distributed cluster, or local to run 
locally with one thread, or local[N] to run locally with N threads. You should start by 
using local for testing. 
./bin/run-example SparkPi 10 
This will run 10 iterations to calculate the value of Pi
Basic operations… 
scala> val textFile = sc.textFile("README.md") 
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 
scala> textFile.count() // Number of items in this RDD 
ees0: Long = 126 
scala> textFile.first() // First item in this RDD 
res1: String = # Apache Spark 
scala> val linesWithSpark = textFile.filter(line => 
line.contains("Spark")) 
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 
Simplier - Single liner: 
scala> textFile.filter(line => line.contains("Spark")).count() 
// How many lines contain "Spark"? 
res3: Long = 15
Map - Reduce 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> if (a > b) a else b) 
res4: Long = 15 
scala> import java.lang.Math 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> Math.max(a, b)) 
res5: Int = 15 
scala> val wordCounts = textFile.flatMap(line => line.split(" 
")).map(word => (word, 1)).reduceByKey((a, b) => a + b) 
wordCounts: spark.RDD[(String, Int)] = 
spark.ShuffledAggregatedRDD@71f027b8 
wordCounts.collect()
With Caching… 
scala> linesWithSpark.cache() 
res7: spark.RDD[String] = spark.FilteredRDD@17e51082 
scala> linesWithSpark.count() 
res8: Long = 15 
scala> linesWithSpark.count() 
res9: Long = 15
With HDFS… 
val lines = spark.textFile(“hdfs://...”) 
val errors = lines.filter(line => line.startsWith(“ERROR”)) 
println(Total errors: + errors.count())
Standalone (Scala) 
/* SimpleApp.scala */ 
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
object SimpleApp { 
def main(args: Array[String]) { 
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your 
system 
val conf = new SparkConf().setAppName("Simple Application") 
.setMaster(“local") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numAs = logData.filter(line => line.contains("a")).count() 
val numBs = logData.filter(line => line.contains("b")).count() 
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) 
} 
}
Standalone (Java) 
/* SimpleApp.java */ 
import org.apache.spark.api.java.*; 
import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.function.Function; 
public class SimpleApp { 
public static void main(String[] args) { 
String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system 
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); 
JavaSparkContext sc = new JavaSparkContext(conf); 
JavaRDD<String> logData = sc.textFile(logFile).cache(); 
long numAs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("a"); } 
}).count(); 
long numBs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("b"); } 
}).count(); 
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); 
} 
}
Job Submission 
$SPARK_HOME/bin/spark-submit  
--class "SimpleApp"  
--master local[4]  
target/scala-2.10/simple-project_2.10-1.0.jar
Configuration 
val conf = new SparkConf() 
.setMaster("local") 
.setAppName("CountingSheep") 
.set("spark.executor.memory", "1g") 
val sc = new SparkContext(conf)
Let’s discuss 
Real-time Clickstream Analytics
Analytics High Level Flow 
A clickstream is the recording of the parts 
of the screen a computer user clicks on 
while web browsing. 
http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/
Spark Streaming 
Makes it easy to build scalable fault-tolerant streaming 
applications. 
– Ease of Use 
– Fault Tolerance 
– Combine streaming with batch and interactive queries.
Support Multiple Input/Outputs 
Credit : http://spark.apache.org
Streaming Flow 
Credit : http://spark.apache.org
Spark Streaming Terminology 
• Spark Context 
– An object with configuration parameters 
• Discretized Steams (DStreams) 
– Stream from input sources or generated after transformation 
– Continuous series of RDDs 
• Input DSteams 
– Stream of raw data from input sources
Spark Context 
SparkConf sparkConf = new SparkConf().setAppName(”PageViewsCount"); 
sparkConf.setMaster("local[2]");//running locally with 2 threads 
// Create the context with a 5 second batch size 
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5000));
Reading from Kafka Stream 
int numThreads = Integer.parseInt(args[3]); 
Map<String, Integer> topicMap = new HashMap<String, Integer>(); 
String[] topics = args[2].split(","); 
for (String topic: topics) { 
topicMap.put(topic, numThreads); 
} 
JavaPairReceiverInputDStream<String, String> messages = 
KafkaUtils.createStream(jssc, <zkQuorum>, <group> , topicMap);
Processing of Stream 
/** 
* Incoming: 
// (192.168.2.82,1412977327392,www.example.com,192.168.2.82) 
// 192.168.2.82: key : tuple2._1() 
// 1412977327392,www.example.com,192.168.2.82: value: tuple2._2() 
*/ 
// Method signature: messages.map(InputArgs, OutputArgs) 
JavaDStream<Tuple2<String, String>> events = messages.map(new Function<Tuple2<String, String>, 
Tuple2<String, String>>() { 
@Override 
public Tuple2<String, String> call(Tuple2<String, String> tuple2) { 
String[] parts = tuple2._2().split(","); 
return new Tuple2<>(parts[2], parts[1]); 
} 
}); 
JavaPairDStream<String, Long> pageCountsByUrl = events.map(new Function<Tuple2<String,String>, String>(){ 
@Override 
public String call(Tuple2<String, String> pageView) throws Exception { 
return pageView._2(); 
} 
}).countByValue(); 
System.out.println("Page counts by url:"); 
pageCountsByUrl.print();
Thanks! 
@rahuldausa on twitter and slideshare 
http://www.linkedin.com/in/rahuldausa 
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR 
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 
http://www.meetup.com/DataAnalyticsGroup/ 
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. 
http://www.meetup.com/Big-Data-Hyderabad/
Ad

More Related Content

What's hot (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clement Demonchy
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
CQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDDCQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDD
Dennis Doomen
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Apache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsApache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and Logistics
Kai Wähner
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Lucas Jellema
 
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Kai Wähner
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
confluent
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
mxmxm
 
kafka
kafkakafka
kafka
Amikam Snir
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
Araf Karsh Hamid
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
CQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDDCQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDD
Dennis Doomen
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Apache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsApache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and Logistics
Kai Wähner
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Lucas Jellema
 
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Kai Wähner
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
confluent
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
mxmxm
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
Araf Karsh Hamid
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 

Viewers also liked (20)

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Rahul Jain
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
Hortonworks
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
Jean-Paul Azar
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
Hortonworks
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
Jean-Paul Azar
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Ad

Similar to Real time Analytics with Apache Kafka and Apache Spark (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Ad

More from Rahul Jain (12)

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
Rahul Jain
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
Rahul Jain
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
Rahul Jain
 
Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
Rahul Jain
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
Rahul Jain
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
Rahul Jain
 

Recently uploaded (20)

Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Salesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docxSalesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docx
José Enrique López Rivera
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Salesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docxSalesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docx
José Enrique López Rivera
 

Real time Analytics with Apache Kafka and Apache Spark

  • 1. Real time Analytics with Apache Kafka and Spark October 2014 Meetup Organized by Big Data Hyderabad. http://www.meetup.com/Big-Data-Hyderabad/ Rahul Jain @rahuldausa
  • 2. About Me • Big data/Search Consultant • 8+ years of learning experience. • Worked (got a chance) on High volume distributed applications. • Still a learner (and beginner)
  • 3. Quick Questionnaire How many people know/work on Scala ? How many people know/work on Apache Kafka? How many people know/heard/are using Apache Spark ?
  • 4. What we are going to learn/see today ? • Apache Zookeeper (Overview) • Apache Kafka – Hands-on • Apache Spark – Hands-on • Spark Streaming (Explore)
  • 6. What is Zookeeper • An Open source, High Performance coordination service for distributed applications • Centralized service for – Configuration Management – Locks and Synchronization for providing coordination between distributed systems – Naming service (Registry) – Group Membership • Features – hierarchical namespace – provides watcher on a znode – allows to form a cluster of nodes • Supports a large volume of request for data retrieval and update • http://zookeeper.apache.org/ Source : http://zookeeper.apache.org
  • 7. Zookeeper Use cases • Configuration Management • Cluster member nodes Bootstrapping configuration from a central source • Distributed Cluster Management • Node Join/Leave • Node Status in real time • Naming Service – e.g. DNS • Distributed Synchronization – locks, barriers • Leader election • Centralized and Highly reliable Registry
  • 8. Zookeeper Data Model  Hierarchical Namespace  Each node is called “znode”  Each znode has data(stores data in byte[] array) and can have children  znode – Maintains “Stat” structure with version of data changes , ACL changes and timestamp – Version number increases with each changes
  • 9. Let’s recall basic concepts of Messaging System
  • 10. Point to Point Messaging (Queue) Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
  • 11. Publish-Subscribe Messaging (Topic) Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
  • 13. Overview • An apache project initially developed at LinkedIn • Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs • Features – Persistent messaging – High-throughput – Supports both queue and topic semantics – Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… • http://kafka.apache.org/
  • 14. How it works Credit : http://kafka.apache.org/design.html
  • 15. Real time transfer Broker does not Push messages to Consumer, Consumer Polls messages from Broker. Consumer1 (Group1) Consumer3 (Group2) Kafka Broker Consumer4 (Group2) Producer Zookeeper Consumer2 (Group1) Update Consumed Message offset Queue Topology Topic Topology Kafka Broker
  • 16. Performance Numbers Producer Performance Consumer Performance Credit : http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
  • 17. About Apache Spark • Initially started at UC Berkeley in 2009 • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In-Memory) faster • Most popular for running Iterative Machine Learning Algorithms. • Provides high level APIs in • Java • Scala • Python • Integration with Hadoop and its eco-system and can read existing data. • http://spark.apache.org/
  • 18. So Why Spark ? • Most of Machine Learning Algorithms are iterative because each iteration can improve the results • With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow http://www.wiziq.com/blog/hype-around-apache-spark/
  • 19. Spark Stack • Spark SQL – For SQL and unstructured data processing • MLib – Machine Learning Algorithms • GraphX – Graph Processing • Spark Streaming – stream processing of live data streams http://spark.apache.org
  • 21. Terminology • Application Jar – User Program and its dependencies except Hadoop & Spark Jars bundled into a Jar file • Driver Program – The process to start the execution (main() function) • Cluster Manager – An external service to manage resources on the cluster (standalone manager, YARN, Apache Mesos) • Deploy Mode – cluster : Driver inside the cluster – client : Driver outside of Cluster
  • 22. Terminology (contd.) • Worker Node : Node that run the application program in cluster • Executor – Process launched on a worker node, that runs the Tasks – Keep data in memory or disk storage • Task : A unit of work that will be sent to executor • Job – Consists multiple tasks – Created based on a Action • Stage : Each Job is divided into smaller set of tasks called Stages that is sequential and depend on each other • SparkContext : – represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • 23. Resilient Distributed Dataset (RDD) • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions)
  • 24. Cluster Deployment • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN
  • 27. Let’s start getting hands dirty
  • 28. Kafka Installation • Download – http://kafka.apache.org/downloads.html • Untar it > tar -xzf kafka_<version>.tgz > cd kafka_<version>
  • 29. Start Servers • Start the Zookeeper server > bin/zookeeper-server-start.sh config/zookeeper.properties Pre-requisite: Zookeeper should be up and running. • Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties
  • 30. Create/List Topics • Create a topic > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic test • List down all topics > bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test
  • 31. Producer • Send some Messages > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message
  • 32. Consumer • Receive some Messages > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message
  • 33. Multi-Broker Cluster • Copy configs > cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties • Changes in the config files. config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2
  • 34. Start with New Nodes • Start other Nodes with new configs > bin/kafka-server-start.sh config/server-1.properties & > bin/kafka-server-start.sh config/server-2.properties & • Create a new topic with replication factor as 3 > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic • List down the all topics > bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
  • 35. Let’s move to Apache Spark
  • 36. Spark Shell ./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. ./bin/run-example SparkPi 10 This will run 10 iterations to calculate the value of Pi
  • 37. Basic operations… scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD ees0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 Simplier - Single liner: scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15
  • 38. Map - Reduce scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCounts.collect()
  • 39. With Caching… scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 40. With HDFS… val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(line => line.startsWith(“ERROR”)) println(Total errors: + errors.count())
  • 41. Standalone (Scala) /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }
  • 42. Standalone (Java) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }
  • 43. Job Submission $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
  • 44. Configuration val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
  • 45. Let’s discuss Real-time Clickstream Analytics
  • 46. Analytics High Level Flow A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing. http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/
  • 47. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. – Ease of Use – Fault Tolerance – Combine streaming with batch and interactive queries.
  • 48. Support Multiple Input/Outputs Credit : http://spark.apache.org
  • 49. Streaming Flow Credit : http://spark.apache.org
  • 50. Spark Streaming Terminology • Spark Context – An object with configuration parameters • Discretized Steams (DStreams) – Stream from input sources or generated after transformation – Continuous series of RDDs • Input DSteams – Stream of raw data from input sources
  • 51. Spark Context SparkConf sparkConf = new SparkConf().setAppName(”PageViewsCount"); sparkConf.setMaster("local[2]");//running locally with 2 threads // Create the context with a 5 second batch size JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5000));
  • 52. Reading from Kafka Stream int numThreads = Integer.parseInt(args[3]); Map<String, Integer> topicMap = new HashMap<String, Integer>(); String[] topics = args[2].split(","); for (String topic: topics) { topicMap.put(topic, numThreads); } JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, <zkQuorum>, <group> , topicMap);
  • 53. Processing of Stream /** * Incoming: // (192.168.2.82,1412977327392,www.example.com,192.168.2.82) // 192.168.2.82: key : tuple2._1() // 1412977327392,www.example.com,192.168.2.82: value: tuple2._2() */ // Method signature: messages.map(InputArgs, OutputArgs) JavaDStream<Tuple2<String, String>> events = messages.map(new Function<Tuple2<String, String>, Tuple2<String, String>>() { @Override public Tuple2<String, String> call(Tuple2<String, String> tuple2) { String[] parts = tuple2._2().split(","); return new Tuple2<>(parts[2], parts[1]); } }); JavaPairDStream<String, Long> pageCountsByUrl = events.map(new Function<Tuple2<String,String>, String>(){ @Override public String call(Tuple2<String, String> pageView) throws Exception { return pageView._2(); } }).countByValue(); System.out.println("Page counts by url:"); pageCountsByUrl.print();
  • 54. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ http://www.meetup.com/DataAnalyticsGroup/ Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. http://www.meetup.com/Big-Data-Hyderabad/