SlideShare a Scribd company logo
Big Data and Hadoop Training
Session 5
Big Data - Pipeline
Big Data Pipeline
Lambda Architecture - Streaming(Real-Time) Layer
with
Apache Kafka
Apache Hadoop
Apache Spark
Apache Cassandra
on Amazon Web Services Cloud Platform
Big Data - Pipeline
Big Data - Pipeline
3 EC2 instance for Kafka Cluster
Big Data - Pipeline
Repeat commands for all - 3 EC2 instance for Kafka Cluster
cat /etc/*-release
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
java -version
mkdir kafka
cd kafka
wget http://download.nextag.com/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz
tar -zxvf kafka_2.11-0.10.0.0.tgz
cd kafka_2.11-0.10.0.0
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
Big Data - Pipeline
Kafka-datanode1 (set following properties for config/server.properties)
ubuntu@ip-172-31-63-203:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties
broker.id=1
listeners=PLAINTEXT://172.31.63.203:9092
advertised.listeners=PLAINTEXT://54.173.215.211:9092
zookeeper.connect=52.91.1.93:2181
Kafka-datanode2 (set following properties for config/server.properties)
ubuntu@ip-172-31-9-25:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties
broker.id=2
listeners=PLAINTEXT://172.31.9.25:9092
advertised.listeners=PLAINTEXT://54.226.29.194:9092
zookeeper.connect=52.91.1.93:2181
Modify config/server.properties for
kafka-datanode1 & kafkadatanode2
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
Big Data - Pipeline
Launch zookeeper / datanode1 / datanode2
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
1) Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
2) Start server on Kafka-datanode1
bin/kafka-server-start.sh config/server.properties
3) Start server on Kafka-datanode2
bin/kafka-server-start.sh config/server.properties
export KAFKA_HEAP_OPTS="-Xmx256M -Xms256M"
4) Create Topic & Start consumer
bin/kafka-topics.sh --zookeeper 52.91.1.93:2181 --create --topic data --partitions 1 --replication-factor 2
bin/kafka-console-consumer.sh --zookeeper 52.91.1.93:2181 --topic data --from-beginning
Big Data - Pipeline
Launch Kafka Cluster
(Zookeeper/kafka datanode1/ kafka datanode2)
Big Data - Pipeline
Execute Python / Kafka Spark Job
Big Data - Pipeline
Sample data which we will be sending to Kafka Server
from Java Kafka Producer (csv file)
Big Data - Pipeline
Python Spark Job Processing Data from AWS Kafka Cluster
Big Data - Pipeline
Python Spark Streaming Application
Thank You
hkbhadraa@gmail.com

More Related Content

What's hot (16)

PDF
Cyber Range - An Open-Source Offensive / Defensive Learning Environment on AWS
Tom Cappetta
 
PPTX
Cyber Range - Blackhat Europe 19 Arsenal
Tom Cappetta
 
PPTX
Salting new ground one man ops from scratch
Jay Harrison
 
PDF
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
OpenStack Korea Community
 
PPTX
Enable IPv6 on Route53 AWS ELB, docker and node App
Fyllo
 
PDF
[2C4]Clustered computing with CoreOS, fleet and etcd
NAVER D2
 
PDF
Docker at OpenDNS
OpenDNS
 
PDF
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltStack
 
PDF
Salt conf 2014-installing-openstack-using-saltstack-v02
Yazz Atlas
 
PPT
Python Deployment with Fabric
andymccurdy
 
PPTX
OpenShift4 Installation by UPI on kvm
Jooho Lee
 
PDF
Small, Simple, and Secure: Alpine Linux under the Microscope
Docker, Inc.
 
PDF
Multinode kubernetes-cluster
Ram Nath
 
PDF
Etcd- Mission Critical Key-Value Store
CoreOS
 
PDF
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
Open Source Consulting
 
PPTX
What makes AWS invincible? from JAWS Days 2014
Emma Haruka Iwao
 
Cyber Range - An Open-Source Offensive / Defensive Learning Environment on AWS
Tom Cappetta
 
Cyber Range - Blackhat Europe 19 Arsenal
Tom Cappetta
 
Salting new ground one man ops from scratch
Jay Harrison
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
OpenStack Korea Community
 
Enable IPv6 on Route53 AWS ELB, docker and node App
Fyllo
 
[2C4]Clustered computing with CoreOS, fleet and etcd
NAVER D2
 
Docker at OpenDNS
OpenDNS
 
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltStack
 
Salt conf 2014-installing-openstack-using-saltstack-v02
Yazz Atlas
 
Python Deployment with Fabric
andymccurdy
 
OpenShift4 Installation by UPI on kvm
Jooho Lee
 
Small, Simple, and Secure: Alpine Linux under the Microscope
Docker, Inc.
 
Multinode kubernetes-cluster
Ram Nath
 
Etcd- Mission Critical Key-Value Store
CoreOS
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
Open Source Consulting
 
What makes AWS invincible? from JAWS Days 2014
Emma Haruka Iwao
 

Similar to Big data and hadoop training - Session 5 (20)

PDF
Sparkstreaming
Marilyn Waldman
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Introduction Apache Kafka
Joe Stein
 
PPTX
Big data Lambda Architecture - Batch Layer Hands On
hkbhadraa
 
PDF
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
Streaming Processing with a Distributed Commit Log
Joe Stein
 
PPTX
BIg_Data_on_AWS_Simplified excelent.pptx
HectorRivera374811
 
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
PPTX
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
PPTX
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 
PPTX
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
PDF
Apache kafka-a distributed streaming platform
confluent
 
PDF
Apache Kafka - A Distributed Streaming Platform
Paolo Castagna
 
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Sparkstreaming
Marilyn Waldman
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Introduction Apache Kafka
Joe Stein
 
Big data Lambda Architecture - Batch Layer Hands On
hkbhadraa
 
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Streaming Processing with a Distributed Commit Log
Joe Stein
 
BIg_Data_on_AWS_Simplified excelent.pptx
HectorRivera374811
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Apache kafka-a distributed streaming platform
confluent
 
Apache Kafka - A Distributed Streaming Platform
Paolo Castagna
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Ad

More from hkbhadraa (12)

PPTX
Big data and hadoop training - Session 3
hkbhadraa
 
PPTX
Big data and hadoop training - Session 2
hkbhadraa
 
PPTX
Retail products - machine learning recommendation engine
hkbhadraa
 
PDF
Project management part 5
hkbhadraa
 
PDF
Project management part 4
hkbhadraa
 
PDF
Project management part 3
hkbhadraa
 
PDF
Project management part 2
hkbhadraa
 
PDF
Project management part 1
hkbhadraa
 
PDF
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
hkbhadraa
 
PDF
Gamification
hkbhadraa
 
PDF
Internet of things
hkbhadraa
 
PDF
IBM Bluemix Cloud Platform Application Development with Eclipse IDE
hkbhadraa
 
Big data and hadoop training - Session 3
hkbhadraa
 
Big data and hadoop training - Session 2
hkbhadraa
 
Retail products - machine learning recommendation engine
hkbhadraa
 
Project management part 5
hkbhadraa
 
Project management part 4
hkbhadraa
 
Project management part 3
hkbhadraa
 
Project management part 2
hkbhadraa
 
Project management part 1
hkbhadraa
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
hkbhadraa
 
Gamification
hkbhadraa
 
Internet of things
hkbhadraa
 
IBM Bluemix Cloud Platform Application Development with Eclipse IDE
hkbhadraa
 
Ad

Recently uploaded (20)

PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PDF
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PPTX
Presentation abdominal distension (1).pptx
ChZiaullah
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
big data eco system fundamentals of data science
arivukarasi
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
Kafka Use Cases Real-World Applications
Accentfuture
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
Presentation abdominal distension (1).pptx
ChZiaullah
 

Big data and hadoop training - Session 5

  • 1. Big Data and Hadoop Training Session 5
  • 2. Big Data - Pipeline Big Data Pipeline Lambda Architecture - Streaming(Real-Time) Layer with Apache Kafka Apache Hadoop Apache Spark Apache Cassandra on Amazon Web Services Cloud Platform
  • 3. Big Data - Pipeline
  • 4. Big Data - Pipeline 3 EC2 instance for Kafka Cluster
  • 5. Big Data - Pipeline Repeat commands for all - 3 EC2 instance for Kafka Cluster cat /etc/*-release sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer java -version mkdir kafka cd kafka wget http://download.nextag.com/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxvf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
  • 6. Big Data - Pipeline Kafka-datanode1 (set following properties for config/server.properties) ubuntu@ip-172-31-63-203:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties broker.id=1 listeners=PLAINTEXT://172.31.63.203:9092 advertised.listeners=PLAINTEXT://54.173.215.211:9092 zookeeper.connect=52.91.1.93:2181 Kafka-datanode2 (set following properties for config/server.properties) ubuntu@ip-172-31-9-25:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties broker.id=2 listeners=PLAINTEXT://172.31.9.25:9092 advertised.listeners=PLAINTEXT://54.226.29.194:9092 zookeeper.connect=52.91.1.93:2181 Modify config/server.properties for kafka-datanode1 & kafkadatanode2 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
  • 7. Big Data - Pipeline Launch zookeeper / datanode1 / datanode2 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194 1) Start zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties 2) Start server on Kafka-datanode1 bin/kafka-server-start.sh config/server.properties 3) Start server on Kafka-datanode2 bin/kafka-server-start.sh config/server.properties export KAFKA_HEAP_OPTS="-Xmx256M -Xms256M" 4) Create Topic & Start consumer bin/kafka-topics.sh --zookeeper 52.91.1.93:2181 --create --topic data --partitions 1 --replication-factor 2 bin/kafka-console-consumer.sh --zookeeper 52.91.1.93:2181 --topic data --from-beginning
  • 8. Big Data - Pipeline Launch Kafka Cluster (Zookeeper/kafka datanode1/ kafka datanode2)
  • 9. Big Data - Pipeline Execute Python / Kafka Spark Job
  • 10. Big Data - Pipeline Sample data which we will be sending to Kafka Server from Java Kafka Producer (csv file)
  • 11. Big Data - Pipeline Python Spark Job Processing Data from AWS Kafka Cluster
  • 12. Big Data - Pipeline Python Spark Streaming Application