SlideShare a Scribd company logo
A Complete Hadoop
Stack
Abhra Pal
April - 2016
Background
I have been working and experimenting with Hadoop for sometime now. Now that I know and
understand, it reminds me how I started and struggled getting the right direction in designing a
complete Hadoop stack. While Hadoop gives you a lot of flexibility, it also creates a confusion
because there are many directions you can choose and integrate with your core solution. So
here, I present my version of it. This is also my first publication on slideshare – so apologies if
your find mistakes.
I have only used Apache Hadoop (version 2.6.2) and all open source projects. For you to be able
to understand the technical solution, you need to understand Hadoop core concepts, NOSQL,
Columnar Database, grid, CGI, Python, JSON and Javascript.
Solution : My Hadoop Technology Stack
Data Acquisition Data
Transformation
Data Exploration &
Prediction
Data Lake Output Data Exploitation
File Based data
using Hadoop shell
PIG for loading R (but there are
challenges of using
R with full capacity
of Hadoop)
MONGODB with
mongo-Hadoop
connectors to store
the output from data
transformations
Python CGI to create
a web application
Web Service Data
using FLUME
HIVE for query
based operations on
structured data
Python libraries like
NUMPY / SCIPY
for machine learning
PIG will move the
data from HIVE to
MONGO
JQUERY to bind
JSON responses
with D3
Structured Data
using SQOOP
HBASE for
columnar data
The whole stack / cluster can be built on Ubuntu 64 bit OS – 14.04 and Hadoop 2
Architecture / Interface Diagram
HADOOP Cluster MongoDB Cluster Application Server
SQOOP
Flume
Hadoop
Shell
HDFS
HIVE
PIG
Python Machine
Learning Router / Query Servers
(≥2)
Config Server (≥2)
Shard / Replica (≥2)
HCATALOG
Python CGI
JQUERYMongo
REST
D3
PiMONGO
Mongo
Hadoop
Mongo
Hadoop
Conclusion, Alternatives & Further Reading
I wanted to keep this simple and straight forward – so I have kept the details. If you like to know
further details – I would love to explain to do so with examples.
Spark coming in a big way ahead – would definitely provide us more solutions on designing the
stack – specially the Spark R libraries, but spark is compatible with Python.
Different adapters are improving for R-Hadoop and sure this will improve in time to come.
The main idea of my solution is that HIVE alone can not suffice for a real time dashboard and
hence the output of processed data and exploration needs to be taken out of Hadoop.
The reason for using MONGODB in my stack is the flexibility and that MONGO offers an
automatic REST API on all its collections.
Ad

Recommended

Hadoop online training course
Hadoop online training course
Kamal A
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
Makoto Yui
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
Yahoo Developer Network
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
George Ang
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
20171012 found IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
20170210 sapporotechbar7
20170210 sapporotechbar7
Ryuji Tamagawa
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
Vince Gonzalez
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Hadoop online training by certified trainer
Hadoop online training by certified trainer
sriram0233
 
Hadoop
Hadoop
siva shankari
 
Hadoop online training
Hadoop online training
Smartittrainings
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
F07-Cloud-Hadoop-BAM
F07-Cloud-Hadoop-BAM
Bioinformatics Open Source Conference
 
Apache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source Conference
Edward Yoon
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011
Edward Yoon
 
myHadoop 0.30
myHadoop 0.30
Glenn K. Lockwood
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Big Data Hadoop Training
Big Data Hadoop Training
stratapps
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
Sri Ambati
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Combining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar Environment
inside-BigData.com
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]
Shweta Patnaik
 
A Generative Method for Infrastructure Emergence
A Generative Method for Infrastructure Emergence
whichlight
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Jayush Luniya
 

More Related Content

What's hot (20)

20170210 sapporotechbar7
20170210 sapporotechbar7
Ryuji Tamagawa
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
Vince Gonzalez
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Hadoop online training by certified trainer
Hadoop online training by certified trainer
sriram0233
 
Hadoop
Hadoop
siva shankari
 
Hadoop online training
Hadoop online training
Smartittrainings
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
F07-Cloud-Hadoop-BAM
F07-Cloud-Hadoop-BAM
Bioinformatics Open Source Conference
 
Apache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source Conference
Edward Yoon
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011
Edward Yoon
 
myHadoop 0.30
myHadoop 0.30
Glenn K. Lockwood
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Big Data Hadoop Training
Big Data Hadoop Training
stratapps
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
Sri Ambati
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Combining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar Environment
inside-BigData.com
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]
Shweta Patnaik
 
20170210 sapporotechbar7
20170210 sapporotechbar7
Ryuji Tamagawa
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
Vince Gonzalez
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Hadoop online training by certified trainer
Hadoop online training by certified trainer
sriram0233
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
Apache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source Conference
Edward Yoon
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011
Edward Yoon
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Big Data Hadoop Training
Big Data Hadoop Training
stratapps
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
Sri Ambati
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Combining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar Environment
inside-BigData.com
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]
Shweta Patnaik
 

Viewers also liked (13)

A Generative Method for Infrastructure Emergence
A Generative Method for Infrastructure Emergence
whichlight
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Jayush Luniya
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Cask Data
 
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
Cask Data
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
Cask Data
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
DataWorks Summit/Hadoop Summit
 
Getting started with replica set in MongoDB
Getting started with replica set in MongoDB
Kishor Parkhe
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Transactions Over Apache HBase
Transactions Over Apache HBase
Cask Data
 
Mongodb sharding
Mongodb sharding
xiangrong
 
Mongo DB
Mongo DB
Karan Kukreja
 
A Generative Method for Infrastructure Emergence
A Generative Method for Infrastructure Emergence
whichlight
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Jayush Luniya
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Cask Data
 
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
Cask Data
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
Cask Data
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
DataWorks Summit/Hadoop Summit
 
Getting started with replica set in MongoDB
Getting started with replica set in MongoDB
Kishor Parkhe
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Transactions Over Apache HBase
Transactions Over Apache HBase
Cask Data
 
Mongodb sharding
Mongodb sharding
xiangrong
 
Ad

Similar to A complete hadoop stack (20)

Hadoop online training
Hadoop online training
Keylabs
 
Big Data Training in Mohali
Big Data Training in Mohali
E2MATRIX
 
Big Data Training in Amritsar
Big Data Training in Amritsar
E2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in Ludhiana
E2MATRIX
 
Best hadoop-online-training
Best hadoop-online-training
Geohedrick
 
Hadoop online training
Hadoop online training
srikanthhadoop
 
spark_v1_2
spark_v1_2
Frank Schroeter
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015
Edureka!
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
Alaina Carter
 
Hadoop content
Hadoop content
Hadoop online training
 
Resume_Karthick
Resume_Karthick
Karthick Selvaraj
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
Søren Lund
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
business Corporate
 
SparkPaper
SparkPaper
Suraj Thapaliya
 
Spark vs Hadoop
Spark vs Hadoop
Olesya Eidam
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Sachin Holla
 
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
jolangoldikp
 
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
abinash bindhani
 
Hadoop demo ppt
Hadoop demo ppt
Phil Young
 
Hadoop online training
Hadoop online training
Keylabs
 
Big Data Training in Mohali
Big Data Training in Mohali
E2MATRIX
 
Big Data Training in Amritsar
Big Data Training in Amritsar
E2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in Ludhiana
E2MATRIX
 
Best hadoop-online-training
Best hadoop-online-training
Geohedrick
 
Hadoop online training
Hadoop online training
srikanthhadoop
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015
Edureka!
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
Alaina Carter
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
Søren Lund
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Sachin Holla
 
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
jolangoldikp
 
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
abinash bindhani
 
Hadoop demo ppt
Hadoop demo ppt
Phil Young
 
Ad

Recently uploaded (20)

A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 

A complete hadoop stack

  • 1. A Complete Hadoop Stack Abhra Pal April - 2016
  • 2. Background I have been working and experimenting with Hadoop for sometime now. Now that I know and understand, it reminds me how I started and struggled getting the right direction in designing a complete Hadoop stack. While Hadoop gives you a lot of flexibility, it also creates a confusion because there are many directions you can choose and integrate with your core solution. So here, I present my version of it. This is also my first publication on slideshare – so apologies if your find mistakes. I have only used Apache Hadoop (version 2.6.2) and all open source projects. For you to be able to understand the technical solution, you need to understand Hadoop core concepts, NOSQL, Columnar Database, grid, CGI, Python, JSON and Javascript.
  • 3. Solution : My Hadoop Technology Stack Data Acquisition Data Transformation Data Exploration & Prediction Data Lake Output Data Exploitation File Based data using Hadoop shell PIG for loading R (but there are challenges of using R with full capacity of Hadoop) MONGODB with mongo-Hadoop connectors to store the output from data transformations Python CGI to create a web application Web Service Data using FLUME HIVE for query based operations on structured data Python libraries like NUMPY / SCIPY for machine learning PIG will move the data from HIVE to MONGO JQUERY to bind JSON responses with D3 Structured Data using SQOOP HBASE for columnar data The whole stack / cluster can be built on Ubuntu 64 bit OS – 14.04 and Hadoop 2
  • 4. Architecture / Interface Diagram HADOOP Cluster MongoDB Cluster Application Server SQOOP Flume Hadoop Shell HDFS HIVE PIG Python Machine Learning Router / Query Servers (≥2) Config Server (≥2) Shard / Replica (≥2) HCATALOG Python CGI JQUERYMongo REST D3 PiMONGO Mongo Hadoop Mongo Hadoop
  • 5. Conclusion, Alternatives & Further Reading I wanted to keep this simple and straight forward – so I have kept the details. If you like to know further details – I would love to explain to do so with examples. Spark coming in a big way ahead – would definitely provide us more solutions on designing the stack – specially the Spark R libraries, but spark is compatible with Python. Different adapters are improving for R-Hadoop and sure this will improve in time to come. The main idea of my solution is that HIVE alone can not suffice for a real time dashboard and hence the output of processed data and exploration needs to be taken out of Hadoop. The reason for using MONGODB in my stack is the flexibility and that MONGO offers an automatic REST API on all its collections.