SlideShare a Scribd company logo
Hadoop and OpenStack
Matthew Farrellee, @spinningmatt, Red Hat
Sumit Mohanty, @smohanty, Hortonworks
What is OpenStack?
OpenStack is
A cloud operating system that controls large
pools of compute, storage, and networking
resources throughout a datacenter, all
managed through a dashboard that gives
administrators control while empowering their
users to provision resources through a web
interface.
An ecosystem of projects
● Compute - Nova
● Networking - Neutron
● Object Storage - Swift
● Block Storage - Cinder
● Identity - Keystone
● Image Service - Glance
● Dashboard - Horizon
● Telemetry - Ceilometer
● Orchestration - Heat
● Data Processing - Sahara
Sahara is combining use cases
Trends
Hadoop
EC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.
com/aws/2006/08/amazon_ec2_beta.html)
Data analysis is hard
Data analysis is hard...
● Come up w/ a relevant question
○ The question you answer won’t be the question you
set out to ask
○ Mine: Can I predict doctor specialty from what
procedures they perform?
● Find the data
○ Tons, little consistency, unknown origin, horded
○ Data w/o a dictionary is worse than code w/o
comments. Run away!
Data analysis is hard...
● Data usability
○ Acceptable license? (Even for Gov’t sets)
■ Mine: Metadata copyrighted by AMA!
○ Private is often highly protected, no/narrow DMZ
● Explore and clean
○ Two of the oldest people in the medical profession
working with medicare
○ Stephen Glasser graduated in 1773
○ Cheryl Palma graduated in 1776
Data analysis is hard...
● You got some answer to a question you
approximately asked
● You must refine the question and process
● Repeat
This is hard enough without having to manage
tools and infrastructure!
Sahara’s goal
Make managing Hadoop+ infrastructure and
tools so simple that they get out of your way
Sahara provides
● Apache Hadoop cluster and workload
management
○ Cluster - construct and manage the lifecycle of a
Hadoop cluster
○ Workload - workflow for big data processing with
Hadoop (AWS EMR-like)
● Through a Python library, REST API, Web
UI, command line interface
Sahara’s architecture
Data
Sources
Sahara
Python
Client
RESTAPI
Cluster
Configuration
Manager
Horizon
Keystone
Auth
Data
Access
Layer
Swift
Sahara
Pages
Hadoop
VM
Vendors
Plugins
Hadoop
VM
Hadoop
VM
Hadoop
VM
Resources
Orchestration
Manager
Job
Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service
Sahara’s features
● Plugin mechanism - distro choice
● Cluster scaling - elasticity
● Swift integration - data storage
● Cinder integration - persistent HDFS
● Network management with Nova and Neutron
● Anti-affinity, separate services on physical hardware
● Data locality with Swift
● Repeatable cluster creation w/ template mechanism
● http://docs.openstack.
org/developer/sahara/userdoc/features.html
Storage considerations
● Swift
○ Input/output through Swift HCFS plugin
○ Intermediate data stored in HDFS on cluster
○ Locality when co-locating swift & nova-compute
● HDFS
○ Local (long lived cluster) and remote (copy in)
● HDFS backed by ephemeral disk or Cinder
○ Ephemeral - /var/lib/nova/instances on compute host
○ Cinder - persistent block devices attached to instances
Sahara’s plugin architecture
● This is important!
● It’s where Hadoop distribution vendors
integrate their management software
● It’s how users pick different software
versions
● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel
Manager), and Spark (w/ minimal CDH)
HDP Plugin Overview
● Full support for all Sahara Functionality
● Nova and Neutron network
● Cluster Scaling
● Scale Up
● Swift Integration
● Cinder Support
● Data Locality
● EDP
● Apache Ambari REST API’s used for cluster
provisioning
● Monitoring/Management of clusters via Ambari
● Full support for multiple HDP stacks
● HDP pre-installed or generic VM images
HDP 1.3
● NameNode
● Secondary NameNode
● DataNode
● HDFS
● ZooKeeper
● Ambari Server/Agent
● HCatalog
● Sqoop
● Job Tracker
● Task Tracker
● MapReduce
● Hive
● MySQL
● Pig
● WebHCat Server
● Oozie
● Ganglia
● Nagios
● HBase
HDP Plugin Stack Support
HDP 2.0
● History Server
● MapReduce 2 / YARN
● Resource Manager
● YARN Client
HDP 2.1
● Storm
● Falcon
Com
ing Soon!
Available
Available
HDP 2.1 +
● SOLR
● Cascading
Roadm
ap
Ambari Blueprints
● Two primary goals of Ambari Blueprints
○ Ability to export a complete description of a running
cluster
○ Provide API based cluster installations based on a self-
contained cluster description
● Blueprints contain cluster topology and configuration
information
● Enables Interesting use cases between physical and virtual,
including OpenStack/Sahara
Blueprint API
BLUEPRINT
POST /blueprints/my-
blueprint
CLUSTER
INSTANCE
POST
/clusters/MyCluster
1
2
Example: Single-Node Definitions
{
"configurations" : [
{
”hdfs-site" : {
"dfs.namenode.name.dir" : ”/hadoop/nn"
}
}
],
"host_groups" : [
{
"name" : ”uber-host",
"components" : [
{ "name" : "NAMENODE” },
{ "name" : "SECONDARY_NAMENODE” },
{ "name" : "DATANODE” },
{ "name" : "HDFS_CLIENT” },
{ "name" : "RESOURCEMANAGER” },
{ "name" : "NODEMANAGER” },
{ "name" : "YARN_CLIENT” },
{ "name" : "HISTORYSERVER” },
{ "name" : "MAPREDUCE2_CLIENT” }
],
"cardinality" : "1"
}
],
"Blueprints" : {
"blueprint_name" : "single-node-hdfs-yarn",
"stack_name" : "HDP",
"stack_version" : "2.0"
}
}
{
"blueprint" : "single-node-hdfs-yarn",
"host_groups" :[
{
"name" : ”uber-host",
"hosts" : [
{
"fqdn" : "c6401.ambari.apache.org”
}
]
}
]
}
BLUEPRINT
CLUSTER INSTANCE
Description
• Single-node cluster
• Use HDP 2.0 Stack
• HDFS + YARN + MR2
• Everything on c6401
Demo - youtu.be/vmry_kXqn4c
● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstore
o A full stack hadoop application
o Uses the main players in the hadoop ecosystem
o To demonstrate a single domain
o Just accepted into the Bigtop project!
● Come by the Red Hat booth - G18
Q&A
● Status - Integrated for Juno (Oct 2014)
● Distro - RDO (Fedora/RHEL/CentOS), RHEL
OSP 5, ...
● Home - https://launchpad.net/sahara
● Docs - http://docs.openstack.org/developer/sahara
● Code - https://github.com/openstack/ *sahara*
● Email - openstack-dev w/ [sahara]
● IRC - #openstack-sahara on freenode
Ad

More Related Content

What's hot (20)

Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Sergey Lukjanov
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
Zhidong Yu
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014
Sergey Lukjanov
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
spinningmatt
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
Wei Ting Chen
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
DataWorks Summit
 
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFSMySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
Mats Kindahl
 
Savanna project update Jan 2014
Savanna project update Jan 2014Savanna project update Jan 2014
Savanna project update Jan 2014
Sergey Lukjanov
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
Wei Ting Chen
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
DataWorks Summit/Hadoop Summit
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with Hadoop
Sagar Jauhari
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
rhatr
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Sergey Lukjanov
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
Zhidong Yu
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014
Sergey Lukjanov
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
spinningmatt
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
Wei Ting Chen
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
DataWorks Summit
 
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFSMySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
Mats Kindahl
 
Savanna project update Jan 2014
Savanna project update Jan 2014Savanna project update Jan 2014
Savanna project update Jan 2014
Sergey Lukjanov
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
Wei Ting Chen
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with Hadoop
Sagar Jauhari
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
rhatr
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 

Viewers also liked (8)

Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014
Tesora
 
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Hortonworks
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
Sandeep Raju
 
Hadoop For OpenStack Log Analysis
Hadoop For OpenStack Log AnalysisHadoop For OpenStack Log Analysis
Hadoop For OpenStack Log Analysis
OpenStack Foundation
 
2012 09-08-josug-jeff
2012 09-08-josug-jeff2012 09-08-josug-jeff
2012 09-08-josug-jeff
Zheng (Jeff) Xu
 
Dell Crowbar Software Framework for OpenStack Deployments
Dell Crowbar Software Framework for OpenStack DeploymentsDell Crowbar Software Framework for OpenStack Deployments
Dell Crowbar Software Framework for OpenStack Deployments
Mike Pittaro
 
Savanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStack
Mirantis
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014
Tesora
 
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Hortonworks
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
Sandeep Raju
 
Dell Crowbar Software Framework for OpenStack Deployments
Dell Crowbar Software Framework for OpenStack DeploymentsDell Crowbar Software Framework for OpenStack Deployments
Dell Crowbar Software Framework for OpenStack Deployments
Mike Pittaro
 
Savanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStack
Mirantis
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Ad

Similar to Hadoop and OpenStack - Hadoop Summit San Jose 2014 (20)

Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
OpenShift Origin
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
nvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
cpallares
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
Adam Doyle
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
datastack
 
Upcoming services in OpenStack
Upcoming services in OpenStackUpcoming services in OpenStack
Upcoming services in OpenStack
Cisco DevNet
 
State of openstack industry: Why we are doing this
State of openstack industry: Why we are doing thisState of openstack industry: Why we are doing this
State of openstack industry: Why we are doing this
Dmitriy Novakovskiy
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
Geoffrey Fox
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Cassandra Lunch #94: StreamSets and Cassandra
Apache Cassandra Lunch #94: StreamSets and CassandraApache Cassandra Lunch #94: StreamSets and Cassandra
Apache Cassandra Lunch #94: StreamSets and Cassandra
Anant Corporation
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
OpenShift Origin
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
nvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
cpallares
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
Adam Doyle
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
datastack
 
Upcoming services in OpenStack
Upcoming services in OpenStackUpcoming services in OpenStack
Upcoming services in OpenStack
Cisco DevNet
 
State of openstack industry: Why we are doing this
State of openstack industry: Why we are doing thisState of openstack industry: Why we are doing this
State of openstack industry: Why we are doing this
Dmitriy Novakovskiy
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
Geoffrey Fox
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Cassandra Lunch #94: StreamSets and Cassandra
Apache Cassandra Lunch #94: StreamSets and CassandraApache Cassandra Lunch #94: StreamSets and Cassandra
Apache Cassandra Lunch #94: StreamSets and Cassandra
Anant Corporation
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Ad

Recently uploaded (20)

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 

Hadoop and OpenStack - Hadoop Summit San Jose 2014

  • 1. Hadoop and OpenStack Matthew Farrellee, @spinningmatt, Red Hat Sumit Mohanty, @smohanty, Hortonworks
  • 3. OpenStack is A cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
  • 4. An ecosystem of projects ● Compute - Nova ● Networking - Neutron ● Object Storage - Swift ● Block Storage - Cinder ● Identity - Keystone ● Image Service - Glance ● Dashboard - Horizon ● Telemetry - Ceilometer ● Orchestration - Heat ● Data Processing - Sahara
  • 6. Trends Hadoop EC2 OpenStack www.google.com/trends/explore#q=hadoop,ec2,openstack EC2 beta Aug 25 2006 (http://aws.typepad. com/aws/2006/08/amazon_ec2_beta.html)
  • 8. Data analysis is hard... ● Come up w/ a relevant question ○ The question you answer won’t be the question you set out to ask ○ Mine: Can I predict doctor specialty from what procedures they perform? ● Find the data ○ Tons, little consistency, unknown origin, horded ○ Data w/o a dictionary is worse than code w/o comments. Run away!
  • 9. Data analysis is hard... ● Data usability ○ Acceptable license? (Even for Gov’t sets) ■ Mine: Metadata copyrighted by AMA! ○ Private is often highly protected, no/narrow DMZ ● Explore and clean ○ Two of the oldest people in the medical profession working with medicare ○ Stephen Glasser graduated in 1773 ○ Cheryl Palma graduated in 1776
  • 10. Data analysis is hard... ● You got some answer to a question you approximately asked ● You must refine the question and process ● Repeat This is hard enough without having to manage tools and infrastructure!
  • 11. Sahara’s goal Make managing Hadoop+ infrastructure and tools so simple that they get out of your way
  • 12. Sahara provides ● Apache Hadoop cluster and workload management ○ Cluster - construct and manage the lifecycle of a Hadoop cluster ○ Workload - workflow for big data processing with Hadoop (AWS EMR-like) ● Through a Python library, REST API, Web UI, command line interface
  • 14. Sahara’s features ● Plugin mechanism - distro choice ● Cluster scaling - elasticity ● Swift integration - data storage ● Cinder integration - persistent HDFS ● Network management with Nova and Neutron ● Anti-affinity, separate services on physical hardware ● Data locality with Swift ● Repeatable cluster creation w/ template mechanism ● http://docs.openstack. org/developer/sahara/userdoc/features.html
  • 15. Storage considerations ● Swift ○ Input/output through Swift HCFS plugin ○ Intermediate data stored in HDFS on cluster ○ Locality when co-locating swift & nova-compute ● HDFS ○ Local (long lived cluster) and remote (copy in) ● HDFS backed by ephemeral disk or Cinder ○ Ephemeral - /var/lib/nova/instances on compute host ○ Cinder - persistent block devices attached to instances
  • 16. Sahara’s plugin architecture ● This is important! ● It’s where Hadoop distribution vendors integrate their management software ● It’s how users pick different software versions ● Currently: Vanilla (reference impl. w/ Apache versions), HDP (via Ambari), IDH (via Intel Manager), and Spark (w/ minimal CDH)
  • 17. HDP Plugin Overview ● Full support for all Sahara Functionality ● Nova and Neutron network ● Cluster Scaling ● Scale Up ● Swift Integration ● Cinder Support ● Data Locality ● EDP ● Apache Ambari REST API’s used for cluster provisioning ● Monitoring/Management of clusters via Ambari ● Full support for multiple HDP stacks ● HDP pre-installed or generic VM images
  • 18. HDP 1.3 ● NameNode ● Secondary NameNode ● DataNode ● HDFS ● ZooKeeper ● Ambari Server/Agent ● HCatalog ● Sqoop ● Job Tracker ● Task Tracker ● MapReduce ● Hive ● MySQL ● Pig ● WebHCat Server ● Oozie ● Ganglia ● Nagios ● HBase HDP Plugin Stack Support HDP 2.0 ● History Server ● MapReduce 2 / YARN ● Resource Manager ● YARN Client HDP 2.1 ● Storm ● Falcon Com ing Soon! Available Available HDP 2.1 + ● SOLR ● Cascading Roadm ap
  • 19. Ambari Blueprints ● Two primary goals of Ambari Blueprints ○ Ability to export a complete description of a running cluster ○ Provide API based cluster installations based on a self- contained cluster description ● Blueprints contain cluster topology and configuration information ● Enables Interesting use cases between physical and virtual, including OpenStack/Sahara
  • 21. Example: Single-Node Definitions { "configurations" : [ { ”hdfs-site" : { "dfs.namenode.name.dir" : ”/hadoop/nn" } } ], "host_groups" : [ { "name" : ”uber-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "SECONDARY_NAMENODE” }, { "name" : "DATANODE” }, { "name" : "HDFS_CLIENT” }, { "name" : "RESOURCEMANAGER” }, { "name" : "NODEMANAGER” }, { "name" : "YARN_CLIENT” }, { "name" : "HISTORYSERVER” }, { "name" : "MAPREDUCE2_CLIENT” } ], "cardinality" : "1" } ], "Blueprints" : { "blueprint_name" : "single-node-hdfs-yarn", "stack_name" : "HDP", "stack_version" : "2.0" } } { "blueprint" : "single-node-hdfs-yarn", "host_groups" :[ { "name" : ”uber-host", "hosts" : [ { "fqdn" : "c6401.ambari.apache.org” } ] } ] } BLUEPRINT CLUSTER INSTANCE Description • Single-node cluster • Use HDP 2.0 Stack • HDFS + YARN + MR2 • Everything on c6401
  • 22. Demo - youtu.be/vmry_kXqn4c ● http://jayunit100.github.io/bigpetstore/slides ● Bigpetstore o A full stack hadoop application o Uses the main players in the hadoop ecosystem o To demonstrate a single domain o Just accepted into the Bigtop project! ● Come by the Red Hat booth - G18
  • 23. Q&A ● Status - Integrated for Juno (Oct 2014) ● Distro - RDO (Fedora/RHEL/CentOS), RHEL OSP 5, ... ● Home - https://launchpad.net/sahara ● Docs - http://docs.openstack.org/developer/sahara ● Code - https://github.com/openstack/ *sahara* ● Email - openstack-dev w/ [sahara] ● IRC - #openstack-sahara on freenode