Hadoop and OpenStack - Hadoop Summit San Jose 2014

Hadoop and OpenStack
Matthew Farrellee, @spinningmatt, Red Hat
Sumit Mohanty, @smohanty, Hortonworks

OpenStack is
A cloud operating system that controls large
pools of compute, storage, and networking
resources throughout a datacenter, all
managed through a dashboard that gives
administrators control while empowering their
users to provision resources through a web
interface.

An ecosystem of projects
● Compute - Nova
● Networking - Neutron
● Object Storage - Swift
● Block Storage - Cinder
● Identity - Keystone
● Image Service - Glance
● Dashboard - Horizon
● Telemetry - Ceilometer
● Orchestration - Heat
● Data Processing - Sahara

Trends
Hadoop
EC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.
com/aws/2006/08/amazon_ec2_beta.html)

Data analysis is hard...
● Come up w/ a relevant question
○ The question you answer won’t be the question you
set out to ask
○ Mine: Can I predict doctor specialty from what
procedures they perform?
● Find the data
○ Tons, little consistency, unknown origin, horded
○ Data w/o a dictionary is worse than code w/o
comments. Run away!

● Data usability
○ Acceptable license? (Even for Gov’t sets)
■ Mine: Metadata copyrighted by AMA!
○ Private is often highly protected, no/narrow DMZ
● Explore and clean
○ Two of the oldest people in the medical profession
working with medicare
○ Stephen Glasser graduated in 1773
○ Cheryl Palma graduated in 1776

● You got some answer to a question you
approximately asked
● You must refine the question and process
● Repeat
This is hard enough without having to manage
tools and infrastructure!

Sahara’s goal
Make managing Hadoop+ infrastructure and
tools so simple that they get out of your way

Sahara provides
● Apache Hadoop cluster and workload
management
○ Cluster - construct and manage the lifecycle of a
Hadoop cluster
○ Workload - workflow for big data processing with
Hadoop (AWS EMR-like)
● Through a Python library, REST API, Web
UI, command line interface

Sahara’s architecture
Data
Sources
Sahara
Python
Client
RESTAPI
Cluster
Configuration
Manager
Horizon
Keystone
Auth
Data
Access
Layer
Swift
Sahara
Pages
Hadoop
VM
Vendors
Plugins
Hadoop
VM
Hadoop
VM
Hadoop
VM
Resources
Orchestration
Manager
Job
Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service

Sahara’s features
● Plugin mechanism - distro choice
● Cluster scaling - elasticity
● Swift integration - data storage
● Cinder integration - persistent HDFS
● Network management with Nova and Neutron
● Anti-affinity, separate services on physical hardware
● Data locality with Swift
● Repeatable cluster creation w/ template mechanism
● http://docs.openstack.
org/developer/sahara/userdoc/features.html

Storage considerations
● Swift
○ Input/output through Swift HCFS plugin
○ Intermediate data stored in HDFS on cluster
○ Locality when co-locating swift & nova-compute
● HDFS
○ Local (long lived cluster) and remote (copy in)
● HDFS backed by ephemeral disk or Cinder
○ Ephemeral - /var/lib/nova/instances on compute host
○ Cinder - persistent block devices attached to instances

Sahara’s plugin architecture
● This is important!
● It’s where Hadoop distribution vendors
integrate their management software
● It’s how users pick different software
versions
● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel
Manager), and Spark (w/ minimal CDH)

HDP Plugin Overview
● Full support for all Sahara Functionality
● Nova and Neutron network
● Cluster Scaling
● Scale Up
● Swift Integration
● Cinder Support
● Data Locality
● EDP
● Apache Ambari REST API’s used for cluster
provisioning
● Monitoring/Management of clusters via Ambari
● Full support for multiple HDP stacks
● HDP pre-installed or generic VM images

HDP 1.3
● NameNode
● Secondary NameNode
● DataNode
● HDFS
● ZooKeeper
● Ambari Server/Agent
● HCatalog
● Sqoop
● Job Tracker
● Task Tracker
● MapReduce
● Hive
● MySQL
● Pig
● WebHCat Server
● Oozie
● Ganglia
● Nagios
● HBase
HDP Plugin Stack Support
HDP 2.0
● History Server
● MapReduce 2 / YARN
● Resource Manager
● YARN Client
HDP 2.1
● Storm
● Falcon
Com
ing Soon!
Available
Available
HDP 2.1 +
● SOLR
● Cascading
Roadm
ap

Ambari Blueprints
● Two primary goals of Ambari Blueprints
○ Ability to export a complete description of a running
cluster
○ Provide API based cluster installations based on a self-
contained cluster description
● Blueprints contain cluster topology and configuration
information
● Enables Interesting use cases between physical and virtual,
including OpenStack/Sahara

Blueprint API
BLUEPRINT
POST /blueprints/my-
blueprint
CLUSTER
INSTANCE
POST
/clusters/MyCluster
1
2

Example: Single-Node Definitions
{
"configurations" : [
{
”hdfs-site" : {
"dfs.namenode.name.dir" : ”/hadoop/nn"
}
}
],
"host_groups" : [
{
"name" : ”uber-host",
"components" : [
{ "name" : "NAMENODE” },
{ "name" : "SECONDARY_NAMENODE” },
{ "name" : "DATANODE” },
{ "name" : "HDFS_CLIENT” },
{ "name" : "RESOURCEMANAGER” },
{ "name" : "NODEMANAGER” },
{ "name" : "YARN_CLIENT” },
{ "name" : "HISTORYSERVER” },
{ "name" : "MAPREDUCE2_CLIENT” }
],
"cardinality" : "1"
}
],
"Blueprints" : {
"blueprint_name" : "single-node-hdfs-yarn",
"stack_name" : "HDP",
"stack_version" : "2.0"
}
}
{
"blueprint" : "single-node-hdfs-yarn",
"host_groups" :[
{
"name" : ”uber-host",
"hosts" : [
{
"fqdn" : "c6401.ambari.apache.org”
}
]
}
]
}
BLUEPRINT
CLUSTER INSTANCE
Description
• Single-node cluster
• Use HDP 2.0 Stack
• HDFS + YARN + MR2
• Everything on c6401

Demo - youtu.be/vmry_kXqn4c
● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstore
o A full stack hadoop application
o Uses the main players in the hadoop ecosystem
o To demonstrate a single domain
o Just accepted into the Bigtop project!
● Come by the Red Hat booth - G18

Q&A
● Status - Integrated for Juno (Oct 2014)
● Distro - RDO (Fedora/RHEL/CentOS), RHEL
OSP 5, ...
● Home - https://launchpad.net/sahara
● Docs - http://docs.openstack.org/developer/sahara
● Code - https://github.com/openstack/ *sahara*
● Email - openstack-dev w/ [sahara]
● IRC - #openstack-sahara on freenode

Hadoop and OpenStack - Hadoop Summit San Jose 2014

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Hadoop and OpenStack - Hadoop Summit San Jose 2014 (20)

Recently uploaded (20)

Hadoop and OpenStack - Hadoop Summit San Jose 2014