DataOps with Project Amaterasu

DataOps with Project Amaterasu
Yaniv Rodenski
Karel Alfonso

What Data Pipelines are Made Off
• Big Data applications:
• Ingestion
• Storage
• Processing
• Serving
• Workﬂows
• Machine learning
• Data Sources and Destinations
• Tests?
• Schemas??

Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment 
 
Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment

Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/
engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools

Continuous Delivery
• Keep software in a production
ready state
• Test all the changes: unit,
integration
• Exercise deployments
• Faster feedback cycle

No silos
Autonomous
teams
Feedback Automation
Build quality in
Shared
responsibility
DevOps & Collaboration

The case for CI/CD/DevOps in Big Data Projects
• Coordination: data engineers, analysts, business, ops
• Integrate and test critical jobs
• Complex infrastructure: multiple distributed systems
• Need to decouple cluster operation via APIs/DSLs
• DevOps team to manage cluster operations: scaling, monitoring,
deployment.
• Include CI/CD practices are part of the delivery process.

DataOps with Project Amaterasu

How are these techniques
applicable to
Big Data applications?

What Do We Need for Deploying our apps?
• Source control system: Git, Hg, etc
• CI process to run tests and package app
• A repository to store packaged app
• A repository to store conﬁguration
• An API/DSL to deploy to the cluster
• Mechanism to monitor the behaviour and performance of the app

Who are we?
Software developers with 
years of Big Data experience
What do we want?
Simple and robust way to 
deploy Big Data applications
How will we get it?
Write thousands of lines 
of code on top of Mesos

Amaterasu - Simple Continually Deployed Data
Apps
• Amaterasu is the Shinto goddess of sun
• In the Japanese manga series Naruto
Amaterasu is a super-natural power in the
shape of a black ﬂame that can only be
taken out by its Sender
• Started as a framework to reliably execute
Spark driver programs

Amaterasu - Simple Continually Deployed Data
Apps
• Big Data apps in Multiple Frameworks
(Currently Only Spark is Supported)
• Multiple Languages (soon)
• Workﬂow as YAML
• Simple to Write, easy to deploy
• Reliable execution (via Mesos)
• Multiple Environments

Big Data Pipeline Ops Requirements
• Support managing multiple distributed
technologies: Apache Spark, HDFS, Kafka,
Cassandra, etc.
• Treat data center as the OS while providing
resource isolation, scalability and fault tolerance.
• Ability to run multiple tasks per machine to
maximize utilization

Why Mesos?
• General purpose, battle tested cluster resource scheduler.
• Can run major modern Big Data systems: Hadoop, Spark,
Kafka, Cassandra
• Can deploys spark as part of the execution
• Supports scheduled and long running apps.
• Improves resource management and efﬁciency
• Great APIs
• DC/OS provides an even reacher environment

Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• Local directories support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• Benefits of using git:
• Branching
• Tooling

Workﬂow DSL - maki.yml
---
job-name: amaterasu-test
flow:
- name: start
type: spark-scala
file: file.scala
- name: step2
type: spark-scala
file: file2.scala
error: file2.scala
name: handle-error
type: spark-scala
file: cleanup.scala
...
Actions
Error handling actions

Amaterasu is not a workﬂow engine,  
it’s a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications

Actions DSL
• Your Scala/Future languages Spark code
• Few changes:
• Don’t create a new sc/sqlContext, use the one
in scope or access via AmaContext.sc and
AmaContext.sqlContext
• AmaContext.getDataFrame and
AmaContext.getRDD are used to access data
from previously executed actions

import io.shinto.amaterasu.runtime._
val oddRdd = AmaContext.getRDD[Int]("start", "rdd") 
.filter(x=>x%2 == 0)
oddRdd.take(5).foreach(println) 
val highNoDf = AmaContext.getDataFrame("start", “odd")
.where("_1 > 3")
highNoDf.write.json("file:///tmp/test1")
Actions DSL (in action)
val data = Array(1, 2, 3, 4, 5)
val x = data.tail
val rdd = AmaContext.sc.parallelize(data)
val odd = rdd.filter(n => n%2 != 0)
Action 1 (“start”) Action 2

Environments
• Conﬁguration is stored per environment
• Stored as JSON
• Contains:
• Spark master URI
• Input/output path
• Work dir
• User deﬁned key-values

production.json
{
"name":"production",
"sparkMasterUrl":"mesos://server1:5050",
"inputPath":"hdfs://hdfsprd:9000/user/amaterasu/input",
"outputPath":"hdfs://hdfsprd:9000/user/amaterasu/output",
"workingDir":"alluxio://server3:19998/",
"configuration":{
"spark.cassandra.connection.host":"cassie-prod",
"sourceTable":"documents"
}
}

dev.json
{
"name":"test",
"sparkMasterUrl":"local[*]",
"inputRootPath":"file:///tmp/input",
"outputRootPath":"file:///tmp/output",
"workingDir":"file:///tmp/work",
"configuration":{
"spark.cassandra.connection.host":"127.0.0.1",
"sourceTable":"documents"
}
}

val oddRdd = AmaContext.getRDD[Int]("start", "rdd").filter(x=>x/2 == 0)
oddRdd.take(5).foreach(println) 
val highNoDf = AmaContext.getDataFrame("start", “x").where("_1 > 3")
highNoDf.write.json(Env.outputPath)
Environments in the Actions DSL

Future Development
• Continuous integration and test automation
• R, shell and Python support (R is already in progress)
• Extend environments to support:
• Full spark conﬁguration (spark-defaults.conf, etc.)
• Extendable conﬁguration model
• Better tooling
• DC/OS universe package
• Other frameworks: Flink, vowpal wabbit
• YARN?

Amaterasu + demos 
https://github.com/shintoio/
Slack
http://shintoio.slack.com
Getting started

DataOps with Project Amaterasu

Recommended

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to DataOps with Project Amaterasu (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

DataOps with Project Amaterasu