The Automation Factory

The
Automation
Factory

nathan@milford.io
blog.milford.io
twitter.com/NathanMilford
github.com/nmilford

This is NOT strictly a
Cassandra talk.

♫ There's no earthly way of knowing ♫

This is an infrastructure talk.

♫ How your infrastructure's growing. ♫

Startups move fast.

Priorities change.

Infrastructure needs to be able to
pivot, too.

♫ Who knows where business is going.
Or which way the data's flowing. ♫

When you scale up,

so do your problems.

♫ Drives imploding?
IO plateauing? ♫

Not to mention unexpected
disasters.

We lost a whole data center during
Hurricane Sandy.
♫ Is a hurricane a'blowing? ♫

How do you keep up with growth?

♫ There's no earthly way of knowing ♫

How do you deal with failure?

♫ Are the status LEDs a 'glowing?
Is the server reaper mowing? ♫

How do you deal with too much
success?

♫ Yes! The danger must be growing
For the data keeps on flowing. ♫

What do you do?

♫ And they're certainly not showing
any signs that they are slowing! ♫

Hold your breath.

Make a wish.

Automate!

♫ Come with me
And you'll be
In a world of systems automation ♫

♫ Take a look
And you’ll see
Into my Chef lucubrations

So login, Install, begin
With the Chef cookbook of my creation
What you'll see might require
Explanation ♫

♫ If you want to view paradise
Simply go to Github and view it
Pull requests welcome, go to it
Want to change the code
A merge will do it ♫

https://github.com/linkedin/glu/
https://github.com/octo/collectd/
https://github.com/opscode/chef/
https://github.com/saltstack/salt/
https://github.com/outbrain/onering/
https://github.com/nmilford/chef-cassandra/
https://github.com/rabbitmq/rabbitmq-server/

def discover_cassandra_schema
require 'cassandra-cql'
schema = {}
server = "#{node[:ipaddress]}:#{node[:Cassandra][:rpc_port]}"

db = CassandraCQL::Database.new("#{server}") rescue nil
if db
db.keyspaces.collect{|s| schema[s.name] =
s.column_families.collect{|cfname, cfobj| cfname } }
schema.delete("system")
schema.delete("OpsCenter")
return schema ♫ There is no life I know
end To compare with writing automation
return nil Write it once
end You’ll be free♫

*clickity*

*clickity*

*clickity*

♫ To play Diablo 3 ♫

♫ If you want to scale past a petabyte
Just install Chef, Salt and Graphite
If you want to sleep the whole night
Automate the world
It will be all right♫

♫ There is no life I know
To compare with writing automation
Write it once
You’ll be free ♫

♫ If you truly wish to be.♫

The
Automation
Factory
A Journey from Bare Metal
to Active Cassandra Node

nathan@milford.io
blog.milford.io
twitter.com/NathanMilford
github.com/nmilford

Cassandra NYC 2011

http://www.slideshare.net/nmilford/cassandra-for-sysadmins

2 Years Later
● 80 billion impressions a month.

● 4 clusters for disparate
use-cases, more in planning.

● 73 Cassandra nodes
across 3 data centers.

Mo' Servers,
Mo' Problems

We got multiple cages of servers.

So... yeah... you can see where
automation might help :)

Automation Attack Plan

●
Provisioning!
●
Orchestration! ●
Command and Control!
●
Config Management! ● Monitoring and Alerting!

Provisioning
●
Started with Cobbler (which is Awesome!)
●
High performance infrastructures are snowflakes,
can get out of hand fast.

●
No tool that worked completely, end to end, the
tool won't write itself.

We Built Our Own: Onering

Note: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gary Hetzel, is a
Star Trek fan. We are not responsible for any LotR puns.
https://github.com/outbrain/onering/

Onering: Provisioning &
Orchestration
●
Initiates/manages provisioning
and inventory.
●
Acts as an orchestration layer in
our automation.
●
Keeps all metadata, which is
searchable.
●
Has a CLI tool and REST API to
work with.
●
Acts as our single point of truth
& final authority on state.

Onering Provisioning Workflow
➔
Developers put in machine requests by role for
quarterly order.
➔
Machines show up, get racked and powered on.
➔
Machines boot into the Razor microkernel and report to
Onering.
➔
Appropriate nodes get kickstarted & bootstrapped into
roles specified.
➔
Additional nodes sit idle in 'allocatable' state.
➔
Once OS is installed, configuration is handed off to...

Config Management: Chef
●
Onering bootstraps into a Chef run.
●
Chef installs all the system stuff.
●
Chef sets up Java and tunes the OS how we like.
●
Chef runs the Cassandra Cookbook.
include_recipe "java"

package "apache-cassandra1" do
action :install
end

template "/etc/cassandra/conf/cassandra.yaml" do
owner "cassandra"
group "cassandra"
mode "0755"
source "cassandra.yaml.erb"
end

https://github.com/opscode/chef/

Cassandra Cookbook does it all!
●
Builds/mounts disks.
●
Handles multiple clusters,
different versions.
●
Generates configs (in some
cases automatically based
on hardware profile).
●
Connects to local instance
and gets the schema.
●
Generates collectd config
and maintenance script.
●
Schedules maintenance.
https://github.com/nmilford/chef-cassandra

Glu: Continuous Deployment
● Not related to getting a C* node
to production, but it's how we get
apps there.
● Built at Linkedin.
● Onering talks to it!
●
Holds deployment metadata.
●
Maven Builds an RPM, dumps to a repo.
●
Glu-Agent yum installs it and performs checks.

https://github.com/linkedin/glu

Command & Control:
Distributed commands:
salt '*ny*' cassandra.column_families
salt 'cass*' cassandra.compactionstats
salt '*stg*' cassandra.info
salt 'cass1.ny.*' cassandra.keyspaces
salt -E 'cass1-(stg|prod)' cassandra.netstats
salt '*' cassandra.tpstats

Scary commands:
salt '*' --batch-size 25% service.restart cassandra
salt '*' -b2 cmd.run "nodetool -h $(hostname) -p 7199 snapshot"

We actually wrap salt in Onering to provide AAA, as well to allow use of Onering
metadata for node targeting.

https://github.com/saltstack/salt

Common Monitoring & Events Bus
●
A single infrastructure-wide bus for systems
data:
– Metrics
– Events
– Metadata
●
Collectd as systems agent.
●
RabbitMQ as message bus.
●
Graphite as metrics endpoint.
●
Working on an events mechanism.
●
Each layer should be interchangeable.

Collectd
●
Been around forever.
●
Had to rebuild the JMX plugin to not use OpenJDK.
●
Easy to write plugins and extend.
●
Writes to RabbitMQ out of the box.
●
Easy to templatize config for Chef.
<% @node[:Cassandra][:Keyspaces].each do |ks| -%>
<% ks[1].each do |cf| -%>
Collect "<%= ks[0] %>.<%= cf %>"
Collect "KeyCache.<%= ks[0] %>.<%= cf %>"
Collect "RowCache.<%= ks[0] %>.<%= cf %>"
<% end -%>
<% end -%>
https://github.com/octo/collectd

RabbitMQ
●
Lots of apps support AMPQ.
●
Shovel plugin for multi-site.
●
Pretty stable.
●
I'm not mad at it.

https://github.com/rabbitmq/rabbitmq-server

Graphite

●
Plays well with RabbitMQ.
●
Easy to get metrics into.
●
Scads of functions.
●
Easy to get meaningful data out of.

https://launchpad.net/graphite

Graphite Render, Activate!
http://graphite/render?
Width=800
&height=600
&from=-2hours
&until=now
&target=sortByMaxima(highestCurrent(collectd.machines
.*.cass2*.GenericJMX.ReadStage.PendingTasks,5))
&target=sortByMaxima(highestCurrent(collectd.machines
.*.cass2*.GenericJMX.MutationStage.PendingTasks,5))
&hideLegend=false

Alerting: Nagios Self Serve
●
Uses Onering for new node discovery.
●
Developers add their own alerts based off of
Graphite data.
●
Ops get fewer alerts and are not a bottleneck.
●
Devs are more engaged.
●
Everyone is happy.

The Automation Factory

Recommended

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to The Automation Factory (20)

Recently uploaded (20)

The Automation Factory