Big Data, Baby Steps

1
Big Data, Baby Steps
“What Every Leader Should Consider When Starting a Big Data Initiative”
April 12, 2014

Goal for this presentation
“Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks everyone
else is doing it, so everyone claims they are doing it...”
- Dan Ariely on Facebook Jan 6, 2013 and “others”
2
Why Me? Why Ancestry?
• Established consumer facing Web company looking to
leverage our data
• Started with Hadoop and HBase in 2012 on AncestryDNA
• When we started, I looked for guidance – it was missing
• Learn from us: what works, what didn’t, how to adjust

Agenda
• What to consider before you start
• Understand the Hadoop ecosystem
– What pieces is Ancestry using and why?
– Big Data architecture at Ancestry
• Hadoop distributions
• Big Data consultants
• How to build your team(s)
• Custom logs
– Other companies and Ancestry specifics
• Top three things to remember
3

Gartner new technology hype cycles
Where is Big Data (and Big Data Analytics) on this curve?
4
Source: Gartner August 2013

What to consider before you start
• Big Data, Business Intelligence, and Analytics are tied
– Analytics is an umbrella term that represents the entire
ecosystem needed to turn data into actions
• Understand your “data”
– Web click stream data, sales transactions, advertising data,
fraud detection, sensor data, social data, etc.
• Visualize your final goal and work backwards
– Imagine (prototype) the dashboards, analytics, and actions
that will be available
• Deliver value to the business at each step
– “Goal of analytics is not to produce actionable insights; the
goal is to produce results.” Ken Rudin
5

Understand the Hadoop ecosystem
• Hadoop 2.0 and HDFS (Yarn)
• Workflow
• NoSQL
• Data Organization
• Log collection
• Near Real-Time Stream Processing
• NFS File System on HDFS
6

What are the pieces Ancestry is using?
We use or plan to use:
Yarn and Ambari
Forensics on log data:
Visualization:
(Graphs + Deep Zoom)
7

Visualization
Company that used traditional “Cubes” and Excel
– Business Intelligence/Data Warehouse world has moved
beyond cubes
– Great product that didn’t work for us
– People went back to using Excel
– In two weeks, 30 people created 120+ dashboards and reports
– Tied to an MPP Data Warehouse is changing our company
– Created the “Wild, wild, west” - fixing with a blessed portal
8

Hadoop distributions
• Open Source, Active Community, Large Eco-System of
Projects, requires more internal knowledge and support
• First Distribution, Large “War Chest” (Cash Investment),
Impala, and the Cloudera Console
• Custom file system (API equivalent to HDFS) that improves
performance, custom Hbase implementation, High
Availability Features
• Closest to Apache Hadoop, tested on Yahoo!’s 7000 node
cluster before being released.
• Several Cloud options: Google and Amazon. Quick and
easy to get going. Great way to experiment and learn.
Watch your data storage costs
9

Typical Big Data architecture
Cassandra Repo
Users Properties
User
Properties
User
Segments
Rules
Defines
Samza Stream
Processing
Stream A Stream B
Stream C
Kafka
Stream Repo
Runs on
Hadoop
System of Records
Simple ETL
Raw Data
Global Properties & Models
Marketing
Segmentation and
Targeting Managment
Expose to the
Web Site
User Facing Stacks and Services
Log Forwarder Kafka Producer
EDW
(MPP)
Simple ELT
MapReduce
ETL
Designs
Actions Feeds
10

Ancestry system diagram
11
Hadoop
System of Records
Dogwood
ELT
User 360 Services Initiative
Kafka Log Forwarder
EDW
ParAccel
MapReduce
ETL
Splunk Alternative Initiative
Operation Monitoring Reporting
Initiative
Stream Kafka
Samza Stream
Processing
Stream A Stream B
Stream C
Notification Service
Mirror
.Net Stack
Java Stack
JVM stack
Vert.X stack
Node.js Stack Python Stack
Kafka Producer
Aspect
Aspect
Aggr ETL
ETL Kafka
Actions
Feeds
Production HadoopTableau
Elastic Search
Kibana

How to organize and build your team(s)
• Hiring vs. training smart developers in your organization
– Training
▫ Self-starters who can train themselves
▫ Online training that is free or with minimal cost
▫ Paid training for specific technologies
– Promote your technology and people will reach out to you
▫ Bit of a chicken and egg problem
• Key roles for the team
– Developers who understand operations
– Hadoop engineers
– Team leaders and managers
12

Big Data consultants
• Lots of them, charging lots of money
• Not all of them are created equal
• Prefer consultants who are vendor agnostic
• Find consultants who have experience in what you want
to do
• Check references
13

Companies working with custom logs
14
• Scribe, Scuba, Hive, and Hadoop as the data
warehouse infrastructure. Run over 10K Hive
scripts daily to crunch log data. Analyst on
each team to make sure logging is correct.
• Uses a very simple interface similar to log4j to
log data. How to keep this accurate?
• Tried Scribe. Implemented Kafka and Avro to
collect log data. Use a binary format with a
schema registry.
• Recently open sourced their log collecting
infrastructure (Suro – Data Pipeline).
“Used to be a web site that occasionally logged data. Now
we’re a logging engine that occasionally serves as a web
site.”

Collecting custom logs at Ancestry
• Framework piece with a “Logging Aspect”
– Logging is a cross cutting concern
– Avoid breaking changes
– Annotations for parameter names (normalization layer)
• Defined Big Data headers that must be present in every
log (User ID, Anonymous ID, Session, Request ID, Client)
– Stitch data together
– Partitioned in Hive by day/month/year
– JSON payload
– Validate messages sent vs. messages received
– Schema repository (long-term)
15

Ancestry log collection details
Each server
• 10 rolling logs
• Scraper process
Validate your data
collection infrastructure
• Auto incrementing count in every log
message
• Count on Framework side (sender) and
count on Hadoop (receiver)
17
Local Server
Hard Drive
Single Server
Kafka
Scrapper
10 rolling files
Hadoop
Log Sender
Log Receiver

Ancestry moving forward
• Ancestry is not “done” - the journey continues
– Still evolving and changing
– My thinking and understanding has also changed
• Means we will embrace new technologies in the future
– Keep our eyes open and experiment
• This is affecting the entire organization
– Becoming more involved with Open Source and the
communities that support it
18

Top three things to remember
• First and foremost, understand your needs
– No clear right or wrong way
– Keep it simple because simple scales
• This is about Analytics and impacting the business
• Find a company that fits you and follow them:
– Netflix (cloud architecture, code for survival, simian army)
– Facebook (HBase)
– LinkedIn (Kafka, Samza, Azkaban)
19

byetman@ancestry.com
http://blogs.ancestry.com/techroots/
(Filter on Big Data or search for
“Adventures in Big Data” in the title)
Bill’s contact information
20
Bill Yetman
VP of Engineering at Ancestry

Big Data, Baby Steps

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big Data, Baby Steps (20)

Recently uploaded (20)

Big Data, Baby Steps

Editor's Notes