SlideShare a Scribd company logo
1
Big Data, Baby Steps
“What Every Leader Should Consider When Starting a Big Data Initiative”
April 12, 2014
Goal for this presentation
“Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks everyone
else is doing it, so everyone claims they are doing it...”
- Dan Ariely on Facebook Jan 6, 2013 and “others”
2
Why Me? Why Ancestry?
• Established consumer facing Web company looking to
leverage our data
• Started with Hadoop and HBase in 2012 on AncestryDNA
• When we started, I looked for guidance – it was missing
• Learn from us: what works, what didn’t, how to adjust
Agenda
• What to consider before you start
• Understand the Hadoop ecosystem
– What pieces is Ancestry using and why?
– Big Data architecture at Ancestry
• Hadoop distributions
• Big Data consultants
• How to build your team(s)
• Custom logs
– Other companies and Ancestry specifics
• Top three things to remember
3
Gartner new technology hype cycles
Where is Big Data (and Big Data Analytics) on this curve?
4
Source: Gartner August 2013
What to consider before you start
• Big Data, Business Intelligence, and Analytics are tied
– Analytics is an umbrella term that represents the entire
ecosystem needed to turn data into actions
• Understand your “data”
– Web click stream data, sales transactions, advertising data,
fraud detection, sensor data, social data, etc.
• Visualize your final goal and work backwards
– Imagine (prototype) the dashboards, analytics, and actions
that will be available
• Deliver value to the business at each step
– “Goal of analytics is not to produce actionable insights; the
goal is to produce results.” Ken Rudin
5
Understand the Hadoop ecosystem
• Hadoop 2.0 and HDFS (Yarn)
• Workflow
• NoSQL
• Data Organization
• Log collection
• Near Real-Time Stream Processing
• NFS File System on HDFS
6
What are the pieces Ancestry is using?
We use or plan to use:
Yarn and Ambari
Forensics on log data:
Visualization:
(Graphs + Deep Zoom)
7
Visualization
Company that used traditional “Cubes” and Excel
– Business Intelligence/Data Warehouse world has moved
beyond cubes
– Great product that didn’t work for us
– People went back to using Excel
– In two weeks, 30 people created 120+ dashboards and reports
– Tied to an MPP Data Warehouse is changing our company
– Created the “Wild, wild, west” - fixing with a blessed portal
8
Hadoop distributions
• Open Source, Active Community, Large Eco-System of
Projects, requires more internal knowledge and support
• First Distribution, Large “War Chest” (Cash Investment),
Impala, and the Cloudera Console
• Custom file system (API equivalent to HDFS) that improves
performance, custom Hbase implementation, High
Availability Features
• Closest to Apache Hadoop, tested on Yahoo!’s 7000 node
cluster before being released.
• Several Cloud options: Google and Amazon. Quick and
easy to get going. Great way to experiment and learn.
Watch your data storage costs
9
Typical Big Data architecture
Cassandra Repo
Users Properties
User
Properties
User
Segments
Rules
Defines
Samza Stream
Processing
Stream A Stream B
Stream C
Kafka
Stream Repo
Runs on
Hadoop
System of Records
Simple ETL
Raw Data
Global Properties & Models
Marketing
Segmentation and
Targeting Managment
Expose to the
Web Site
User Facing Stacks and Services
Log Forwarder Kafka Producer
EDW
(MPP)
Simple ELT
MapReduce
ETL
Designs
Actions Feeds
10
Ancestry system diagram
11
Hadoop
System of Records
Dogwood
ELT
User 360 Services Initiative
Kafka Log Forwarder
EDW
ParAccel
MapReduce
ETL
Splunk Alternative Initiative
Operation Monitoring Reporting
Initiative
Stream Kafka
Samza Stream
Processing
Stream A Stream B
Stream C
Notification Service
Mirror
.Net Stack
Java Stack
JVM stack
Vert.X stack
Node.js Stack Python Stack
Kafka Producer
Aspect
Aspect
Aggr ETL
ETL Kafka
Actions
Feeds
Production HadoopTableau
Elastic Search
Kibana
How to organize and build your team(s)
• Hiring vs. training smart developers in your organization
– Training
▫ Self-starters who can train themselves
▫ Online training that is free or with minimal cost
▫ Paid training for specific technologies
– Promote your technology and people will reach out to you
▫ Bit of a chicken and egg problem
• Key roles for the team
– Developers who understand operations
– Hadoop engineers
– Team leaders and managers
12
Big Data consultants
• Lots of them, charging lots of money
• Not all of them are created equal
• Prefer consultants who are vendor agnostic
• Find consultants who have experience in what you want
to do
• Check references
13
Companies working with custom logs
14
• Scribe, Scuba, Hive, and Hadoop as the data
warehouse infrastructure. Run over 10K Hive
scripts daily to crunch log data. Analyst on
each team to make sure logging is correct.
• Uses a very simple interface similar to log4j to
log data. How to keep this accurate?
• Tried Scribe. Implemented Kafka and Avro to
collect log data. Use a binary format with a
schema registry.
• Recently open sourced their log collecting
infrastructure (Suro – Data Pipeline).
“Used to be a web site that occasionally logged data. Now
we’re a logging engine that occasionally serves as a web
site.”
Collecting custom logs at Ancestry
• Framework piece with a “Logging Aspect”
– Logging is a cross cutting concern
– Avoid breaking changes
– Annotations for parameter names (normalization layer)
• Defined Big Data headers that must be present in every
log (User ID, Anonymous ID, Session, Request ID, Client)
– Stitch data together
– Partitioned in Hive by day/month/year
– JSON payload
– Validate messages sent vs. messages received
– Schema repository (long-term)
15
Stitching data together
16
Ancestry log collection details
Each server
• 10 rolling logs
• Scraper process
Validate your data
collection infrastructure
• Auto incrementing count in every log
message
• Count on Framework side (sender) and
count on Hadoop (receiver)
17
Local Server
Hard Drive
Single Server
Kafka
Scrapper
10 rolling files
Hadoop
Log Sender
Log Receiver
Ancestry moving forward
• Ancestry is not “done” - the journey continues
– Still evolving and changing
– My thinking and understanding has also changed
• Means we will embrace new technologies in the future
– Keep our eyes open and experiment
• This is affecting the entire organization
– Becoming more involved with Open Source and the
communities that support it
18
Top three things to remember
• First and foremost, understand your needs
– No clear right or wrong way
– Keep it simple because simple scales
• This is about Analytics and impacting the business
• Find a company that fits you and follow them:
– Netflix (cloud architecture, code for survival, simian army)
– Facebook (HBase)
– LinkedIn (Kafka, Samza, Azkaban)
19
byetman@ancestry.com
http://blogs.ancestry.com/techroots/
(Filter on Big Data or search for
“Adventures in Big Data” in the title)
Bill’s contact information
20
Bill Yetman
VP of Engineering at Ancestry

More Related Content

DOCX
Big data abstract
PDF
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
Introduction to Big Data
PPT
BigData Analytics with Hadoop and BIRT
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PDF
Introduction to Big Data
PPT
Big Tools for Big Data
Big data abstract
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Big Data Analysis Patterns - TriHUG 6/27/2013
Introduction to Big Data
BigData Analytics with Hadoop and BIRT
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction to Big Data
Big Tools for Big Data

What's hot (20)

PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
PPTX
Big Data Hadoop Tutorial by Easylearning Guru
PPTX
Hadoop and BigData - July 2016
PPTX
Exploring Big Data Analytics Tools
PPTX
BigData
PDF
Introduction to Bigdata and HADOOP
PPTX
Big Data - An Overview
PPTX
Big Data Analytics
PDF
An introduction to Big Data
PPTX
Big data ppt
PDF
Big Data Analytics for Real Time Systems
PPTX
Big Data Analytics for Non-Programmers
PPT
Big Data Analytics 2014
PDF
Introduction to Big Data
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
PDF
Big Data Real Time Applications
PDF
What is Big Data?
PDF
Overview of big data in cloud computing
PPTX
Hadoop for beginners free course ppt
PDF
Big Data Final Presentation
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Hadoop Tutorial by Easylearning Guru
Hadoop and BigData - July 2016
Exploring Big Data Analytics Tools
BigData
Introduction to Bigdata and HADOOP
Big Data - An Overview
Big Data Analytics
An introduction to Big Data
Big data ppt
Big Data Analytics for Real Time Systems
Big Data Analytics for Non-Programmers
Big Data Analytics 2014
Introduction to Big Data
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Real Time Applications
What is Big Data?
Overview of big data in cloud computing
Hadoop for beginners free course ppt
Big Data Final Presentation
Ad

Viewers also liked (20)

PPTX
Big data ppt
PPTX
Rapid Prototyping for Big Data with AWS
PDF
Small Problems Vs Great Solutions Big Data Approach for Active & Passive Self...
PDF
NextGen Infrastructure for Big Data
PPTX
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
PDF
advertisement
PPSX
Social media for business www.mintsocialmedia.com
PDF
Cab advertising
PDF
คู่มือการใช้งานโปรแกรม Quotation center
DOC
Answer
PPTX
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
DOCX
Bahaya gula berlebihan
PPT
Narration
DOCX
PPTX
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
KEY
NFC standards
PPT
14 context clues
PPTX
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
PPTX
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Big data ppt
Rapid Prototyping for Big Data with AWS
Small Problems Vs Great Solutions Big Data Approach for Active & Passive Self...
NextGen Infrastructure for Big Data
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
advertisement
Social media for business www.mintsocialmedia.com
Cab advertising
คู่มือการใช้งานโปรแกรม Quotation center
Answer
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
Bahaya gula berlebihan
Narration
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
NFC standards
14 context clues
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Ad

Similar to Big Data, Baby Steps (20)

PPTX
5 Things that Make Hadoop a Game Changer
PPTX
Introduction To Big Data & Hadoop
PPTX
Atlanta Data Science Meetup | Qubole slides
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Hadoop as data refinery
PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PDF
Big Data at a Gaming Company: Spil Games
PPTX
Big data4businessusers
PPTX
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
PDF
PPTX
Big data analytics - hadoop
PPTX
Better Together: The New Data Management Orchestra
PPTX
Better Together: The New Data Management Orchestra
PDF
Incorporating the Data Lake into Your Analytic Architecture
PPTX
Atlanta hadoop users group july 2013
PDF
Data Governance for Data Lakes
PDF
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
PDF
Big data rmoug
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
5 Things that Make Hadoop a Game Changer
Introduction To Big Data & Hadoop
Atlanta Data Science Meetup | Qubole slides
Hadoop as Data Refinery - Steve Loughran
Hadoop as data refinery
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Big Data at a Gaming Company: Spil Games
Big data4businessusers
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Big data analytics - hadoop
Better Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
Incorporating the Data Lake into Your Analytic Architecture
Atlanta hadoop users group july 2013
Data Governance for Data Lakes
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Big data rmoug
Hadoop meets Agile! - An Agile Big Data Model
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Cloud computing and distributed systems.
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced IT Governance
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Cloud computing and distributed systems.
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
GamePlan Trading System Review: Professional Trader's Honest Take
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
“AI and Expert System Decision Support & Business Intelligence Systems”
Big Data Technologies - Introduction.pptx
Advanced IT Governance
MYSQL Presentation for SQL database connectivity
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Chapter 2 Digital Image Fundamentals.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Big Data, Baby Steps

  • 1. 1 Big Data, Baby Steps “What Every Leader Should Consider When Starting a Big Data Initiative” April 12, 2014
  • 2. Goal for this presentation “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...” - Dan Ariely on Facebook Jan 6, 2013 and “others” 2 Why Me? Why Ancestry? • Established consumer facing Web company looking to leverage our data • Started with Hadoop and HBase in 2012 on AncestryDNA • When we started, I looked for guidance – it was missing • Learn from us: what works, what didn’t, how to adjust
  • 3. Agenda • What to consider before you start • Understand the Hadoop ecosystem – What pieces is Ancestry using and why? – Big Data architecture at Ancestry • Hadoop distributions • Big Data consultants • How to build your team(s) • Custom logs – Other companies and Ancestry specifics • Top three things to remember 3
  • 4. Gartner new technology hype cycles Where is Big Data (and Big Data Analytics) on this curve? 4 Source: Gartner August 2013
  • 5. What to consider before you start • Big Data, Business Intelligence, and Analytics are tied – Analytics is an umbrella term that represents the entire ecosystem needed to turn data into actions • Understand your “data” – Web click stream data, sales transactions, advertising data, fraud detection, sensor data, social data, etc. • Visualize your final goal and work backwards – Imagine (prototype) the dashboards, analytics, and actions that will be available • Deliver value to the business at each step – “Goal of analytics is not to produce actionable insights; the goal is to produce results.” Ken Rudin 5
  • 6. Understand the Hadoop ecosystem • Hadoop 2.0 and HDFS (Yarn) • Workflow • NoSQL • Data Organization • Log collection • Near Real-Time Stream Processing • NFS File System on HDFS 6
  • 7. What are the pieces Ancestry is using? We use or plan to use: Yarn and Ambari Forensics on log data: Visualization: (Graphs + Deep Zoom) 7
  • 8. Visualization Company that used traditional “Cubes” and Excel – Business Intelligence/Data Warehouse world has moved beyond cubes – Great product that didn’t work for us – People went back to using Excel – In two weeks, 30 people created 120+ dashboards and reports – Tied to an MPP Data Warehouse is changing our company – Created the “Wild, wild, west” - fixing with a blessed portal 8
  • 9. Hadoop distributions • Open Source, Active Community, Large Eco-System of Projects, requires more internal knowledge and support • First Distribution, Large “War Chest” (Cash Investment), Impala, and the Cloudera Console • Custom file system (API equivalent to HDFS) that improves performance, custom Hbase implementation, High Availability Features • Closest to Apache Hadoop, tested on Yahoo!’s 7000 node cluster before being released. • Several Cloud options: Google and Amazon. Quick and easy to get going. Great way to experiment and learn. Watch your data storage costs 9
  • 10. Typical Big Data architecture Cassandra Repo Users Properties User Properties User Segments Rules Defines Samza Stream Processing Stream A Stream B Stream C Kafka Stream Repo Runs on Hadoop System of Records Simple ETL Raw Data Global Properties & Models Marketing Segmentation and Targeting Managment Expose to the Web Site User Facing Stacks and Services Log Forwarder Kafka Producer EDW (MPP) Simple ELT MapReduce ETL Designs Actions Feeds 10
  • 11. Ancestry system diagram 11 Hadoop System of Records Dogwood ELT User 360 Services Initiative Kafka Log Forwarder EDW ParAccel MapReduce ETL Splunk Alternative Initiative Operation Monitoring Reporting Initiative Stream Kafka Samza Stream Processing Stream A Stream B Stream C Notification Service Mirror .Net Stack Java Stack JVM stack Vert.X stack Node.js Stack Python Stack Kafka Producer Aspect Aspect Aggr ETL ETL Kafka Actions Feeds Production HadoopTableau Elastic Search Kibana
  • 12. How to organize and build your team(s) • Hiring vs. training smart developers in your organization – Training ▫ Self-starters who can train themselves ▫ Online training that is free or with minimal cost ▫ Paid training for specific technologies – Promote your technology and people will reach out to you ▫ Bit of a chicken and egg problem • Key roles for the team – Developers who understand operations – Hadoop engineers – Team leaders and managers 12
  • 13. Big Data consultants • Lots of them, charging lots of money • Not all of them are created equal • Prefer consultants who are vendor agnostic • Find consultants who have experience in what you want to do • Check references 13
  • 14. Companies working with custom logs 14 • Scribe, Scuba, Hive, and Hadoop as the data warehouse infrastructure. Run over 10K Hive scripts daily to crunch log data. Analyst on each team to make sure logging is correct. • Uses a very simple interface similar to log4j to log data. How to keep this accurate? • Tried Scribe. Implemented Kafka and Avro to collect log data. Use a binary format with a schema registry. • Recently open sourced their log collecting infrastructure (Suro – Data Pipeline). “Used to be a web site that occasionally logged data. Now we’re a logging engine that occasionally serves as a web site.”
  • 15. Collecting custom logs at Ancestry • Framework piece with a “Logging Aspect” – Logging is a cross cutting concern – Avoid breaking changes – Annotations for parameter names (normalization layer) • Defined Big Data headers that must be present in every log (User ID, Anonymous ID, Session, Request ID, Client) – Stitch data together – Partitioned in Hive by day/month/year – JSON payload – Validate messages sent vs. messages received – Schema repository (long-term) 15
  • 17. Ancestry log collection details Each server • 10 rolling logs • Scraper process Validate your data collection infrastructure • Auto incrementing count in every log message • Count on Framework side (sender) and count on Hadoop (receiver) 17 Local Server Hard Drive Single Server Kafka Scrapper 10 rolling files Hadoop Log Sender Log Receiver
  • 18. Ancestry moving forward • Ancestry is not “done” - the journey continues – Still evolving and changing – My thinking and understanding has also changed • Means we will embrace new technologies in the future – Keep our eyes open and experiment • This is affecting the entire organization – Becoming more involved with Open Source and the communities that support it 18
  • 19. Top three things to remember • First and foremost, understand your needs – No clear right or wrong way – Keep it simple because simple scales • This is about Analytics and impacting the business • Find a company that fits you and follow them: – Netflix (cloud architecture, code for survival, simian army) – Facebook (HBase) – LinkedIn (Kafka, Samza, Azkaban) 19
  • 20. [email protected] http://blogs.ancestry.com/techroots/ (Filter on Big Data or search for “Adventures in Big Data” in the title) Bill’s contact information 20 Bill Yetman VP of Engineering at Ancestry

Editor's Notes

  • #3: Quote: This was a quick search and the earliest reference I saw – pretty sure someone else said this but the idea is the key.Me:Established consumer facing Web company looking to leverage our dataStarted with Hadoop and HBase in 2012 on AncestryDNAWhen we started, I looked for guidance – it was missingLearn from us: what works, what didn’t, how to adjustPlease understand, I had very little experience with Hadoop, Big Data, etc. before starting the AncestryDNA project. That project was very specific and focused. Two of my engineers and someone outside our company went to lunch. This encounter gave us the confidence to use HBase in our DNA project. Best $45 lunch I ever approved. . Moving to a general Big Data Analytics effort was very different – and yes, I’ve made mistakes. No real blog entries with someone saying “this is exactly what we did”. So I started writing Big Data posts in ACOM’s Technology Blog. Learn from what we’re going through.
  • #4: Understand what you are getting into. Understand the Hadoop eco-system. Hadoop is presented as a “turn-key” and “fully baked” technology – it is definitely not.Architecture Diagrams (General one and a more specific Ancestry one). These should spark discussion.Build a teamBasics about Hadoop distributions and consultantsGo into details about custom logs at Ancestry and other companies. This is an area were we have learned.I will end with the top three things to remember – if nothing else, these are three things I want you to remember
  • #5: This is an open question. I think the industry is starting up the “slope of enlightenment” but I’m worried we are still dropping down to the “trough of disillusionment”. It is also possible that different companies are on different parts of this graph. The other question is are you/us in this room early adaptors? Or the early majority?
  • #6: Don’t be fooled. What you are doing in any Big Data Project is “Analytics”. Basic data collection and aggregation to Advanced Analytics Modeling/Machine Learning.Data: Story about a Tax Web Site that was going to look at how many times their users transferred money. What they found was the average user went through 28 clicks before finding the “Transfer Funds” item. Other items: how quickly to your insight (batch or near-real time; fraud vs. click stream)Visualize: If I could do anything different this is one of them. Set the expectations and visualize the outcome.Ken Rudin (Zynga and now FaceBook). TDWI 2013 Keynote in Chicago that is an eye opener. Someone to follow.
  • #7: Base Hadoop (How close to the “bleeding” edge do you want to be?)WorkflowNoSQLData OrganizationLog (or any data) CollectionNear-Real Time StreamingFile System on top of HDFS
  • #8: Hadoop 2.0 (Yarn, Ambari)Hbase on top of HDFSAzkaban – simple, easier to understand and use compared to OozieStinger with Hive – we are a SQL shop, HQL is closeKafka – Pub/Sub at scale (guarantees message delivery at least once. You will get multiple messages, app must handle that).Samza on top of Kafka for near-real time stream processing (reversed query)Cassandra and MongoDBOrange FS on top of HDFS. Normal file system view on top of HDFS is valuable.Log collection: Storing JSON logs – we can get away with ElasticSearch (Lucene and SOLR), Kibana (time series dashboards) LogStash will help if you need to format the log data, since we’re JSON, we don’t need it.Tableau has been a huge win for us. Spread like wildfire through the organization. More on the next slide.
  • #10: If you look, there are 12+ different Hadoop distributions. Going to talk about the main five.Just setting up a Hadoop development environment (VM with Hadoop to develop against) takes time. Even with directions on a wiki a new developer will take about a week to get it right (install, check, blow away, reinstall). VMWareWhere are companies moving to? (Horton Works is gaining momentum)One of the big three distributions, approached us early on. Really pitched their services, training, and licensing (about $5K per node). I was in a meeting with our CFO and he said to me “Hey Bill, I hear that XYZ is saying they are about to close a multi-million dollar contract with Ancestry for their Hadoop distribution.” Well, the didn’t close the deal. That’s the environment you are in.
  • #11: When I first started, I talked to a consultant who wrote a diagram very similar to this on a napkin. I still have it.First, collecting your logs/data/etc. KafkaStream processing for near-real time (we’re getting close to this)Raw data is moved to Hadoop. Always store the raw first, then process it.MapReduce data and create HIVE tables, then run other scripts or MapReduce jobs to create files that are sucked into the EDWWhat/where is your data warehouse? Could be Hadoop (Facebook). Could be a more traditional EDW (like us).Moved away from a 10 year old, Microsoft DW on SQL 2008 R2 to an MPP solution on ParAccel (now Matrix)Finally, you need a way to expose the data you are collecting back out to the web site/applications to take action
  • #12: Color means –this is in place. White means this is coming.At the top we’re showing the logging aspect that is included in our Web Applications and ServicesYou’ve seen this before – we feed KafkaFeed Hadoop (initially we have elasticsearch and Kibana on 5 nodes – this will change over time)Feed aggregate data into the EDWVisualize with Tableau (Hadoop or DW)Why have a Production Hadoop Cluster (LinkedIn – Persons we think you know, jobs you might be interested in. Generated every night for all their users.)
  • #13: Anyone here tried to hire Hadoop Engineers? Very few, in high demand, usually love their current job, and very expensive.We went a different way, we identified smart developers in the company and trained them. This takes more time and is an investment.Recently a Boston Hadoop Engineer connected to me through LinkedIn, he is looking to move to SF and found Ancestry. He read my blog entries.
  • #14: This is a dangerous area. You can invest a lot of time and money here.You have a new team without much Hadoop experience, they can be a big help (this was the boat we were in)Prefer vendor agnostic providers.Find consultants with experience in your areaReally like both these companies. (Should we show them?)
  • #16: How do the big boys (Google, FB, and others) handle SW development? They have two distinct groups. Infrastructure Engineers and Application Engineers. Infrastructure engineers build the common cross cutting concerns that every team uses. Logging is one of those cross cutting concerns (SLAs, monitoring, automated deployment, virtualization, A/B Testing). We’re not exactly the same but Ancestry is approaching logging the same way. The other item included on every entry is what we call our Big Data Headers. Not relying on date/time to stitch requests together.
  • #17: This is an example of how we might stitch data together. When the user id is not present, we use the permanent anonymous id cookie. Once a user has logged in, we have their account id. In this example, the same user has visited with two different sessions, clearing their cookies in between. Once they log in, we can tie them up.
  • #18: On each server, we allocate a specific amount of disk space for the log files. Log files roll over once they hit a specific size. We keep 10 active files before deleting old data. The goal is to hold about 1 days worth of logs on each host. If we haven’t picked up the logs by then, we lose data. We install a Kafka log scraper on each host that pulls the data written to these files. Kafka uses a very efficient transport mechanism for it’s producers. It sends multiple messages and compresses them. It is very efficient on the network.True confession. We designed this initially without validating how the data collection was going. Big miss. One simple way is to use an auto-incrementing value and include it in every log message. Look for breaks in the sequence. Another way is to keep track of messages sent in the logging aspect and the messages received in Hadoop. You need a stream to send the message counts every minute or so. Then count the messages received in Hadoop
  • #19: Still have a lot to learn. Real challenge is rolling this out across a mature web site that produces a lot of data. Easier to start when your small and grow/change your technology as needed.Ultimately, we are looking to change our company culture. Allow the “data” to help us make decisions. Get out of those “gut” reactions that drive a product decision. Put a feature out, make sure it moves a metric we care about (Google, FB say 80% of all features don’t move a key metric), or take that feature down. A/B Testing is a key.Add in the fact we are moving from a Microsoft shop to a Java/Linux/Open Source development company. This is a huge shift.
  • #20: No right way or wrong way. It has to be “your way”.You are doing Analytics not “Big Data”. That Analytics must impact the business (or why do it).Other amazing companies have made this transition. Find one that matches you and follow them. I love the NetFlix Architecture videos on YouTube (Adrian Cockcroft, older dude in a t-shirt, chaos monkey usually). LinkedIn has been very open with us and we’ve joined the Kafka and Samza projects. The popularity of HBase speaks for itself. Ancestry sponsored an HBase meet up in March. Very successful, very interesting to see. Join the communities and contribute.JoinInfoQ and attend QCon in SF (Nov, 2014). This is a non-Microsoft, open source conference that has amazing presentations. Or the Hadoop Summit in San Jose (June 3rd thru 5th)
  • #21: Very impressed with Technology Companies in Silicon Valley. They share infrastructure code, don’t feel it is part of their IP. (IP is their data and algorithms.) Ancestry started attending the Yahoo! Big Data meet up and joined the HBase community. This has really opened us up, stimulated innovation, and provided direction for our teams.Companies in Utah can learn a lot from them. I believe we don’t share and collaborate nearly as much as our SF counterparts.Currently Reading:“Secrets of Analytical Leaders: Insights from Information Insiders”Author: Wayne EckersonWilling to discuss Big Data Projects and infrastructure with any company. The best way forward is to support each other.