SlideShare a Scribd company logo
1
Big Data, Baby Steps
“What Every Leader Should Consider When Starting a Big Data Initiative”
April 12, 2014
Goal for this presentation
“Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks everyone
else is doing it, so everyone claims they are doing it...”
- Dan Ariely on Facebook Jan 6, 2013 and “others”
2
Why Me? Why Ancestry?
• Established consumer facing Web company looking to
leverage our data
• Started with Hadoop and HBase in 2012 on AncestryDNA
• When we started, I looked for guidance – it was missing
• Learn from us: what works, what didn’t, how to adjust
Agenda
• What to consider before you start
• Understand the Hadoop ecosystem
– What pieces is Ancestry using and why?
– Big Data architecture at Ancestry
• Hadoop distributions
• Big Data consultants
• How to build your team(s)
• Custom logs
– Other companies and Ancestry specifics
• Top three things to remember
3
Gartner new technology hype cycles
Where is Big Data (and Big Data Analytics) on this curve?
4
Source: Gartner August 2013
What to consider before you start
• Big Data, Business Intelligence, and Analytics are tied
– Analytics is an umbrella term that represents the entire
ecosystem needed to turn data into actions
• Understand your “data”
– Web click stream data, sales transactions, advertising data,
fraud detection, sensor data, social data, etc.
• Visualize your final goal and work backwards
– Imagine (prototype) the dashboards, analytics, and actions
that will be available
• Deliver value to the business at each step
– “Goal of analytics is not to produce actionable insights; the
goal is to produce results.” Ken Rudin
5
Understand the Hadoop ecosystem
• Hadoop 2.0 and HDFS (Yarn)
• Workflow
• NoSQL
• Data Organization
• Log collection
• Near Real-Time Stream Processing
• NFS File System on HDFS
6
What are the pieces Ancestry is using?
We use or plan to use:
Yarn and Ambari
Forensics on log data:
Visualization:
(Graphs + Deep Zoom)
7
Visualization
Company that used traditional “Cubes” and Excel
– Business Intelligence/Data Warehouse world has moved
beyond cubes
– Great product that didn’t work for us
– People went back to using Excel
– In two weeks, 30 people created 120+ dashboards and reports
– Tied to an MPP Data Warehouse is changing our company
– Created the “Wild, wild, west” - fixing with a blessed portal
8
Hadoop distributions
• Open Source, Active Community, Large Eco-System of
Projects, requires more internal knowledge and support
• First Distribution, Large “War Chest” (Cash Investment),
Impala, and the Cloudera Console
• Custom file system (API equivalent to HDFS) that improves
performance, custom Hbase implementation, High
Availability Features
• Closest to Apache Hadoop, tested on Yahoo!’s 7000 node
cluster before being released.
• Several Cloud options: Google and Amazon. Quick and
easy to get going. Great way to experiment and learn.
Watch your data storage costs
9
Typical Big Data architecture
Cassandra Repo
Users Properties
User
Properties
User
Segments
Rules
Defines
Samza Stream
Processing
Stream A Stream B
Stream C
Kafka
Stream Repo
Runs on
Hadoop
System of Records
Simple ETL
Raw Data
Global Properties & Models
Marketing
Segmentation and
Targeting Managment
Expose to the
Web Site
User Facing Stacks and Services
Log Forwarder Kafka Producer
EDW
(MPP)
Simple ELT
MapReduce
ETL
Designs
Actions Feeds
10
Ancestry system diagram
11
Hadoop
System of Records
Dogwood
ELT
User 360 Services Initiative
Kafka Log Forwarder
EDW
ParAccel
MapReduce
ETL
Splunk Alternative Initiative
Operation Monitoring Reporting
Initiative
Stream Kafka
Samza Stream
Processing
Stream A Stream B
Stream C
Notification Service
Mirror
.Net Stack
Java Stack
JVM stack
Vert.X stack
Node.js Stack Python Stack
Kafka Producer
Aspect
Aspect
Aggr ETL
ETL Kafka
Actions
Feeds
Production HadoopTableau
Elastic Search
Kibana
How to organize and build your team(s)
• Hiring vs. training smart developers in your organization
– Training
▫ Self-starters who can train themselves
▫ Online training that is free or with minimal cost
▫ Paid training for specific technologies
– Promote your technology and people will reach out to you
▫ Bit of a chicken and egg problem
• Key roles for the team
– Developers who understand operations
– Hadoop engineers
– Team leaders and managers
12
Big Data consultants
• Lots of them, charging lots of money
• Not all of them are created equal
• Prefer consultants who are vendor agnostic
• Find consultants who have experience in what you want
to do
• Check references
13
Companies working with custom logs
14
• Scribe, Scuba, Hive, and Hadoop as the data
warehouse infrastructure. Run over 10K Hive
scripts daily to crunch log data. Analyst on
each team to make sure logging is correct.
• Uses a very simple interface similar to log4j to
log data. How to keep this accurate?
• Tried Scribe. Implemented Kafka and Avro to
collect log data. Use a binary format with a
schema registry.
• Recently open sourced their log collecting
infrastructure (Suro – Data Pipeline).
“Used to be a web site that occasionally logged data. Now
we’re a logging engine that occasionally serves as a web
site.”
Collecting custom logs at Ancestry
• Framework piece with a “Logging Aspect”
– Logging is a cross cutting concern
– Avoid breaking changes
– Annotations for parameter names (normalization layer)
• Defined Big Data headers that must be present in every
log (User ID, Anonymous ID, Session, Request ID, Client)
– Stitch data together
– Partitioned in Hive by day/month/year
– JSON payload
– Validate messages sent vs. messages received
– Schema repository (long-term)
15
Stitching data together
16
Ancestry log collection details
Each server
• 10 rolling logs
• Scraper process
Validate your data
collection infrastructure
• Auto incrementing count in every log
message
• Count on Framework side (sender) and
count on Hadoop (receiver)
17
Local Server
Hard Drive
Single Server
Kafka
Scrapper
10 rolling files
Hadoop
Log Sender
Log Receiver
Ancestry moving forward
• Ancestry is not “done” - the journey continues
– Still evolving and changing
– My thinking and understanding has also changed
• Means we will embrace new technologies in the future
– Keep our eyes open and experiment
• This is affecting the entire organization
– Becoming more involved with Open Source and the
communities that support it
18
Top three things to remember
• First and foremost, understand your needs
– No clear right or wrong way
– Keep it simple because simple scales
• This is about Analytics and impacting the business
• Find a company that fits you and follow them:
– Netflix (cloud architecture, code for survival, simian army)
– Facebook (HBase)
– LinkedIn (Kafka, Samza, Azkaban)
19
byetman@ancestry.com
http://blogs.ancestry.com/techroots/
(Filter on Big Data or search for
“Adventures in Big Data” in the title)
Bill’s contact information
20
Bill Yetman
VP of Engineering at Ancestry

More Related Content

PDF
Transform from database professional to a Big Data architect
PDF
How to get started in Big Data without Big Costs - StampedeCon 2016
PDF
Incorporating the Data Lake into Your Analytic Architecture
PPTX
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
PPTX
Introduction of Big data, NoSQL & Hadoop
PDF
Paytm labs soyouwanttodatascience
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Transform from database professional to a Big Data architect
How to get started in Big Data without Big Costs - StampedeCon 2016
Incorporating the Data Lake into Your Analytic Architecture
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Introduction of Big data, NoSQL & Hadoop
Paytm labs soyouwanttodatascience
Introduction to Kudu - StampedeCon 2016
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...

What's hot (20)

PDF
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PPTX
Unlock the value in your big data reservoir using oracle big data discovery a...
PDF
Moving to a data-centric architecture: Toronto Data Unconference 2015
PDF
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
PDF
The Future of Analytics, Data Integration and BI on Big Data Platforms
PPTX
Hadoop and BigData - July 2016
PPTX
Getting Started with Big Data in the Cloud
PPTX
Why data warehouses cannot support hot analytics
PPTX
Big Data on azure
PDF
An introduction to Big Data
PDF
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
PPTX
Next Big Thing In IT Space
PPTX
Big Data in the Real World
PPTX
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
PDF
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
PDF
A Continuously Deployed Hadoop Analytics Platform?
PDF
Big Data Architecture
PDF
Intro to Big Data - Spark
PPTX
Stephen Dillon - Fast Data Presentation Sept 02
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Unlock the value in your big data reservoir using oracle big data discovery a...
Moving to a data-centric architecture: Toronto Data Unconference 2015
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
The Future of Analytics, Data Integration and BI on Big Data Platforms
Hadoop and BigData - July 2016
Getting Started with Big Data in the Cloud
Why data warehouses cannot support hot analytics
Big Data on azure
An introduction to Big Data
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Next Big Thing In IT Space
Big Data in the Real World
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
A Continuously Deployed Hadoop Analytics Platform?
Big Data Architecture
Intro to Big Data - Spark
Stephen Dillon - Fast Data Presentation Sept 02
Ad

Viewers also liked (13)

PDF
Private Cloud Delivers Big Data in Oil & Gas v4
PPTX
10 concepts the enterprise decision maker needs to understand about Hadoop
PPTX
Big Data, Baby Steps
PPTX
Enable breakthrough in Parkinson disease research- Ido Karavany-
PDF
Spark Streaming and IoT by Mike Freedman
PPTX
Spark Streaming the Industrial IoT
PDF
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
PPTX
Big Data Application Architectures - IoT
PDF
Internet of Things (IoT) and Big Data
PPTX
Apache NiFi- MiNiFi meetup Slides
PDF
Architecting a Next Generation Data Platform
PDF
Strata San Jose 2017 - Ben Sharma Presentation
PPTX
Big data ppt
Private Cloud Delivers Big Data in Oil & Gas v4
10 concepts the enterprise decision maker needs to understand about Hadoop
Big Data, Baby Steps
Enable breakthrough in Parkinson disease research- Ido Karavany-
Spark Streaming and IoT by Mike Freedman
Spark Streaming the Industrial IoT
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Big Data Application Architectures - IoT
Internet of Things (IoT) and Big Data
Apache NiFi- MiNiFi meetup Slides
Architecting a Next Generation Data Platform
Strata San Jose 2017 - Ben Sharma Presentation
Big data ppt
Ad

Similar to Utah Big Mountain Big Data Baby Steps (4-12-2014) Final (20)

PPTX
5 Things that Make Hadoop a Game Changer
PPTX
Introduction To Big Data & Hadoop
PPTX
Atlanta Data Science Meetup | Qubole slides
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Hadoop as data refinery
PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PDF
Big Data at a Gaming Company: Spil Games
PPTX
Big data4businessusers
PPTX
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
PDF
PPTX
Big data analytics - hadoop
PPTX
Better Together: The New Data Management Orchestra
PPTX
Better Together: The New Data Management Orchestra
PPTX
Atlanta hadoop users group july 2013
PDF
Data Governance for Data Lakes
PDF
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
PDF
Big data rmoug
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
PDF
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
5 Things that Make Hadoop a Game Changer
Introduction To Big Data & Hadoop
Atlanta Data Science Meetup | Qubole slides
Hadoop as Data Refinery - Steve Loughran
Hadoop as data refinery
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Big Data at a Gaming Company: Spil Games
Big data4businessusers
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Big data analytics - hadoop
Better Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
Atlanta hadoop users group july 2013
Data Governance for Data Lakes
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Big data rmoug
Hadoop meets Agile! - An Agile Big Data Model
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...

Recently uploaded (20)

PDF
Google’s NotebookLM Unveils Video Overviews
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
How AI Agents Improve Data Accuracy and Consistency in Due Diligence.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
KodekX | Application Modernization Development
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
creating-agentic-ai-solutions-leveraging-aws.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Google’s NotebookLM Unveils Video Overviews
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
How AI Agents Improve Data Accuracy and Consistency in Due Diligence.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Chapter 2 Digital Image Fundamentals.pdf
KodekX | Application Modernization Development
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
creating-agentic-ai-solutions-leveraging-aws.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Sensors and Actuators in IoT Systems using pdf
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Belt and Road Supply Chain Finance Blockchain Solution
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
A Day in the Life of Location Data - Turning Where into How.pdf
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...

Utah Big Mountain Big Data Baby Steps (4-12-2014) Final

  • 1. 1 Big Data, Baby Steps “What Every Leader Should Consider When Starting a Big Data Initiative” April 12, 2014
  • 2. Goal for this presentation “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...” - Dan Ariely on Facebook Jan 6, 2013 and “others” 2 Why Me? Why Ancestry? • Established consumer facing Web company looking to leverage our data • Started with Hadoop and HBase in 2012 on AncestryDNA • When we started, I looked for guidance – it was missing • Learn from us: what works, what didn’t, how to adjust
  • 3. Agenda • What to consider before you start • Understand the Hadoop ecosystem – What pieces is Ancestry using and why? – Big Data architecture at Ancestry • Hadoop distributions • Big Data consultants • How to build your team(s) • Custom logs – Other companies and Ancestry specifics • Top three things to remember 3
  • 4. Gartner new technology hype cycles Where is Big Data (and Big Data Analytics) on this curve? 4 Source: Gartner August 2013
  • 5. What to consider before you start • Big Data, Business Intelligence, and Analytics are tied – Analytics is an umbrella term that represents the entire ecosystem needed to turn data into actions • Understand your “data” – Web click stream data, sales transactions, advertising data, fraud detection, sensor data, social data, etc. • Visualize your final goal and work backwards – Imagine (prototype) the dashboards, analytics, and actions that will be available • Deliver value to the business at each step – “Goal of analytics is not to produce actionable insights; the goal is to produce results.” Ken Rudin 5
  • 6. Understand the Hadoop ecosystem • Hadoop 2.0 and HDFS (Yarn) • Workflow • NoSQL • Data Organization • Log collection • Near Real-Time Stream Processing • NFS File System on HDFS 6
  • 7. What are the pieces Ancestry is using? We use or plan to use: Yarn and Ambari Forensics on log data: Visualization: (Graphs + Deep Zoom) 7
  • 8. Visualization Company that used traditional “Cubes” and Excel – Business Intelligence/Data Warehouse world has moved beyond cubes – Great product that didn’t work for us – People went back to using Excel – In two weeks, 30 people created 120+ dashboards and reports – Tied to an MPP Data Warehouse is changing our company – Created the “Wild, wild, west” - fixing with a blessed portal 8
  • 9. Hadoop distributions • Open Source, Active Community, Large Eco-System of Projects, requires more internal knowledge and support • First Distribution, Large “War Chest” (Cash Investment), Impala, and the Cloudera Console • Custom file system (API equivalent to HDFS) that improves performance, custom Hbase implementation, High Availability Features • Closest to Apache Hadoop, tested on Yahoo!’s 7000 node cluster before being released. • Several Cloud options: Google and Amazon. Quick and easy to get going. Great way to experiment and learn. Watch your data storage costs 9
  • 10. Typical Big Data architecture Cassandra Repo Users Properties User Properties User Segments Rules Defines Samza Stream Processing Stream A Stream B Stream C Kafka Stream Repo Runs on Hadoop System of Records Simple ETL Raw Data Global Properties & Models Marketing Segmentation and Targeting Managment Expose to the Web Site User Facing Stacks and Services Log Forwarder Kafka Producer EDW (MPP) Simple ELT MapReduce ETL Designs Actions Feeds 10
  • 11. Ancestry system diagram 11 Hadoop System of Records Dogwood ELT User 360 Services Initiative Kafka Log Forwarder EDW ParAccel MapReduce ETL Splunk Alternative Initiative Operation Monitoring Reporting Initiative Stream Kafka Samza Stream Processing Stream A Stream B Stream C Notification Service Mirror .Net Stack Java Stack JVM stack Vert.X stack Node.js Stack Python Stack Kafka Producer Aspect Aspect Aggr ETL ETL Kafka Actions Feeds Production HadoopTableau Elastic Search Kibana
  • 12. How to organize and build your team(s) • Hiring vs. training smart developers in your organization – Training ▫ Self-starters who can train themselves ▫ Online training that is free or with minimal cost ▫ Paid training for specific technologies – Promote your technology and people will reach out to you ▫ Bit of a chicken and egg problem • Key roles for the team – Developers who understand operations – Hadoop engineers – Team leaders and managers 12
  • 13. Big Data consultants • Lots of them, charging lots of money • Not all of them are created equal • Prefer consultants who are vendor agnostic • Find consultants who have experience in what you want to do • Check references 13
  • 14. Companies working with custom logs 14 • Scribe, Scuba, Hive, and Hadoop as the data warehouse infrastructure. Run over 10K Hive scripts daily to crunch log data. Analyst on each team to make sure logging is correct. • Uses a very simple interface similar to log4j to log data. How to keep this accurate? • Tried Scribe. Implemented Kafka and Avro to collect log data. Use a binary format with a schema registry. • Recently open sourced their log collecting infrastructure (Suro – Data Pipeline). “Used to be a web site that occasionally logged data. Now we’re a logging engine that occasionally serves as a web site.”
  • 15. Collecting custom logs at Ancestry • Framework piece with a “Logging Aspect” – Logging is a cross cutting concern – Avoid breaking changes – Annotations for parameter names (normalization layer) • Defined Big Data headers that must be present in every log (User ID, Anonymous ID, Session, Request ID, Client) – Stitch data together – Partitioned in Hive by day/month/year – JSON payload – Validate messages sent vs. messages received – Schema repository (long-term) 15
  • 17. Ancestry log collection details Each server • 10 rolling logs • Scraper process Validate your data collection infrastructure • Auto incrementing count in every log message • Count on Framework side (sender) and count on Hadoop (receiver) 17 Local Server Hard Drive Single Server Kafka Scrapper 10 rolling files Hadoop Log Sender Log Receiver
  • 18. Ancestry moving forward • Ancestry is not “done” - the journey continues – Still evolving and changing – My thinking and understanding has also changed • Means we will embrace new technologies in the future – Keep our eyes open and experiment • This is affecting the entire organization – Becoming more involved with Open Source and the communities that support it 18
  • 19. Top three things to remember • First and foremost, understand your needs – No clear right or wrong way – Keep it simple because simple scales • This is about Analytics and impacting the business • Find a company that fits you and follow them: – Netflix (cloud architecture, code for survival, simian army) – Facebook (HBase) – LinkedIn (Kafka, Samza, Azkaban) 19
  • 20. [email protected] http://blogs.ancestry.com/techroots/ (Filter on Big Data or search for “Adventures in Big Data” in the title) Bill’s contact information 20 Bill Yetman VP of Engineering at Ancestry