SlideShare a Scribd company logo
BigData @ comScore
Michael Brown, CTO, comScore, Inc.
March 25th
, 2011
comScore is a Global Leader in Measuring the Digital World
NASDAQ SCOR
Clients 1600+ worldwide
Employees 1,000+
Headquarters Reston, VA
Global Coverage
170+ countries under measurement;
43 markets reported
Local Presence 30+ locations in 21 countries
2© comScore, Inc. Proprietary.
Local Presence 30+ locations in 21 countries
V0910
Broad Client Base and Deep Expertise Across Key Industries
Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology
3© comScore, Inc. Proprietary. V0910
The Trusted Source for Digital Intelligence Across Vertical Markets
47 out of the top 50
4 out of the top 4
WIRELESS CARRIERS
9 out of the top 10
INVESTMENT BANKS
9 out of the top 10
9 out of the top 10
INTERNET SERVICE
PROVIDERS
9 out of the top 10
AUTO INSURERS
4© comScore, Inc. Proprietary.
47 out of the top 50
ONLINE PROPERTIES
45 out of the top 50
ADVERTISING AGENCIES
9 out of the top 10
MAJOR MEDIA COMPANIES
9 out of the top 10
PHARMACEUTICAL
COMPANIES
9 out of the top 10
CONSUMER FINANCE
COMPANIES
9 out of the top 10
CPG COMPANIES
V0910
comScore History of Leadership and Innovation
To measure the search market
To measure
video streaming
To provide behavioral ad effectiveness
To meter mobile user behavior
1st
To Unify census + panel measurement
5© comScore, Inc. Proprietary.
To build and project from 2 million+ longitudinal panel
To monitor and report e-commerce data
1
To deliver a worldwide Internet audience measurement
Global Shaper
Company
2010
V0910
Average Records Captured per Day (2005-2009)
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
6© comScore, Inc. Proprietary.
-
200,000,000
400,000,000
600,000,000
800,000,000
Launching the 3rd Generation
In 2009, in the midst of the recession, comScore decided to build and
release its 3rd
Generation Product – Unified Digital Measurement (UDM or
Hybrid)
Technology Goals
– Ramp up data collection
– Deploy new methodologies for data processing and analysis
– Be able to scale linearly to the environment to support growth
7© comScore, Inc. Proprietary.
– Be able to scale linearly to the environment to support growth
– Have yesterdays data available today
And one more thing … do it in 4 months or less.
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
Global
PERSON Measurement
Global
MACHINE Measurement
8© comScore, Inc. Proprietary.
PAGE TAGSPANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Adopted by 88% of Top U.S. Media Properties
V0910
How Does the Hybrid Process Work?
Collect Traffic from
PCs and devices
Clean Traffic – remove non-
human, bots, apply edit rules
9© comScore, Inc. Proprietary.
Apply comScore
URL Dictionary
Total Traffic Filtered Traffic
URL Dictionary (CFD): Advertising Industry “Currency”
Intelligent grouping of
Properties with 7+ levels of
detail
– Property (e.g., Yahoo!
Properties, Microsoft Sites)
– Media Title (e.g., Yahoo!, MSN)
10© comScore, Inc. Proprietary.
– Channel (e.g., Yahoo! Search,
MSN Homepages)
– Subchannel (e.g., Yahoo!
Image Search, MSNBC)
– Group/Subgroup (e.g., Yahoo!
Calendar, Today)
URL Dictionary (CFD) Coverage Statistics
11MM Unique Domains Average/Month in 2010
• Over 80% pages viewed from top 131K domains in 2010 vs. 392K in 2009
11© comScore, Inc. Proprietary.
• 2,360K patterns in January 2011represents 85% of all pages
• 1,254K syndicated entities in January 2010
• 41K patterns added/month in 2010.
Worldwide UDM™ Penetration
Europe
Austria 80%
Asia Pacific
Australia 91%
North America
Canada 94%
Latin America
Argentina 94%
Middle East & Africa
Israel 93%
Percentage of Machines Included in UDM Measurement
12© comScore, Inc. Proprietary. July 2010 Penetration Data
Austria 80%
Belgium 85%
Switzerland 84%
Germany 84%
Denmark 82%
Spain 90%
Finland 85%
France 91%
Ireland 91%
Italy 80%
Netherlands 88%
Norway 84%
Portugal 86%
Sweden 85%
United Kingdom 90%
Australia 91%
Hong Kong 88%
India 84%
Japan 73%
Malaysia 87%
New Zealand 88%
Singapore 91%
Canada 94%
United States 91%
Argentina 94%
Brazil 92%
Chile 94%
Colombia 95%
Mexico 93%
Puerto Rico 92%
Israel 93%
South Africa 73%
V0910
Worldwide Tags per Day
15,000,000,000
20,000,000,000
25,000,000,000
#ofrecords
13© comScore, Inc. Proprietary.
0
5,000,000,000
10,000,000,000
Jul
2009
Aug
2009
Sep
2009
Oct
2009
Nov
2009
Dec
2009
Jan
2010
Feb
2010
Mar
2010
Apr
2010
May
2010
Jun
2010
Jul
2010
Aug
2010
Sep
2010
Oct
2010
Nov
2010
Dec
2010
Jan
2011
Feb
2011
#ofrecords
Beacon Records Panel Records
Monthly Totals
300,000,000,000
400,000,000,000
500,000,000,000
600,000,000,000
#ofrecords
14© comScore, Inc. Proprietary.
0
100,000,000,000
200,000,000,000
300,000,000,000
Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb
2009 2010 2011
#ofrecords
Beacon Records Panel Records
High Level Data Flow
Panel
ETL
15© comScore, Inc. Proprietary.
Census
ETL
Delivery
Enterprise Data Warehouse : Sybase IQ 15.2 Multiplex
EDW is currently comprised of 20 servers running Windows 2003 R2 x64
– Currently 220 Intel CPUs
– Dedicated EDW technical team of 3 DBAs and 1 Administrator
– Ability to grow compute capacity and storage capacity independently
EDW data repository housed on both EMC VMAX and Clarion
– 4 EDW instances (2 in Virginia and 2 in Illinois)
– One EDW instance is 147TB usable (app. 200TB of raw data)
16© comScore, Inc. Proprietary.
– One EDW instance is 147TB usable (app. 200TB of raw data)
– Production EDW Drive Layout
416 x 1TB SATA, RAID6, 14+2
42 x 600GB 15K, RAID1
8 X 400GB Flash, RAID5, 7+1
Current Capacity and Performance Metrics
– 1,835,412,793,799 Rows loaded
– 140TB in 14,168 tables
– Capable of Loading 56 Billion rows per hour
Subsystem
System designed using multiple sub systems
Easily take out and replace different components as demands changed
Moved from a single server to a cluster of servers in a few months in some
cases with first stage tag processing
Periodically redesign different subsystems to support increased
processing demands
17© comScore, Inc. Proprietary.
Many systems on their third generation of technology
Homegrown Distributed Processing
Reduced core
aggregation from
Reduce final
product creation
2002 – comScore distributed processing framework
Open Source
Hadoop
ScalabilityWall
18© comScore, Inc. Proprietary.
aggregation from
48 hours to 7 hours
product creation
from 24 hours to
2 hours
Hadoop
framework
ScalabilityWall
GreenPlum
GreenPlum MPP
– 80 Node Cluster: 1 Master; 6 ETL; 72 Workers
– Using Dell R510 with 12 600GB 15K RAID, 64GB RAM, 24 cores (HT)
– Support analytic end users with access to record level data, through a SQL
interface
– Ability to load over 400 billion rows in 8 hours
– Hourly data loading in place
19© comScore, Inc. Proprietary.
– Hourly data loading in place
– Allow the analysts to mine the data for the business uses
– Use for quick analysis of raw event data and for the ideation and creation of
new products
Hadoop
Hadoop
– Dev - 6x Dell 2950 w/6 1TB
– Prod - 10x Dell R710 w/ 6 600GB
– Prod in 2 weeks – 10x Dell R710 w/6 600GB & 20x Dell R510 w/12 2TB
– Moving large processing jobs that currently are constrained by our current
framework to Hadoop. We have some large analytical runs that currently go
for over 40 hours on 32 servers and we are re-engineering to reduce
20© comScore, Inc. Proprietary.
for over 40 hours on 32 servers and we are re-engineering to reduce
processing time.
– We have found that the Fair Scheduler works well for our job loads
– We use a “homegrown” workflow system (BORG) that manages tasks inside
and outside hadoop.
Sharding
Sharding divides work across multiple systems using different mechanisms
Shard data as far up stream as possible
Ability to break data into multiple chunks early in processing, enables ability to
compute capacity down stream to accommodate large volume increases in data
ingest
21© comScore, Inc. Proprietary.
Sorting
We use DMExpress from SyncSort across hundreds of servers this allows
for efficient data processing
We sort input data based on a column in advance
To calculate uniques, check if the prior value changed from the current
value and then increment a counter
We now have aggregation systems that can process over 50 GB of data
with 357 million rows in less than an hour on a Dell R710 2U serve
22© comScore, Inc. Proprietary.
with 357 million rows in less than an hour on a Dell R710 2U serve
Compression w/Sorting
Compress Log Files when processing large volumes of log data
Several advantages to Sorting Data First:
– Reduces the size of the data
– Improves application performance
Examples:
– 1 Hour of our data (313 GB raw, 815 million rows)
23© comScore, Inc. Proprietary.
1 Hour of our data (313 GB raw, 815 million rows)
– Standard compression of time ordered data is 93GB (30% of original)
– Standard compression on a 2 key sorted set is 56GB (18% of original)
– For one day it saves 800GB
– For one month it saves 25 TB
– For 90 days it saves 75TB
Big data makes you think differently
Question: How many distinct cookies over 3 months?
Data: 3 monthly tables with distinct cookies, indexed
Size: 10B records per table
Platform: Sybase IQ
Attempt: UNION select count(cookies) over 3 monthly tables
24© comScore, Inc. Proprietary.
– Union operator distincts
Result: FAIL. Out of temp space. Out of luck.
– Failed after 30 minutes.
Why? UNION performs a SELECT and then a DISTINCT (sorting 30B rows)
Rethink the problem!
INNER joins are cheaper
No sort, they use existing indexes
Remember set theory? Of course you do!
Let months be {A, B, C}
A B
∪ ∪
25© comScore, Inc. Proprietary.
INNER join on only 2 tables of data at a time
2 month intersections took 2 hours each and less taxing on memory
Used intersection of intermediate (indexed!) results… 5 mins
C
A ∪ B ∪ C = A + B + C – A ∩ B – A ∩ C – C ∩ B + A ∩ B ∩ C
A ∩ B ∩ C = (A ∩ B) ∩ (A ∩ C) ∩ (C ∩ B)
Total query time: 6.5 hours
TCO with Large Cluster Systems
Examine replication factor and disk configuration for systems with
replication built into the framework to support redundancy and
concurrency
Example:
Hadoop cluster that supports 108TB of base compressed data
Hypothetical Configurations:
26© comScore, Inc. Proprietary.
– Replication Factor of 3
R710 (6x drives, JBOD); requires 162 servers
R510 (12x drives JBOD); requires 68 servers
– Replication Factor of 2
R710 (6x drives, RAID 5); requires 129 servers
R510 (12x drives, RAID 5); requires 54 servers
Useful Factoids
Colorful, bite-sized graphical representations of the best discoveries we unearth.
27© comScore, Inc. Proprietary.
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com
28© comScore, Inc. Proprietary.

More Related Content

PDF
Using Hadoop
eaiti
 
PDF
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
MapR Technologies
 
PPT
Case Study Com Score
FM Signal
 
PPTX
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Steven Totman
 
PPTX
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steven Totman
 
PPTX
Expect More from Hadoop
MapR Technologies
 
PDF
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Cloudera, Inc.
 
PPTX
Integrating Hadoop into your enterprise IT environment
MapR Technologies
 
Using Hadoop
eaiti
 
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
MapR Technologies
 
Case Study Com Score
FM Signal
 
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Steven Totman
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steven Totman
 
Expect More from Hadoop
MapR Technologies
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Cloudera, Inc.
 
Integrating Hadoop into your enterprise IT environment
MapR Technologies
 

What's hot (19)

PDF
Big Data LDN 2018: 7 SUCCESSFUL HABITS FOR DATA-INTENSIVE APPLICATIONS IN PRO...
Matt Stubbs
 
PDF
Meruvian - Introduction to MapR
The World Bank
 
PDF
Modern real-time streaming architectures
Arun Kejariwal
 
PDF
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
PPTX
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
PDF
Big data processing with PubSub, Dataflow, and BigQuery
Thuyen Ho
 
PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
PDF
Anomaly Detection At The Edge
Arun Kejariwal
 
PPTX
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
PPTX
Converging your data landscape
MapR Technologies
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
PDF
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
PDF
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
Bonaventura Del Monte
 
Big Data LDN 2018: 7 SUCCESSFUL HABITS FOR DATA-INTENSIVE APPLICATIONS IN PRO...
Matt Stubbs
 
Meruvian - Introduction to MapR
The World Bank
 
Modern real-time streaming architectures
Arun Kejariwal
 
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Big data processing with PubSub, Dataflow, and BigQuery
Thuyen Ho
 
Geo-Distributed Big Data and Analytics
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Anomaly Detection At The Edge
Arun Kejariwal
 
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Converging your data landscape
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Technologies
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
Bonaventura Del Monte
 
Ad

Viewers also liked (20)

PDF
Cosso cox
Izzatul Jannah Jannah
 
PDF
Bitcoin 101 - Certified Bitcoin Professional Training Session
Lisa Cheng
 
PPT
Broadband tech 2005
eaiti
 
PDF
KULTPRIT LookBook %231
Flavia Furtos
 
DOCX
Journal
szeming_teoh
 
PPTX
Psych comic strip
szeming_teoh
 
PPTX
Official short presentation (eng)
Ivelin Stoyanov
 
PDF
Spring2016Report
Erika Hang
 
PDF
Have a taste of Cocktail Advertising - Digital & Social Media
Flavia Furtos
 
PDF
How To Structure Large Applications With AngularJS
Stefan Unterhofer
 
PPT
Hitesh renuwel
Solanki Hitesh
 
DOCX
English essay
szeming_teoh
 
PPT
Video presentation
szeming_teoh
 
PPT
Hitesh cross cultural comm in business
Solanki Hitesh
 
PPT
Ctolinux 2001
eaiti
 
PPT
Ping solutions overview_111904
eaiti
 
PPT
Cto forum nirav_kapadia_2006_03_31_2006
eaiti
 
PPTX
Observasi pendidikan
dinsund
 
Bitcoin 101 - Certified Bitcoin Professional Training Session
Lisa Cheng
 
Broadband tech 2005
eaiti
 
KULTPRIT LookBook %231
Flavia Furtos
 
Journal
szeming_teoh
 
Psych comic strip
szeming_teoh
 
Official short presentation (eng)
Ivelin Stoyanov
 
Spring2016Report
Erika Hang
 
Have a taste of Cocktail Advertising - Digital & Social Media
Flavia Furtos
 
How To Structure Large Applications With AngularJS
Stefan Unterhofer
 
Hitesh renuwel
Solanki Hitesh
 
English essay
szeming_teoh
 
Video presentation
szeming_teoh
 
Hitesh cross cultural comm in business
Solanki Hitesh
 
Ctolinux 2001
eaiti
 
Ping solutions overview_111904
eaiti
 
Cto forum nirav_kapadia_2006_03_31_2006
eaiti
 
Observasi pendidikan
dinsund
 
Ad

Similar to BigData @ comScore (20)

PDF
How to Suceed in Hadoop
Precisely
 
PDF
comScore
Teradata Aster
 
PDF
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Teradata Aster
 
PDF
comScore Webit Big Data_OWest Nov 13 (Final)pptx
Owen West
 
PPTX
Gov Day Sacramento 2015 - Keynote/Overview
Splunk
 
PPTX
SplunkLive! Utrecht 2016 - NXP
Splunk
 
PDF
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
PPTX
Integrate Big Data into Your Organization with Informatica and Perficient
Perficient, Inc.
 
PDF
Big Data Evolution
itnewsafrica
 
PDF
Decision Ready Data: Power Your Analytics with Great Data
DLT Solutions
 
PPTX
Adding Hadoop to Your Analytics Mix?
Think Big, a Teradata Company
 
PDF
Startup Success = Big Data + Analytics | Cairo innovates 2014
TA Telecom
 
PPT
Big data
Bhuvana Patt
 
PPTX
BDA UNIT 1big data – web analytics – big data applications– big data technolo...
BalachandarJ5
 
PDF
Big data beyond the hype may 2014
bigdatagurus_meetup
 
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
PPT
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
PPTX
Hofstra University - Overview of Big Data
sarasioux
 
PDF
Introduction to Big Data
trendwiseanalytics1
 
PDF
Hadoop is Happening
Precisely
 
How to Suceed in Hadoop
Precisely
 
comScore
Teradata Aster
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Teradata Aster
 
comScore Webit Big Data_OWest Nov 13 (Final)pptx
Owen West
 
Gov Day Sacramento 2015 - Keynote/Overview
Splunk
 
SplunkLive! Utrecht 2016 - NXP
Splunk
 
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
Integrate Big Data into Your Organization with Informatica and Perficient
Perficient, Inc.
 
Big Data Evolution
itnewsafrica
 
Decision Ready Data: Power Your Analytics with Great Data
DLT Solutions
 
Adding Hadoop to Your Analytics Mix?
Think Big, a Teradata Company
 
Startup Success = Big Data + Analytics | Cairo innovates 2014
TA Telecom
 
Big data
Bhuvana Patt
 
BDA UNIT 1big data – web analytics – big data applications– big data technolo...
BalachandarJ5
 
Big data beyond the hype may 2014
bigdatagurus_meetup
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
Hofstra University - Overview of Big Data
sarasioux
 
Introduction to Big Data
trendwiseanalytics1
 
Hadoop is Happening
Precisely
 

More from eaiti (19)

PPT
Handheld device med_care_2001
eaiti
 
PPT
Dc roundtablesmall webservices_2002
eaiti
 
PPT
Middleware 2002
eaiti
 
PPT
J2ee 2000
eaiti
 
PPT
Xp presentation 2003
eaiti
 
PDF
Push to pull
eaiti
 
PPT
Intrusion detection 2001
eaiti
 
PDF
Cloud mz cto_roundtable
eaiti
 
PPT
Mobile 2000
eaiti
 
PPT
Stateof cto career_2002
eaiti
 
PPT
Dions globalsoa web2presentation1_2006
eaiti
 
PPT
Thads globalsoa web2presentation2_2006
eaiti
 
PPT
Social apps 3_1_2008
eaiti
 
PPT
It outsourcing 2005
eaiti
 
PPT
Washdc cto-0905-2003
eaiti
 
PPTX
Quantum technology
eaiti
 
PDF
Hemispheres of Data
eaiti
 
PDF
Enterprise Mobility Management
eaiti
 
PDF
Greenplum: Driving the future of Data Warehousing and Analytics
eaiti
 
Handheld device med_care_2001
eaiti
 
Dc roundtablesmall webservices_2002
eaiti
 
Middleware 2002
eaiti
 
J2ee 2000
eaiti
 
Xp presentation 2003
eaiti
 
Push to pull
eaiti
 
Intrusion detection 2001
eaiti
 
Cloud mz cto_roundtable
eaiti
 
Mobile 2000
eaiti
 
Stateof cto career_2002
eaiti
 
Dions globalsoa web2presentation1_2006
eaiti
 
Thads globalsoa web2presentation2_2006
eaiti
 
Social apps 3_1_2008
eaiti
 
It outsourcing 2005
eaiti
 
Washdc cto-0905-2003
eaiti
 
Quantum technology
eaiti
 
Hemispheres of Data
eaiti
 
Enterprise Mobility Management
eaiti
 
Greenplum: Driving the future of Data Warehousing and Analytics
eaiti
 

Recently uploaded (20)

PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
Software Development Company | KodekX
KodekX
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
DOCX
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
PPTX
The Power of IoT Sensor Integration in Smart Infrastructure and Automation.pptx
Rejig Digital
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
CIFDAQ
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
This slide provides an overview Technology
mineshkharadi333
 
Software Development Company | KodekX
KodekX
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
The Power of IoT Sensor Integration in Smart Infrastructure and Automation.pptx
Rejig Digital
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
CIFDAQ
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 

BigData @ comScore

  • 1. BigData @ comScore Michael Brown, CTO, comScore, Inc. March 25th , 2011
  • 2. comScore is a Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1600+ worldwide Employees 1,000+ Headquarters Reston, VA Global Coverage 170+ countries under measurement; 43 markets reported Local Presence 30+ locations in 21 countries 2© comScore, Inc. Proprietary. Local Presence 30+ locations in 21 countries V0910
  • 3. Broad Client Base and Deep Expertise Across Key Industries Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology 3© comScore, Inc. Proprietary. V0910
  • 4. The Trusted Source for Digital Intelligence Across Vertical Markets 47 out of the top 50 4 out of the top 4 WIRELESS CARRIERS 9 out of the top 10 INVESTMENT BANKS 9 out of the top 10 9 out of the top 10 INTERNET SERVICE PROVIDERS 9 out of the top 10 AUTO INSURERS 4© comScore, Inc. Proprietary. 47 out of the top 50 ONLINE PROPERTIES 45 out of the top 50 ADVERTISING AGENCIES 9 out of the top 10 MAJOR MEDIA COMPANIES 9 out of the top 10 PHARMACEUTICAL COMPANIES 9 out of the top 10 CONSUMER FINANCE COMPANIES 9 out of the top 10 CPG COMPANIES V0910
  • 5. comScore History of Leadership and Innovation To measure the search market To measure video streaming To provide behavioral ad effectiveness To meter mobile user behavior 1st To Unify census + panel measurement 5© comScore, Inc. Proprietary. To build and project from 2 million+ longitudinal panel To monitor and report e-commerce data 1 To deliver a worldwide Internet audience measurement Global Shaper Company 2010 V0910
  • 6. Average Records Captured per Day (2005-2009) 800,000,000 1,000,000,000 1,200,000,000 1,400,000,000 1,600,000,000 1,800,000,000 6© comScore, Inc. Proprietary. - 200,000,000 400,000,000 600,000,000 800,000,000
  • 7. Launching the 3rd Generation In 2009, in the midst of the recession, comScore decided to build and release its 3rd Generation Product – Unified Digital Measurement (UDM or Hybrid) Technology Goals – Ramp up data collection – Deploy new methodologies for data processing and analysis – Be able to scale linearly to the environment to support growth 7© comScore, Inc. Proprietary. – Be able to scale linearly to the environment to support growth – Have yesterdays data available today And one more thing … do it in 4 months or less.
  • 8. Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Global PERSON Measurement Global MACHINE Measurement 8© comScore, Inc. Proprietary. PAGE TAGSPANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 88% of Top U.S. Media Properties V0910
  • 9. How Does the Hybrid Process Work? Collect Traffic from PCs and devices Clean Traffic – remove non- human, bots, apply edit rules 9© comScore, Inc. Proprietary. Apply comScore URL Dictionary Total Traffic Filtered Traffic
  • 10. URL Dictionary (CFD): Advertising Industry “Currency” Intelligent grouping of Properties with 7+ levels of detail – Property (e.g., Yahoo! Properties, Microsoft Sites) – Media Title (e.g., Yahoo!, MSN) 10© comScore, Inc. Proprietary. – Channel (e.g., Yahoo! Search, MSN Homepages) – Subchannel (e.g., Yahoo! Image Search, MSNBC) – Group/Subgroup (e.g., Yahoo! Calendar, Today)
  • 11. URL Dictionary (CFD) Coverage Statistics 11MM Unique Domains Average/Month in 2010 • Over 80% pages viewed from top 131K domains in 2010 vs. 392K in 2009 11© comScore, Inc. Proprietary. • 2,360K patterns in January 2011represents 85% of all pages • 1,254K syndicated entities in January 2010 • 41K patterns added/month in 2010.
  • 12. Worldwide UDM™ Penetration Europe Austria 80% Asia Pacific Australia 91% North America Canada 94% Latin America Argentina 94% Middle East & Africa Israel 93% Percentage of Machines Included in UDM Measurement 12© comScore, Inc. Proprietary. July 2010 Penetration Data Austria 80% Belgium 85% Switzerland 84% Germany 84% Denmark 82% Spain 90% Finland 85% France 91% Ireland 91% Italy 80% Netherlands 88% Norway 84% Portugal 86% Sweden 85% United Kingdom 90% Australia 91% Hong Kong 88% India 84% Japan 73% Malaysia 87% New Zealand 88% Singapore 91% Canada 94% United States 91% Argentina 94% Brazil 92% Chile 94% Colombia 95% Mexico 93% Puerto Rico 92% Israel 93% South Africa 73% V0910
  • 13. Worldwide Tags per Day 15,000,000,000 20,000,000,000 25,000,000,000 #ofrecords 13© comScore, Inc. Proprietary. 0 5,000,000,000 10,000,000,000 Jul 2009 Aug 2009 Sep 2009 Oct 2009 Nov 2009 Dec 2009 Jan 2010 Feb 2010 Mar 2010 Apr 2010 May 2010 Jun 2010 Jul 2010 Aug 2010 Sep 2010 Oct 2010 Nov 2010 Dec 2010 Jan 2011 Feb 2011 #ofrecords Beacon Records Panel Records
  • 14. Monthly Totals 300,000,000,000 400,000,000,000 500,000,000,000 600,000,000,000 #ofrecords 14© comScore, Inc. Proprietary. 0 100,000,000,000 200,000,000,000 300,000,000,000 Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb 2009 2010 2011 #ofrecords Beacon Records Panel Records
  • 15. High Level Data Flow Panel ETL 15© comScore, Inc. Proprietary. Census ETL Delivery
  • 16. Enterprise Data Warehouse : Sybase IQ 15.2 Multiplex EDW is currently comprised of 20 servers running Windows 2003 R2 x64 – Currently 220 Intel CPUs – Dedicated EDW technical team of 3 DBAs and 1 Administrator – Ability to grow compute capacity and storage capacity independently EDW data repository housed on both EMC VMAX and Clarion – 4 EDW instances (2 in Virginia and 2 in Illinois) – One EDW instance is 147TB usable (app. 200TB of raw data) 16© comScore, Inc. Proprietary. – One EDW instance is 147TB usable (app. 200TB of raw data) – Production EDW Drive Layout 416 x 1TB SATA, RAID6, 14+2 42 x 600GB 15K, RAID1 8 X 400GB Flash, RAID5, 7+1 Current Capacity and Performance Metrics – 1,835,412,793,799 Rows loaded – 140TB in 14,168 tables – Capable of Loading 56 Billion rows per hour
  • 17. Subsystem System designed using multiple sub systems Easily take out and replace different components as demands changed Moved from a single server to a cluster of servers in a few months in some cases with first stage tag processing Periodically redesign different subsystems to support increased processing demands 17© comScore, Inc. Proprietary. Many systems on their third generation of technology
  • 18. Homegrown Distributed Processing Reduced core aggregation from Reduce final product creation 2002 – comScore distributed processing framework Open Source Hadoop ScalabilityWall 18© comScore, Inc. Proprietary. aggregation from 48 hours to 7 hours product creation from 24 hours to 2 hours Hadoop framework ScalabilityWall
  • 19. GreenPlum GreenPlum MPP – 80 Node Cluster: 1 Master; 6 ETL; 72 Workers – Using Dell R510 with 12 600GB 15K RAID, 64GB RAM, 24 cores (HT) – Support analytic end users with access to record level data, through a SQL interface – Ability to load over 400 billion rows in 8 hours – Hourly data loading in place 19© comScore, Inc. Proprietary. – Hourly data loading in place – Allow the analysts to mine the data for the business uses – Use for quick analysis of raw event data and for the ideation and creation of new products
  • 20. Hadoop Hadoop – Dev - 6x Dell 2950 w/6 1TB – Prod - 10x Dell R710 w/ 6 600GB – Prod in 2 weeks – 10x Dell R710 w/6 600GB & 20x Dell R510 w/12 2TB – Moving large processing jobs that currently are constrained by our current framework to Hadoop. We have some large analytical runs that currently go for over 40 hours on 32 servers and we are re-engineering to reduce 20© comScore, Inc. Proprietary. for over 40 hours on 32 servers and we are re-engineering to reduce processing time. – We have found that the Fair Scheduler works well for our job loads – We use a “homegrown” workflow system (BORG) that manages tasks inside and outside hadoop.
  • 21. Sharding Sharding divides work across multiple systems using different mechanisms Shard data as far up stream as possible Ability to break data into multiple chunks early in processing, enables ability to compute capacity down stream to accommodate large volume increases in data ingest 21© comScore, Inc. Proprietary.
  • 22. Sorting We use DMExpress from SyncSort across hundreds of servers this allows for efficient data processing We sort input data based on a column in advance To calculate uniques, check if the prior value changed from the current value and then increment a counter We now have aggregation systems that can process over 50 GB of data with 357 million rows in less than an hour on a Dell R710 2U serve 22© comScore, Inc. Proprietary. with 357 million rows in less than an hour on a Dell R710 2U serve
  • 23. Compression w/Sorting Compress Log Files when processing large volumes of log data Several advantages to Sorting Data First: – Reduces the size of the data – Improves application performance Examples: – 1 Hour of our data (313 GB raw, 815 million rows) 23© comScore, Inc. Proprietary. 1 Hour of our data (313 GB raw, 815 million rows) – Standard compression of time ordered data is 93GB (30% of original) – Standard compression on a 2 key sorted set is 56GB (18% of original) – For one day it saves 800GB – For one month it saves 25 TB – For 90 days it saves 75TB
  • 24. Big data makes you think differently Question: How many distinct cookies over 3 months? Data: 3 monthly tables with distinct cookies, indexed Size: 10B records per table Platform: Sybase IQ Attempt: UNION select count(cookies) over 3 monthly tables 24© comScore, Inc. Proprietary. – Union operator distincts Result: FAIL. Out of temp space. Out of luck. – Failed after 30 minutes. Why? UNION performs a SELECT and then a DISTINCT (sorting 30B rows)
  • 25. Rethink the problem! INNER joins are cheaper No sort, they use existing indexes Remember set theory? Of course you do! Let months be {A, B, C} A B ∪ ∪ 25© comScore, Inc. Proprietary. INNER join on only 2 tables of data at a time 2 month intersections took 2 hours each and less taxing on memory Used intersection of intermediate (indexed!) results… 5 mins C A ∪ B ∪ C = A + B + C – A ∩ B – A ∩ C – C ∩ B + A ∩ B ∩ C A ∩ B ∩ C = (A ∩ B) ∩ (A ∩ C) ∩ (C ∩ B) Total query time: 6.5 hours
  • 26. TCO with Large Cluster Systems Examine replication factor and disk configuration for systems with replication built into the framework to support redundancy and concurrency Example: Hadoop cluster that supports 108TB of base compressed data Hypothetical Configurations: 26© comScore, Inc. Proprietary. – Replication Factor of 3 R710 (6x drives, JBOD); requires 162 servers R510 (12x drives JBOD); requires 68 servers – Replication Factor of 2 R710 (6x drives, RAID 5); requires 129 servers R510 (12x drives, RAID 5); requires 54 servers
  • 27. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. 27© comScore, Inc. Proprietary. Visit www.comscoredatamine.com or follow @datagems for the latest gems.
  • 28. Thank You! Michael Brown CTO comScore, Inc. [email protected] 28© comScore, Inc. Proprietary.