SlideShare a Scribd company logo
DOES MORE DATA MEAN BETTER DECISION MAKING?
(Assessing Data Quality with a Unified-Log and a bit of Stream-
Processing)
Scott Krueger
Data Architect
ATALE OF DATA-DRIVEN DECISION MAKING - ACT ONE
Me: "@StreamEngine - how many user sessions have we had from
Illinois in the last 4 hours from those #ChicagoRocks tweets?"
StreamEngine:" In the last 4 hours, we have had 43, 578 new user sessions
from Ohio as a result of the #ChicagoRocks tweets"
Me: "How confident are you about that?" (I need to make a
quick call here)
StreamEngine: "Sorry, what was that?"
WHY ARE WETALKING ABOUTTHIS
Analytical Sciences
Data Quality
Business and Culture
Tech
Talks at Big Data Week, London 2016
2020
MOTIVATIONS
MOTIVATIONS
http://lemonly.com/work/the-cost-
of-bad-data/
MOTIVATIONS
"From now on, our cars will more deeply understand that
buses (and other large vehicles) are less likely to yield to
us than other types of vehicles, and we hope to handle
situations like this more gracefully in the future."
http://www.bbc.co.uk/news/technology-35692845
SO WHAT'S GOING ON HERE? WE ARE CREATORS OF POOR QUALITY DATA
Machines (we
make machines
that consume /
create data)
Software (we
create software
that consumes /
creates data sets)
photo: Faruk Ates, https://flic.kr/p/stxXK
COMMON DATA PROBLEMS
https://github.com/Quartz/bad-data-guide
Pillar 1: Data Integrity
Data completeness - is it there? is it in tact? are the ‘required-to-be-of-value’ fields
there
Data Interpretation - what is that thing? what does ‘cost’ mean?
Data change - we don't use this anymore so I’m not writing it anymore. Oh, you’re still
reading it?
Pillar 2: Data validity - what's in there?
Values make sense?
Values expected?
Data presence* - are the messages making it? Are there as many as there should be
when data was created? Is this an expected volume?
THE 2 (OR SO) PILLARS OF DATA QUALITY
http://www.newyorkinternationallimousines.com/
DATA INTEGRITY
A definition of what the data is so it can be turned into a meaningful
piece or set of information
Varies with approach and ‘structure’ of data
Event Schemas (not to be confused with the relational DB term)
Examples: protocol buffers, thrift, avro
DATA INTEGRITY:THE GREATTRADE-OFF
Somewhere between left and right something has to prepare data for
usage elsewhere.There is cost associated with every position.
Data-In Data-Out
Schema-on-Write Schema-on-Read
Schema-Inbetween
ATALE OF DATA-DRIVEN DECISION MAKING - ACTTWO
Me: "@StreamEngine - I would like to measure how effective all of our
data-driven decision making is. I need a measure of quality. I think you
can help me with this."
StreamEngine: "Are you from the future?"
Me: "I'm not from it, but I'm thinking about it..."
WHAT ARE WE DOING ABOUT IT?
This requires a brief understanding of our 'unified log' approach at
skyscanner
SKYSCANNER EVENT DATA PLATFORM
EXPLOITTHE PLATFORM - MAKE USE OF WHAT
YOU HAVE
DATA INTEGRITY (SCHEMAVALIDATION)
Data definition - a message that doesn't fit throws an exception
Try...Catch...Log To SchemaValidation Failure “stream/topic” with message
SchemaValidation
DATAVALIDATION: EVERYONE PLAYS A PART
Everywhere between left and right everything has a data validation
opportunity
Data-In Data-Out
Validation-on-Write Validation-on-Read
Validation-Inbetween
DataValidation ReferenceYAML configuration
DataValidation Flow
WHAT DOES IT ALL LOOK LIKE?
alert!
alert!
TO IMPROVE QUALITY….
… ISTO CLOSETHE LOOP
EVERYONE PLAYS A PART
A shared repository for event structure and validation rules
Any service that logs events runs automated tests that use this repository
A generic stream service that assesses data quality and gives the heads up to consumers
Data consumers who find new data quality mishaps commit back to the repository
IFYOU CAN'T MEASURE IT HOW DOYOU KNOW
YOU ARE IMPROVINGTHINGS?
Quality of decision making = 100 -
(((# of data issues detected + recent commits for improved detection)
/ # high quality events logged)))*100)
example: 99.8 %
TIPS FOR IMPROVEMENT
Master Data Management
Metadata Management
Handling Data Change
Culture
MASTER DATA MANAGEMENT
(“ONE SOURCE OF REFERENCE/LOOKUP DATA”)
Simple rules to maintain consistency of reference data
use of enums/constants in schemas for reference data sets you
don't provide in your systems
authoritative data sources (your internal data services; industry
standard sets e.g. "IATA" for travel, ISO geography/timezones etc.)
Bring this ref data as close to the processing as possible
API, csv / json, tables, trans logs -> Unified Log Topic
METADATA MANAGEMENT
(DATA PROVENANCE AND OTHER NICETIES)
Data Debugging
Data flow measurements - how long did it take for my message to go through this pipeline?
Historical records - transparency for everyone (you the business operator, and you the customer)
Governance and regulation - data quality laws? http://www.forbes.com/sites/forbestechcouncil/
2016/04/29/how-companies-can-leverage-real-time-platforms-and-metadata-to-improve-
healthcare-delivery/2/#53fbea89480b
+ float device_diagonal_screen_size = 19;
+ float device_diagonal_screen_size = 19 [deprecated=true];
+ DisplayMeasurement diagonal_screen_size = 26;
DATA CHANGE:WE DON'T ALWAYS GET IT RIGHT
0
5
Time
Event Definition Changes
T1
T2
T3
+ // float device_diagonal_screen_size = 19 [deprecated=true];
+ DisplayMeasurement diagonal_screen_size = 26;
2 rules:
1. Maintain
backwards
compatibility.
2. Rebuild.
DATA QUALITY CULTURE
photo: scott krueger
WHAT ARE YOU GOINGTO DO ABOUT IT?
Understand the causes and details of data problems in your services
Unify and simplify - one source of truth for everything: reference data,
reports, archive, formulae, validation rules, data definitions, metdadata
Start measuring - this is your baseline and allows you to measure
confidence in decision making
Work (or evolve) your tech
Fix the system, stop moaning about it
It’s never too late
ATALE OF DATA-DRIVEN DECISION MAKING - ACTTHREE -THE FINALE
Me: "@StreamEngine - give me some decision quality numbers!"
StreamEngine: "Right now, decision quality is at 96.05%. This
time last week it was 92.4%. Well done! In 1 week sales are up 2
% and this is positively correlated to the decisions you made
with this data. Would you like me to predict sales uplift over the
next month if you improve decision quality by 1 %?
Me: "You bet I would..."
THANKS FOR LISTENING

More Related Content

PDF
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
PPTX
Getting It Right Exactly Once: Principles for Streaming Architectures
PPTX
Real-Time, Geospatial, Maps by Neil Dahlke
PDF
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
PDF
Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
PPTX
Real-Time Geospatial Intelligence at Scale
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
Getting It Right Exactly Once: Principles for Streaming Architectures
Real-Time, Geospatial, Maps by Neil Dahlke
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...
Modeling the Smart and Connected City of the Future with Kafka and Spark
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Real-Time Geospatial Intelligence at Scale

What's hot (20)

PDF
Building the Next-gen Digital Meter Platform for Fluvius
PDF
The Fast Path to Building Operational Applications with Spark
PDF
First Steps with Apache Kafka on Google Cloud Platform
PPTX
Driving the On-Demand Economy with Predictive Analytics
PDF
Petabridge: The New .NET Enterprise Stack
PPTX
O'Reilly Media Webcast: Building Real-Time Data Pipelines
PDF
Auto-Train a Time-Series Forecast Model With AML + ADB
PPTX
Microservice Plumbing - Glynn Bird - Codemotion Rome 2017
PPTX
Bank of China (HK) Tech Talk 1: Dive Into Apache Kafka
PDF
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
PPTX
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
PDF
Machines and the Magic of Fast Learning
PPTX
Implementing a canonical IoT backend in Azure with Azure Stream Analytics
PDF
Integrating Web and Business Data
PPTX
Internet of Things and Multi-model Data Infrastructure
PPTX
INTRODUCING: CREATE PIPELINE
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
PDF
Event Driven Architecture: Mistakes, I've made a few...
PDF
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
Building the Next-gen Digital Meter Platform for Fluvius
The Fast Path to Building Operational Applications with Spark
First Steps with Apache Kafka on Google Cloud Platform
Driving the On-Demand Economy with Predictive Analytics
Petabridge: The New .NET Enterprise Stack
O'Reilly Media Webcast: Building Real-Time Data Pipelines
Auto-Train a Time-Series Forecast Model With AML + ADB
Microservice Plumbing - Glynn Bird - Codemotion Rome 2017
Bank of China (HK) Tech Talk 1: Dive Into Apache Kafka
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Dataflow - A Unified Model for Batch and Streaming Data Processing
Machines and the Magic of Fast Learning
Implementing a canonical IoT backend in Azure with Azure Stream Analytics
Integrating Web and Business Data
Internet of Things and Multi-model Data Infrastructure
INTRODUCING: CREATE PIPELINE
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture: Mistakes, I've made a few...
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
Ad

Viewers also liked (20)

PDF
BDW16 London - Amjad Zaim, Cognitro Analytics: How Deep is Your Learning
PDF
BDW16 London - Charlie Ballard, TripAdvisor - TripAdvisor and Constant Change...
PDF
BDW16 London - Ingrid Funie, Imperial College London - Machine Learning and F...
PDF
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
PDF
BDW16 London - Wael Elrifai, Pentaho - Big Data-Driven Innovatiom
PPTX
BDW16 London - Josh Partridge, Shazam - How Labels, Radio Stations and Brand...
PDF
BDW16 London - Mark van Rijmenam, Datafloq - Big Data is Dead, Long Live Big ...
PDF
BDW16 London - Marius Boeru, Bigstep - How to Automate Big Data with Ansible
PDF
BDW16 London - Mishal Patel, NHS - Modernising Routine Breast Cancer Using Bi...
PDF
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
PDF
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
PDF
BDW16 London - Nondas Sourlas, Bupa - Big Data in Healthcare
PDF
BDW16 London - Roland Major, Transport for London - Cloud Search Secured
PDF
BDW16 London - Vojta Rocek, Trologic - Challenging Big Data
PDF
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
PDF
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
PDF
BDW16 London - Rob Anderson, MapR - Big Data and Everyday Lives
PDF
BDW16 London - Chris von Csefalvay, Helioserv - Cats and What They Tell us Ab...
PDF
BDW16 London - Harry Powell & Raffael Strassnig, Barclays UK - Graph-Based Re...
PPTX
ETL Metadata Injection with Pentaho Data Integration
BDW16 London - Amjad Zaim, Cognitro Analytics: How Deep is Your Learning
BDW16 London - Charlie Ballard, TripAdvisor - TripAdvisor and Constant Change...
BDW16 London - Ingrid Funie, Imperial College London - Machine Learning and F...
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
BDW16 London - Wael Elrifai, Pentaho - Big Data-Driven Innovatiom
BDW16 London - Josh Partridge, Shazam - How Labels, Radio Stations and Brand...
BDW16 London - Mark van Rijmenam, Datafloq - Big Data is Dead, Long Live Big ...
BDW16 London - Marius Boeru, Bigstep - How to Automate Big Data with Ansible
BDW16 London - Mishal Patel, NHS - Modernising Routine Breast Cancer Using Bi...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
BDW16 London - Nondas Sourlas, Bupa - Big Data in Healthcare
BDW16 London - Roland Major, Transport for London - Cloud Search Secured
BDW16 London - Vojta Rocek, Trologic - Challenging Big Data
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
BDW16 London - Rob Anderson, MapR - Big Data and Everyday Lives
BDW16 London - Chris von Csefalvay, Helioserv - Cats and What They Tell us Ab...
BDW16 London - Harry Powell & Raffael Strassnig, Barclays UK - Graph-Based Re...
ETL Metadata Injection with Pentaho Data Integration
Ad

Similar to BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decision Making? (20)

PDF
There’s data everywhere! - Simo Ahava
PPT
Data quality and bi
PDF
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
PDF
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
PDF
Data quality
PDF
Data quality
PDF
CWIN17 India / Bigdata architecture yashowardhan sowale
PDF
Crosswalk
PPTX
SaaS Vs On Premise BI
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PDF
Take Action: The New Reality of Data-Driven Business
PDF
Why Should Data Pipelines be Automated for Effective and Continuous Delivery_...
PDF
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
PDF
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PPTX
DataOps: Nine steps to transform your data science impact Strata London May 18
PPT
Bad customer data?
PDF
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PPT
Why Data Virtualization? An Introduction by Denodo
There’s data everywhere! - Simo Ahava
Data quality and bi
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Data quality
Data quality
CWIN17 India / Bigdata architecture yashowardhan sowale
Crosswalk
SaaS Vs On Premise BI
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Take Action: The New Reality of Data-Driven Business
Why Should Data Pipelines be Automated for Effective and Continuous Delivery_...
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf
Advanced Analytics and Machine Learning with Data Virtualization
DataOps: Nine steps to transform your data science impact Strata London May 18
Bad customer data?
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Why Data Virtualization? An Introduction by Denodo

More from Big Data Week (10)

PPTX
BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
PPTX
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
PDF
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
PPTX
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
PPTX
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
PDF
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
PPTX
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
PPTX
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
PPTX
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
PPTX
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...
BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...

Recently uploaded (20)

PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
creating-agentic-ai-solutions-leveraging-aws.pdf
PDF
REPORT: Heating appliances market in Poland 2024
PDF
Transforming Manufacturing operations through Intelligent Integrations
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Modernizing your data center with Dell and AMD
A Day in the Life of Location Data - Turning Where into How.pdf
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
GamePlan Trading System Review: Professional Trader's Honest Take
Reimagining Insurance: Connected Data for Confident Decisions.pdf
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Belt and Road Supply Chain Finance Blockchain Solution
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Enable Enterprise-Ready Security on IBM i Systems.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
creating-agentic-ai-solutions-leveraging-aws.pdf
REPORT: Heating appliances market in Poland 2024
Transforming Manufacturing operations through Intelligent Integrations
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Modernizing your data center with Dell and AMD

BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decision Making?

  • 1. DOES MORE DATA MEAN BETTER DECISION MAKING? (Assessing Data Quality with a Unified-Log and a bit of Stream- Processing) Scott Krueger Data Architect
  • 2. ATALE OF DATA-DRIVEN DECISION MAKING - ACT ONE Me: "@StreamEngine - how many user sessions have we had from Illinois in the last 4 hours from those #ChicagoRocks tweets?" StreamEngine:" In the last 4 hours, we have had 43, 578 new user sessions from Ohio as a result of the #ChicagoRocks tweets" Me: "How confident are you about that?" (I need to make a quick call here) StreamEngine: "Sorry, what was that?"
  • 3. WHY ARE WETALKING ABOUTTHIS Analytical Sciences Data Quality Business and Culture Tech Talks at Big Data Week, London 2016
  • 7. MOTIVATIONS "From now on, our cars will more deeply understand that buses (and other large vehicles) are less likely to yield to us than other types of vehicles, and we hope to handle situations like this more gracefully in the future." http://www.bbc.co.uk/news/technology-35692845
  • 8. SO WHAT'S GOING ON HERE? WE ARE CREATORS OF POOR QUALITY DATA Machines (we make machines that consume / create data) Software (we create software that consumes / creates data sets) photo: Faruk Ates, https://flic.kr/p/stxXK
  • 10. Pillar 1: Data Integrity Data completeness - is it there? is it in tact? are the ‘required-to-be-of-value’ fields there Data Interpretation - what is that thing? what does ‘cost’ mean? Data change - we don't use this anymore so I’m not writing it anymore. Oh, you’re still reading it? Pillar 2: Data validity - what's in there? Values make sense? Values expected? Data presence* - are the messages making it? Are there as many as there should be when data was created? Is this an expected volume? THE 2 (OR SO) PILLARS OF DATA QUALITY http://www.newyorkinternationallimousines.com/
  • 11. DATA INTEGRITY A definition of what the data is so it can be turned into a meaningful piece or set of information Varies with approach and ‘structure’ of data Event Schemas (not to be confused with the relational DB term) Examples: protocol buffers, thrift, avro
  • 12. DATA INTEGRITY:THE GREATTRADE-OFF Somewhere between left and right something has to prepare data for usage elsewhere.There is cost associated with every position. Data-In Data-Out Schema-on-Write Schema-on-Read Schema-Inbetween
  • 13. ATALE OF DATA-DRIVEN DECISION MAKING - ACTTWO Me: "@StreamEngine - I would like to measure how effective all of our data-driven decision making is. I need a measure of quality. I think you can help me with this." StreamEngine: "Are you from the future?" Me: "I'm not from it, but I'm thinking about it..."
  • 14. WHAT ARE WE DOING ABOUT IT? This requires a brief understanding of our 'unified log' approach at skyscanner
  • 16. EXPLOITTHE PLATFORM - MAKE USE OF WHAT YOU HAVE
  • 17. DATA INTEGRITY (SCHEMAVALIDATION) Data definition - a message that doesn't fit throws an exception Try...Catch...Log To SchemaValidation Failure “stream/topic” with message
  • 19. DATAVALIDATION: EVERYONE PLAYS A PART Everywhere between left and right everything has a data validation opportunity Data-In Data-Out Validation-on-Write Validation-on-Read Validation-Inbetween
  • 22. WHAT DOES IT ALL LOOK LIKE? alert! alert!
  • 23. TO IMPROVE QUALITY…. … ISTO CLOSETHE LOOP
  • 24. EVERYONE PLAYS A PART A shared repository for event structure and validation rules Any service that logs events runs automated tests that use this repository A generic stream service that assesses data quality and gives the heads up to consumers Data consumers who find new data quality mishaps commit back to the repository
  • 25. IFYOU CAN'T MEASURE IT HOW DOYOU KNOW YOU ARE IMPROVINGTHINGS? Quality of decision making = 100 - (((# of data issues detected + recent commits for improved detection) / # high quality events logged)))*100) example: 99.8 %
  • 26. TIPS FOR IMPROVEMENT Master Data Management Metadata Management Handling Data Change Culture
  • 27. MASTER DATA MANAGEMENT (“ONE SOURCE OF REFERENCE/LOOKUP DATA”) Simple rules to maintain consistency of reference data use of enums/constants in schemas for reference data sets you don't provide in your systems authoritative data sources (your internal data services; industry standard sets e.g. "IATA" for travel, ISO geography/timezones etc.) Bring this ref data as close to the processing as possible API, csv / json, tables, trans logs -> Unified Log Topic
  • 28. METADATA MANAGEMENT (DATA PROVENANCE AND OTHER NICETIES) Data Debugging Data flow measurements - how long did it take for my message to go through this pipeline? Historical records - transparency for everyone (you the business operator, and you the customer) Governance and regulation - data quality laws? http://www.forbes.com/sites/forbestechcouncil/ 2016/04/29/how-companies-can-leverage-real-time-platforms-and-metadata-to-improve- healthcare-delivery/2/#53fbea89480b
  • 29. + float device_diagonal_screen_size = 19; + float device_diagonal_screen_size = 19 [deprecated=true]; + DisplayMeasurement diagonal_screen_size = 26; DATA CHANGE:WE DON'T ALWAYS GET IT RIGHT 0 5 Time Event Definition Changes T1 T2 T3 + // float device_diagonal_screen_size = 19 [deprecated=true]; + DisplayMeasurement diagonal_screen_size = 26; 2 rules: 1. Maintain backwards compatibility. 2. Rebuild.
  • 31. WHAT ARE YOU GOINGTO DO ABOUT IT? Understand the causes and details of data problems in your services Unify and simplify - one source of truth for everything: reference data, reports, archive, formulae, validation rules, data definitions, metdadata Start measuring - this is your baseline and allows you to measure confidence in decision making Work (or evolve) your tech Fix the system, stop moaning about it It’s never too late
  • 32. ATALE OF DATA-DRIVEN DECISION MAKING - ACTTHREE -THE FINALE Me: "@StreamEngine - give me some decision quality numbers!" StreamEngine: "Right now, decision quality is at 96.05%. This time last week it was 92.4%. Well done! In 1 week sales are up 2 % and this is positively correlated to the decisions you made with this data. Would you like me to predict sales uplift over the next month if you improve decision quality by 1 %? Me: "You bet I would..."