OBSERVABILITY: NOT JUST
AN OPS THING
@cyen
@honeycombio
OBSERVABILITY THE DEV PROCESS
OBSERVABILITY THE DEV PROCESS
The practice of understanding
the internal state of a system
via knowledge of its external
outputs.
Wikipedia (paraphrased)
OBSERVABILITY THE DEV PROCESS
Twitter hive mind
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
VERIFY IT
BUILD IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
VERIFY IT (ON MY MACHINE)
BUILD IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
…?
VERIFY IT (ON MY MACHINE)
BUILD IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
What does my service look like from this one annoying
customer’s perspective?
OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
What will the impact be, of this change we’re planning?
What does my service look like from that one huge, annoying
customer’s perspective?
OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
What will the impact be, of this change we’re planning?
What does "normal" look like these days? Does it line
up with what I thought?
What does my service look like from that one huge, annoying
customer’s perspective?
OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
&
WHY DOES THIS MATTER SO MUCH TO ME?
▸ How’s our load? Is it spread reasonably evenly across our Kafka
partitions?
▸ Did latency increase in our API server? How does our new
batching endpoint compare to our old RESTy endpoint?
▸ How did those recent memory optimizations affect our query-
serving capacity?
▸ How’s our load? Are high-volume customers spread reasonably
evenly across our Kafka partitions?
▸ Did latency increase in our API server? Which customers were
impacted the most? And who’ll benefit the most from batching?
▸ How did those recent memory optimizations affect our query-
serving capacity for customers with string-heavy payloads?
Influx/Days 2017 San Francisco | Christine Yen
OK. SO WHAT DOES THIS LOOK LIKE?
DECIDE IT
BUILD IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
Like debug statements
in production data
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
Feature flags and flexible observability tools = manual canarying
… except we can do it for everything
👾
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
👾
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
👾
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
BUILD ID
FEATURE FLAGS
CUSTOMER IDS
BUILD ID
FEATURE FLAGS
CUSTOMER IDS
GRAPHS
ALERTS
CODE
RELEASES
1. HYPOTHESIS
2. INSTRUMENTATION (MAYBE)
3. VALIDATION (OR NOT)
4. ONWARD
TAKING THE FIRST FEW STEPS
CONCEPTUALLY
▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Business-relevant or infrastructure-specific characteristics (e.g.
customer ID, DB replica set)
CONCEPTUALLY
▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Business-relevant or infrastructure-specific characteristics (e.g.
customer ID, DB replica set)
▸ Temporary additional fields for validating hypotheses
CONCEPTUALLY
▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Business-relevant or infrastructure-specific characteristics (e.g.
customer ID, DB replica set)
▸ Temporary additional fields for validating hypotheses
▸ Prune stale fields (if necessary)
CONCEPTUALLY
▸ Contextual, structured data
SOME BEST PRACTICES
▸ Contextual, structured data
▸ Common set of nouns and consistent naming
SOME BEST PRACTICES
▸ Contextual, structured data
▸ Common set of nouns and consistent naming
▸ Don't be dogmatic; let the use case dictate the ingest pattern
SOME BEST PRACTICES
▸ Contextual, structured data
▸ Common set of nouns and consistent naming
▸ Don't be dogmatic; let the use case dictate the ingest pattern
▸ e.g. instrumenting individual reads while batching writes
SOME BEST PRACTICES
AN EXAMPLE SCHEMA EVOLUTION
first pass:
- server_hostname
- method
- url
- build_id
- remote_addr
- request_id
- status
- x_forwarded_for
- error
- event_time
- team_id
- payload_size
- sample_rate
then we added:
- dropped
- get_schema_dur_ms
- protobuf_encoding_dur_ms
- kafka_write_dur_ms
- request_dur_ms
- json_decoding_dur_ms +others
a couple of days later, we added:
- offset
- kafka_topic
- chosen_partition
after that:
- memory_inuse
- num_goroutines
a week after that:
- warning
- drop_reason
and on and on, adding 2-3 fields
every couple of weeks:
- user_agent
- unknown_columns
- dataset_partitions
- dataset_id
- dataset_name
- api_version
- create_marker_dur_ms
- marker_id
- nil_value_for_columns
- batch
- gzipped
- batch_datapoint_lens
- batch_num_datasets
- batch_process_datapoints_dur_ms
- batch_validate_datasets_dur_ms
- batch_dataset_names
- dataset_columns
- event_columns
▸ Stop writing software based on intuition, start backing it up with
data
▸ Teach observability tools to speak more than "Ops"
▸ ??? (← a.k.a., Ask lots of questions and validate hypotheses)
▸ Profit!
DEVELOPERS, OUR MISSION:
THANKS!@cyen

More Related Content

PDF
The Present and Future of Serverless Observability
PDF
Observability for Emerging Infra (what got you here won't get you there)
PDF
RedisConf17 - Observability and the Glorious Future
PPTX
Observability – the good, the bad, and the ugly
PDF
Final observability starts_with_data
PPTX
Observability - the good, the bad, and the ugly
PDF
Observability driven development
PDF
Using Time Series for Full Observability of a SaaS Platform
The Present and Future of Serverless Observability
Observability for Emerging Infra (what got you here won't get you there)
RedisConf17 - Observability and the Glorious Future
Observability – the good, the bad, and the ugly
Final observability starts_with_data
Observability - the good, the bad, and the ugly
Observability driven development
Using Time Series for Full Observability of a SaaS Platform

Similar to Influx/Days 2017 San Francisco | Christine Yen (20)

PDF
3rd Wave Observability: Open or Bust
PDF
Circonus: Design failures - A Case Study
PPTX
From SLO to GOTY
PPTX
System insight without Interference
PDF
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
PDF
Challenges, Objections, and the Future of Streaming With Eric Sammer | Curren...
PPTX
Solving the Hidden Costs of Kubernetes with Observability
PPTX
Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org - Dev...
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PDF
Your Attention, Please: Better Observability for Distributed Systems - John F...
PPTX
Observability Shivagami Gugan
PDF
The Observability Graph; Knowledge Graphs for Automated Infrastructure Observ...
KEY
Make Life Suck Less (Building Scalable Systems)
PDF
Observer, a "real life" time series application
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Martin Kleppmann-Designing Data-Intensive Applications_ The Big Ideas Behind ...
PDF
Building data intensive applications
PPTX
Prometheus - Open Source Forum Japan
PDF
Scalable Online Analytics for Monitoring
PDF
Next generation alerting and fault detection, SRECon Europe 2016
3rd Wave Observability: Open or Bust
Circonus: Design failures - A Case Study
From SLO to GOTY
System insight without Interference
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Challenges, Objections, and the Future of Streaming With Eric Sammer | Curren...
Solving the Hidden Costs of Kubernetes with Observability
Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org - Dev...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Your Attention, Please: Better Observability for Distributed Systems - John F...
Observability Shivagami Gugan
The Observability Graph; Knowledge Graphs for Automated Infrastructure Observ...
Make Life Suck Less (Building Scalable Systems)
Observer, a "real life" time series application
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Martin Kleppmann-Designing Data-Intensive Applications_ The Big Ideas Behind ...
Building data intensive applications
Prometheus - Open Source Forum Japan
Scalable Online Analytics for Monitoring
Next generation alerting and fault detection, SRECon Europe 2016
Ad

More from InfluxData (20)

PPTX
Announcing InfluxDB Clustered
PDF
Best Practices for Leveraging the Apache Arrow Ecosystem
PDF
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
PDF
Power Your Predictive Analytics with InfluxDB
PDF
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
PDF
Build an Edge-to-Cloud Solution with the MING Stack
PDF
Meet the Founders: An Open Discussion About Rewriting Using Rust
PDF
Introducing InfluxDB Cloud Dedicated
PDF
Gain Better Observability with OpenTelemetry and InfluxDB
PPTX
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
PDF
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
PPTX
Introducing InfluxDB’s New Time Series Database Storage Engine
PDF
Start Automating InfluxDB Deployments at the Edge with balena
PDF
Understanding InfluxDB’s New Storage Engine
PDF
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
PPTX
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
PDF
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Announcing InfluxDB Clustered
Best Practices for Leveraging the Apache Arrow Ecosystem
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
Power Your Predictive Analytics with InfluxDB
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
Build an Edge-to-Cloud Solution with the MING Stack
Meet the Founders: An Open Discussion About Rewriting Using Rust
Introducing InfluxDB Cloud Dedicated
Gain Better Observability with OpenTelemetry and InfluxDB
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
Introducing InfluxDB’s New Time Series Database Storage Engine
Start Automating InfluxDB Deployments at the Edge with balena
Understanding InfluxDB’s New Storage Engine
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Ad

Recently uploaded (20)

PPTX
AI_Cyberattack_Solutions AI AI AI AI .pptx
PDF
BIOCHEM CH2 OVERVIEW OF MICROBIOLOGY.pdf
PPTX
Reading as a good Form of Recreation
PPT
Ethics in Information System - Management Information System
PPTX
Introduction to cybersecurity and digital nettiquette
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PDF
Alethe Consulting Corporate Profile and Solution Aproach
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PDF
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
PDF
simpleintnettestmetiaerl for the simple testint
PDF
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
PPTX
Top Website Bugs That Hurt User Experience – And How Expert Web Design Fixes
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PPTX
KSS ON CYBERSECURITY INCIDENT RESPONSE AND PLANNING MANAGEMENT.pptx
PPTX
Layers_of_the_Earth_Grade7.pptx class by
PDF
Top 8 Trusted Sources to Buy Verified Cash App Accounts.pdf
PPTX
Database Information System - Management Information System
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PPTX
Internet Safety for Seniors presentation
PPT
12 Things That Make People Trust a Website Instantly
AI_Cyberattack_Solutions AI AI AI AI .pptx
BIOCHEM CH2 OVERVIEW OF MICROBIOLOGY.pdf
Reading as a good Form of Recreation
Ethics in Information System - Management Information System
Introduction to cybersecurity and digital nettiquette
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
Alethe Consulting Corporate Profile and Solution Aproach
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
simpleintnettestmetiaerl for the simple testint
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
Top Website Bugs That Hurt User Experience – And How Expert Web Design Fixes
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
KSS ON CYBERSECURITY INCIDENT RESPONSE AND PLANNING MANAGEMENT.pptx
Layers_of_the_Earth_Grade7.pptx class by
Top 8 Trusted Sources to Buy Verified Cash App Accounts.pdf
Database Information System - Management Information System
Mathew Digital SEO Checklist Guidlines 2025
Internet Safety for Seniors presentation
12 Things That Make People Trust a Website Instantly

Influx/Days 2017 San Francisco | Christine Yen

  • 1. OBSERVABILITY: NOT JUST AN OPS THING @cyen @honeycombio
  • 3. OBSERVABILITY THE DEV PROCESS The practice of understanding the internal state of a system via knowledge of its external outputs. Wikipedia (paraphrased)
  • 4. OBSERVABILITY THE DEV PROCESS Twitter hive mind
  • 5. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data
  • 6. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT
  • 7. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT BUILD IT
  • 8. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT VERIFY IT BUILD IT
  • 9. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT VERIFY IT (ON MY MACHINE) BUILD IT
  • 10. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT …? VERIFY IT (ON MY MACHINE) BUILD IT
  • 11. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT BUILD IT VERIFY IT (ON MY MACHINE) VERIFY IT (IN PRODUCTION)
  • 12. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT BUILD IT VERIFY IT (ON MY MACHINE) VERIFY IT (IN PRODUCTION) WATCH IT
  • 13. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT BUILD IT VERIFY IT (ON MY MACHINE) VERIFY IT (IN PRODUCTION) WATCH IT
  • 14. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT BUILD IT VERIFY IT (ON MY MACHINE) VERIFY IT (IN PRODUCTION) WATCH IT
  • 15. OBSERVABILITY THE DEV PROCESS How do my error rates look these days? Have things gotten slower over time?
  • 16. OBSERVABILITY THE DEV PROCESS How do my error rates look these days? Have things gotten slower over time? What does my service look like from this one annoying customer’s perspective?
  • 17. OBSERVABILITY THE DEV PROCESS How do my error rates look these days? Have things gotten slower over time? What will the impact be, of this change we’re planning? What does my service look like from that one huge, annoying customer’s perspective?
  • 18. OBSERVABILITY THE DEV PROCESS How do my error rates look these days? Have things gotten slower over time? What will the impact be, of this change we’re planning? What does "normal" look like these days? Does it line up with what I thought? What does my service look like from that one huge, annoying customer’s perspective?
  • 19. OBSERVABILITY THE DEV PROCESS The ability to answer questions about your system, using data DECIDE IT BUILD IT VERIFY IT (ON MY MACHINE) VERIFY IT (IN PRODUCTION) WATCH IT &
  • 20. WHY DOES THIS MATTER SO MUCH TO ME?
  • 21. ▸ How’s our load? Is it spread reasonably evenly across our Kafka partitions? ▸ Did latency increase in our API server? How does our new batching endpoint compare to our old RESTy endpoint? ▸ How did those recent memory optimizations affect our query- serving capacity?
  • 22. ▸ How’s our load? Are high-volume customers spread reasonably evenly across our Kafka partitions? ▸ Did latency increase in our API server? Which customers were impacted the most? And who’ll benefit the most from batching? ▸ How did those recent memory optimizations affect our query- serving capacity for customers with string-heavy payloads?
  • 24. OK. SO WHAT DOES THIS LOOK LIKE?
  • 25. DECIDE IT BUILD IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT
  • 26. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT
  • 27. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT
  • 28. DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT BUILD IT
  • 29. DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT BUILD IT
  • 30. DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT BUILD IT
  • 31. DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT BUILD IT Like debug statements in production data
  • 32. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT 🏁
  • 33. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT 🏁 Feature flags and flexible observability tools = manual canarying … except we can do it for everything 👾
  • 34. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT 🏁 👾
  • 35. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT 🏁 👾
  • 36. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT
  • 37. BUILD IT DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT
  • 38. DECIDE IT VERIFY IT (WFM🤘) VERIFY IT (IN PROD) WATCH IT BUILD IT
  • 40. BUILD ID FEATURE FLAGS CUSTOMER IDS GRAPHS ALERTS CODE RELEASES
  • 41. 1. HYPOTHESIS 2. INSTRUMENTATION (MAYBE) 3. VALIDATION (OR NOT) 4. ONWARD
  • 42. TAKING THE FIRST FEW STEPS
  • 43. CONCEPTUALLY ▸ Start at the edge with basic, common attributes (e.g. HTTP)
  • 44. ▸ Start at the edge with basic, common attributes (e.g. HTTP) ▸ Business-relevant or infrastructure-specific characteristics (e.g. customer ID, DB replica set) CONCEPTUALLY
  • 45. ▸ Start at the edge with basic, common attributes (e.g. HTTP) ▸ Business-relevant or infrastructure-specific characteristics (e.g. customer ID, DB replica set) ▸ Temporary additional fields for validating hypotheses CONCEPTUALLY
  • 46. ▸ Start at the edge with basic, common attributes (e.g. HTTP) ▸ Business-relevant or infrastructure-specific characteristics (e.g. customer ID, DB replica set) ▸ Temporary additional fields for validating hypotheses ▸ Prune stale fields (if necessary) CONCEPTUALLY
  • 47. ▸ Contextual, structured data SOME BEST PRACTICES
  • 48. ▸ Contextual, structured data ▸ Common set of nouns and consistent naming SOME BEST PRACTICES
  • 49. ▸ Contextual, structured data ▸ Common set of nouns and consistent naming ▸ Don't be dogmatic; let the use case dictate the ingest pattern SOME BEST PRACTICES
  • 50. ▸ Contextual, structured data ▸ Common set of nouns and consistent naming ▸ Don't be dogmatic; let the use case dictate the ingest pattern ▸ e.g. instrumenting individual reads while batching writes SOME BEST PRACTICES
  • 51. AN EXAMPLE SCHEMA EVOLUTION
  • 52. first pass: - server_hostname - method - url - build_id - remote_addr - request_id - status - x_forwarded_for - error - event_time - team_id - payload_size - sample_rate then we added: - dropped - get_schema_dur_ms - protobuf_encoding_dur_ms - kafka_write_dur_ms - request_dur_ms - json_decoding_dur_ms +others a couple of days later, we added: - offset - kafka_topic - chosen_partition after that: - memory_inuse - num_goroutines a week after that: - warning - drop_reason and on and on, adding 2-3 fields every couple of weeks: - user_agent - unknown_columns - dataset_partitions - dataset_id - dataset_name - api_version - create_marker_dur_ms - marker_id - nil_value_for_columns - batch - gzipped - batch_datapoint_lens - batch_num_datasets - batch_process_datapoints_dur_ms - batch_validate_datasets_dur_ms - batch_dataset_names - dataset_columns - event_columns
  • 53. ▸ Stop writing software based on intuition, start backing it up with data ▸ Teach observability tools to speak more than "Ops" ▸ ??? (← a.k.a., Ask lots of questions and validate hypotheses) ▸ Profit! DEVELOPERS, OUR MISSION: THANKS!@cyen