SlideShare a Scribd company logo
bertjan@openvalue.eu
Debugging distributed systems
Bert Jan Schrijver
@bjschrijver
Debugging distributed systems: the good parts
bertjan@openvalue.eu
Bert Jan Schrijver
@bjschrijver
Networking 101
How the internet works
Why?
Bert Jan Schrijver
L e t ’ s m e e t
@bjschrijver
Why are distributed
systems difficult?
Networking 101
What?
Why?
✅
Demo
War stories
Conclusion
W h a t ‘ s n e x t ?
Outline
A structured approach
@bjschrijver
What is a distributed system?
A distributed system is a system whose
components are located on different
networked computers
which communicate and coordinate their
actions by passing messages to one
another.
• Concurrency of components
• Lack of a global clock
• Independent failure of components
➡ Distributed systems are harder to
reason about
Characteristics of distributed systems
Source: http://www.nasa.gov/images/content/218652main_STOCC_FS_img_lg.jpg
Working with distributed systems is
fundamentally different from writing
software on a single computer
- Martin Kleppmann
- and the main difference is that there are
lots of new and exciting ways for things to
go wrong.
“
”
Photo: Dave Lehl
”
Why do things go wrong?
“ ”
Photo: Dave Lehl
The fallacies of distributed computing
are a set of assertions made by L Peter
Deutsch and others at Sun Microsystems
describing false assumptions that
programmers new to distributed
applications invariably make.
1. The network is reliable;
2. Latency is zero;
3. Bandwidth is infinite;
4. The network is secure;
5. Topology doesn't change;
6. There is one administrator;
7. Transport cost is zero;
8. The network is homogeneous.
Fallacies of distributed computing
What could possibly go wrong?
“ ”
Photo: Dave Lehl
OSI & TCP/IP
Source: https://www.guru99.com/difference-tcp-ip-vs-osi-model.html
.. in your browser’s address bar and press Enter
What happens when you type google.com…
Source: https://github.com/alex/what-happens-when
16
Source: https://7216-presscdn-0-76-pagely.netdna-ssl.com/wp-content/uploads/2011/12/confused-man-single-good-men.jpg
Where do I start?
A structured approach
to debugging distributed systems
@bjschrijver
Check DNS & routing
Check connection
Debug client side
Create minimal reproducer
Debug server side
Observe & document
Wrap up & post mortem
Inspect traffic / messages
Step 1: Observe & document
• What do you know about the problem?
• Inspect logging, errors, metrics, tracing
• Draw the path from source to target - what’s
in between? Focus on details!
• Document what you know
• Can we reproduce in a test?
• By injecting errors, for example
Tools
Whiteboard,
documentation, logging,
metrics, tracing
(opentracing.io), tests,
jepsen.io
Step 1: Observe & document
Step 2: Create minimal reproducer
• Goal: maximise the amount of debugging
cycles
• Focus on short development iterations /
feedback loops
• Get close to the action!
Tools
IDE, Shell scripts,
SSH tunnels, Curl
Step 3: Debug client side
• Focus on eliminating anything that could be
wrong on the client side
• Are we connecting to the right host?
• Do we send the right message?
• Do we receive a response?
• Not much different from local
debugging
Tools
IDE, debugger,
logging
Step 4: Check DNS & routing
• DNS:
• Make sure you know what IP address the
hostname should resolve to
• Verify that this actually happens
at the client
• Routing:
• Verify you can reach the
target machine
Tools
host, nslookup,
dig, whois, ping,
traceroute,
nslookup.io,
dnschecker.org
Step 5: Check connection
• Can we connect to the port?
• If not, do we get a REJECT or a DROP?
• Does the connection open and stay open?
• Are we talking TLS?
• What is the connection speed
between us?
Tools
telnet, nc, curl,
iperf
Step 6: Inspect traffic / messages
• Do we send the right request?
• Do we receive the right response?
• How do we know?
• How do we handle TLS?
• Are there any load balancers
or proxies in between?
Tools
curl, wireshark,
tcpdump, network
tab in browser,
mitm/tls proxy
Step 7: Debug server side
• Inspect the remote host
• Can we attach a remote debugger?
• See https://youtube.com/OpenValue
• Profiling
• Java Flight Recorder
• Strace
Tools
SSH tunnels,
remote debugger,
profiler, strace,
JFR
Step 8: Wrap up & post mortem
• Document the issue:
• Timeline
• What did we see?
• Why did it happen?
• What was the impact?
• How did we find out?
• What did we do to mitigate and fix?
• What should we do to prevent
repetition?
Tools
Whiteboard,
documentation
If you really want a reliable system, you
have to understand what its failure modes
are. You have to actually have witnessed
it misbehaving.
- Jason Cahoon
“
”
Distributed systems war stories
The one where it worked half of the time…
The one at a school…
Arnhem JUG March 2023 - Debugging distributed systems
The one with breaking news…
Summary: a structured approach
to debugging distributed systems
@bjschrijver
Check DNS & routing
Check connection
Debug client side
Create minimal reproducer
Debug server side
Observe & document
Wrap up & post mortem
Inspect traffic / messages
Source: https://cdn2.vox-cdn.com/thumbor/J9OqPYS7FgI9fjGhnF7AFh8foVY=/148x0:1768x1080/1280x854/cdn0.vox-cdn.com/uploads/chorus_image/image/46147742/cute-success-kid-1920x1080.0.0.jpg
THAT’S IT.
NOW GO KICK SOME ASS!
Questions?
@bjschrijver
Thanks for your time.
Got feedback? Tweet it!
All pictures belong
to their respective
authors
@bjschrijver
Ad

More Related Content

Similar to Arnhem JUG March 2023 - Debugging distributed systems (20)

WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
Scott Sutherland
 
When Security Tools Fail You
When Security Tools Fail YouWhen Security Tools Fail You
When Security Tools Fail You
Michael Gough
 
Rewriting DevOps
Rewriting DevOpsRewriting DevOps
Rewriting DevOps
Matthew Boeckman
 
Derbycon - Passing the Torch
Derbycon - Passing the TorchDerbycon - Passing the Torch
Derbycon - Passing the Torch
Will Schroeder
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Brian Brazil
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Sagi Brody
 
How To Start Your InfoSec Career
How To Start Your InfoSec CareerHow To Start Your InfoSec Career
How To Start Your InfoSec Career
Andrew McNicol
 
Push Functional Testing Further
Push Functional Testing FurtherPush Functional Testing Further
Push Functional Testing Further
Alan Richardson
 
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
Luncheon 2016-07-16 -  Topic 2 - Advanced Threat Hunting by Justin FalckLuncheon 2016-07-16 -  Topic 2 - Advanced Threat Hunting by Justin Falck
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
North Texas Chapter of the ISSA
 
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdfSophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Michael Gough
 
Heartbleed
HeartbleedHeartbleed
Heartbleed
Punit Goswami
 
THOTCON 0x6: Going Kinetic on Electronic Crime Networks
THOTCON 0x6: Going Kinetic on Electronic Crime NetworksTHOTCON 0x6: Going Kinetic on Electronic Crime Networks
THOTCON 0x6: Going Kinetic on Electronic Crime Networks
John Bambenek
 
Monitoring microservices
Monitoring microservicesMonitoring microservices
Monitoring microservices
William Brander
 
2023 NCIT: Introduction to Intrusion Detection
2023 NCIT: Introduction to Intrusion Detection2023 NCIT: Introduction to Intrusion Detection
2023 NCIT: Introduction to Intrusion Detection
APNIC
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
duleepa
 
Heartbleed Bug Vulnerability: Discovery, Impact and Solution
Heartbleed Bug Vulnerability: Discovery, Impact and SolutionHeartbleed Bug Vulnerability: Discovery, Impact and Solution
Heartbleed Bug Vulnerability: Discovery, Impact and Solution
CASCouncil
 
When the internet bleeded : RootConf 2014
When the internet bleeded : RootConf 2014When the internet bleeded : RootConf 2014
When the internet bleeded : RootConf 2014
Anant Shrivastava
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil
 
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Michael Gough
 
How an Attacker "Audits" Your Software Systems
How an Attacker "Audits" Your Software SystemsHow an Attacker "Audits" Your Software Systems
How an Attacker "Audits" Your Software Systems
Security Innovation
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
Scott Sutherland
 
When Security Tools Fail You
When Security Tools Fail YouWhen Security Tools Fail You
When Security Tools Fail You
Michael Gough
 
Derbycon - Passing the Torch
Derbycon - Passing the TorchDerbycon - Passing the Torch
Derbycon - Passing the Torch
Will Schroeder
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Brian Brazil
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Sagi Brody
 
How To Start Your InfoSec Career
How To Start Your InfoSec CareerHow To Start Your InfoSec Career
How To Start Your InfoSec Career
Andrew McNicol
 
Push Functional Testing Further
Push Functional Testing FurtherPush Functional Testing Further
Push Functional Testing Further
Alan Richardson
 
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
Luncheon 2016-07-16 -  Topic 2 - Advanced Threat Hunting by Justin FalckLuncheon 2016-07-16 -  Topic 2 - Advanced Threat Hunting by Justin Falck
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
North Texas Chapter of the ISSA
 
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdfSophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Michael Gough
 
THOTCON 0x6: Going Kinetic on Electronic Crime Networks
THOTCON 0x6: Going Kinetic on Electronic Crime NetworksTHOTCON 0x6: Going Kinetic on Electronic Crime Networks
THOTCON 0x6: Going Kinetic on Electronic Crime Networks
John Bambenek
 
Monitoring microservices
Monitoring microservicesMonitoring microservices
Monitoring microservices
William Brander
 
2023 NCIT: Introduction to Intrusion Detection
2023 NCIT: Introduction to Intrusion Detection2023 NCIT: Introduction to Intrusion Detection
2023 NCIT: Introduction to Intrusion Detection
APNIC
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
duleepa
 
Heartbleed Bug Vulnerability: Discovery, Impact and Solution
Heartbleed Bug Vulnerability: Discovery, Impact and SolutionHeartbleed Bug Vulnerability: Discovery, Impact and Solution
Heartbleed Bug Vulnerability: Discovery, Impact and Solution
CASCouncil
 
When the internet bleeded : RootConf 2014
When the internet bleeded : RootConf 2014When the internet bleeded : RootConf 2014
When the internet bleeded : RootConf 2014
Anant Shrivastava
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil
 
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Michael Gough
 
How an Attacker "Audits" Your Software Systems
How an Attacker "Audits" Your Software SystemsHow an Attacker "Audits" Your Software Systems
How an Attacker "Audits" Your Software Systems
Security Innovation
 

Recently uploaded (20)

Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Ad

Arnhem JUG March 2023 - Debugging distributed systems

  • 2. Debugging distributed systems: the good parts [email protected] Bert Jan Schrijver @bjschrijver Networking 101 How the internet works
  • 4. Bert Jan Schrijver L e t ’ s m e e t @bjschrijver
  • 5. Why are distributed systems difficult? Networking 101 What? Why? ✅ Demo War stories Conclusion W h a t ‘ s n e x t ? Outline A structured approach @bjschrijver
  • 6. What is a distributed system?
  • 7. A distributed system is a system whose components are located on different networked computers which communicate and coordinate their actions by passing messages to one another.
  • 8. • Concurrency of components • Lack of a global clock • Independent failure of components ➡ Distributed systems are harder to reason about Characteristics of distributed systems Source: http://www.nasa.gov/images/content/218652main_STOCC_FS_img_lg.jpg
  • 9. Working with distributed systems is fundamentally different from writing software on a single computer - Martin Kleppmann - and the main difference is that there are lots of new and exciting ways for things to go wrong. “ ” Photo: Dave Lehl ”
  • 10. Why do things go wrong? “ ” Photo: Dave Lehl
  • 11. The fallacies of distributed computing are a set of assertions made by L Peter Deutsch and others at Sun Microsystems describing false assumptions that programmers new to distributed applications invariably make.
  • 12. 1. The network is reliable; 2. Latency is zero; 3. Bandwidth is infinite; 4. The network is secure; 5. Topology doesn't change; 6. There is one administrator; 7. Transport cost is zero; 8. The network is homogeneous. Fallacies of distributed computing
  • 13. What could possibly go wrong? “ ” Photo: Dave Lehl
  • 14. OSI & TCP/IP Source: https://www.guru99.com/difference-tcp-ip-vs-osi-model.html
  • 15. .. in your browser’s address bar and press Enter What happens when you type google.com… Source: https://github.com/alex/what-happens-when
  • 16. 16
  • 18. A structured approach to debugging distributed systems @bjschrijver Check DNS & routing Check connection Debug client side Create minimal reproducer Debug server side Observe & document Wrap up & post mortem Inspect traffic / messages
  • 19. Step 1: Observe & document • What do you know about the problem? • Inspect logging, errors, metrics, tracing • Draw the path from source to target - what’s in between? Focus on details! • Document what you know • Can we reproduce in a test? • By injecting errors, for example Tools Whiteboard, documentation, logging, metrics, tracing (opentracing.io), tests, jepsen.io
  • 20. Step 1: Observe & document
  • 21. Step 2: Create minimal reproducer • Goal: maximise the amount of debugging cycles • Focus on short development iterations / feedback loops • Get close to the action! Tools IDE, Shell scripts, SSH tunnels, Curl
  • 22. Step 3: Debug client side • Focus on eliminating anything that could be wrong on the client side • Are we connecting to the right host? • Do we send the right message? • Do we receive a response? • Not much different from local debugging Tools IDE, debugger, logging
  • 23. Step 4: Check DNS & routing • DNS: • Make sure you know what IP address the hostname should resolve to • Verify that this actually happens at the client • Routing: • Verify you can reach the target machine Tools host, nslookup, dig, whois, ping, traceroute, nslookup.io, dnschecker.org
  • 24. Step 5: Check connection • Can we connect to the port? • If not, do we get a REJECT or a DROP? • Does the connection open and stay open? • Are we talking TLS? • What is the connection speed between us? Tools telnet, nc, curl, iperf
  • 25. Step 6: Inspect traffic / messages • Do we send the right request? • Do we receive the right response? • How do we know? • How do we handle TLS? • Are there any load balancers or proxies in between? Tools curl, wireshark, tcpdump, network tab in browser, mitm/tls proxy
  • 26. Step 7: Debug server side • Inspect the remote host • Can we attach a remote debugger? • See https://youtube.com/OpenValue • Profiling • Java Flight Recorder • Strace Tools SSH tunnels, remote debugger, profiler, strace, JFR
  • 27. Step 8: Wrap up & post mortem • Document the issue: • Timeline • What did we see? • Why did it happen? • What was the impact? • How did we find out? • What did we do to mitigate and fix? • What should we do to prevent repetition? Tools Whiteboard, documentation
  • 28. If you really want a reliable system, you have to understand what its failure modes are. You have to actually have witnessed it misbehaving. - Jason Cahoon “ ”
  • 30. The one where it worked half of the time…
  • 31. The one at a school…
  • 33. The one with breaking news…
  • 34. Summary: a structured approach to debugging distributed systems @bjschrijver Check DNS & routing Check connection Debug client side Create minimal reproducer Debug server side Observe & document Wrap up & post mortem Inspect traffic / messages
  • 37. Thanks for your time. Got feedback? Tweet it! All pictures belong to their respective authors @bjschrijver