SlideShare a Scribd company logo
Fault Tolerance
in Distributed Environment
Speaker
• Orkhan Gasimov, Software Engineer
• 15 years of software engineering;
• variety of technologies (languages & frameworks);
• solution design and implementation;
• Teaching training courses.
• Architecture.
• Java.
• JavaScript / TypeScript.
• Author of training courses.
• Spring Cloud.
• Akka for Java.
Fault Tolerance in Distributed Environment
• Forces:
• Network.
• High Load.
• RPC mechanics.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API
Possible Issues
• Services are distributed across the network.
• Network will eventually fail.
• Services will be unavailable.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API
Possible Issues
• Services are distributed across the network.
• Network will eventually fail.
• Services will be unavailable.
• Major issues came from and are solved by:
• Network – it fails, it heals 
• High load – more latency.
• RPC – internal mechanics. DBService
API
DBService
DBService
DBService
Web
API
Mobile
API
Network
Service Coordination
Service Coordination
• Alternate Routes:
Consumer Producer
Producer
Producer
Service Coordination
• Service Discovery:
• Dynamic service locations.
Service
Registry
Consumer Producer
Service Coordination
• Service Discovery:
• Dynamic service locations.
• Scale horizontally on demand.
Service
Registry
Consumer Producer
Producer
Producer
Service Coordination
• Service Discovery:
• Dynamic service locations.
• Scale horizontally on demand.
• Client-side load balancing on the fly.
Service
Registry
Consumer Producer
Producer
Producer
LB
Service Coordination
• Service Discovery:
• Dynamic service locations.
• Scale horizontally on demand.
• Client-side load balancing on the fly.
• Discovery Cluster.
Service
Registry
Consumer Producer
Producer
Producer
LB
Service
Registry
Service
Registry
High Load
High Load
• Loaded services may get slower.
• Processing of requests slow down.
• Slow processing occupy threads longer.
• Common solutions:
• Horizontal Scalability.
• Load Balancing.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API
High Load
• Solution #1:
• Scale manually by deploying more instances.
• Drawbacks:
• Deployment requires time and resources. Not a real-time solution.
• Lack of automation.
• Down-scale?
• Who?
• When?
DB
Service
API
DBService
DBService
DB
Web
API
Mobile
API
Service
Service
Service
High Load
• Solution #2:
• Scale dynamically, using an auto-scaler service.
• Up-scale and down-scale services on demand.
• Drawbacks:
• What happens when resource limit is reached?
• Auto-scaler requires support.
• May fail to down-scale.
• May fail to up-scale.
DBService
API
DBService
DBService
DB
Web
API
Mobile
API
Service
Service
Service
RPC Mechanics
RPC Mechanics
• Services are reusable.
• Cascading call stack deliver the final result.
Service ServiceService Service Service
ServiceService
RPC Mechanics
• If network or service fails, the whole stack may fail.
• Users may get undefined errors instead of a meaningful results.
Service ServiceService Service Service
ServiceService
RPC Mechanics
• Sometimes you have to wait, to get an error.
• Why we have to wait to get an error?
• Can system react quicker to errors?
• Users should wait for results, not errors!
Service ServiceService Service Service
RPC Mechanics
• System needs time to free-up resources.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API
RPC Mechanics
• Possible solution:
• Set time-outs, consider slow calls erroneous.
• Set a threshold, e.g. error rate per interval.
• Set a recovery time, watch the threshold.
• Define a constant fallback value (no network calls).
• Break for recovery if the threshold reached (use the fallback value).
• Cascade errors, to accelerate at top-most level.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API
Circuit Breaker
Predict issues
Circuit Breaker
• Algorithm:
• Closed – watch requests, keep track of errors.
• Open – if there are too many errors, fast fail to a default value.
• Half-Open – try one request some time later.
Closed
Open
Half-Open
Success
Too many fails
Fast Fail
Try one request
Fail
Success
Circuit Breaker
• Track Exceptions & watch for time-outs.
• Simplify:
• On time-out throw a Timeout Exception.
• Track Exceptions!
Closed
Open
Half-Open
Success
Too many fails
Fast Fail
Try one request
Fail
Success
Circuit Breaker
• Usually applied at client side.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
APICB
CB
N-Modular Redundancy
Process faster if possible
N-Modular Redundancy
• Algorithm:
• Send N parallel calls.
• Vote for the first acceptable result.
Caller
Executor 1
Executor 2
Executor N
Voter Receiver
N-Modular Redundancy
• Caller is the Voter:
• A special case, adaptation of pattern for RPC.
• Speed up processing.
• Some executors may perform faster, while others may be overloaded.
• Lower probability of errors.
• More chances to get an acceptable response (in a timely manner).
Executor 1 Executor 2 Executor N
ReceiverCaller & Voter
N-Modular Redundancy
• Be careful:
• Remote calls should be idempotent.
• Not all calls will be idempotent.
Caller
Executor 1
Executor 2
Executor N
Voter Receiver
Recovery Blocks
Slow down if necessary
Recovery Blocks
• Pareto principle:
• 20% of code does the 80% of the job.
• 80% of code does the 20% of the job.
• Fast Path:
• The most important 20% of the logic.
• Protect from time-outs.
• Protection Rings:
• Layer error handling logic as rings.
• Inner rings work faster, outer rings cover more cases.
• Protect from errors in a modular way.
Basic Error Handling
Fast Path
Advanced Error Handling
Recovery Blocks
• Error handling logic is never enough.
• But, it may be heavy enough to slow down processing.
• Work faster when everything is fine.
• Slow down to protect when necessary.
ReceiverCaller
Request
Response
Result
Basic Error Handling
Fast Path
Advanced Error Handling
Recovery Blocks
• Recovery Blocks:
• Recover from errors if possible (more attempts, longer processing).
• Sequential or asynchronous services, layered over the network.
• Sequentially layered methods.
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response Result
Recovery Blocks
• In case of asynchronous flows:
• Caller may stream to first block;
• On error, processing is passed to next block;
• The first acceptable result is sent to Receiver;
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response
Recovery Blocks
• In case of unrecoverable errors:
• Customize for each project depending on requirements.
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response Result
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response
Error Kernel
Actor Model
Actors
• Actors:
• Communicate by sending messages.
• Form into a hierarchy.
• Delegation:
• Split tasks to sub-tasks.
• Delegate sub-tasks to subordinates.
Actors
• Sub-tasks:
• Are the actual piece of work.
• Mostly executed in parallel.
• May result in an error.
• Error handling:
• Actors may fail while executing a task.
• Fails are reported to parent as an error message.
• Parent actors supervise and monitor subordinates.
• Parents may resume, restart, stop subordinates or escalate the error.
Error Kernel
• Error Kernel Pattern:
• The root actor is the Error Kernel of the whole actor system.
• Each parent actor is the local Error Kernel of it’s hierarchy.
• Being the Error Kernel:
• Split the task to sub-tasks and delegate the execution.
• Aggregate results, supervise and monitor faults.
• Do not execute tasks at root, avoid errors.
• If root fails, the whole task fails.
Instance Healers
Instance Healers
• Need a mechanism to watch for failed instances.
• Services are scaled by running several instances.
• Services may fail.
Instance Healers
• Instance Healer:
• Watch for the services instances using health checks.
• If necessary, heal instances (restart or spawn).
• May be the Error Kernel of the system.
Healer
Dead NewHealed
Instance Healers
Healer
Dead NewHealed
Instance Healers
• Possible Issues:
• Services may be in the middle of processing.
• Stateful services may loose data, e.g. a service with in-mem data.
• Prefer stateless processing.
• Keep state in an external storage.
Instance Healers
• Event Sourcing:
• Keep track of state-changing events, rebuild system state from events.
• Take snapshots for faster rebuilds, remember last processed event.
• Services may re-read events after healing.
• Processing should be idempotent!
Command
Service
Query
Service
Event
Store
Data
Store
Event
Processor
Thank You!
Fault Tolerance in Distributed Environment
Ad

More Related Content

What's hot (20)

Service Discovery in Distributed Systems
Service Discovery in Distributed SystemsService Discovery in Distributed Systems
Service Discovery in Distributed Systems
Ivan Voroshilin
 
MySQL High Availability with Replication New Features
MySQL High Availability with Replication New FeaturesMySQL High Availability with Replication New Features
MySQL High Availability with Replication New Features
Shivji Kumar Jha
 
JAX-RS.next
JAX-RS.nextJAX-RS.next
JAX-RS.next
Michal Gajdos
 
Infrastructure as Code with Chef
Infrastructure as Code with ChefInfrastructure as Code with Chef
Infrastructure as Code with Chef
Sarah Hynes Cheney
 
5 steps to take setting up a streamlined container pipeline
5 steps to take setting up a streamlined container pipeline5 steps to take setting up a streamlined container pipeline
5 steps to take setting up a streamlined container pipeline
Michel Schildmeijer
 
Apic dc api deep dive
Apic dc api deep dive Apic dc api deep dive
Apic dc api deep dive
Cisco DevNet
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSS
aspyker
 
Jelastic - DevOps for Java with Docker Containers - Madrid 2015
Jelastic - DevOps for Java with Docker Containers - Madrid 2015Jelastic - DevOps for Java with Docker Containers - Madrid 2015
Jelastic - DevOps for Java with Docker Containers - Madrid 2015
Jelastic Multi-Cloud PaaS
 
JavaOne 2014 BOF4241 What's Next for JSF?
JavaOne 2014 BOF4241 What's Next for JSF?JavaOne 2014 BOF4241 What's Next for JSF?
JavaOne 2014 BOF4241 What's Next for JSF?
Edward Burns
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Lagom Workshop BarcelonaJUG 2017-06-08
Lagom Workshop  BarcelonaJUG 2017-06-08Lagom Workshop  BarcelonaJUG 2017-06-08
Lagom Workshop BarcelonaJUG 2017-06-08
Ignasi Marimon-Clos i Sunyol
 
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Martin Bergljung
 
JavaOne 2015: 12 Factor App
JavaOne 2015: 12 Factor AppJavaOne 2015: 12 Factor App
JavaOne 2015: 12 Factor App
Joe Kutner
 
2015 UJUG, Servlet 4.0 portion
2015 UJUG, Servlet 4.0 portion2015 UJUG, Servlet 4.0 portion
2015 UJUG, Servlet 4.0 portion
mnriem
 
spray: REST on Akka (Scala Days)
spray: REST on Akka (Scala Days)spray: REST on Akka (Scala Days)
spray: REST on Akka (Scala Days)
sirthias
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime Deployment
Joel Dickson
 
WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015
Pavel Bucek
 
Moving From Actions & Behaviors to Microservices
Moving From Actions & Behaviors to MicroservicesMoving From Actions & Behaviors to Microservices
Moving From Actions & Behaviors to Microservices
Jeff Potts
 
Always On - Zero Downtime releases
Always On - Zero Downtime releasesAlways On - Zero Downtime releases
Always On - Zero Downtime releases
Anders Lundsgård
 
Service Discovery in Distributed Systems
Service Discovery in Distributed SystemsService Discovery in Distributed Systems
Service Discovery in Distributed Systems
Ivan Voroshilin
 
MySQL High Availability with Replication New Features
MySQL High Availability with Replication New FeaturesMySQL High Availability with Replication New Features
MySQL High Availability with Replication New Features
Shivji Kumar Jha
 
Infrastructure as Code with Chef
Infrastructure as Code with ChefInfrastructure as Code with Chef
Infrastructure as Code with Chef
Sarah Hynes Cheney
 
5 steps to take setting up a streamlined container pipeline
5 steps to take setting up a streamlined container pipeline5 steps to take setting up a streamlined container pipeline
5 steps to take setting up a streamlined container pipeline
Michel Schildmeijer
 
Apic dc api deep dive
Apic dc api deep dive Apic dc api deep dive
Apic dc api deep dive
Cisco DevNet
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSS
aspyker
 
Jelastic - DevOps for Java with Docker Containers - Madrid 2015
Jelastic - DevOps for Java with Docker Containers - Madrid 2015Jelastic - DevOps for Java with Docker Containers - Madrid 2015
Jelastic - DevOps for Java with Docker Containers - Madrid 2015
Jelastic Multi-Cloud PaaS
 
JavaOne 2014 BOF4241 What's Next for JSF?
JavaOne 2014 BOF4241 What's Next for JSF?JavaOne 2014 BOF4241 What's Next for JSF?
JavaOne 2014 BOF4241 What's Next for JSF?
Edward Burns
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Martin Bergljung
 
JavaOne 2015: 12 Factor App
JavaOne 2015: 12 Factor AppJavaOne 2015: 12 Factor App
JavaOne 2015: 12 Factor App
Joe Kutner
 
2015 UJUG, Servlet 4.0 portion
2015 UJUG, Servlet 4.0 portion2015 UJUG, Servlet 4.0 portion
2015 UJUG, Servlet 4.0 portion
mnriem
 
spray: REST on Akka (Scala Days)
spray: REST on Akka (Scala Days)spray: REST on Akka (Scala Days)
spray: REST on Akka (Scala Days)
sirthias
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime Deployment
Joel Dickson
 
WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015
Pavel Bucek
 
Moving From Actions & Behaviors to Microservices
Moving From Actions & Behaviors to MicroservicesMoving From Actions & Behaviors to Microservices
Moving From Actions & Behaviors to Microservices
Jeff Potts
 
Always On - Zero Downtime releases
Always On - Zero Downtime releasesAlways On - Zero Downtime releases
Always On - Zero Downtime releases
Anders Lundsgård
 

Similar to Fault Tolerance in Distributed Environment (20)

Designing Fault Tolerant Microservices
Designing Fault Tolerant MicroservicesDesigning Fault Tolerant Microservices
Designing Fault Tolerant Microservices
Orkhan Gasimov
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
Bhakti Mehta
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
C4Media
 
Reactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-JavaReactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-Java
Kasun Indrasiri
 
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Lohika_Odessa_TechTalks
 
Tech talk microservices debugging
Tech talk microservices debuggingTech talk microservices debugging
Tech talk microservices debugging
Andrey Kolodnitsky
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15
Derek Ashmore
 
Ioug oow12 em12c
Ioug oow12 em12cIoug oow12 em12c
Ioug oow12 em12c
Kellyn Pot'Vin-Gorman
 
5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery
XebiaLabs
 
Reactive solutions using java 9 and spring reactor
Reactive solutions using java 9 and spring reactorReactive solutions using java 9 and spring reactor
Reactive solutions using java 9 and spring reactor
OrenEzer1
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application Design
GlobalLogic Ukraine
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance Pages
Enkitec
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
Rajeev Bharshetty
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
Josh Evans
 
Spring Cloud: Why? How? What?
Spring Cloud: Why? How? What?Spring Cloud: Why? How? What?
Spring Cloud: Why? How? What?
Orkhan Gasimov
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
Stuart (Pid) Williams
 
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
InfluxData
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
NGINX, Inc.
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application Design
Orkhan Gasimov
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
Amit Kejriwal
 
Designing Fault Tolerant Microservices
Designing Fault Tolerant MicroservicesDesigning Fault Tolerant Microservices
Designing Fault Tolerant Microservices
Orkhan Gasimov
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
Bhakti Mehta
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
C4Media
 
Reactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-JavaReactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-Java
Kasun Indrasiri
 
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Lohika_Odessa_TechTalks
 
Tech talk microservices debugging
Tech talk microservices debuggingTech talk microservices debugging
Tech talk microservices debugging
Andrey Kolodnitsky
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15
Derek Ashmore
 
5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery
XebiaLabs
 
Reactive solutions using java 9 and spring reactor
Reactive solutions using java 9 and spring reactorReactive solutions using java 9 and spring reactor
Reactive solutions using java 9 and spring reactor
OrenEzer1
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application Design
GlobalLogic Ukraine
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance Pages
Enkitec
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
Rajeev Bharshetty
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
Josh Evans
 
Spring Cloud: Why? How? What?
Spring Cloud: Why? How? What?Spring Cloud: Why? How? What?
Spring Cloud: Why? How? What?
Orkhan Gasimov
 
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
InfluxData
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
NGINX, Inc.
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application Design
Orkhan Gasimov
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
Amit Kejriwal
 
Ad

More from Orkhan Gasimov (12)

Complex Application Design
Complex Application DesignComplex Application Design
Complex Application Design
Orkhan Gasimov
 
Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...
Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...
Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...
Orkhan Gasimov
 
Digital Transformation - Why? How? What?
Digital Transformation - Why? How? What?Digital Transformation - Why? How? What?
Digital Transformation - Why? How? What?
Orkhan Gasimov
 
Service Mesh - Why? How? What?
Service Mesh - Why? How? What?Service Mesh - Why? How? What?
Service Mesh - Why? How? What?
Orkhan Gasimov
 
Angular Web Components
Angular Web ComponentsAngular Web Components
Angular Web Components
Orkhan Gasimov
 
Vert.x - Reactive & Distributed [Devoxx version]
Vert.x - Reactive & Distributed [Devoxx version]Vert.x - Reactive & Distributed [Devoxx version]
Vert.x - Reactive & Distributed [Devoxx version]
Orkhan Gasimov
 
Vertx - Reactive & Distributed
Vertx - Reactive & DistributedVertx - Reactive & Distributed
Vertx - Reactive & Distributed
Orkhan Gasimov
 
Spring Cloud: API gateway upgrade & configuration in the cloud
Spring Cloud: API gateway upgrade & configuration in the cloudSpring Cloud: API gateway upgrade & configuration in the cloud
Spring Cloud: API gateway upgrade & configuration in the cloud
Orkhan Gasimov
 
Refactoring Monolith to Microservices
Refactoring Monolith to MicroservicesRefactoring Monolith to Microservices
Refactoring Monolith to Microservices
Orkhan Gasimov
 
Angular or React
Angular or ReactAngular or React
Angular or React
Orkhan Gasimov
 
Secured REST Microservices with Spring Cloud
Secured REST Microservices with Spring CloudSecured REST Microservices with Spring Cloud
Secured REST Microservices with Spring Cloud
Orkhan Gasimov
 
Data Microservices with Spring Cloud
Data Microservices with Spring CloudData Microservices with Spring Cloud
Data Microservices with Spring Cloud
Orkhan Gasimov
 
Complex Application Design
Complex Application DesignComplex Application Design
Complex Application Design
Orkhan Gasimov
 
Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...
Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...
Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...
Orkhan Gasimov
 
Digital Transformation - Why? How? What?
Digital Transformation - Why? How? What?Digital Transformation - Why? How? What?
Digital Transformation - Why? How? What?
Orkhan Gasimov
 
Service Mesh - Why? How? What?
Service Mesh - Why? How? What?Service Mesh - Why? How? What?
Service Mesh - Why? How? What?
Orkhan Gasimov
 
Angular Web Components
Angular Web ComponentsAngular Web Components
Angular Web Components
Orkhan Gasimov
 
Vert.x - Reactive & Distributed [Devoxx version]
Vert.x - Reactive & Distributed [Devoxx version]Vert.x - Reactive & Distributed [Devoxx version]
Vert.x - Reactive & Distributed [Devoxx version]
Orkhan Gasimov
 
Vertx - Reactive & Distributed
Vertx - Reactive & DistributedVertx - Reactive & Distributed
Vertx - Reactive & Distributed
Orkhan Gasimov
 
Spring Cloud: API gateway upgrade & configuration in the cloud
Spring Cloud: API gateway upgrade & configuration in the cloudSpring Cloud: API gateway upgrade & configuration in the cloud
Spring Cloud: API gateway upgrade & configuration in the cloud
Orkhan Gasimov
 
Refactoring Monolith to Microservices
Refactoring Monolith to MicroservicesRefactoring Monolith to Microservices
Refactoring Monolith to Microservices
Orkhan Gasimov
 
Secured REST Microservices with Spring Cloud
Secured REST Microservices with Spring CloudSecured REST Microservices with Spring Cloud
Secured REST Microservices with Spring Cloud
Orkhan Gasimov
 
Data Microservices with Spring Cloud
Data Microservices with Spring CloudData Microservices with Spring Cloud
Data Microservices with Spring Cloud
Orkhan Gasimov
 
Ad

Recently uploaded (20)

How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 

Fault Tolerance in Distributed Environment

  • 2. Speaker • Orkhan Gasimov, Software Engineer • 15 years of software engineering; • variety of technologies (languages & frameworks); • solution design and implementation; • Teaching training courses. • Architecture. • Java. • JavaScript / TypeScript. • Author of training courses. • Spring Cloud. • Akka for Java.
  • 3. Fault Tolerance in Distributed Environment • Forces: • Network. • High Load. • RPC mechanics. DBService API DBService DBService DBService Web API Mobile API
  • 4. Possible Issues • Services are distributed across the network. • Network will eventually fail. • Services will be unavailable. DBService API DBService DBService DBService Web API Mobile API
  • 5. Possible Issues • Services are distributed across the network. • Network will eventually fail. • Services will be unavailable. • Major issues came from and are solved by: • Network – it fails, it heals  • High load – more latency. • RPC – internal mechanics. DBService API DBService DBService DBService Web API Mobile API
  • 7. Service Coordination • Alternate Routes: Consumer Producer Producer Producer
  • 8. Service Coordination • Service Discovery: • Dynamic service locations. Service Registry Consumer Producer
  • 9. Service Coordination • Service Discovery: • Dynamic service locations. • Scale horizontally on demand. Service Registry Consumer Producer Producer Producer
  • 10. Service Coordination • Service Discovery: • Dynamic service locations. • Scale horizontally on demand. • Client-side load balancing on the fly. Service Registry Consumer Producer Producer Producer LB
  • 11. Service Coordination • Service Discovery: • Dynamic service locations. • Scale horizontally on demand. • Client-side load balancing on the fly. • Discovery Cluster. Service Registry Consumer Producer Producer Producer LB Service Registry Service Registry
  • 13. High Load • Loaded services may get slower. • Processing of requests slow down. • Slow processing occupy threads longer. • Common solutions: • Horizontal Scalability. • Load Balancing. DBService API DBService DBService DBService Web API Mobile API
  • 14. High Load • Solution #1: • Scale manually by deploying more instances. • Drawbacks: • Deployment requires time and resources. Not a real-time solution. • Lack of automation. • Down-scale? • Who? • When? DB Service API DBService DBService DB Web API Mobile API Service Service Service
  • 15. High Load • Solution #2: • Scale dynamically, using an auto-scaler service. • Up-scale and down-scale services on demand. • Drawbacks: • What happens when resource limit is reached? • Auto-scaler requires support. • May fail to down-scale. • May fail to up-scale. DBService API DBService DBService DB Web API Mobile API Service Service Service
  • 17. RPC Mechanics • Services are reusable. • Cascading call stack deliver the final result. Service ServiceService Service Service ServiceService
  • 18. RPC Mechanics • If network or service fails, the whole stack may fail. • Users may get undefined errors instead of a meaningful results. Service ServiceService Service Service ServiceService
  • 19. RPC Mechanics • Sometimes you have to wait, to get an error. • Why we have to wait to get an error? • Can system react quicker to errors? • Users should wait for results, not errors! Service ServiceService Service Service
  • 20. RPC Mechanics • System needs time to free-up resources. DBService API DBService DBService DBService Web API Mobile API
  • 21. RPC Mechanics • Possible solution: • Set time-outs, consider slow calls erroneous. • Set a threshold, e.g. error rate per interval. • Set a recovery time, watch the threshold. • Define a constant fallback value (no network calls). • Break for recovery if the threshold reached (use the fallback value). • Cascade errors, to accelerate at top-most level. DBService API DBService DBService DBService Web API Mobile API
  • 23. Circuit Breaker • Algorithm: • Closed – watch requests, keep track of errors. • Open – if there are too many errors, fast fail to a default value. • Half-Open – try one request some time later. Closed Open Half-Open Success Too many fails Fast Fail Try one request Fail Success
  • 24. Circuit Breaker • Track Exceptions & watch for time-outs. • Simplify: • On time-out throw a Timeout Exception. • Track Exceptions! Closed Open Half-Open Success Too many fails Fast Fail Try one request Fail Success
  • 25. Circuit Breaker • Usually applied at client side. DBService API DBService DBService DBService Web API Mobile APICB CB
  • 27. N-Modular Redundancy • Algorithm: • Send N parallel calls. • Vote for the first acceptable result. Caller Executor 1 Executor 2 Executor N Voter Receiver
  • 28. N-Modular Redundancy • Caller is the Voter: • A special case, adaptation of pattern for RPC. • Speed up processing. • Some executors may perform faster, while others may be overloaded. • Lower probability of errors. • More chances to get an acceptable response (in a timely manner). Executor 1 Executor 2 Executor N ReceiverCaller & Voter
  • 29. N-Modular Redundancy • Be careful: • Remote calls should be idempotent. • Not all calls will be idempotent. Caller Executor 1 Executor 2 Executor N Voter Receiver
  • 31. Recovery Blocks • Pareto principle: • 20% of code does the 80% of the job. • 80% of code does the 20% of the job. • Fast Path: • The most important 20% of the logic. • Protect from time-outs. • Protection Rings: • Layer error handling logic as rings. • Inner rings work faster, outer rings cover more cases. • Protect from errors in a modular way. Basic Error Handling Fast Path Advanced Error Handling
  • 32. Recovery Blocks • Error handling logic is never enough. • But, it may be heavy enough to slow down processing. • Work faster when everything is fine. • Slow down to protect when necessary. ReceiverCaller Request Response Result Basic Error Handling Fast Path Advanced Error Handling
  • 33. Recovery Blocks • Recovery Blocks: • Recover from errors if possible (more attempts, longer processing). • Sequential or asynchronous services, layered over the network. • Sequentially layered methods. Block N ReceiverBlock 2 Block 1 Caller Request Fail Fail Response Result
  • 34. Recovery Blocks • In case of asynchronous flows: • Caller may stream to first block; • On error, processing is passed to next block; • The first acceptable result is sent to Receiver; Block N ReceiverBlock 2 Block 1 Caller Request Fail Fail Response
  • 35. Recovery Blocks • In case of unrecoverable errors: • Customize for each project depending on requirements. Block N ReceiverBlock 2 Block 1 Caller Request Fail Fail Response Result Block N ReceiverBlock 2 Block 1 Caller Request Fail Fail Response
  • 37. Actors • Actors: • Communicate by sending messages. • Form into a hierarchy. • Delegation: • Split tasks to sub-tasks. • Delegate sub-tasks to subordinates.
  • 38. Actors • Sub-tasks: • Are the actual piece of work. • Mostly executed in parallel. • May result in an error. • Error handling: • Actors may fail while executing a task. • Fails are reported to parent as an error message. • Parent actors supervise and monitor subordinates. • Parents may resume, restart, stop subordinates or escalate the error.
  • 39. Error Kernel • Error Kernel Pattern: • The root actor is the Error Kernel of the whole actor system. • Each parent actor is the local Error Kernel of it’s hierarchy. • Being the Error Kernel: • Split the task to sub-tasks and delegate the execution. • Aggregate results, supervise and monitor faults. • Do not execute tasks at root, avoid errors. • If root fails, the whole task fails.
  • 41. Instance Healers • Need a mechanism to watch for failed instances. • Services are scaled by running several instances. • Services may fail.
  • 42. Instance Healers • Instance Healer: • Watch for the services instances using health checks. • If necessary, heal instances (restart or spawn). • May be the Error Kernel of the system. Healer Dead NewHealed
  • 44. Instance Healers • Possible Issues: • Services may be in the middle of processing. • Stateful services may loose data, e.g. a service with in-mem data. • Prefer stateless processing. • Keep state in an external storage.
  • 45. Instance Healers • Event Sourcing: • Keep track of state-changing events, rebuild system state from events. • Take snapshots for faster rebuilds, remember last processed event. • Services may re-read events after healing. • Processing should be idempotent! Command Service Query Service Event Store Data Store Event Processor
  • 46. Thank You! Fault Tolerance in Distributed Environment