Fault Tolerance in Distributed Environment

Fault Tolerance
in Distributed Environment

Speaker
• Orkhan Gasimov, Software Engineer
• 15 years of software engineering;
• variety of technologies (languages & frameworks);
• solution design and implementation;
• Teaching training courses.
• Architecture.
• Java.
• JavaScript / TypeScript.
• Author of training courses.
• Spring Cloud.
• Akka for Java.

Fault Tolerance in Distributed Environment
• Forces:
• Network.
• High Load.
• RPC mechanics.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API

Possible Issues
• Services are distributed across the network.
• Network will eventually fail.
• Services will be unavailable.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API

Possible Issues
• Services are distributed across the network.
• Network will eventually fail.
• Services will be unavailable.
• Major issues came from and are solved by:
• Network – it fails, it heals 
• High load – more latency.
• RPC – internal mechanics. DBService
API
DBService
DBService
DBService
Web
API
Mobile
API

Service Coordination
• Alternate Routes:
Consumer Producer
Producer
Producer

• Service Discovery:
• Dynamic service locations.
Service
Registry
Consumer Producer

• Scale horizontally on demand.
Service
Registry
Consumer Producer
Producer
Producer

• Client-side load balancing on the fly.
Service
Registry
Consumer Producer
Producer
Producer
LB

• Client-side load balancing on the fly.
• Discovery Cluster.
Service
Registry
Consumer Producer
Producer
Producer
LB
Service
Registry
Service
Registry

High Load
• Loaded services may get slower.
• Processing of requests slow down.
• Slow processing occupy threads longer.
• Common solutions:
• Horizontal Scalability.
• Load Balancing.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API

High Load
• Solution #1:
• Scale manually by deploying more instances.
• Drawbacks:
• Deployment requires time and resources. Not a real-time solution.
• Lack of automation.
• Down-scale?
• Who?
• When?
DB
Service
API
DBService
DBService
DB
Web
API
Mobile
API
Service
Service
Service

High Load
• Solution #2:
• Scale dynamically, using an auto-scaler service.
• Up-scale and down-scale services on demand.
• Drawbacks:
• What happens when resource limit is reached?
• Auto-scaler requires support.
• May fail to down-scale.
• May fail to up-scale.
DBService
API
DBService
DBService
DB
Web
API
Mobile
API
Service
Service
Service

RPC Mechanics
• Services are reusable.
• Cascading call stack deliver the final result.
Service ServiceService Service Service
ServiceService

RPC Mechanics
• If network or service fails, the whole stack may fail.
• Users may get undefined errors instead of a meaningful results.
ServiceService

RPC Mechanics
• Sometimes you have to wait, to get an error.
• Why we have to wait to get an error?
• Can system react quicker to errors?
• Users should wait for results, not errors!

RPC Mechanics
• System needs time to free-up resources.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API

RPC Mechanics
• Possible solution:
• Set time-outs, consider slow calls erroneous.
• Set a threshold, e.g. error rate per interval.
• Set a recovery time, watch the threshold.
• Define a constant fallback value (no network calls).
• Break for recovery if the threshold reached (use the fallback value).
• Cascade errors, to accelerate at top-most level.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
API

Circuit Breaker
Predict issues

Circuit Breaker
• Algorithm:
• Closed – watch requests, keep track of errors.
• Open – if there are too many errors, fast fail to a default value.
• Half-Open – try one request some time later.
Closed
Open
Half-Open
Success
Too many fails
Fast Fail
Try one request
Fail
Success

Circuit Breaker
• Track Exceptions & watch for time-outs.
• Simplify:
• On time-out throw a Timeout Exception.
• Track Exceptions!
Closed
Open
Half-Open
Success
Too many fails
Fast Fail
Try one request
Fail
Success

Circuit Breaker
• Usually applied at client side.
DBService
API
DBService
DBService
DBService
Web
API
Mobile
APICB
CB

N-Modular Redundancy
Process faster if possible

• Algorithm:
• Send N parallel calls.
• Vote for the first acceptable result.
Caller
Executor 1
Executor 2
Executor N
Voter Receiver

• Caller is the Voter:
• A special case, adaptation of pattern for RPC.
• Speed up processing.
• Some executors may perform faster, while others may be overloaded.
• Lower probability of errors.
• More chances to get an acceptable response (in a timely manner).
Executor 1 Executor 2 Executor N
ReceiverCaller & Voter

• Be careful:
• Remote calls should be idempotent.
• Not all calls will be idempotent.
Caller
Executor 1
Executor 2
Executor N
Voter Receiver

Recovery Blocks
Slow down if necessary

Recovery Blocks
• Pareto principle:
• 20% of code does the 80% of the job.
• 80% of code does the 20% of the job.
• Fast Path:
• The most important 20% of the logic.
• Protect from time-outs.
• Protection Rings:
• Layer error handling logic as rings.
• Inner rings work faster, outer rings cover more cases.
• Protect from errors in a modular way.
Basic Error Handling
Fast Path
Advanced Error Handling

Recovery Blocks
• Error handling logic is never enough.
• But, it may be heavy enough to slow down processing.
• Work faster when everything is fine.
• Slow down to protect when necessary.
ReceiverCaller
Request
Response
Result
Basic Error Handling
Fast Path
Advanced Error Handling

Recovery Blocks
• Recovery Blocks:
• Recover from errors if possible (more attempts, longer processing).
• Sequential or asynchronous services, layered over the network.
• Sequentially layered methods.
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response Result

Recovery Blocks
• In case of asynchronous flows:
• Caller may stream to first block;
• On error, processing is passed to next block;
• The first acceptable result is sent to Receiver;
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response

Recovery Blocks
• In case of unrecoverable errors:
• Customize for each project depending on requirements.
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response Result
Block N
ReceiverBlock 2
Block 1
Caller
Request
Fail
Fail
Response

Actors
• Actors:
• Communicate by sending messages.
• Form into a hierarchy.
• Delegation:
• Split tasks to sub-tasks.
• Delegate sub-tasks to subordinates.

Actors
• Sub-tasks:
• Are the actual piece of work.
• Mostly executed in parallel.
• May result in an error.
• Error handling:
• Actors may fail while executing a task.
• Fails are reported to parent as an error message.
• Parent actors supervise and monitor subordinates.
• Parents may resume, restart, stop subordinates or escalate the error.

Error Kernel
• Error Kernel Pattern:
• The root actor is the Error Kernel of the whole actor system.
• Each parent actor is the local Error Kernel of it’s hierarchy.
• Being the Error Kernel:
• Split the task to sub-tasks and delegate the execution.
• Aggregate results, supervise and monitor faults.
• Do not execute tasks at root, avoid errors.
• If root fails, the whole task fails.

Instance Healers
• Need a mechanism to watch for failed instances.
• Services are scaled by running several instances.
• Services may fail.

Instance Healers
• Instance Healer:
• Watch for the services instances using health checks.
• If necessary, heal instances (restart or spawn).
• May be the Error Kernel of the system.
Healer
Dead NewHealed

Instance Healers
Healer
Dead NewHealed

Instance Healers
• Possible Issues:
• Services may be in the middle of processing.
• Stateful services may loose data, e.g. a service with in-mem data.
• Prefer stateless processing.
• Keep state in an external storage.

Instance Healers
• Event Sourcing:
• Keep track of state-changing events, rebuild system state from events.
• Take snapshots for faster rebuilds, remember last processed event.
• Services may re-read events after healing.
• Processing should be idempotent!
Command
Service
Query
Service
Event
Store
Data
Store
Event
Processor

Thank You!
Fault Tolerance in Distributed Environment

Fault Tolerance in Distributed Environment

Recommended

More Related Content

What's hot (20)

Similar to Fault Tolerance in Distributed Environment (20)

More from Orkhan Gasimov (12)

Recently uploaded (20)

Fault Tolerance in Distributed Environment