Addressing Failures

Cascading Failures

A cascading failure is a failure that grows over time as a result of positive feedback.^107^ It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail. For example, a single replica for a service can fail due to overload, increasing load on remaining replicas and increasing their probability of failing, causing a domino effect that takes down all the replicas for a service.

Causes

Server Overload
Resource Exhaustion
CPU
- Increased number of in-flight requests
- Excessively long queue lengths
- Thread starvation
- CPU or request starvation
- Missed RPC Deadlines
- Reduced CPU caching benefits
Memory
- Dying tasks
- Increases rate of Garbage Collection (GC), resulting in increased CPU Usage
- Reduction in cache hit rates
Threads
File Descriptors
Dependencies among resources
Service Unavailability

Prevention

Load test the server's capacity limits, and test the failure mode for overload
Serve degraded results
Instrument the server to reject requests when overloaded
Instrument higher-level systems to reject requests, rather than overloading servers
Perform capacity planning
Queue Management
Load Shedding and Graceful Degradation

Load Shedding

The idea is to ignore some requests rather than crashing a system and making it fail to serve any request.

Retries
Latency and Deadlines
Picking a Deadline
Missing Deadlines
Deadline Propagation
Cancellation Propagation
Bimodal Latency
Slow Startup and Cold Caching
Always go Downward in the Stack

Triggering Conditions for Cascading Failures

Process Death
Process Updates
New Rollouts
Organic Growth
Planned Changes, Drains, or Turndowns
Request Profile Changes
Resource Limits

Testing for Cascading Failures

Test Until Failure and Beyond
Test Popular Clients
Test Noncritical Backends

Immediate Steps to Address Cascading Failures

Increase Resources
Stop Health Check Failures/Deaths
Restart Servers
Drop Traffic
Enter Degraded Modes
Eliminate Batch Load
Eliminate Bad Traffic

Reference

http://highscalability.com/blog/2018/4/25/google-addressing-cascading-failures.html

Cascading Failures​

Causes​

Prevention​

Load Shedding​

Triggering Conditions for Cascading Failures​

Testing for Cascading Failures​

Immediate Steps to Address Cascading Failures​

Reference​