Your App Is Dying Under Heavy Traffic—Here’s How Big Tech Stops the Meltdown

by Tejas GhadgeFebruary 27th, 2025

Too Long; Didn't Read

Overload occurs when a system receives more work than it can efficiently process. This can lead to increased latency, degraded throughput, and eventual system failure. To mitigate these issues, load shedding is a vital technique that is often missed.

Companies Mentioned

featured image - Your App Is Dying Under Heavy Traffic—Here’s How Big Tech Stops the Meltdown

Summary:
This article explores advanced load shedding techniques as a means to manage overload in distributed systems. Drawing from my learnings over 14 years of working on large scale systems at Amazon, with the last 5 years being at AWS Lambda, I am sharing one powerful technique that developers can think about while dealing with extreme load which eventually leads to system failures. In the process, we will also learn about fundamental laws like Amdahls law, Universal Scalability Law and Little’s law that are true for all systems and form basis of actions and behaviors required to maintain such systems.

Introduction

If you are using AWS services, you get access to high availability and near instantaneous scaling response to burst requests without worrying about overload. Nonetheless, it helps to understand the fundamentals behind designing such systems. Overload occurs when a system receives more work than it can efficiently process, leading to increased latency, degraded throughput, and eventual system failure. To mitigate these issues, load shedding—intentionally rejecting excess requests—emerges as a vital technique that is often missed from consideration. This article builds on insights and learnings from my 14 years of experience building high TPS systems at Amazon.

Understanding Overload in Distributed Systems

I will first introduce Amdahl’s law. Let us take a simple example where cars are lined up at a gas station with only 1 gas pump. Even if you form a parallel queue of cars, the rate at which cars exit the gas station is dependent on how quickly 1 gas pump fills the tank. In simple terms, the overall speed-up you gain by adding more resources (queues) is limited by the part of the work that can’t be split up i.e. single gas pump. This is Amdahl’s law in simple words without getting into technical details.

Now let us understand Universal Scalability law that builds on Amdahl’s law by taking into consideration the inter-process communication delays. Let us assume that you add another gas pump to our gas station. Now forming multiple parallel queues will help and a person can redirect traffic to the free gas pump. However, the system now has two potential bottlenecks - 1. Two gas pumps which will determine how quickly cars exit. 2. The person redirecting cars. So, Universal Scalability Law takes into consideration the overhead of inter-process communication with serializable components (e.g. gas pump). We will see these laws in action as we discuss next parts.

Distributed systems face inherent challenges as load increases. When a server is underutilized, latency remains low; however, as load increases, factors like thread contention, context switching, garbage collection, and I/O delays begin to impact performance and highlight bottlenecks as called out by Amdahl (e.g. CPU core limitation) and Universal Scalability law (e.g. context switching or network I/O threads switching between different threads while making multiple requests etc).

Under heavy load, response times increase sharply, causing many requests to exceed client timeouts. This phenomenon turns latency issues into availability problem. Moreover, overloaded systems tend to enter a destructive feedback loop: timed-out requests lead to retries, which further compound the load, worsening system performance.

Strategies for Dealing with Overload.

One way to handle the limitations of a system as introduced by Amdahl’s law and Universal Scalability law is shedding excess traffic to avoid overload of system. However, the first step is to identify at what point the system starts getting overloaded. We will discuss this aspect next.

Proactive Scaling and Testing

One of the key defenses to protect systems getting into overload failure mode is performing automated load testing as part of deployment release pipelines. Systems can be designed to scale automatically before reaching critical thresholds. Consider following example -

Assume clients of our service have a timeout of 1 second. As TPS to our service increases from 100 TPS to 300 TPS, the latency to respond reaches 1 second. Beyond 300 TPS, the requests from clients start timing out after 1 second, leading to no useful work done by service. If this threshold of 300TPS is known, we can set the fleet autoscaling rules to scale up the service when service starts handling close to 80% of capacity i.e. 240 TPS. But let us not forget Amdahl’s law here. If the bottleneck to scale beyond 300 TPS is an external limit like a database, then there is no alternative other than load shedding which we will discuss next.

Load Shedding Techniques:

Once the service breakpoint is known, we can reject requests beyond a predetermined threshold. (Please note that the service breakpoint is a moving target as we make new changes to service so this has to be part of release cycle.) For instance, in our previous example, we can decide to reject requests once service hits 80% capacity i.e. 240 TPS. However, for many cases, there is a better approach. Load shedding is designed by strategically rejecting non-critical requests when resource limits are approached. The primary goal is to preserve low latency and high availability for critical requests. When executed effectively, load shedding enables a server to maintain its "goodput"—the number of requests processed successfully within acceptable latency thresholds—even as overall request volume increases. However, just preparing for load shedding at application level is not enough. One of the systems that I worked with would deal with bursts of millions of requests in a second. At that scale, the application layer, i.e. JVM, would get overloaded before it could decide that it needs to reject requests. As a result, to guarantee robustness, a layered defense mechanism is required which we will discuss next.

Layered Defense Mechanisms

Robustness is enhanced by a layered approach to overload management. In addition to load shedding at the application layer, lower layers—such as load balancers, proxies, and operating systems—can contribute to managing overload. For example, HTTP proxies like NGINX can enforce maximum connection limits, while operating system tools (e.g., ip tables) can restrict incoming connections at the network level. Such multi-tiered defenses ensure that no single layer becomes a point of failure and service can pro-actively protect against the request surge. Each layer acts as a safeguard, contributing to an overall strategy that mitigates overload even if one mechanism temporarily fails.

Prioritization and Timeout Management

Effective load shedding also involves prioritizing requests based on their importance. Critical API calls, such as those from human-facing interfaces, should be given precedence over background or less essential tasks. Enforcing per-request timeouts helps prevent wasted processing on requests that have exceeded their useful life. By incorporating timeout hints and monitoring elapsed processing time, servers can drop requests gracefully before they consume disproportionate resources.

Conclusion:

In summary, below are the lessons that can be drawn from the article.

Proactive Overload Prevention: Systems should be designed to scale automatically, with load shedding as a safety net to preserve performance under unexpected spikes. Regular load testing, using both synthetic and real-world traffic patterns, is crucial for understanding system behavior and fine-tuning overload controls.
Layered Protections: Combining load shedding with network-level controls and proactive monitoring creates a strong defense against overload.
Prioritization: Not all requests are equal—prioritizing critical requests and managing client timeouts can significantly enhance system availability.

In conclusion, advanced load shedding techniques, when combined with proactive scaling and layered defenses, form the cornerstone of robust distributed systems. By shedding non-essential load gracefully, systems maintain critical performance levels, ensuring that key services remain available even under extreme conditions. All distributed systems follow Amdahl’s law and Universal Scalability Law, and knowing about these laws will help design robust distributed systems. If readers are interested, Little’s law is another powerful law to explore which explains the balance between throughput, wait times, and the number of tasks or customer requests in a system and developers can think about capacity and requests while designing systems.