Thundering Herd Problem and How not to do API retries



9899 views Backend System Design



When a user makes an API call to the backend and encounters a network issue causing the call to fail, one common solution is to retry the call assuming that the APIs are idempotent.

However, when multiple clients simultaneously retry their calls, it can cause a “Thundering Herd” problem, overwhelming the server and making it difficult to recover. In this article, we will discuss how to address this issue by implementing Exponential Backoff and Jitter.

Retrying is brutal

The simplest way to retry failed API calls is to repeat the call immediately in a loop for a specified number of attempts. However, this approach can make the problem worse at scale.

Imagine the server is already overwhelmed with requests, and each client retries every single connected API call that has failed. This creates more pressure on the server and leaves no room for recovery.

Exponential Backoff

Exponential Backoff is a retry strategy that introduces a delay between retries, increasing the delay exponentially with each retry attempt.

This approach gives servers a breathing space and allows them time to recover. Instead of repeating the API call back-to-back, we introduce a delay that increases with each retry attempt.

Adding exponential backoff reduces immediate retries but there is still a chance that the retries repeating at the same time will continue to bombard the server concurrently.

Jitter to the rescue

To further improve this approach, we can add a jitter to the retry intervals. Jitter refers to adding some random delay to the retry interval to avoid retries coinciding.

This approach helps to distribute the retries and prevent them from adding to the problem. By introducing randomness, we can ensure that the retries are distributed across a wider time interval, reducing the chances of coincidences during retries.

Caveats

When implementing Exponential Backoff and Jitter, there are two key things to keep in mind. First, it is essential to add random jitter to the retry interval to avoid coincidences during retries. Second, ensure that the retries are exponentially spaced, so that the time interval between retries increases with each attempt.


Arpit Bhayani

Arpit's Newsletter

CS newsletter for the curious engineers

❤️ by 38000+ readers

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.




Other essays that you might like



Be a better engineer

A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.


Paid Courses

System Design for Beginners

A masterclass that helps early engineers and product managers become great at designing scalable systems.

300+ learners

Details →

System Design Masterclass

A masterclass that helps you become great at designing scalable, fault-tolerant, and highly available systems.

1000+ learners

Details →

Redis Internals

Learn internals of Redis by re-implementing some of the core features in Golang.

98+ learners

Details →

Free Courses

Designing Microservices

A free playlist to help you understand Microservices and their high-level patterns in depth.

823+ learners

Details →

GitHub Outage Dissections

A free playlist to help you learn core engineering from outages that happened at GitHub.

651+ learners

Details →

Hash Table Internals

A free playlist to help you understand the internal workings and construction of Hash Tables.

1027+ learners

Details →

BitTorrent Internals

A free playlist to help you understand the algorithms and strategies that power P2P networks and BitTorrent.

692+ learners

Details →