System Design Blueprint

Building a Rate Limiter

An original guide to engineering a service that protects your APIs from overuse and ensures stability. Learn the core algorithms and distributed architecture from the ground up.

1The "Why" and "What"

Let's start by understanding the purpose of a rate limiter and defining our goals.

A rate limiter is like a bouncer at a club. It doesn't inspect who you are (that's authentication), but rather how often you're trying to get in. Its job is to control the flow of traffic to protect the services behind it.

Why Do We Need It?

Security: Prevent Denial-of-Service (DoS) attacks where bad actors flood your service with requests.
Cost Control: Limit usage of expensive, third-party API calls or resource-intensive computations.
Stability: Prevent a single user or runaway script from overloading the system and degrading performance for everyone else.
Fair Usage: Ensure no single user monopolizes a shared resource.

Our Requirements

  • It must accurately limit requests based on configurable rules (e.g., 100 requests per minute per user).
  • It must be highly performant and add minimal latency to overall request time.
  • It must work in a distributed environment (across multiple servers).
  • When a user is limited, the system should return a clear `429 Too Many Requests` error.

2Where Does It Live?

The most common and effective place to put a rate limiter is at the edge of your system, typically within an API Gateway.

API Gateway Middleware

By implementing the rate limiter in the gateway, we create a single choke point for all incoming traffic. This protects every downstream service without them needing to know about rate limiting at all. The flow is simple:

1. Request arrives at the API Gateway.

2. Rate Limiter middleware checks if the request is allowed.

3. If YES: Forward request to the intended service.

4. If NO: Immediately reject with a `429` error.

3Core Algorithms

The "brain" of the rate limiter is the algorithm it uses to track and limit requests. Let's explore the most common ones.

Token Bucket

Imagine a bucket that holds a set number of tokens. To make a request, you must grab a token. The bucket is refilled at a constant rate. If there are no tokens, you must wait.

✓ Pro: Great for handling bursts of traffic, as long as tokens are available.

✗ Con: Can be complex to tune the bucket size and refill rate.

Leaky Bucket

Imagine a bucket with a small hole. Requests are "poured" into the bucket. The bucket processes requests out of the hole at a fixed, constant rate. If the bucket is full when a new request arrives, it is rejected.

✓ Pro: Smoothes out traffic into a steady stream, which is easy for downstream services to handle.

✗ Con: Bursts of requests are flattened, which might not be ideal for all use cases.

Fixed Window Counter

The simplest method. We count requests from a user in a fixed time window (e.g., 1 minute). If the count exceeds the limit, we reject further requests until the window resets.

✓ Pro: Very easy to implement and memory-efficient.

✗ Con: A burst of traffic at the edge of a window (e.g., 1:59 and 2:01) can allow double the rate.

Sliding Window Counter

A hybrid approach that fixes the flaw in the Fixed Window. It smooths the rate by considering a weighted count from the previous window in addition to the current window's count.

✓ Pro: Good balance of performance and accuracy. Prevents the "edge" problem.

✗ Con: More complex to implement than a fixed window.

Recommended Choice: A Sliding Window Counter provides a great mix of performance, memory efficiency, and accuracy for most modern applications.

4Distributed Architecture

A single rate limiter is a single point of failure. In a real system, we have multiple servers. How do they share rate limit data?

Centralized Data Store

The solution is to use a centralized, high-speed data store like Redis. Every server in our API gateway talks to this central Redis cluster to share and update request counts.

`user_id_123: { count: 98, window_start: 1678886400 }`

Handling Race Conditions

With multiple servers reading and writing simultaneously, we can get a race condition (e.g., two servers read a count of 99, both allow the request, and both set the count to 100). To prevent this, we must use atomic operations. Redis provides commands like `INCR` which are atomic, meaning they are guaranteed to complete as a single, uninterruptible operation.

Final Design Principles

  • Place the rate limiter at the API Gateway to protect all downstream services.
  • Choose an appropriate algorithm based on the trade-off between accuracy and performance. Sliding Window Counter is often a good default.
  • Use a centralized, fast data store like Redis to synchronize state across a distributed system.
  • Leverage atomic operations (like Redis `INCR`) to avoid race conditions.
  • Always provide clear feedback to the user by returning a 429 status code when a request is limited.

Ready to Practice?

Practice designing a Rate Limiter with an AI interviewer and get instant feedback.