CacheU
High Level Design

Rate Limiter

A detailed guide to Rate Limiting in High Level Design (HLD), covering concepts, algorithms, architectures, distributed implementations, and real-world scenarios for controlling traffic and protecting systems from overload.

Rate Limiter

Introduction

Modern internet systems must handle millions of requests per second from users, applications, bots, and automated services.

Without control, this traffic can easily overwhelm servers.

For example:

  • A malicious bot sends 100,000 requests per second
  • A misconfigured client repeatedly retries requests
  • A viral application suddenly receives millions of users

If a system processes all these requests blindly, it can experience:

  • CPU exhaustion
  • Database overload
  • service crashes
  • cascading failures

To prevent this, distributed systems use Rate Limiting.

A Rate Limiter controls how many requests a client can make within a specific time window.

It acts like a traffic controller that ensures no client consumes more resources than allowed.


Real World Analogy

Imagine a highway toll booth.

Only a limited number of cars can pass through every minute.

If too many cars arrive simultaneously:

  • traffic slows
  • vehicles are delayed
  • some may be turned away

This ensures the toll system does not collapse under pressure.

A rate limiter performs the same role for APIs and services.


Why Rate Limiting is Important

Rate limiting protects systems from several problems.

ProblemExplanation
System overloadToo many requests exhaust resources
Abuse and botsPrevents scraping or automated attacks
Fair usageEnsures all users receive fair access
Cost controlLimits expensive operations
StabilityPrevents cascading failures

Many modern APIs enforce strict rate limits.

Example:


100 requests per minute per user

If a client exceeds the limit, the system returns:


HTTP 429 Too Many Requests


Where Rate Limiters Are Used

Rate limiting appears across multiple layers of modern architectures.

LayerExample
API GatewayLimit requests per user
Load BalancerProtect backend services
DatabaseLimit query frequency
Authentication servicesPrevent brute-force login attacks
Messaging systemsControl producer/consumer throughput

Basic Rate Limiting Flow

Diagram
flowchart TD Client --> RateLimiter RateLimiter -->|Allowed| Service RateLimiter -->|Rejected| Error429

Steps:

  1. Client sends request
  2. Rate limiter checks quota
  3. If within limit → request proceeds
  4. If exceeded → request rejected

Key Concepts

1. Rate Limit

The maximum number of allowed requests.

Example:

100 requests per minute

2. Time Window

The duration during which requests are counted.

Examples:

  • per second
  • per minute
  • per hour

3. Identity

Rate limits must apply to a specific entity.

Common identifiers include:

IdentifierExample
User IDuser123
API keyapi_abc
IP address192.168.1.1
TokenOAuth token

Types of Rate Limiting Policies

Different policies apply limits differently.

PolicyDescription
User-basedLimit per user
IP-basedLimit per IP
API key basedLimit per API consumer
Global rate limitEntire system limit

Example:

100 requests/minute per user
1000 requests/minute globally

Rate Limiting Algorithms

Several algorithms exist to implement rate limiting.


1. Fixed Window Counter

This is the simplest rate limiting algorithm.

Idea

Divide time into fixed windows.

Example:

100 requests per minute

Requests are counted within each minute.


Example Timeline

TimeRequestsAllowed
10:00–10:01100yes
10:01–10:02counter resetsyes

Flow

Diagram
flowchart TD Request --> CheckCounter CheckCounter -->|< limit| Allow CheckCounter -->|>= limit| Reject

Problem

Fixed windows cause burst traffic at window boundaries.

Example:

100 requests at 10:00:59
100 requests at 10:01:00

Total:

200 requests in 2 seconds

This can overload systems.


2. Sliding Window Log

Instead of fixed windows, this method stores timestamps of requests.

When a new request arrives:

  1. Remove timestamps older than the time window
  2. Count remaining requests
  3. Compare with limit

Example

Limit: 5 requests per 10 seconds

Stored timestamps:

[1s, 3s, 4s, 8s, 9s]

If a request arrives at 10 seconds:

1s is removed
Remaining = 4 requests

Request is allowed.


Pros

AdvantageExplanation
AccurateNo burst problems
FairTrue rolling window

Cons

Requires storing many timestamps.

Memory usage grows with traffic.


3. Sliding Window Counter

An optimized version of sliding window.

Instead of storing timestamps, it tracks:

  • current window count
  • previous window count

The algorithm calculates a weighted count.


Example

If a window is 60 seconds:

previous window = 50 requests
current window = 20 requests

Weighted calculation determines if request should be allowed.


Advantage

More memory efficient than sliding logs.


4. Token Bucket Algorithm

The token bucket algorithm is widely used in networking.


Concept

A bucket contains tokens.

Each request consumes one token.

Tokens refill over time.


Example

Bucket size = 10
Refill rate = 1 token/sec

Flow

Diagram
flowchart TD RefillTokens --> Bucket ClientRequest --> CheckTokens CheckTokens -->|Token Available| ConsumeToken ConsumeToken --> Service CheckTokens -->|Empty| Reject

Behavior

ScenarioResult
burst trafficallowed until tokens exhausted
sustained high trafficthrottled

Token bucket allows controlled bursts.


5. Leaky Bucket Algorithm

Leaky bucket works like a water tank with a small hole.

Water enters at any rate but exits at a constant rate.


Concept

Requests are queued and processed at a fixed rate.


Flow

Diagram
flowchart TD Requests --> BucketQueue BucketQueue --> DrainRate DrainRate --> Service

Behavior

FeatureExplanation
Smooth trafficoutput rate constant
Queue overflowrequests dropped

Algorithm Comparison

AlgorithmBurst SupportAccuracyMemory
Fixed WindowPoorLowLow
Sliding LogExcellentHighHigh
Sliding CounterGoodMediumMedium
Token BucketExcellentHighLow
Leaky BucketSmooth outputMediumLow

Distributed Rate Limiting

In large systems, services run across multiple servers.

Example:

10 API servers

If each server maintains its own rate counter:

Limit bypass occurs

Example:

100 requests/server × 10 servers = 1000 requests

Instead of:

100 requests total

Distributed Architecture

Rate limit counters must be shared across servers.

Diagram
flowchart LR Clients --> API1 Clients --> API2 Clients --> API3 API1 --> Redis API2 --> Redis API3 --> Redis

Shared storage maintains a global counter.


Using Redis for Rate Limiting

A common distributed solution uses Redis.

Why Redis?

FeatureBenefit
In-memoryextremely fast
atomic operationssafe counters
distributedshared across servers

Example Redis Counter

INCR user123:rate
EXPIRE user123:rate 60

Meaning:

increment request count
expire after 60 seconds

Distributed Token Bucket

Diagram
flowchart TD Clients[Clients] --> APIServers[API Servers] APIServers --> TokenBucketStore[Token Bucket Store] TokenBucketStore --> RedisCluster[(Redis Cluster)]

All servers share token state.


Rate Limiting in API Gateways

API gateways frequently enforce rate limits.

Typical architecture:

Diagram
flowchart LR Client --> API_Gateway API_Gateway --> RateLimiter RateLimiter --> Microservices

Advantages:

  • centralized control
  • protects backend services
  • simplifies implementation

Multi-Level Rate Limiting

Large systems implement multiple rate limits simultaneously.

Example:

LimitValue
Per user100 req/min
Per IP200 req/min
Global10,000 req/min

Architecture

Diagram
flowchart TD Client --> Gateway Gateway --> UserLimiter UserLimiter --> IPLimiter IPLimiter --> GlobalLimiter GlobalLimiter --> Service

Handling Rate Limit Violations

When clients exceed limits:

Return HTTP status:

429 Too Many Requests

Often include headers:

X-RateLimit-Limit
X-RateLimit-Remaining
X-RateLimit-Reset

These inform clients when they can retry.


Rate Limiting vs Throttling

These concepts are related but slightly different.

FeatureRate LimitingThrottling
Goalreject extra requestsslow request processing
Behaviorimmediate rejectiondelay requests
ExampleAPI limitbandwidth control

Real World Systems Using Rate Limiters

Rate limiting is widely used across internet platforms.

Examples include systems operated by organizations such as:

  • Stripe APIs limiting payment requests
  • GitHub APIs limiting developer queries
  • Cloudflare protecting websites from bot traffic

These platforms rely heavily on distributed rate limiting to protect infrastructure.


Common Challenges

Rate limiting is simple conceptually but challenging at scale.

ChallengeExplanation
Distributed counterssynchronization across servers
clock skewinaccurate time windows
hot keyshigh traffic for single users
performancelimiter must be extremely fast

Best Practices

PracticeWhy
Use token bucketsupports bursts
Implement distributed countersconsistent limits
add jitter for retriesprevent synchronized retries
monitor rate-limit metricsdetect abuse
apply multi-level limitsprotect system globally

Summary

Rate limiting is a fundamental protection mechanism in modern distributed systems.

It ensures that:

  • systems remain stable
  • users receive fair access
  • infrastructure is protected from abuse

Core ideas include:

  • controlling request frequency
  • enforcing quotas
  • implementing efficient algorithms
  • scaling limits across distributed systems

When implemented correctly, a rate limiter acts as a safety valve that protects systems from overload while maintaining reliable service for legitimate users.