System Design Blueprint

How to Design a Real-Time Chat System

A detailed, end-to-end explanation of how a large-scale messaging system like WhatsApp works — from sending a message to delivery, storage, encryption, and global scaling.

How WhatsApp Works

This section gives a quick end-to-end view of how a WhatsApp message travels through the system.

1. User sends an encrypted message from the client.
2. Message reaches the nearest WebSocket gateway.
3. Gateway forwards the message to the Message Service.
4. Message is persisted in the database before acknowledgment.
5. Server checks whether the recipient is online.
6. If online, the message is delivered in real time.
7. If offline, the message is queued and a push notification is sent.
8. Delivery and read receipts are updated asynchronously.

Key idea: Messages are stored first, delivered fast, and never permanently stored after successful delivery.

1Understanding the Problem

Before diving into the solution, we need to understand what a real-time chat system must accomplish and the constraints it operates under.

A real-time chat system enables instant message delivery between users across devices and networks. The system must handle one-to-one conversations, group chats, offline message delivery, and maintain message ordering while scaling to billions of daily messages.

Functional Requirements

→ One-to-one messaging between users
→ Group messaging with multiple participants
→ Message delivery to offline users
→ Read receipts and delivery confirmations
→ Online presence indicators
→ Message history sync across devices

Non-Functional Requirements

→ Low latency: Millisecond message delivery
→ High availability: 99.99% uptime
→ Consistency: Correct message ordering
→ Scalability: Billions of daily messages
→ Security: End-to-end encryption

Scale Estimation

Hypothetical Load:

• Monthly active users: 2 billion
• Daily active users: 500 million
• Messages per day: 100 billion
• Average message size: 100 bytes

Performance Calculations:

100B / 86,400s × 2 (peak factor) ≈ 2.3 million msg/sec

100B × 100 bytes ≈ 10 TB/day

500M DAU × 20% online ≈ 100 million connections

Key Insight: The system is write-heavy with massive concurrent connections. WebSocket efficiency and horizontal scaling are critical design considerations.

2High-Level Architecture

The system consists of several interconnected services that handle different responsibilities.

System Components

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Mobile    │────▶│    Load      │────▶│   WebSocket     │
│   Client    │◀────│   Balancer   │◀────│   Gateway       │
└─────────────┘     └──────────────┘     └────────┬────────┘
                                                   │
                    ┌──────────────────────────────┼──────────────┐
                    │                              │              │
              ┌─────▼─────┐              ┌─────────▼────────┐    │
              │  Message  │              │    Presence      │    │
              │  Service  │              │    Service       │    │
              └─────┬─────┘              └──────────────────┘    │
                    │                                            │
         ┌──────────┼──────────┐                    ┌────────────▼───┐
         │          │          │                    │     Group      │
    ┌────▼────┐ ┌───▼───┐ ┌────▼────┐              │    Service     │
    │ Message │ │ Redis │ │  Push   │              └────────────────┘
    │   DB    │ │ Cache │ │ Service │
    └─────────┘ └───────┘ └─────────┘

WebSocket Gateway

Maintains persistent connections with clients, authenticates users, routes messages between users and backend services, and manages connection heartbeats.

Message Service

Handles message processing, validation, storage, and delivery logic. Ensures messages are persisted before acknowledgment.

Presence Service

Tracks which users are online, manages last-seen timestamps, and propagates status changes to relevant contacts.

Group Service

Manages group membership, permissions, and message fan-out to all group participants.

3WebSocket Connection Management

Real-time messaging requires persistent connections between clients and servers. HTTP polling wastes resources, so WebSocket connections provide efficient bidirectional communication.

Connection Registry

The system must track which server holds each user's connection. A distributed cache like Redis stores this mapping:

User ID → {
    server_id: "gateway-server-42",
    connection_id: "conn-abc123",
    last_heartbeat: "2024-01-15T10:30:00Z"
}

Message Routing Flow

When User A sends a message to User B:

1. Look up User B's connection location in Redis

2. Route the message to that specific gateway server

3. The gateway pushes the message through User B's WebSocket

4. User B's client acknowledges receipt

Handling Connection Failures

Connections drop due to network issues, app switches, or device sleep. The system handles this through:

Heartbeat Mechanism

Clients send periodic pings; missing heartbeats trigger cleanup after 30 seconds.

Reconnection Logic

Clients automatically reconnect with exponential backoff and resume from last message.

Message Buffering

Undelivered messages queue until the recipient reconnects or goes offline.

4Message Storage and Delivery

Message Data Model

Message {
    message_id:      UUID           // Globally unique identifier
    conversation_id: UUID           // Groups messages in a chat
    sender_id:       UUID           // Who sent the message
    content:         encrypted_bytes // Encrypted message payload
    content_type:    enum           // text, image, video, audio
    timestamp:       datetime       // Server-assigned timestamp
    status:          enum           // sent, delivered, read
}

Database Selection

Primary Message Store

Cassandra or ScyllaDB

• Handles high write throughput
• Scales horizontally across regions
• Optimized for time-series data retrieval
• Partition by conversation_id for locality

Message Queue

Apache Kafka

• Buffers messages during traffic spikes
• Ensures no message loss during failures
• Enables replay for debugging and recovery
• Decouples producers from consumers

Message Delivery Flow

When User A sends a message to User B:

Send Path:

1. Client sends encrypted message over WebSocket

2. Gateway authenticates and forwards to Message Service

3. Message Service validates content and rate limits

4. Message writes to primary database

5. Acknowledgment sent to User A (single checkmark)

Receive Path:

6. System checks User B's connection status

7. If online: push through WebSocket gateway

8. If offline: queue for push notification

9. Delivery confirmation updates status (double checkmark)

10. Read receipt notifies User A when viewed

5Handling Offline Users

Users frequently go offline. The system must reliably deliver messages when they return.

Push Notification Strategy

1. Message Service detects recipient is offline

2. Push Service receives delivery request

3. Platform-specific push sent (APNs/FCM)

4. Push contains minimal data (sender, preview)

5. Full message syncs when app opens

Message Synchronization

1. Client reports last received message ID

2. Server queries all messages after that ID

3. Messages batch and stream to client

4. Client acknowledges receipt

5. Delivery status propagates to senders

6Group Messaging

Group chats introduce fan-out complexity where one message reaches many recipients.

Fan-Out Strategies

Fan-out on Write

When a message arrives, immediately create copies for each recipient.

✓ Fast reads, simple recipient logic

✗ High write amplification, storage overhead

Best for: Small groups (under 100 members)

Fan-out on Read

Store message once, recipients query their groups.

✓ Efficient storage, single write

✗ Slower reads, complex query logic

Best for: Large groups or broadcast channels

Recommended Approach: Use a hybrid strategy - fan-out on write for small groups (faster UX) and fan-out on read for large groups (efficient storage). WhatsApp uses this pattern.

7End-to-End Encryption

Message security requires encryption that prevents even the server from reading content.

Signal Protocol Overview

1. Key Generation: Each device generates a public-private key pair

2. Key Exchange: Users exchange public keys when starting conversations

3. Session Establishment: Devices negotiate a shared session key using Double Ratchet

4. Message Encryption: Each message encrypts with a unique message key

5. Forward Secrecy: Compromised keys cannot decrypt past messages

Server-Side Key Storage

The server stores only public keys - it never has access to private keys or message content:

DeviceKey {
    user_id:          UUID
    device_id:        UUID
    identity_key:     public_key    // Long-term identity
    signed_prekey:    public_key    // Medium-term, rotated
    one_time_prekeys: [public_key]  // Single-use keys
}

8Scaling Strategies

Horizontal Scaling

Each component scales independently based on its specific bottleneck:

• WebSocket Gateways: Add servers as connection count grows (1M connections per server)
• Message Services: Partition by conversation ID ranges
• Databases: Shard by user ID or conversation ID
• Cache Layers: Distribute across Redis cluster nodes

Geographic Distribution

Global users require regional infrastructure:

• Deploy in multiple regions (US, Europe, Asia, etc.)
• Route users to nearest region via GeoDNS
• Replicate data across regions for disaster recovery
• Handle cross-region messaging with async replication

Database Sharding

Shard messages by conversation ID for optimal query performance:

Shard = hash(conversation_id) % number_of_shards

Benefits:
• Messages in one conversation reside together
• Queries for conversation history hit single shard
• Load distributes evenly across shards

✓Key Design Principles

Use WebSocket connections with a distributed connection registry for real-time message delivery.
Handle offline users with push notifications and message synchronization on reconnect.
Implement hybrid fan-out for group messaging based on group size.
Secure messages with end-to-end encryption using the Signal Protocol.
Scale globally with geographic distribution and database sharding.
Ensure reliability with message ordering, deduplication, and graceful failover.

Ready to Practice?

Practice designing a real-time chat system with an AI interviewer and get instant feedback.