Complete System Design Guide

Design Dropbox / Google Drive

Build a cloud file storage system handling petabytes of data with 11 nines durability

Master file chunking, deduplication, sync mechanisms, conflict resolution, and offline support

All System Design

What You Will Learn

Problem Statement & Requirements

Scale & Storage Estimation

High-Level Architecture

File Chunking Strategy

Content-Addressable Storage

Deduplication Techniques

Metadata Service Design

Sync Mechanism & Notifications

Conflict Detection & Resolution

Large File & Resumable Uploads

File Sharing & Permissions

Version History & Offline Mode

1Understanding the Problem

A cloud file storage service like Dropbox or Google Drive allows users to store files in the cloud, sync them across devices, and share with others. The key challenge is handling massive scale (petabytes of data) while ensuring files are never lost (11 nines durability) and changes sync instantly across all devices.

What Makes Cloud Storage Challenging?

Unlike serving static web content, file storage requires handling uploads, downloads, sync, and sharing - all while maintaining consistency and durability.

Massive Scale

Billions of files, petabytes of storage

Real-time Sync

Changes appear on all devices instantly

Never Lose Data

11 nines = 1 file lost per billion years

Functional Requirements

→Upload and download files from any device
→Auto-sync file changes across all connected devices
→Share files and folders via links or permissions
→Access previous versions of files (version history)
→Work offline and sync when connected
→Support files up to 50GB

Non-Functional Requirements

→Durability: 99.999999999% (11 nines)
→Availability: 99.9% uptime for file access
→Sync Latency: Changes within seconds
→Scalability: Billions of files, PB storage
→Bandwidth: Minimize upload/download time
→Consistency: No data corruption or conflicts

Key Insight: The hardest part isn't storage - it's sync. When a user edits a file on their laptop, that change must propagate to their phone, tablet, and web browser within seconds, handling conflicts if someone else edited simultaneously.

2Scale & Storage Estimation

Understanding scale helps us make architectural decisions about storage, chunking, and deduplication.

Scale Assumptions

• Registered Users: 500 million
• Paying Subscribers: 15 million (3%)
• Daily Active Users: 100 million
• Files per User: ~100 files average
• Average File Size: 1 MB (many docs, some large media)
• Storage per Paying User: 2 TB quota

Storage & Traffic Calculations:

// Total Files

500M users × 100 files = 50 billion files

// Raw Storage (without dedup)

50B files × 1 MB = 50 PB (petabytes)

// With 50% Deduplication

50 PB × 0.5 = ~25 PB actual storage

// Daily Uploads (10% users upload 2 files)

50M × 2 files × 1 MB = 100 TB/day uploads

// Sync Events (file changes)

100M DAU × 10 changes = 1 billion sync events/day

Metadata vs Block Storage

• Metadata: ~1 KB per file (name, path, permissions)
• Total Metadata: 50B × 1KB = 50 TB
• Block Storage: 25 PB (actual file content)
• Ratio: Metadata is 0.2% of total storage

This separation allows different optimization strategies for each.

Traffic Patterns

• Read:Write Ratio: 3:1 (more downloads)
• Peak Hours: 9 AM - 6 PM (work hours)
• Sync Frequency: Every 30 seconds poll
• Concurrent Connections: 10M+ WebSocket

Desktop clients maintain persistent connections for real-time sync.

Key Insight: With 50% deduplication, we save 25 PB of storage. This happens because many users have the same files (popular documents, default system files, shared content). Deduplication is not optional at this scale - it's essential for cost efficiency.

3High-Level Architecture

The architecture separates metadata (file info) from block storage (actual content), enabling independent scaling and optimization.

System Architecture Overview


┌─────────────────────────────────────────────────────────────────────────────┐
│                          DROPBOX ARCHITECTURE                                │
└─────────────────────────────────────────────────────────────────────────────┘

     Desktop Client          Mobile App              Web Browser
     (Sync Daemon)           (iOS/Android)           (JavaScript)
          │                       │                       │
          └───────────────────────┼───────────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────┐
                    │      API Gateway        │ ── Auth, Rate Limit
                    │      (Load Balancer)    │
                    └────────────┬────────────┘
                                 │
          ┌──────────────────────┼──────────────────────┐
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐
│   Block Server  │   │ Metadata Service│   │ Notification    │
│   (Upload/      │   │ (File/Folder    │   │ Service         │
│    Download)    │   │  Hierarchy)     │   │ (Sync Events)   │
└────────┬────────┘   └────────┬────────┘   └────────┬────────┘
         │                     │                     │
         │                     │                     │
         ▼                     ▼                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                            DATA LAYER                                        │
│                                                                              │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐           │
│  │   Block Storage  │  │    Metadata DB   │  │   Message Queue  │           │
│  │   (S3/GCS)       │  │   (PostgreSQL    │  │   (Kafka/SQS)    │           │
│  │                  │  │    + Sharding)   │  │                  │           │
│  │  Content-        │  │                  │  │  Sync events,    │           │
│  │  Addressable     │  │  File tree,      │  │  notifications   │           │
│  │  Storage (CAS)   │  │  permissions,    │  │                  │           │
│  │                  │  │  sharing         │  │                  │           │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘           │
│                                                                              │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐           │
│  │      Redis       │  │   Elasticsearch  │  │      CDN         │           │
│  │  (Cache, Locks)  │  │  (File Search)   │  │  (Static/Media)  │           │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Core Components

Block Server: Handles file upload/download, chunking, dedup
Metadata Service: File hierarchy, permissions, versions
Notification Service: Real-time sync events via WebSocket
Sync Service: Coordinates changes across devices
Sharing Service: Links, permissions, collaboration

Storage Layer

Block Storage (S3/GCS): Actual file chunks, 11 nines durability
Metadata DB: PostgreSQL sharded by user_id
Redis: Hot metadata cache, distributed locks
Kafka: Sync events, async processing
Elasticsearch: File search across user's files

Why Separate Metadata from Blocks? Metadata operations (list files, get permissions) are frequent and small. Block operations (upload/download) are infrequent but large. Separating them allows each to scale independently and use optimal storage (fast DB for metadata, cheap blob storage for blocks).

4File Chunking Strategy

Instead of storing files as single blobs, we split them into fixed-size chunks (4 MB). This enables delta sync, deduplication, parallel uploads, and resumable transfers.

How Chunking Works


File Upload with Chunking:

Original File: presentation.pptx (12 MB)
                     │
                     ▼
        ┌────────────────────────────────────────────────┐
        │              CHUNKING (4 MB each)              │
        └────────────────────────────────────────────────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      ▼              ▼              ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Chunk 1  │  │ Chunk 2  │  │ Chunk 3  │
│  4 MB    │  │  4 MB    │  │  4 MB    │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │             │
     ▼             ▼             ▼
  SHA-256       SHA-256       SHA-256
     │             │             │
     ▼             ▼             ▼
  abc123...    def456...    ghi789...
     │             │             │
     └─────────────┼─────────────┘
                   │
                   ▼
         ┌─────────────────┐
         │ File Metadata   │
         │ presentation.pptx
         │ chunks: [       │
         │   {idx:0, hash:"abc123..."},
         │   {idx:1, hash:"def456..."},
         │   {idx:2, hash:"ghi789..."}
         │ ]               │
         └─────────────────┘


Delta Sync (only changed chunk uploaded):

Before Edit:  [Chunk1] [Chunk2] [Chunk3]
                         │
              (user edits slide on page 5)
                         │
                         ▼
After Edit:   [Chunk1] [Chunk2'] [Chunk3]
                         │
               Only Chunk2' is uploaded!
               (75% bandwidth saved)

Benefits of Chunking

✓Delta Sync: Only upload changed chunks
✓Deduplication: Same chunk = same hash = store once
✓Parallel Upload: Upload all chunks simultaneously
✓Resumable: Failed upload continues from last chunk
✓Large Files: 50 GB file = 12,500 chunks (manageable)

Chunk Size Trade-offs

Small Chunks (1 MB)

Better dedup, but more metadata overhead

Medium Chunks (4 MB) ✓

Good balance - Dropbox uses this

Large Chunks (16 MB)

Less overhead, but poor delta sync

Chunking Implementation

class FileChunker:
    CHUNK_SIZE = 4 * 1024 * 1024  # 4 MB

    def chunk_file(self, file_path):
        chunks = []
        with open(file_path, 'rb') as f:
            index = 0
            while True:
                data = f.read(self.CHUNK_SIZE)
                if not data:
                    break

                # SHA-256 hash of chunk content
                chunk_hash = hashlib.sha256(data).hexdigest()

                chunks.append({
                    'index': index,
                    'hash': chunk_hash,
                    'size': len(data),
                    'data': data  # For upload
                })
                index += 1

        return chunks

    def upload_file(self, file_path, user_id):
        chunks = self.chunk_file(file_path)
        uploaded_chunks = []

        for chunk in chunks:
            # Check if chunk already exists (deduplication)
            if not self.block_storage.exists(chunk['hash']):
                self.block_storage.upload(chunk['hash'], chunk['data'])

            uploaded_chunks.append({
                'index': chunk['index'],
                'hash': chunk['hash'],
                'size': chunk['size']
            })

        # Save file metadata
        self.metadata_service.create_file(
            user_id=user_id,
            file_name=os.path.basename(file_path),
            chunks=uploaded_chunks,
            total_size=sum(c['size'] for c in uploaded_chunks)
        )

5Content-Addressable Storage (CAS)

In CAS, data is stored and retrieved using its content hash (SHA-256) rather than a file path. This is the foundation for deduplication.

How CAS Works


Traditional Storage:              Content-Addressable Storage:
──────────────────────           ─────────────────────────────

/users/alice/doc.pdf  ──┐        Content Hash = Address
/users/bob/doc.pdf    ──┼──▶     sha256(chunk) → "abc123def..."
/users/carol/doc.pdf  ──┘                           │
                                                    ▼
3 copies stored                  ┌─────────────────────────────┐
(wasted space!)                  │   Block Storage             │
                                 │   Key: "abc123def..."       │
                                 │   Value: [chunk data]       │
                                 └─────────────────────────────┘

                                 Only 1 copy stored!


Storage Key Generation:
─────────────────────────────────────────────────────────────

chunk_data = bytes(4MB of content)

hash = SHA-256(chunk_data)
     = "a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1"

storage_key = hash[:64]  # Use full 256-bit hash

Block Storage Layout:
/blocks/a3/f2/a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
       ^^  ^^
       First two chars as directory prefix (prevents too many files in one dir)

Why SHA-256?

→Collision Resistant: Virtually impossible for two different chunks to have the same hash
→Fast: ~500 MB/s hashing speed on modern CPUs
→256-bit: 2^256 possible hashes (more than atoms in universe)
→Deterministic: Same content always = same hash

Integrity Verification

CAS provides built-in data integrity. When downloading, we verify the hash matches.

def download_chunk(chunk_hash):
    data = block_storage.get(chunk_hash)

    # Verify integrity
    computed_hash = sha256(data).hexdigest()
    if computed_hash != chunk_hash:
        raise DataCorruptionError(
            "Chunk corrupted!"
        )

    return data

6Deduplication Techniques

Deduplication can save 50%+ storage by storing identical content only once, even across different users.

Types of Deduplication

Client-Side Dedup

Check hash before uploading. Skip if exists.

// Client checks first
for chunk in chunks:
    exists = api.check_chunk_exists(chunk.hash)
    if not exists:
        api.upload_chunk(chunk)

# Benefits:
# - Saves upload bandwidth
# - Instant "upload" for existing files

Server-Side Dedup

Server checks before writing to storage.

// Server checks before storing
def store_chunk(hash, data):
    if block_storage.exists(hash):
        # Already stored by another user
        increment_ref_count(hash)
        return "exists"
    else:
        block_storage.put(hash, data)
        set_ref_count(hash, 1)
        return "stored"

Cross-User Deduplication


Scenario: 1000 users upload the same PDF (company handbook)

Without Dedup:
──────────────
Storage used = 1000 × 10 MB = 10 GB
Upload bandwidth = 1000 × 10 MB = 10 GB

With Dedup:
───────────
First user: Uploads 10 MB, hash computed, stored
User 2-1000: Client computes hash, server says "exists", no upload needed!

Storage used = 10 MB (only 1 copy!)
Upload bandwidth = 10 MB (first user only)

Savings: 99.9% storage, 99.9% bandwidth


Reference Counting for Garbage Collection:
──────────────────────────────────────────

Block "abc123..." (company_handbook.pdf chunks)
├── ref_count: 1000
├── created_at: 2024-01-15
└── last_accessed: 2024-01-20

When user deletes file:
- Decrement ref_count
- If ref_count == 0: Schedule for garbage collection
- Delay actual deletion by 30 days (in case of recovery)

Common Dedup Candidates

• Default OS files (Windows DLLs, Mac frameworks)
• Popular software installers
• Shared documents within organizations
• Email attachments forwarded multiple times
• Stock photos and icons
• Unchanged portions of edited files

Security Consideration

Problem: Client-side dedup can reveal if a file exists (timing attack).

Solution: Always upload chunks, let server dedup. Or add random delay to "exists" response.

Real-world Impact: Dropbox reported 75% storage savings from deduplication in their early days. As more users join, the savings increase because common files become more likely to be duplicated.

7Metadata Service Design

The metadata service manages the file hierarchy, permissions, versions, and sharing - everything except the actual file content.

Metadata Schema

-- Users
CREATE TABLE users (
    user_id         UUID PRIMARY KEY,
    email           VARCHAR(255) UNIQUE,
    storage_quota   BIGINT DEFAULT 2147483648,  -- 2 GB free tier
    storage_used    BIGINT DEFAULT 0,
    created_at      TIMESTAMP DEFAULT NOW()
);

-- Files and Folders (Unified)
CREATE TABLE file_entries (
    entry_id        UUID PRIMARY KEY,
    user_id         UUID REFERENCES users(user_id),
    parent_id       UUID REFERENCES file_entries(entry_id),  -- NULL for root
    name            VARCHAR(255) NOT NULL,
    is_folder       BOOLEAN DEFAULT FALSE,
    size_bytes      BIGINT,
    content_hash    VARCHAR(64),     -- SHA-256 of chunk hashes
    version         INT DEFAULT 1,
    is_deleted      BOOLEAN DEFAULT FALSE,  -- Soft delete
    created_at      TIMESTAMP DEFAULT NOW(),
    modified_at     TIMESTAMP DEFAULT NOW(),

    -- Unique name within parent folder
    UNIQUE (user_id, parent_id, name, is_deleted),

    -- Index for listing folder contents
    INDEX idx_parent (user_id, parent_id, is_deleted)
);

-- File Chunks (linking file to blocks)
CREATE TABLE file_chunks (
    file_id         UUID REFERENCES file_entries(entry_id),
    chunk_index     INT NOT NULL,
    chunk_hash      VARCHAR(64) NOT NULL,  -- Reference to block storage
    chunk_size      INT NOT NULL,

    PRIMARY KEY (file_id, chunk_index)
);

-- File Versions (for version history)
CREATE TABLE file_versions (
    version_id      UUID PRIMARY KEY,
    file_id         UUID REFERENCES file_entries(entry_id),
    version_number  INT NOT NULL,
    content_hash    VARCHAR(64),
    size_bytes      BIGINT,
    modified_by     UUID REFERENCES users(user_id),
    created_at      TIMESTAMP DEFAULT NOW(),

    UNIQUE (file_id, version_number)
);

-- Sharing
CREATE TABLE shares (
    share_id        UUID PRIMARY KEY,
    entry_id        UUID REFERENCES file_entries(entry_id),
    shared_by       UUID REFERENCES users(user_id),
    shared_with     UUID REFERENCES users(user_id),  -- NULL for public link
    permission      VARCHAR(20) DEFAULT 'view',  -- view, edit
    link_token      VARCHAR(64) UNIQUE,  -- For shareable links
    expires_at      TIMESTAMP,
    created_at      TIMESTAMP DEFAULT NOW()
);

File Tree Representation

Root (user_id, parent_id=NULL)
├── Documents (folder)
│   ├── work (folder)
│   │   ├── report.pdf (file)
│   │   └── notes.txt (file)
│   └── personal (folder)
│       └── taxes.xlsx (file)
├── Photos (folder)
│   └── vacation.jpg (file)
└── video.mp4 (file)

Listing /Documents/work:
SELECT * FROM file_entries
WHERE user_id = :user_id
  AND parent_id = :work_folder_id
  AND is_deleted = FALSE

Sharding Strategy

→Shard by user_id: All user's files on same shard
→Reason: Most queries are user-scoped
→Shared files: Stored in owner's shard, linked to recipient
→Hot users: Enterprise users may need dedicated shards

8Sync Mechanism & Notifications

When a file changes on one device, all other connected devices must be notified and updated within seconds.

Sync Flow


File Change Sync Flow:

Device A (Laptop)                   Server                    Device B (Phone)
      │                               │                              │
      │  1. User saves file           │                              │
      │  (local change detected)      │                              │
      │                               │                              │
      │  2. Compute chunk hashes      │                              │
      │  3. Upload changed chunks ───▶│                              │
      │  4. Update metadata ─────────▶│                              │
      │                               │                              │
      │                               │  5. Store chunks             │
      │                               │  6. Update DB                │
      │                               │  7. Publish sync event       │
      │                               │           │                  │
      │                               │           ▼                  │
      │                               │    ┌─────────────┐           │
      │                               │    │   Kafka     │           │
      │                               │    │ sync.events │           │
      │                               │    └──────┬──────┘           │
      │                               │           │                  │
      │                               │           ▼                  │
      │                               │  8. Notification Service     │
      │                               │           │                  │
      │                               │           │  WebSocket       │
      │                               │           └─────────────────▶│
      │                               │                              │
      │                               │                 9. Receive notification
      │                               │                    {file_id, version}
      │                               │                              │
      │                               │◀──────── 10. Fetch metadata ─│
      │                               │                              │
      │                               │─ 11. Return file info ──────▶│
      │                               │                              │
      │                               │◀──── 12. Download chunks ────│
      │                               │                              │
      │                               │── 13. Return chunk data ────▶│
      │                               │                              │
      │                               │               14. Write to local disk
      │                               │                    (sync complete!)

Notification Methods

WebSocket (Primary)

Persistent connection, instant push. Used by desktop/mobile apps.

Long Polling (Fallback)

For networks blocking WebSocket. Request waits up to 60 sec.

Push Notification

For mobile when app is closed. FCM/APNs.

Cursor-based Sync

Each device maintains a cursor (last known sync state). On reconnect, fetch all changes since cursor.

// Client stores cursor
local_cursor = "2024-01-20T10:30:00Z"

// On reconnect or periodic sync
changes = api.get_changes(cursor=local_cursor)

for change in changes:
    if change.type == "create":
        download_and_create(change.file)
    elif change.type == "update":
        download_and_update(change.file)
    elif change.type == "delete":
        delete_local(change.file)

local_cursor = changes.new_cursor

Local Change Detection

The desktop client must detect local file changes efficiently without constantly scanning.

File System Watcher

• macOS: FSEvents API
• Windows: ReadDirectoryChangesW
• Linux: inotify

Change Detection

• File watcher triggers on modify/create/delete
• Debounce rapid changes (wait 2 sec after last change)
• Compare file hash to detect actual content change

9Conflict Detection & Resolution

When two users edit the same file simultaneously, we have a conflict. The system must detect and handle this gracefully.

Conflict Scenario


Conflict Example:

         Server State
         version: 5
              │
     ┌────────┴────────┐
     │                 │
 Alice (offline)   Bob (online)
 edits file        edits file
 version: 5→6      version: 5→6
     │                 │
     │                 │──▶ Upload succeeds
     │                      Server: version 6 (Bob's)
     │
     │ (comes online)
     │──▶ Upload fails!
         "Conflict: your version 6 != server version 6"


Resolution Options:
──────────────────

1. LAST WRITE WINS (simple, data loss possible)
   └── Bob's version kept, Alice's discarded
   └── Used by: Simple systems, non-critical files

2. CONFLICTED COPY (Dropbox approach)
   └── Keep both versions:
       - report.docx (Bob's version, on server)
       - report (Alice's conflicted copy 2024-01-20).docx
   └── User manually merges

3. OPERATIONAL TRANSFORM (Google Docs approach)
   └── Merge changes automatically at character level
   └── Complex, requires app-specific logic

4. VERSION BRANCHING (Git approach)
   └── Create branches, require explicit merge
   └── Too complex for file sync

Conflict Detection

def upload_file(file_id, new_hash, client_version):
    db_version = get_file_version(file_id)

    if client_version != db_version:
        # Conflict detected!
        return ConflictError(
            server_version=db_version,
            client_version=client_version
        )

    # No conflict, proceed
    update_file(file_id, new_hash, db_version + 1)
    return Success(new_version=db_version + 1)

Conflicted Copy Strategy

def handle_conflict(file_id, local_file, user):
    original = get_file(file_id)

    # Create conflicted copy
    conflict_name = f"{original.name} " \
        f"({user.name}'s conflicted copy " \
        f"{datetime.now().date()}).{original.ext}"

    create_file(
        parent_id=original.parent_id,
        name=conflict_name,
        content=local_file.content
    )

    # Notify user
    notify_user(user, "Conflict resolved: " \
        f"check {conflict_name}")

Best Practice: Dropbox's approach (conflicted copies) is most user-friendly for file sync. Users can see both versions and manually choose. For collaborative editing (Google Docs style), use Operational Transform.

10Large File & Resumable Uploads

Uploading a 50 GB file on unreliable network requires resumable uploads. If connection drops, resume from last successful chunk.

Resumable Upload Flow


50 GB Video Upload:

1. INITIATE UPLOAD
   Client: POST /uploads/init
           {file_name: "video.mp4", size: 50GB, chunks: 12500}
   Server: Returns upload_session_id, signed URLs for each chunk

2. UPLOAD CHUNKS (parallel)
   ┌─────────────────────────────────────────────────────────────┐
   │ Chunk 1 ──▶ S3 (success)     ✓                              │
   │ Chunk 2 ──▶ S3 (success)     ✓                              │
   │ Chunk 3 ──▶ S3 (success)     ✓                              │
   │ Chunk 4 ──▶ S3 (fail!)       ✗  ← Network disconnects       │
   │ Chunk 5 ──▶ (not sent)                                      │
   │ ...                                                         │
   │ Chunk 12500 ──▶ (not sent)                                  │
   └─────────────────────────────────────────────────────────────┘

3. RESUME (after reconnect)
   Client: GET /uploads/{session_id}/status
   Server: Returns {uploaded_chunks: [1,2,3], pending: [4,5,...,12500]}

4. CONTINUE FROM CHUNK 4
   ┌─────────────────────────────────────────────────────────────┐
   │ Chunk 4 ──▶ S3 (success)     ✓                              │
   │ Chunk 5 ──▶ S3 (success)     ✓                              │
   │ ...                                                         │
   │ Chunk 12500 ──▶ S3 (success) ✓                              │
   └─────────────────────────────────────────────────────────────┘

5. FINALIZE
   Client: POST /uploads/{session_id}/complete
   Server: Verify all chunks, create file metadata, return file_id


Session Management:
───────────────────
- Upload sessions expire after 7 days
- Incomplete chunks are garbage collected
- Progress persisted in Redis: upload:{session_id}:chunks = [1,2,3]

Parallel Upload Optimization

→Upload 4-8 chunks in parallel
→Use signed URLs to upload directly to S3
→Skip dedup-matched chunks (instant "upload")
→Retry failed chunks with exponential backoff

Bandwidth Throttling

Don't saturate user's network. Allow configurable limits.

Settings:
- Upload bandwidth: Auto / 1 Mbps / 5 Mbps
- Download bandwidth: Auto / 2 Mbps / 10 Mbps
- Pause sync on metered connection

Auto mode:
- Detect available bandwidth
- Use 80% for sync, leave 20% for user

11File Sharing & Permissions

Users can share files/folders with specific people or via public links, with different permission levels.

Sharing Options

Share with People

Invite by email, they get it in their Dropbox

Share via Link

Anyone with link can view/download

Password Protected

Link requires password to access

Permission Model


Permission Levels:
──────────────────
VIEWER:   Read-only, can download
EDITOR:   Can modify files, add new files
OWNER:    Full control, can share, delete, transfer

Folder Sharing Inheritance:
───────────────────────────
/Shared_Folder (shared with Bob as EDITOR)
├── file1.txt      ← Bob can edit
├── subfolder/     ← Bob can edit
│   └── file2.txt  ← Bob can edit
└── file3.txt      ← Bob can edit

Permission Check Flow:
──────────────────────
def can_access(user_id, file_id, required_permission):
    # 1. Check if user owns the file
    file = get_file(file_id)
    if file.user_id == user_id:
        return True  # Owner has all permissions

    # 2. Check direct share
    share = get_share(file_id, user_id)
    if share and share.permission >= required_permission:
        return True

    # 3. Check parent folder shares (inheritance)
    parent = file.parent
    while parent:
        share = get_share(parent.id, user_id)
        if share and share.permission >= required_permission:
            return True
        parent = parent.parent

    return False

Shareable Link Generation

def create_share_link(file_id, options):
    token = generate_secure_token(32)  # Random 256-bit

    share = Share(
        entry_id=file_id,
        link_token=token,
        permission=options.permission,  # view/edit
        password_hash=hash(options.password) if options.password else None,
        expires_at=options.expiry,
        download_limit=options.max_downloads
    )
    save(share)

    return f"https://dropbox.com/s/{token}"

# Accessing shared link
def access_share_link(token, password=None):
    share = get_share_by_token(token)

    if share.expires_at and share.expires_at < now():
        raise LinkExpiredError()

    if share.password_hash:
        if not password or hash(password) != share.password_hash:
            raise PasswordRequiredError()

    return get_file(share.entry_id)

Shared Folder Sync

When folder is shared with someone, it appears in their file tree.

• Shared folder syncs to all members
• Changes by any member propagate to all
• Storage counted against owner's quota
• Members can "leave" shared folder
• Owner can remove members anytime

12Version History & Offline Mode

Version History

Keep previous versions for recovery. Paid plans keep 180+ days.

File: report.docx
──────────────────
v5 (current) - Jan 20, 10:30 AM - Alice
v4           - Jan 20, 09:15 AM - Bob
v3           - Jan 19, 04:00 PM - Alice
v2           - Jan 18, 11:00 AM - Alice
v1           - Jan 17, 02:30 PM - Alice (created)

// Each version stores:
- content_hash (points to chunks)
- size_bytes
- modified_by
- created_at

// Restoring a version:
1. Create new version (v6)
2. Copy chunks from v3 to v6
3. Update current pointer

Offline Mode

Users mark files for offline access. Sync when connected.

Offline Flow:
─────────────
1. User marks folder "Available Offline"
2. Client downloads all files in folder
3. User goes offline, edits files
4. Client queues changes locally

On Reconnect:
─────────────
1. Upload queued changes
2. Fetch remote changes (may conflict)
3. Resolve conflicts (conflicted copies)
4. Sync complete

Local Database (SQLite):
- Tracks offline-enabled files
- Stores pending changes queue
- Caches file metadata

Deleted File Recovery

Soft delete with recovery period:

• File marked is_deleted = true
• Hidden from normal listings
• Visible in "Deleted Files" view
• Can restore within 30 days (free) / 180 days (paid)

Permanent deletion:

• After retention period, mark for GC
• Decrement chunk reference counts
• If ref_count = 0, delete chunk from S3
• For compliance: immediate permanent delete option

✓Key Design Principles

File Chunking: Split files into 4 MB chunks. Enables delta sync (only changed chunks), deduplication, parallel uploads, and resumable transfers.
Content-Addressable Storage: Store chunks by SHA-256 hash. Same content = same address. Foundation for deduplication and integrity verification.
Deduplication: Store identical chunks only once across all users. 50%+ storage savings. Check hash before upload to skip duplicate data.
Separate Metadata & Blocks: Metadata (file tree, permissions) in PostgreSQL. Blocks in S3/GCS. Each scales independently with optimal storage.
Real-time Sync: WebSocket for instant notifications. Cursor-based sync for catching up after offline. File system watcher for local change detection.
Conflict Resolution: Use version numbers to detect conflicts. Create conflicted copies rather than overwriting. Let users manually merge.
11 Nines Durability: Store chunks in S3/GCS with cross-region replication. Multiple copies ensure data survives any hardware failure.

Ready to Design Cloud Storage?

Practice with an AI interviewer and get instant feedback on your system design skills.