Build a cloud file storage system handling petabytes of data with 11 nines durability
Master file chunking, deduplication, sync mechanisms, conflict resolution, and offline support
A cloud file storage service like Dropbox or Google Drive allows users to store files in the cloud, sync them across devices, and share with others. The key challenge is handling massive scale (petabytes of data) while ensuring files are never lost (11 nines durability) and changes sync instantly across all devices.
Unlike serving static web content, file storage requires handling uploads, downloads, sync, and sharing - all while maintaining consistency and durability.
Massive Scale
Billions of files, petabytes of storage
Real-time Sync
Changes appear on all devices instantly
Never Lose Data
11 nines = 1 file lost per billion years
Key Insight: The hardest part isn't storage - it's sync. When a user edits a file on their laptop, that change must propagate to their phone, tablet, and web browser within seconds, handling conflicts if someone else edited simultaneously.
Understanding scale helps us make architectural decisions about storage, chunking, and deduplication.
500M users × 100 files = 50 billion files
50B files × 1 MB = 50 PB (petabytes)
50 PB × 0.5 = ~25 PB actual storage
50M × 2 files × 1 MB = 100 TB/day uploads
100M DAU × 10 changes = 1 billion sync events/day
This separation allows different optimization strategies for each.
Desktop clients maintain persistent connections for real-time sync.
Key Insight: With 50% deduplication, we save 25 PB of storage. This happens because many users have the same files (popular documents, default system files, shared content). Deduplication is not optional at this scale - it's essential for cost efficiency.
The architecture separates metadata (file info) from block storage (actual content), enabling independent scaling and optimization.
┌─────────────────────────────────────────────────────────────────────────────┐
│ DROPBOX ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
Desktop Client Mobile App Web Browser
(Sync Daemon) (iOS/Android) (JavaScript)
│ │ │
└───────────────────────┼───────────────────────┘
│
▼
┌─────────────────────────┐
│ API Gateway │ ── Auth, Rate Limit
│ (Load Balancer) │
└────────────┬────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Block Server │ │ Metadata Service│ │ Notification │
│ (Upload/ │ │ (File/Folder │ │ Service │
│ Download) │ │ Hierarchy) │ │ (Sync Events) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Block Storage │ │ Metadata DB │ │ Message Queue │ │
│ │ (S3/GCS) │ │ (PostgreSQL │ │ (Kafka/SQS) │ │
│ │ │ │ + Sharding) │ │ │ │
│ │ Content- │ │ │ │ Sync events, │ │
│ │ Addressable │ │ File tree, │ │ notifications │ │
│ │ Storage (CAS) │ │ permissions, │ │ │ │
│ │ │ │ sharing │ │ │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Redis │ │ Elasticsearch │ │ CDN │ │
│ │ (Cache, Locks) │ │ (File Search) │ │ (Static/Media) │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Why Separate Metadata from Blocks? Metadata operations (list files, get permissions) are frequent and small. Block operations (upload/download) are infrequent but large. Separating them allows each to scale independently and use optimal storage (fast DB for metadata, cheap blob storage for blocks).
Instead of storing files as single blobs, we split them into fixed-size chunks (4 MB). This enables delta sync, deduplication, parallel uploads, and resumable transfers.
File Upload with Chunking:
Original File: presentation.pptx (12 MB)
│
▼
┌────────────────────────────────────────────────┐
│ CHUNKING (4 MB each) │
└────────────────────────────────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ 4 MB │ │ 4 MB │ │ 4 MB │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
SHA-256 SHA-256 SHA-256
│ │ │
▼ ▼ ▼
abc123... def456... ghi789...
│ │ │
└─────────────┼─────────────┘
│
▼
┌─────────────────┐
│ File Metadata │
│ presentation.pptx
│ chunks: [ │
│ {idx:0, hash:"abc123..."},
│ {idx:1, hash:"def456..."},
│ {idx:2, hash:"ghi789..."}
│ ] │
└─────────────────┘
Delta Sync (only changed chunk uploaded):
Before Edit: [Chunk1] [Chunk2] [Chunk3]
│
(user edits slide on page 5)
│
▼
After Edit: [Chunk1] [Chunk2'] [Chunk3]
│
Only Chunk2' is uploaded!
(75% bandwidth saved)
Small Chunks (1 MB)
Better dedup, but more metadata overhead
Medium Chunks (4 MB) ✓
Good balance - Dropbox uses this
Large Chunks (16 MB)
Less overhead, but poor delta sync
class FileChunker:
CHUNK_SIZE = 4 * 1024 * 1024 # 4 MB
def chunk_file(self, file_path):
chunks = []
with open(file_path, 'rb') as f:
index = 0
while True:
data = f.read(self.CHUNK_SIZE)
if not data:
break
# SHA-256 hash of chunk content
chunk_hash = hashlib.sha256(data).hexdigest()
chunks.append({
'index': index,
'hash': chunk_hash,
'size': len(data),
'data': data # For upload
})
index += 1
return chunks
def upload_file(self, file_path, user_id):
chunks = self.chunk_file(file_path)
uploaded_chunks = []
for chunk in chunks:
# Check if chunk already exists (deduplication)
if not self.block_storage.exists(chunk['hash']):
self.block_storage.upload(chunk['hash'], chunk['data'])
uploaded_chunks.append({
'index': chunk['index'],
'hash': chunk['hash'],
'size': chunk['size']
})
# Save file metadata
self.metadata_service.create_file(
user_id=user_id,
file_name=os.path.basename(file_path),
chunks=uploaded_chunks,
total_size=sum(c['size'] for c in uploaded_chunks)
)In CAS, data is stored and retrieved using its content hash (SHA-256) rather than a file path. This is the foundation for deduplication.
Traditional Storage: Content-Addressable Storage:
────────────────────── ─────────────────────────────
/users/alice/doc.pdf ──┐ Content Hash = Address
/users/bob/doc.pdf ──┼──▶ sha256(chunk) → "abc123def..."
/users/carol/doc.pdf ──┘ │
▼
3 copies stored ┌─────────────────────────────┐
(wasted space!) │ Block Storage │
│ Key: "abc123def..." │
│ Value: [chunk data] │
└─────────────────────────────┘
Only 1 copy stored!
Storage Key Generation:
─────────────────────────────────────────────────────────────
chunk_data = bytes(4MB of content)
hash = SHA-256(chunk_data)
= "a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1"
storage_key = hash[:64] # Use full 256-bit hash
Block Storage Layout:
/blocks/a3/f2/a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
^^ ^^
First two chars as directory prefix (prevents too many files in one dir)
CAS provides built-in data integrity. When downloading, we verify the hash matches.
def download_chunk(chunk_hash):
data = block_storage.get(chunk_hash)
# Verify integrity
computed_hash = sha256(data).hexdigest()
if computed_hash != chunk_hash:
raise DataCorruptionError(
"Chunk corrupted!"
)
return dataDeduplication can save 50%+ storage by storing identical content only once, even across different users.
Check hash before uploading. Skip if exists.
// Client checks first
for chunk in chunks:
exists = api.check_chunk_exists(chunk.hash)
if not exists:
api.upload_chunk(chunk)
# Benefits:
# - Saves upload bandwidth
# - Instant "upload" for existing filesServer checks before writing to storage.
// Server checks before storing
def store_chunk(hash, data):
if block_storage.exists(hash):
# Already stored by another user
increment_ref_count(hash)
return "exists"
else:
block_storage.put(hash, data)
set_ref_count(hash, 1)
return "stored"Scenario: 1000 users upload the same PDF (company handbook) Without Dedup: ────────────── Storage used = 1000 × 10 MB = 10 GB Upload bandwidth = 1000 × 10 MB = 10 GB With Dedup: ─────────── First user: Uploads 10 MB, hash computed, stored User 2-1000: Client computes hash, server says "exists", no upload needed! Storage used = 10 MB (only 1 copy!) Upload bandwidth = 10 MB (first user only) Savings: 99.9% storage, 99.9% bandwidth Reference Counting for Garbage Collection: ────────────────────────────────────────── Block "abc123..." (company_handbook.pdf chunks) ├── ref_count: 1000 ├── created_at: 2024-01-15 └── last_accessed: 2024-01-20 When user deletes file: - Decrement ref_count - If ref_count == 0: Schedule for garbage collection - Delay actual deletion by 30 days (in case of recovery)
Problem: Client-side dedup can reveal if a file exists (timing attack).
Solution: Always upload chunks, let server dedup. Or add random delay to "exists" response.
Real-world Impact: Dropbox reported 75% storage savings from deduplication in their early days. As more users join, the savings increase because common files become more likely to be duplicated.
The metadata service manages the file hierarchy, permissions, versions, and sharing - everything except the actual file content.
-- Users
CREATE TABLE users (
user_id UUID PRIMARY KEY,
email VARCHAR(255) UNIQUE,
storage_quota BIGINT DEFAULT 2147483648, -- 2 GB free tier
storage_used BIGINT DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW()
);
-- Files and Folders (Unified)
CREATE TABLE file_entries (
entry_id UUID PRIMARY KEY,
user_id UUID REFERENCES users(user_id),
parent_id UUID REFERENCES file_entries(entry_id), -- NULL for root
name VARCHAR(255) NOT NULL,
is_folder BOOLEAN DEFAULT FALSE,
size_bytes BIGINT,
content_hash VARCHAR(64), -- SHA-256 of chunk hashes
version INT DEFAULT 1,
is_deleted BOOLEAN DEFAULT FALSE, -- Soft delete
created_at TIMESTAMP DEFAULT NOW(),
modified_at TIMESTAMP DEFAULT NOW(),
-- Unique name within parent folder
UNIQUE (user_id, parent_id, name, is_deleted),
-- Index for listing folder contents
INDEX idx_parent (user_id, parent_id, is_deleted)
);
-- File Chunks (linking file to blocks)
CREATE TABLE file_chunks (
file_id UUID REFERENCES file_entries(entry_id),
chunk_index INT NOT NULL,
chunk_hash VARCHAR(64) NOT NULL, -- Reference to block storage
chunk_size INT NOT NULL,
PRIMARY KEY (file_id, chunk_index)
);
-- File Versions (for version history)
CREATE TABLE file_versions (
version_id UUID PRIMARY KEY,
file_id UUID REFERENCES file_entries(entry_id),
version_number INT NOT NULL,
content_hash VARCHAR(64),
size_bytes BIGINT,
modified_by UUID REFERENCES users(user_id),
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE (file_id, version_number)
);
-- Sharing
CREATE TABLE shares (
share_id UUID PRIMARY KEY,
entry_id UUID REFERENCES file_entries(entry_id),
shared_by UUID REFERENCES users(user_id),
shared_with UUID REFERENCES users(user_id), -- NULL for public link
permission VARCHAR(20) DEFAULT 'view', -- view, edit
link_token VARCHAR(64) UNIQUE, -- For shareable links
expires_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);Root (user_id, parent_id=NULL) ├── Documents (folder) │ ├── work (folder) │ │ ├── report.pdf (file) │ │ └── notes.txt (file) │ └── personal (folder) │ └── taxes.xlsx (file) ├── Photos (folder) │ └── vacation.jpg (file) └── video.mp4 (file) Listing /Documents/work: SELECT * FROM file_entries WHERE user_id = :user_id AND parent_id = :work_folder_id AND is_deleted = FALSE
When a file changes on one device, all other connected devices must be notified and updated within seconds.
File Change Sync Flow:
Device A (Laptop) Server Device B (Phone)
│ │ │
│ 1. User saves file │ │
│ (local change detected) │ │
│ │ │
│ 2. Compute chunk hashes │ │
│ 3. Upload changed chunks ───▶│ │
│ 4. Update metadata ─────────▶│ │
│ │ │
│ │ 5. Store chunks │
│ │ 6. Update DB │
│ │ 7. Publish sync event │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────┐ │
│ │ │ Kafka │ │
│ │ │ sync.events │ │
│ │ └──────┬──────┘ │
│ │ │ │
│ │ ▼ │
│ │ 8. Notification Service │
│ │ │ │
│ │ │ WebSocket │
│ │ └─────────────────▶│
│ │ │
│ │ 9. Receive notification
│ │ {file_id, version}
│ │ │
│ │◀──────── 10. Fetch metadata ─│
│ │ │
│ │─ 11. Return file info ──────▶│
│ │ │
│ │◀──── 12. Download chunks ────│
│ │ │
│ │── 13. Return chunk data ────▶│
│ │ │
│ │ 14. Write to local disk
│ │ (sync complete!)
WebSocket (Primary)
Persistent connection, instant push. Used by desktop/mobile apps.
Long Polling (Fallback)
For networks blocking WebSocket. Request waits up to 60 sec.
Push Notification
For mobile when app is closed. FCM/APNs.
Each device maintains a cursor (last known sync state). On reconnect, fetch all changes since cursor.
// Client stores cursor
local_cursor = "2024-01-20T10:30:00Z"
// On reconnect or periodic sync
changes = api.get_changes(cursor=local_cursor)
for change in changes:
if change.type == "create":
download_and_create(change.file)
elif change.type == "update":
download_and_update(change.file)
elif change.type == "delete":
delete_local(change.file)
local_cursor = changes.new_cursorThe desktop client must detect local file changes efficiently without constantly scanning.
File System Watcher
Change Detection
When two users edit the same file simultaneously, we have a conflict. The system must detect and handle this gracefully.
Conflict Example:
Server State
version: 5
│
┌────────┴────────┐
│ │
Alice (offline) Bob (online)
edits file edits file
version: 5→6 version: 5→6
│ │
│ │──▶ Upload succeeds
│ Server: version 6 (Bob's)
│
│ (comes online)
│──▶ Upload fails!
"Conflict: your version 6 != server version 6"
Resolution Options:
──────────────────
1. LAST WRITE WINS (simple, data loss possible)
└── Bob's version kept, Alice's discarded
└── Used by: Simple systems, non-critical files
2. CONFLICTED COPY (Dropbox approach)
└── Keep both versions:
- report.docx (Bob's version, on server)
- report (Alice's conflicted copy 2024-01-20).docx
└── User manually merges
3. OPERATIONAL TRANSFORM (Google Docs approach)
└── Merge changes automatically at character level
└── Complex, requires app-specific logic
4. VERSION BRANCHING (Git approach)
└── Create branches, require explicit merge
└── Too complex for file sync
def upload_file(file_id, new_hash, client_version):
db_version = get_file_version(file_id)
if client_version != db_version:
# Conflict detected!
return ConflictError(
server_version=db_version,
client_version=client_version
)
# No conflict, proceed
update_file(file_id, new_hash, db_version + 1)
return Success(new_version=db_version + 1)def handle_conflict(file_id, local_file, user):
original = get_file(file_id)
# Create conflicted copy
conflict_name = f"{original.name} " \
f"({user.name}'s conflicted copy " \
f"{datetime.now().date()}).{original.ext}"
create_file(
parent_id=original.parent_id,
name=conflict_name,
content=local_file.content
)
# Notify user
notify_user(user, "Conflict resolved: " \
f"check {conflict_name}")Best Practice: Dropbox's approach (conflicted copies) is most user-friendly for file sync. Users can see both versions and manually choose. For collaborative editing (Google Docs style), use Operational Transform.
Uploading a 50 GB file on unreliable network requires resumable uploads. If connection drops, resume from last successful chunk.
50 GB Video Upload:
1. INITIATE UPLOAD
Client: POST /uploads/init
{file_name: "video.mp4", size: 50GB, chunks: 12500}
Server: Returns upload_session_id, signed URLs for each chunk
2. UPLOAD CHUNKS (parallel)
┌─────────────────────────────────────────────────────────────┐
│ Chunk 1 ──▶ S3 (success) ✓ │
│ Chunk 2 ──▶ S3 (success) ✓ │
│ Chunk 3 ──▶ S3 (success) ✓ │
│ Chunk 4 ──▶ S3 (fail!) ✗ ← Network disconnects │
│ Chunk 5 ──▶ (not sent) │
│ ... │
│ Chunk 12500 ──▶ (not sent) │
└─────────────────────────────────────────────────────────────┘
3. RESUME (after reconnect)
Client: GET /uploads/{session_id}/status
Server: Returns {uploaded_chunks: [1,2,3], pending: [4,5,...,12500]}
4. CONTINUE FROM CHUNK 4
┌─────────────────────────────────────────────────────────────┐
│ Chunk 4 ──▶ S3 (success) ✓ │
│ Chunk 5 ──▶ S3 (success) ✓ │
│ ... │
│ Chunk 12500 ──▶ S3 (success) ✓ │
└─────────────────────────────────────────────────────────────┘
5. FINALIZE
Client: POST /uploads/{session_id}/complete
Server: Verify all chunks, create file metadata, return file_id
Session Management:
───────────────────
- Upload sessions expire after 7 days
- Incomplete chunks are garbage collected
- Progress persisted in Redis: upload:{session_id}:chunks = [1,2,3]
Don't saturate user's network. Allow configurable limits.
Settings: - Upload bandwidth: Auto / 1 Mbps / 5 Mbps - Download bandwidth: Auto / 2 Mbps / 10 Mbps - Pause sync on metered connection Auto mode: - Detect available bandwidth - Use 80% for sync, leave 20% for user
Users can share files/folders with specific people or via public links, with different permission levels.
Share with People
Invite by email, they get it in their Dropbox
Share via Link
Anyone with link can view/download
Password Protected
Link requires password to access
Permission Levels:
──────────────────
VIEWER: Read-only, can download
EDITOR: Can modify files, add new files
OWNER: Full control, can share, delete, transfer
Folder Sharing Inheritance:
───────────────────────────
/Shared_Folder (shared with Bob as EDITOR)
├── file1.txt ← Bob can edit
├── subfolder/ ← Bob can edit
│ └── file2.txt ← Bob can edit
└── file3.txt ← Bob can edit
Permission Check Flow:
──────────────────────
def can_access(user_id, file_id, required_permission):
# 1. Check if user owns the file
file = get_file(file_id)
if file.user_id == user_id:
return True # Owner has all permissions
# 2. Check direct share
share = get_share(file_id, user_id)
if share and share.permission >= required_permission:
return True
# 3. Check parent folder shares (inheritance)
parent = file.parent
while parent:
share = get_share(parent.id, user_id)
if share and share.permission >= required_permission:
return True
parent = parent.parent
return Falsedef create_share_link(file_id, options):
token = generate_secure_token(32) # Random 256-bit
share = Share(
entry_id=file_id,
link_token=token,
permission=options.permission, # view/edit
password_hash=hash(options.password) if options.password else None,
expires_at=options.expiry,
download_limit=options.max_downloads
)
save(share)
return f"https://dropbox.com/s/{token}"
# Accessing shared link
def access_share_link(token, password=None):
share = get_share_by_token(token)
if share.expires_at and share.expires_at < now():
raise LinkExpiredError()
if share.password_hash:
if not password or hash(password) != share.password_hash:
raise PasswordRequiredError()
return get_file(share.entry_id)When folder is shared with someone, it appears in their file tree.
Keep previous versions for recovery. Paid plans keep 180+ days.
File: report.docx ────────────────── v5 (current) - Jan 20, 10:30 AM - Alice v4 - Jan 20, 09:15 AM - Bob v3 - Jan 19, 04:00 PM - Alice v2 - Jan 18, 11:00 AM - Alice v1 - Jan 17, 02:30 PM - Alice (created) // Each version stores: - content_hash (points to chunks) - size_bytes - modified_by - created_at // Restoring a version: 1. Create new version (v6) 2. Copy chunks from v3 to v6 3. Update current pointer
Users mark files for offline access. Sync when connected.
Offline Flow: ───────────── 1. User marks folder "Available Offline" 2. Client downloads all files in folder 3. User goes offline, edits files 4. Client queues changes locally On Reconnect: ───────────── 1. Upload queued changes 2. Fetch remote changes (may conflict) 3. Resolve conflicts (conflicted copies) 4. Sync complete Local Database (SQLite): - Tracks offline-enabled files - Stores pending changes queue - Caches file metadata
Soft delete with recovery period:
is_deleted = truePermanent deletion:
Practice with an AI interviewer and get instant feedback on your system design skills.