Complete System Design Guide

Design Web Crawler System

Build a distributed crawler that discovers and indexes billions of web pages

Master URL frontier design, politeness protocols, content deduplication, and distributed coordination for search engine indexing

All System Design

What You Will Learn

Problem Statement & Requirements
Scale & Traffic Estimation
High-Level Architecture
URL Frontier Design
Politeness & robots.txt
URL Normalization
Content Deduplication
Spider Trap Detection
DNS Resolution Caching
Distributed Coordination
Storage & Database Design
Monitoring & Fault Tolerance

1Understanding the Problem

A web crawler (also called a spider or bot) is a program that systematically browses the internet to discover and download web pages. Search engines like Google use crawlers to build their search index, which contains information about billions of web pages.

What Makes Web Crawling Challenging?

The internet is massive, constantly changing, and full of traps. A good crawler must be fast, polite, and intelligent enough to handle edge cases.

Massive Scale

Billions of pages, growing every day

Constant Change

Pages update, move, or disappear

Spider Traps

Infinite loops and malicious sites

Functional Requirements

  • Discover new URLs by extracting links from crawled pages
  • Download and store web page content (HTML, text, metadata)
  • Re-crawl pages periodically to detect updates
  • Respect robots.txt rules and crawl-delay directives
  • Handle various content types (HTML, PDF, images)
  • Avoid duplicate content and spider traps

Non-Functional Requirements

  • Throughput: Crawl billions of pages per month
  • Politeness: Max 1 request/second per domain
  • Freshness: Important pages re-crawled within hours
  • Scalability: Horizontally scalable crawler fleet
  • Robustness: Handle failures, malformed HTML
  • Extensibility: Easy to add new content handlers

Key Insight: A web crawler is essentially a graph traversal problem where web pages are nodes and hyperlinks are edges. However, this graph has billions of nodes, changes constantly, and we must traverse it politely without overwhelming any website.

2Scale & Traffic Estimation

Understanding the scale helps us design appropriate data structures and choose the right technologies.

Internet Scale (Approximate)

  • Total Web Pages: 50+ billion indexed pages
  • Unique Domains: 350+ million registered domains
  • New Pages Daily: Millions of new pages created
  • Average Page Size: 2-3 MB (including resources)
  • HTML Content: ~100 KB average per page

Design Target: Crawl 1 Billion Pages/Month

// Pages Per Second

1B pages / 30 days / 86,400 sec ≈ ~400 pages/second

// Storage for HTML Content

1B pages × 100 KB = ~100 TB/month

// URLs to Track (discovered but not yet crawled)

~10 URLs per page × 1B = 10 billion URLs in frontier

// Network Bandwidth

400 pages/sec × 100 KB = ~40 MB/sec = 320 Mbps

Crawl Frequency Considerations

  • News sites: Crawl every few minutes
  • E-commerce: Crawl daily for price changes
  • Static pages: Crawl weekly or monthly
  • Low-priority: Crawl once, rarely revisit
  • Priority based on: Page rank, update frequency, importance

Infrastructure Requirements

  • Crawler Workers: 50-100 machines
  • URL Frontier: Distributed queue (Redis/Kafka)
  • Content Storage: Object storage (S3/GCS)
  • Metadata DB: Distributed database (Cassandra)
  • DNS Cache: Local DNS resolver

Politeness Constraint: With 350M domains and 1 req/sec/domain limit, we could theoretically make 350M requests/second. But with 1B pages/month target, we only need 400/sec. The real bottleneck is prioritization - deciding which 400 pages to crawl each second.

3High-Level Architecture

The web crawler consists of several interconnected components working in a continuous loop: fetch, parse, extract, and queue new URLs.

System Architecture Overview


┌─────────────────────────────────────────────────────────────────────────────┐
│                           WEB CRAWLER ARCHITECTURE                           │
└─────────────────────────────────────────────────────────────────────────────┘

     ┌──────────────┐
     │  Seed URLs   │ (Starting points: popular sites, sitemaps)
     └──────┬───────┘
            │
            ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              URL FRONTIER                                    │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐             │
│  │  Priority Queue │  │  Domain Queues  │  │  Politeness     │             │
│  │  (importance)   │  │  (per-host)     │  │  Scheduler      │             │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘             │
└─────────────────────────────────────────────────────────────────────────────┘
            │
            │ Next URL to crawl
            ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                            CRAWLER WORKERS                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Worker 1   │  │  Worker 2   │  │  Worker 3   │  │  Worker N   │        │
│  │  DNS→Fetch  │  │  DNS→Fetch  │  │  DNS→Fetch  │  │  DNS→Fetch  │        │
│  │  →Parse     │  │  →Parse     │  │  →Parse     │  │  →Parse     │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘
            │
            │ Extracted data
            ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           PROCESSING PIPELINE                                │
│                                                                              │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐ │
│  │    Link      │   │   Content    │   │    URL       │   │   Duplicate  │ │
│  │  Extractor   │──▶│   Parser     │──▶│ Normalizer   │──▶│   Detector   │ │
│  └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
            │                                              │
            │ New URLs                                     │ Page Content
            ▼                                              ▼
     ┌──────────────┐                            ┌──────────────────┐
     │ URL Frontier │                            │  Content Store   │
     │   (loop)     │                            │  (for indexing)  │
     └──────────────┘                            └──────────────────┘

Crawl Loop Steps

  1. Pick URL from frontier (respecting politeness)
  2. Check robots.txt rules for the domain
  3. Resolve DNS (use cache if possible)
  4. Fetch page content via HTTP/HTTPS
  5. Parse HTML and extract text content
  6. Extract all hyperlinks from the page
  7. Normalize and deduplicate URLs
  8. Add new URLs to frontier
  9. Store page content for indexing
  10. Update crawl metadata (last crawl time, etc.)

Key Components

  • URL Frontier: Queue of URLs to be crawled, with priority and politeness controls
  • DNS Resolver: Cached DNS lookups to reduce latency
  • Fetcher: HTTP client with timeouts, retries, and robots.txt compliance
  • Parser: HTML parser that extracts links and content
  • Deduplicator: Detects duplicate URLs and content
  • Storage: Stores crawled content and metadata

Design Principle: Each component should be independently scalable. If fetching is the bottleneck, add more workers. If URL frontier is slow, distribute it across more nodes. This allows efficient resource utilization.

4URL Frontier Design

The URL Frontier is the heart of a web crawler. It decides which URL to crawl next while ensuring politeness (not overwhelming any single website) and prioritization (crawling important pages first).

Two-Level Queue Architecture

The frontier uses a two-level design: Front Queues for priority and Back Queues for politeness.


┌─────────────────────────────────────────────────────────────────────────────┐
│                           URL FRONTIER                                       │
└─────────────────────────────────────────────────────────────────────────────┘

                    FRONT QUEUES (Priority-based)
         ┌─────────────────────────────────────────────┐
         │                                             │
         │  ┌─────────┐  ┌─────────┐  ┌─────────┐     │
         │  │ High    │  │ Medium  │  │  Low    │     │
         │  │Priority │  │Priority │  │Priority │     │
         │  │ Queue   │  │ Queue   │  │ Queue   │     │
         │  └────┬────┘  └────┬────┘  └────┬────┘     │
         │       │            │            │           │
         └───────┼────────────┼────────────┼───────────┘
                 │            │            │
                 └────────────┼────────────┘
                              │
                    ┌─────────▼─────────┐
                    │   Front Queue     │
                    │    Selector       │
                    │ (weighted random) │
                    └─────────┬─────────┘
                              │
                    BACK QUEUES (Politeness - per domain)
         ┌────────────────────┼───────────────────────┐
         │                    │                        │
         │  ┌─────────┐  ┌────▼────┐  ┌─────────┐     │
         │  │ Queue   │  │ Router  │  │ Queue   │     │
         │  │ google  │◀─│(by host)│─▶│ amazon  │     │
         │  │  .com   │  └─────────┘  │  .com   │     │
         │  └────┬────┘               └────┬────┘     │
         │       │                         │           │
         │  ┌────┴────┐               ┌────┴────┐     │
         │  │ Queue   │               │ Queue   │     │
         │  │ github  │               │ wikipedia│    │
         │  │  .com   │               │  .org   │     │
         │  └─────────┘               └─────────┘     │
         └────────────────────────────────────────────┘
                              │
                    ┌─────────▼─────────┐
                    │   Back Queue      │
                    │    Selector       │
                    │ (respect delay)   │
                    └─────────┬─────────┘
                              │
                              ▼
                    URL ready for crawling

Front Queues (Prioritization)

Assign URLs to priority queues based on importance signals:

  • PageRank: Higher ranked pages crawled more often
  • Update frequency: News sites get higher priority
  • Traffic: Popular pages are more important
  • Freshness: Not crawled recently? Higher priority
  • Depth: Pages closer to root often more important
// Priority calculation
priority = 0.4 * pageRank
         + 0.3 * updateFrequency
         + 0.2 * (1 - timeSinceLastCrawl)
         + 0.1 * (1 / depth)

Back Queues (Politeness)

Each domain has its own queue to enforce rate limits:

  • Per-domain queue: All URLs for a domain in one queue
  • Crawl delay: Minimum time between requests to same domain
  • Timestamp tracking: Record when each domain was last accessed
  • Queue selection: Pick from domains ready to be crawled
// Politeness check
canCrawl(domain) {
  lastCrawl = getLastCrawlTime(domain)
  delay = getCrawlDelay(domain) // from robots.txt
  return now() - lastCrawl >= delay
}

Implementation Options

Redis + Sorted Sets

Priority queues using ZADD/ZPOPMIN. Fast, but limited by memory.

Apache Kafka

Partitioned by domain hash. Good for distributed crawlers.

RocksDB + Custom

Disk-based for huge frontier. Used by Apache Nutch.

Trade-off: More back queues = better parallelism across domains, but more memory overhead. A typical crawler might have 10,000-100,000 back queues mapping to active domains.

5Politeness & robots.txt

Being a good citizen of the web is crucial. Aggressive crawling can overwhelm servers, get your crawler blocked, or even have legal consequences. The robots.txt file tells crawlers what they can and cannot access.

Understanding robots.txt

Every website can have a /robots.txt file at its root that specifies crawling rules.

# Example robots.txt for example.com

# Rules for all crawlers
User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Crawl-delay: 2          # Wait 2 seconds between requests

# Special rules for Googlebot
User-agent: Googlebot
Allow: /
Disallow: /internal/
Crawl-delay: 1

# Block bad bots entirely
User-agent: BadBot
Disallow: /

# Sitemap location (helps discover pages)
Sitemap: https://example.com/sitemap.xml

Politeness Rules

  • 1.Fetch robots.txt first before crawling any page on a domain
  • 2.Cache robots.txt (typically for 24 hours)
  • 3.Respect Crawl-delay directive (default to 1 second if not specified)
  • 4.Check Disallow rules before fetching any URL
  • 5.Identify your crawler with a descriptive User-Agent
  • 6.Handle 429 (Too Many Requests) by backing off

robots.txt Parser Logic

class RobotsParser:
    def __init__(self, robots_txt):
        self.rules = parse(robots_txt)

    def can_fetch(self, user_agent, url):
        # Find matching user-agent rules
        rules = self.get_rules(user_agent)

        # Check Allow rules first (more specific)
        for allow in rules.allow:
            if url.startswith(allow):
                return True

        # Then check Disallow rules
        for disallow in rules.disallow:
            if url.startswith(disallow):
                return False

        return True  # Default: allowed

    def get_crawl_delay(self, user_agent):
        return self.rules.get(user_agent, {})
                   .get('crawl-delay', 1.0)

Adaptive Politeness

Beyond robots.txt, a smart crawler adapts based on server response:

Response Time Based

If server is slow (> 2s response), increase delay to reduce load

Error Rate Based

If seeing 5xx errors, back off exponentially

Time of Day

Crawl less during peak hours for the site's region

Warning: Ignoring robots.txt can get your IP blocked, your crawler banned, or in extreme cases, lead to legal action. Major search engines like Google respect robots.txt - your crawler should too.

6URL Normalization

The same web page can be referenced by many different URLs. Normalization converts URLs to a canonical form to avoid crawling the same page multiple times.

URLs That Point to the Same Page

# All these URLs might be the same page:

http://Example.com/page          # Different case
http://example.com/page/         # Trailing slash
http://example.com/page?         # Empty query string
http://example.com/page?a=1&b=2  # Query param order
http://example.com/page?b=2&a=1  # Same params, different order
http://example.com/page#section  # Fragment identifier
http://example.com:80/page       # Default port
https://www.example.com/page     # www prefix
http://example.com/./page        # Dot segment
http://example.com/foo/../page   # Parent directory

# After normalization, all become:
https://example.com/page

Normalization Steps

Standard Normalization

  1. Convert scheme and host to lowercase
  2. Remove default port (80 for HTTP, 443 for HTTPS)
  3. Remove fragment identifier (#section)
  4. Decode unreserved percent-encoded characters
  5. Remove dot segments (. and ..)
  6. Add trailing slash to directory paths
  7. Sort query parameters alphabetically

Site-Specific Normalization

  1. Remove tracking parameters (utm_*, fbclid)
  2. Remove session IDs from URLs
  3. Handle www vs non-www (follow redirects)
  4. Handle HTTP to HTTPS upgrades
  5. Remove duplicate slashes
  6. Convert to lowercase paths (site-dependent)

URL Normalizer Implementation

def normalize_url(url):
    # Parse URL components
    parsed = urlparse(url)

    # Lowercase scheme and host
    scheme = parsed.scheme.lower()
    host = parsed.netloc.lower()

    # Remove default ports
    if (scheme == 'http' and ':80' in host):
        host = host.replace(':80', '')
    if (scheme == 'https' and ':443' in host):
        host = host.replace(':443', '')

    # Remove www prefix (optional, site-dependent)
    # host = host.removeprefix('www.')

    # Normalize path
    path = parsed.path or '/'
    path = remove_dot_segments(path)

    # Sort query parameters
    query_params = parse_qs(parsed.query)
    # Remove tracking params
    tracking = ['utm_source', 'utm_medium', 'utm_campaign', 'fbclid']
    query_params = {k: v for k, v in query_params.items()
                    if k not in tracking}
    sorted_query = urlencode(sorted(query_params.items()))

    # Rebuild URL (no fragment)
    return f"{scheme}://{host}{path}" +
           (f"?{sorted_query}" if sorted_query else "")

Important: URL normalization is not perfect. Some websites serve different content based on query parameters that look like tracking params. When in doubt, follow redirects to find the canonical URL.

7Content Deduplication

Even after URL normalization, different URLs can serve identical or near-identical content. Detecting duplicates saves storage, bandwidth, and prevents index bloat.

Content Fingerprinting Techniques

Exact Duplicate: MD5/SHA Hash

Hash the entire page content. If hashes match, pages are identical.

hash1 = sha256(page1_content)
hash2 = sha256(page2_content)
is_duplicate = (hash1 == hash2)

Pros: Simple, fast. Cons: Misses near-duplicates.

Near Duplicate: SimHash

Creates a fingerprint where similar documents have similar hashes.

# SimHash: Locality Sensitive Hash
hash1 = simhash(page1_tokens)
hash2 = simhash(page2_tokens)
distance = hamming_distance(hash1, hash2)
is_near_duplicate = (distance <= 3)

Pros: Detects ~95% similar pages. Cons: More complex.

How SimHash Works


SimHash Algorithm (simplified):

1. Extract features (words/shingles) from document
   Document: "the quick brown fox"
   Features: ["the", "quick", "brown", "fox"]

2. Hash each feature to a 64-bit value
   hash("the")   = 0100110...
   hash("quick") = 1011001...
   hash("brown") = 0110100...
   hash("fox")   = 1001011...

3. For each bit position, sum up:
   - Add +1 if the bit is 1
   - Add -1 if the bit is 0

   Position 0: -1 + 1 + -1 + 1 = 0
   Position 1:  1 + 0 +  1 + 0 = 2
   ...

4. Final hash: 1 if sum > 0, else 0
   SimHash = 01101...

5. Compare SimHashes using Hamming distance
   If distance <= threshold, documents are similar

   Hamming distance = number of bits that differ
   Example: 3 bits different out of 64 → 95% similar

URL-Based Deduplication

Before downloading, check if URL was seen:

# Using Bloom Filter for URL dedup
bloom_filter = BloomFilter(
    expected_items=10_billion,
    false_positive_rate=0.01
)

def should_crawl(url):
    normalized = normalize_url(url)
    if bloom_filter.contains(normalized):
        return False  # Probably seen
    bloom_filter.add(normalized)
    return True

Content-Based Deduplication

After downloading, check if content was seen:

# Store content fingerprints
fingerprint_store = {}  # SimHash → URL

def is_duplicate_content(url, content):
    simhash = compute_simhash(content)

    # Check for near-duplicates
    for stored_hash, stored_url in fingerprint_store:
        if hamming_distance(simhash, stored_hash) <= 3:
            return True, stored_url

    fingerprint_store[simhash] = url
    return False, None

Efficiency Tip: Use a Bloom Filter for URL deduplication (O(1) lookup, small memory). Use SimHash with LSH (Locality Sensitive Hashing) for content deduplication to avoid O(n) comparisons.

8Spider Trap Detection

Spider traps are URLs that cause crawlers to get stuck in an infinite loop, wasting resources. They can be accidental (poor website design) or intentional (anti-crawling measures).

Common Spider Trap Types

Infinite Calendar

/calendar/2024/01/01 → /calendar/2024/01/02 → ... forever

Session ID in URL

/page?sid=abc123 creates infinite unique URLs

Soft 404 Pages

Invalid URLs return 200 OK with links to more invalid URLs

Symbolic Links

/a/b/a/b/a/b/... circular directory structure

Dynamic Pagination

/search?page=1, page=2, ... page=999999

Query Param Explosion

Every combination of filters creates new URLs

Detection Strategies


Spider Trap Detection Rules:

1. URL LENGTH LIMIT
   ─────────────────
   if len(url) > 2000:
       skip_url("URL too long - possible trap")

2. URL DEPTH LIMIT
   ─────────────────
   path_depth = url.path.count('/')
   if path_depth > 15:
       skip_url("Path too deep - possible trap")

3. REPEATED PATH SEGMENTS
   ─────────────────────────
   segments = url.path.split('/')
   if has_repeating_pattern(segments):
       skip_url("Repeating path pattern detected")

   Example: /a/b/c/a/b/c/a/b/c → Pattern [a,b,c] repeats

4. DOMAIN PAGE LIMIT
   ─────────────────────
   pages_from_domain = get_crawled_count(domain)
   if pages_from_domain > MAX_PAGES_PER_DOMAIN:
       skip_url("Domain limit reached")

5. SIMILAR URL PATTERN
   ─────────────────────
   # Track URL patterns (template with placeholders)
   pattern = extract_pattern(url)  # /user/*/posts/*
   if pattern_count[pattern] > threshold:
       deprioritize_pattern(pattern)

6. CONTENT SIMILARITY
   ─────────────────────
   if last_N_pages_from_domain_are_similar():
       reduce_crawl_priority(domain)

URL Pattern Detection

def extract_pattern(url):
    """Convert URL to pattern"""
    # /user/123/posts/456
    # becomes /user/*/posts/*

    parts = urlparse(url).path.split('/')
    pattern_parts = []

    for part in parts:
        if is_numeric(part) or is_uuid(part):
            pattern_parts.append('*')
        else:
            pattern_parts.append(part)

    return '/'.join(pattern_parts)

# Track pattern frequency
pattern_counts = Counter()
pattern_counts[extract_pattern(url)] += 1

# If same pattern seen 1000+ times,
# it's likely templated content

Crawl Budget Management

  • Per-domain limit: Max 50,000 pages per domain
  • Per-pattern limit: Max 100 URLs per pattern
  • Depth limit: Max 15 clicks from seed
  • Time budget: Don't spend more than X hours on one domain
  • Diminishing returns: Deprioritize domain after many similar pages

Real Example: A calendar widget that generates URLs like /events/2024/01/01 can create 365 URLs per year, going back centuries. Without detection, a crawler could waste millions of requests on a single widget.

9DNS Resolution & Caching

Every HTTP request requires a DNS lookup to convert the domain name to an IP address. At 400 requests/second, DNS can become a significant bottleneck without proper caching.

DNS Lookup Flow


DNS Resolution in Web Crawler:

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Crawler   │────▶│  Local DNS   │────▶│  ISP/Public  │
│   Worker    │     │    Cache     │     │  DNS Server  │
└─────────────┘     └──────────────┘     └──────────────┘
                           │                     │
                    Cache Hit?              Cache Miss?
                           │                     │
                           ▼                     ▼
                    Return IP            Query Root DNS
                    (< 1ms)                   │
                                              ▼
                                       Query TLD DNS (.com)
                                              │
                                              ▼
                                       Query Authoritative DNS
                                              │
                                              ▼
                                       Return IP (50-200ms)

Problem: Without caching, 400 req/sec × 100ms = 40 seconds of DNS time!
Solution: Local DNS cache with high TTL

DNS Caching Strategy

  • 1.In-memory cache: Per crawler worker, fastest access
  • 2.Shared cache: Redis/Memcached for all workers
  • 3.Local DNS server: Run your own caching resolver
  • 4.Extended TTL: Cache longer than DNS TTL suggests
  • 5.Prefetch: Resolve DNS for URLs in frontier ahead of time

DNS Cache Implementation

class DNSCache:
    def __init__(self):
        self.cache = {}  # domain → (ip, expiry)
        self.min_ttl = 3600  # 1 hour minimum

    def resolve(self, domain):
        if domain in self.cache:
            ip, expiry = self.cache[domain]
            if time.now() < expiry:
                return ip  # Cache hit

        # Cache miss - do actual lookup
        ip = dns_lookup(domain)
        ttl = max(get_dns_ttl(domain), self.min_ttl)
        self.cache[domain] = (ip, time.now() + ttl)
        return ip

    def prefetch(self, domains):
        """Resolve DNS for batch of domains"""
        for domain in domains:
            if domain not in self.cache:
                self.resolve(domain)

Performance Impact: With DNS caching, cache hit rate is typically 90%+. This reduces average DNS lookup time from ~100ms to ~10ms, saving significant crawl time.

10Distributed Coordination

A single machine cannot crawl billions of pages. We need multiple crawler workers coordinating to avoid duplicate work while maintaining politeness.

Distributed Crawler Architecture


┌─────────────────────────────────────────────────────────────────────────────┐
│                      DISTRIBUTED WEB CRAWLER                                 │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────────────┐
                    │      Coordinator Service    │
                    │  - Worker health checks     │
                    │  - Load balancing           │
                    │  - Crawl policy management  │
                    └──────────────┬──────────────┘
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         │                         │                         │
         ▼                         ▼                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Worker Pod 1  │     │   Worker Pod 2  │     │   Worker Pod N  │
│  ┌───────────┐  │     │  ┌───────────┐  │     │  ┌───────────┐  │
│  │ Crawler 1 │  │     │  │ Crawler 1 │  │     │  │ Crawler 1 │  │
│  │ Crawler 2 │  │     │  │ Crawler 2 │  │     │  │ Crawler 2 │  │
│  │ Crawler 3 │  │     │  │ Crawler 3 │  │     │  │ Crawler 3 │  │
│  └───────────┘  │     │  └───────────┘  │     │  └───────────┘  │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │    Shared Components    │
                    │                         │
                    │  ┌──────────────────┐   │
                    │  │   URL Frontier   │   │
                    │  │  (Kafka/Redis)   │   │
                    │  └──────────────────┘   │
                    │                         │
                    │  ┌──────────────────┐   │
                    │  │  Seen URL Store  │   │
                    │  │ (Bloom Filter)   │   │
                    │  └──────────────────┘   │
                    │                         │
                    │  ┌──────────────────┐   │
                    │  │  Content Store   │   │
                    │  │  (S3/HDFS)       │   │
                    │  └──────────────────┘   │
                    └─────────────────────────┘

Domain Partitioning

Assign each domain to a specific worker to maintain politeness:

# Consistent hashing for domain assignment
def get_worker_for_domain(domain, num_workers):
    hash_value = hash(domain)
    return hash_value % num_workers

# Worker 0: amazon.com, facebook.com, ...
# Worker 1: google.com, twitter.com, ...
# Worker 2: github.com, linkedin.com, ...

# Benefits:
# 1. Each domain handled by one worker
# 2. Politeness naturally enforced
# 3. Efficient connection reuse
# 4. Easy to scale workers

Geographic Distribution

Deploy crawlers close to target websites:

  • US East: Crawl .com, US-hosted sites
  • EU West: Crawl .eu, .de, .fr sites
  • Asia Pacific: Crawl .jp, .cn, .in sites
  • Benefit: Lower latency = faster crawling
  • Challenge: Coordinate across regions

Failure Handling

Worker Failure

Reassign domains to healthy workers. Pending URLs remain in frontier.

Frontier Failure

Use replicated Kafka/Redis. URLs persisted before ACK.

Storage Failure

Retry with exponential backoff. Dead-letter queue for failed writes.

11Storage & Database Design

A web crawler generates massive amounts of data: crawled content, URL metadata, and crawl history. Choosing the right storage for each data type is crucial.

Data Storage Schema

-- URL Metadata (Cassandra/DynamoDB - high write throughput)
CREATE TABLE url_metadata (
    url_hash        BIGINT PRIMARY KEY,  -- Hash of normalized URL
    url             TEXT,
    domain          TEXT,
    first_seen      TIMESTAMP,
    last_crawled    TIMESTAMP,
    last_modified   TIMESTAMP,           -- From HTTP header
    http_status     INT,
    content_hash    BIGINT,              -- SimHash for dedup
    content_length  INT,
    content_type    TEXT,
    crawl_count     INT,
    priority_score  FLOAT,
    INDEX (domain, last_crawled)
);

-- Crawl History (Time-series DB or Cassandra)
CREATE TABLE crawl_log (
    url_hash        BIGINT,
    crawl_time      TIMESTAMP,
    http_status     INT,
    response_time   INT,                 -- milliseconds
    content_changed BOOLEAN,
    error_message   TEXT,
    PRIMARY KEY (url_hash, crawl_time)
);

-- Domain Stats (Redis or Cassandra)
CREATE TABLE domain_stats (
    domain          TEXT PRIMARY KEY,
    pages_crawled   BIGINT,
    last_crawl      TIMESTAMP,
    avg_response_ms INT,
    error_rate      FLOAT,
    robots_txt      TEXT,
    robots_expires  TIMESTAMP,
    crawl_delay     INT
);

Storage Technology Choices

  • URL Frontier: Kafka (distributed, persistent queues)
  • Seen URLs: Bloom filter in Redis (fast lookup)
  • URL Metadata: Cassandra (high write throughput)
  • Page Content: S3/GCS/HDFS (cheap object storage)
  • Domain Stats: Redis (fast read/write)
  • Search Index: Elasticsearch (for downstream)

Content Storage Format

// Store crawled content in S3/GCS
path: s3://crawl-data/{date}/{domain_hash}/{url_hash}

// Content file structure (compressed)
{
  "url": "https://example.com/page",
  "crawl_time": "2024-01-15T10:30:00Z",
  "http_status": 200,
  "headers": {
    "content-type": "text/html",
    "last-modified": "...",
    "etag": "..."
  },
  "content": "<html>...</html>",
  "extracted_text": "...",
  "outlinks": ["url1", "url2", ...]
}

12Monitoring & Fault Tolerance

A crawler running 24/7 needs comprehensive monitoring to detect issues early and recover from failures automatically.

Key Metrics to Monitor

Pages/Second

Crawl throughput across all workers

Avg Response Time

Time to fetch and process pages

Error Rate

4xx, 5xx, timeouts, DNS failures

Frontier Size

URLs waiting to be crawled

Worker Health

CPU, memory, network per worker

Duplicate Rate

% of URLs/content already seen

Alerting Rules

# Prometheus alerting rules

- alert: CrawlThroughputLow
  expr: rate(pages_crawled_total[5m]) < 300
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Crawl throughput below 300 pages/sec"

- alert: HighErrorRate
  expr: rate(crawl_errors_total[5m]) / rate(crawl_requests_total[5m]) > 0.1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 10%"

- alert: FrontierGrowing
  expr: frontier_size > 100000000
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "URL frontier exceeding 100M URLs"

- alert: WorkerDown
  expr: up{job="crawler"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Crawler worker is down"

Key Design Principles

  • URL Frontier with Two-Level Queues: Front queues for priority, back queues for politeness. This ensures important pages are crawled first while respecting rate limits.
  • Politeness is Non-Negotiable: Always respect robots.txt, implement crawl delays, and adapt to server load. Getting blocked helps no one.
  • URL Normalization + Content Dedup: Normalize URLs before adding to frontier. Use SimHash for near-duplicate detection to avoid wasting resources.
  • Spider Trap Detection: Set limits on URL depth, length, patterns, and pages per domain. Better to miss some pages than get stuck forever.
  • DNS Caching is Critical: Cache DNS aggressively. At scale, DNS lookups can become a major bottleneck without proper caching.
  • Distribute by Domain: Assign domains to specific workers using consistent hashing. This naturally enforces politeness and enables connection reuse.
  • Monitor Everything: Track throughput, error rates, frontier size, and worker health. Set up alerts to catch issues before they become critical.

Ready to Design a Web Crawler?

Practice with an AI interviewer and get instant feedback on your system design skills.

Related System Design Questions: