Build a distributed crawler that discovers and indexes billions of web pages
Master URL frontier design, politeness protocols, content deduplication, and distributed coordination for search engine indexing
A web crawler (also called a spider or bot) is a program that systematically browses the internet to discover and download web pages. Search engines like Google use crawlers to build their search index, which contains information about billions of web pages.
The internet is massive, constantly changing, and full of traps. A good crawler must be fast, polite, and intelligent enough to handle edge cases.
Massive Scale
Billions of pages, growing every day
Constant Change
Pages update, move, or disappear
Spider Traps
Infinite loops and malicious sites
Key Insight: A web crawler is essentially a graph traversal problem where web pages are nodes and hyperlinks are edges. However, this graph has billions of nodes, changes constantly, and we must traverse it politely without overwhelming any website.
Understanding the scale helps us design appropriate data structures and choose the right technologies.
1B pages / 30 days / 86,400 sec ≈ ~400 pages/second
1B pages × 100 KB = ~100 TB/month
~10 URLs per page × 1B = 10 billion URLs in frontier
400 pages/sec × 100 KB = ~40 MB/sec = 320 Mbps
Politeness Constraint: With 350M domains and 1 req/sec/domain limit, we could theoretically make 350M requests/second. But with 1B pages/month target, we only need 400/sec. The real bottleneck is prioritization - deciding which 400 pages to crawl each second.
The web crawler consists of several interconnected components working in a continuous loop: fetch, parse, extract, and queue new URLs.
┌─────────────────────────────────────────────────────────────────────────────┐
│ WEB CRAWLER ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────┐
│ Seed URLs │ (Starting points: popular sites, sitemaps)
└──────┬───────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ URL FRONTIER │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Priority Queue │ │ Domain Queues │ │ Politeness │ │
│ │ (importance) │ │ (per-host) │ │ Scheduler │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Next URL to crawl
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CRAWLER WORKERS │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │ Worker N │ │
│ │ DNS→Fetch │ │ DNS→Fetch │ │ DNS→Fetch │ │ DNS→Fetch │ │
│ │ →Parse │ │ →Parse │ │ →Parse │ │ →Parse │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Extracted data
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING PIPELINE │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Link │ │ Content │ │ URL │ │ Duplicate │ │
│ │ Extractor │──▶│ Parser │──▶│ Normalizer │──▶│ Detector │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│ │
│ New URLs │ Page Content
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ URL Frontier │ │ Content Store │
│ (loop) │ │ (for indexing) │
└──────────────┘ └──────────────────┘
Design Principle: Each component should be independently scalable. If fetching is the bottleneck, add more workers. If URL frontier is slow, distribute it across more nodes. This allows efficient resource utilization.
The URL Frontier is the heart of a web crawler. It decides which URL to crawl next while ensuring politeness (not overwhelming any single website) and prioritization (crawling important pages first).
The frontier uses a two-level design: Front Queues for priority and Back Queues for politeness.
┌─────────────────────────────────────────────────────────────────────────────┐
│ URL FRONTIER │
└─────────────────────────────────────────────────────────────────────────────┘
FRONT QUEUES (Priority-based)
┌─────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ High │ │ Medium │ │ Low │ │
│ │Priority │ │Priority │ │Priority │ │
│ │ Queue │ │ Queue │ │ Queue │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
└───────┼────────────┼────────────┼───────────┘
│ │ │
└────────────┼────────────┘
│
┌─────────▼─────────┐
│ Front Queue │
│ Selector │
│ (weighted random) │
└─────────┬─────────┘
│
BACK QUEUES (Politeness - per domain)
┌────────────────────┼───────────────────────┐
│ │ │
│ ┌─────────┐ ┌────▼────┐ ┌─────────┐ │
│ │ Queue │ │ Router │ │ Queue │ │
│ │ google │◀─│(by host)│─▶│ amazon │ │
│ │ .com │ └─────────┘ │ .com │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ ┌────┴────┐ ┌────┴────┐ │
│ │ Queue │ │ Queue │ │
│ │ github │ │ wikipedia│ │
│ │ .com │ │ .org │ │
│ └─────────┘ └─────────┘ │
└────────────────────────────────────────────┘
│
┌─────────▼─────────┐
│ Back Queue │
│ Selector │
│ (respect delay) │
└─────────┬─────────┘
│
▼
URL ready for crawling
Assign URLs to priority queues based on importance signals:
// Priority calculation
priority = 0.4 * pageRank
+ 0.3 * updateFrequency
+ 0.2 * (1 - timeSinceLastCrawl)
+ 0.1 * (1 / depth)Each domain has its own queue to enforce rate limits:
// Politeness check
canCrawl(domain) {
lastCrawl = getLastCrawlTime(domain)
delay = getCrawlDelay(domain) // from robots.txt
return now() - lastCrawl >= delay
}Redis + Sorted Sets
Priority queues using ZADD/ZPOPMIN. Fast, but limited by memory.
Apache Kafka
Partitioned by domain hash. Good for distributed crawlers.
RocksDB + Custom
Disk-based for huge frontier. Used by Apache Nutch.
Trade-off: More back queues = better parallelism across domains, but more memory overhead. A typical crawler might have 10,000-100,000 back queues mapping to active domains.
Being a good citizen of the web is crucial. Aggressive crawling can overwhelm servers, get your crawler blocked, or even have legal consequences. The robots.txt file tells crawlers what they can and cannot access.
Every website can have a /robots.txt file at its root that specifies crawling rules.
# Example robots.txt for example.com # Rules for all crawlers User-agent: * Disallow: /private/ Disallow: /admin/ Disallow: /api/ Allow: /api/public/ Crawl-delay: 2 # Wait 2 seconds between requests # Special rules for Googlebot User-agent: Googlebot Allow: / Disallow: /internal/ Crawl-delay: 1 # Block bad bots entirely User-agent: BadBot Disallow: / # Sitemap location (helps discover pages) Sitemap: https://example.com/sitemap.xml
class RobotsParser:
def __init__(self, robots_txt):
self.rules = parse(robots_txt)
def can_fetch(self, user_agent, url):
# Find matching user-agent rules
rules = self.get_rules(user_agent)
# Check Allow rules first (more specific)
for allow in rules.allow:
if url.startswith(allow):
return True
# Then check Disallow rules
for disallow in rules.disallow:
if url.startswith(disallow):
return False
return True # Default: allowed
def get_crawl_delay(self, user_agent):
return self.rules.get(user_agent, {})
.get('crawl-delay', 1.0)Beyond robots.txt, a smart crawler adapts based on server response:
Response Time Based
If server is slow (> 2s response), increase delay to reduce load
Error Rate Based
If seeing 5xx errors, back off exponentially
Time of Day
Crawl less during peak hours for the site's region
Warning: Ignoring robots.txt can get your IP blocked, your crawler banned, or in extreme cases, lead to legal action. Major search engines like Google respect robots.txt - your crawler should too.
The same web page can be referenced by many different URLs. Normalization converts URLs to a canonical form to avoid crawling the same page multiple times.
# All these URLs might be the same page: http://Example.com/page # Different case http://example.com/page/ # Trailing slash http://example.com/page? # Empty query string http://example.com/page?a=1&b=2 # Query param order http://example.com/page?b=2&a=1 # Same params, different order http://example.com/page#section # Fragment identifier http://example.com:80/page # Default port https://www.example.com/page # www prefix http://example.com/./page # Dot segment http://example.com/foo/../page # Parent directory # After normalization, all become: https://example.com/page
def normalize_url(url):
# Parse URL components
parsed = urlparse(url)
# Lowercase scheme and host
scheme = parsed.scheme.lower()
host = parsed.netloc.lower()
# Remove default ports
if (scheme == 'http' and ':80' in host):
host = host.replace(':80', '')
if (scheme == 'https' and ':443' in host):
host = host.replace(':443', '')
# Remove www prefix (optional, site-dependent)
# host = host.removeprefix('www.')
# Normalize path
path = parsed.path or '/'
path = remove_dot_segments(path)
# Sort query parameters
query_params = parse_qs(parsed.query)
# Remove tracking params
tracking = ['utm_source', 'utm_medium', 'utm_campaign', 'fbclid']
query_params = {k: v for k, v in query_params.items()
if k not in tracking}
sorted_query = urlencode(sorted(query_params.items()))
# Rebuild URL (no fragment)
return f"{scheme}://{host}{path}" +
(f"?{sorted_query}" if sorted_query else "")Important: URL normalization is not perfect. Some websites serve different content based on query parameters that look like tracking params. When in doubt, follow redirects to find the canonical URL.
Even after URL normalization, different URLs can serve identical or near-identical content. Detecting duplicates saves storage, bandwidth, and prevents index bloat.
Hash the entire page content. If hashes match, pages are identical.
hash1 = sha256(page1_content) hash2 = sha256(page2_content) is_duplicate = (hash1 == hash2)
Pros: Simple, fast. Cons: Misses near-duplicates.
Creates a fingerprint where similar documents have similar hashes.
# SimHash: Locality Sensitive Hash hash1 = simhash(page1_tokens) hash2 = simhash(page2_tokens) distance = hamming_distance(hash1, hash2) is_near_duplicate = (distance <= 3)
Pros: Detects ~95% similar pages. Cons: More complex.
SimHash Algorithm (simplified):
1. Extract features (words/shingles) from document
Document: "the quick brown fox"
Features: ["the", "quick", "brown", "fox"]
2. Hash each feature to a 64-bit value
hash("the") = 0100110...
hash("quick") = 1011001...
hash("brown") = 0110100...
hash("fox") = 1001011...
3. For each bit position, sum up:
- Add +1 if the bit is 1
- Add -1 if the bit is 0
Position 0: -1 + 1 + -1 + 1 = 0
Position 1: 1 + 0 + 1 + 0 = 2
...
4. Final hash: 1 if sum > 0, else 0
SimHash = 01101...
5. Compare SimHashes using Hamming distance
If distance <= threshold, documents are similar
Hamming distance = number of bits that differ
Example: 3 bits different out of 64 → 95% similar
Before downloading, check if URL was seen:
# Using Bloom Filter for URL dedup
bloom_filter = BloomFilter(
expected_items=10_billion,
false_positive_rate=0.01
)
def should_crawl(url):
normalized = normalize_url(url)
if bloom_filter.contains(normalized):
return False # Probably seen
bloom_filter.add(normalized)
return TrueAfter downloading, check if content was seen:
# Store content fingerprints
fingerprint_store = {} # SimHash → URL
def is_duplicate_content(url, content):
simhash = compute_simhash(content)
# Check for near-duplicates
for stored_hash, stored_url in fingerprint_store:
if hamming_distance(simhash, stored_hash) <= 3:
return True, stored_url
fingerprint_store[simhash] = url
return False, NoneEfficiency Tip: Use a Bloom Filter for URL deduplication (O(1) lookup, small memory). Use SimHash with LSH (Locality Sensitive Hashing) for content deduplication to avoid O(n) comparisons.
Spider traps are URLs that cause crawlers to get stuck in an infinite loop, wasting resources. They can be accidental (poor website design) or intentional (anti-crawling measures).
Infinite Calendar
/calendar/2024/01/01 → /calendar/2024/01/02 → ... forever
Session ID in URL
/page?sid=abc123 creates infinite unique URLs
Soft 404 Pages
Invalid URLs return 200 OK with links to more invalid URLs
Symbolic Links
/a/b/a/b/a/b/... circular directory structure
Dynamic Pagination
/search?page=1, page=2, ... page=999999
Query Param Explosion
Every combination of filters creates new URLs
Spider Trap Detection Rules:
1. URL LENGTH LIMIT
─────────────────
if len(url) > 2000:
skip_url("URL too long - possible trap")
2. URL DEPTH LIMIT
─────────────────
path_depth = url.path.count('/')
if path_depth > 15:
skip_url("Path too deep - possible trap")
3. REPEATED PATH SEGMENTS
─────────────────────────
segments = url.path.split('/')
if has_repeating_pattern(segments):
skip_url("Repeating path pattern detected")
Example: /a/b/c/a/b/c/a/b/c → Pattern [a,b,c] repeats
4. DOMAIN PAGE LIMIT
─────────────────────
pages_from_domain = get_crawled_count(domain)
if pages_from_domain > MAX_PAGES_PER_DOMAIN:
skip_url("Domain limit reached")
5. SIMILAR URL PATTERN
─────────────────────
# Track URL patterns (template with placeholders)
pattern = extract_pattern(url) # /user/*/posts/*
if pattern_count[pattern] > threshold:
deprioritize_pattern(pattern)
6. CONTENT SIMILARITY
─────────────────────
if last_N_pages_from_domain_are_similar():
reduce_crawl_priority(domain)
def extract_pattern(url):
"""Convert URL to pattern"""
# /user/123/posts/456
# becomes /user/*/posts/*
parts = urlparse(url).path.split('/')
pattern_parts = []
for part in parts:
if is_numeric(part) or is_uuid(part):
pattern_parts.append('*')
else:
pattern_parts.append(part)
return '/'.join(pattern_parts)
# Track pattern frequency
pattern_counts = Counter()
pattern_counts[extract_pattern(url)] += 1
# If same pattern seen 1000+ times,
# it's likely templated contentReal Example: A calendar widget that generates URLs like /events/2024/01/01 can create 365 URLs per year, going back centuries. Without detection, a crawler could waste millions of requests on a single widget.
Every HTTP request requires a DNS lookup to convert the domain name to an IP address. At 400 requests/second, DNS can become a significant bottleneck without proper caching.
DNS Resolution in Web Crawler:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Crawler │────▶│ Local DNS │────▶│ ISP/Public │
│ Worker │ │ Cache │ │ DNS Server │
└─────────────┘ └──────────────┘ └──────────────┘
│ │
Cache Hit? Cache Miss?
│ │
▼ ▼
Return IP Query Root DNS
(< 1ms) │
▼
Query TLD DNS (.com)
│
▼
Query Authoritative DNS
│
▼
Return IP (50-200ms)
Problem: Without caching, 400 req/sec × 100ms = 40 seconds of DNS time!
Solution: Local DNS cache with high TTL
class DNSCache:
def __init__(self):
self.cache = {} # domain → (ip, expiry)
self.min_ttl = 3600 # 1 hour minimum
def resolve(self, domain):
if domain in self.cache:
ip, expiry = self.cache[domain]
if time.now() < expiry:
return ip # Cache hit
# Cache miss - do actual lookup
ip = dns_lookup(domain)
ttl = max(get_dns_ttl(domain), self.min_ttl)
self.cache[domain] = (ip, time.now() + ttl)
return ip
def prefetch(self, domains):
"""Resolve DNS for batch of domains"""
for domain in domains:
if domain not in self.cache:
self.resolve(domain)Performance Impact: With DNS caching, cache hit rate is typically 90%+. This reduces average DNS lookup time from ~100ms to ~10ms, saving significant crawl time.
A single machine cannot crawl billions of pages. We need multiple crawler workers coordinating to avoid duplicate work while maintaining politeness.
┌─────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED WEB CRAWLER │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────┐
│ Coordinator Service │
│ - Worker health checks │
│ - Load balancing │
│ - Crawl policy management │
└──────────────┬──────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Worker Pod 1 │ │ Worker Pod 2 │ │ Worker Pod N │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Crawler 1 │ │ │ │ Crawler 1 │ │ │ │ Crawler 1 │ │
│ │ Crawler 2 │ │ │ │ Crawler 2 │ │ │ │ Crawler 2 │ │
│ │ Crawler 3 │ │ │ │ Crawler 3 │ │ │ │ Crawler 3 │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌────────────▼────────────┐
│ Shared Components │
│ │
│ ┌──────────────────┐ │
│ │ URL Frontier │ │
│ │ (Kafka/Redis) │ │
│ └──────────────────┘ │
│ │
│ ┌──────────────────┐ │
│ │ Seen URL Store │ │
│ │ (Bloom Filter) │ │
│ └──────────────────┘ │
│ │
│ ┌──────────────────┐ │
│ │ Content Store │ │
│ │ (S3/HDFS) │ │
│ └──────────────────┘ │
└─────────────────────────┘
Assign each domain to a specific worker to maintain politeness:
# Consistent hashing for domain assignment
def get_worker_for_domain(domain, num_workers):
hash_value = hash(domain)
return hash_value % num_workers
# Worker 0: amazon.com, facebook.com, ...
# Worker 1: google.com, twitter.com, ...
# Worker 2: github.com, linkedin.com, ...
# Benefits:
# 1. Each domain handled by one worker
# 2. Politeness naturally enforced
# 3. Efficient connection reuse
# 4. Easy to scale workersDeploy crawlers close to target websites:
Worker Failure
Reassign domains to healthy workers. Pending URLs remain in frontier.
Frontier Failure
Use replicated Kafka/Redis. URLs persisted before ACK.
Storage Failure
Retry with exponential backoff. Dead-letter queue for failed writes.
A web crawler generates massive amounts of data: crawled content, URL metadata, and crawl history. Choosing the right storage for each data type is crucial.
-- URL Metadata (Cassandra/DynamoDB - high write throughput)
CREATE TABLE url_metadata (
url_hash BIGINT PRIMARY KEY, -- Hash of normalized URL
url TEXT,
domain TEXT,
first_seen TIMESTAMP,
last_crawled TIMESTAMP,
last_modified TIMESTAMP, -- From HTTP header
http_status INT,
content_hash BIGINT, -- SimHash for dedup
content_length INT,
content_type TEXT,
crawl_count INT,
priority_score FLOAT,
INDEX (domain, last_crawled)
);
-- Crawl History (Time-series DB or Cassandra)
CREATE TABLE crawl_log (
url_hash BIGINT,
crawl_time TIMESTAMP,
http_status INT,
response_time INT, -- milliseconds
content_changed BOOLEAN,
error_message TEXT,
PRIMARY KEY (url_hash, crawl_time)
);
-- Domain Stats (Redis or Cassandra)
CREATE TABLE domain_stats (
domain TEXT PRIMARY KEY,
pages_crawled BIGINT,
last_crawl TIMESTAMP,
avg_response_ms INT,
error_rate FLOAT,
robots_txt TEXT,
robots_expires TIMESTAMP,
crawl_delay INT
);// Store crawled content in S3/GCS
path: s3://crawl-data/{date}/{domain_hash}/{url_hash}
// Content file structure (compressed)
{
"url": "https://example.com/page",
"crawl_time": "2024-01-15T10:30:00Z",
"http_status": 200,
"headers": {
"content-type": "text/html",
"last-modified": "...",
"etag": "..."
},
"content": "<html>...</html>",
"extracted_text": "...",
"outlinks": ["url1", "url2", ...]
}A crawler running 24/7 needs comprehensive monitoring to detect issues early and recover from failures automatically.
Pages/Second
Crawl throughput across all workers
Avg Response Time
Time to fetch and process pages
Error Rate
4xx, 5xx, timeouts, DNS failures
Frontier Size
URLs waiting to be crawled
Worker Health
CPU, memory, network per worker
Duplicate Rate
% of URLs/content already seen
# Prometheus alerting rules
- alert: CrawlThroughputLow
expr: rate(pages_crawled_total[5m]) < 300
for: 10m
labels:
severity: warning
annotations:
summary: "Crawl throughput below 300 pages/sec"
- alert: HighErrorRate
expr: rate(crawl_errors_total[5m]) / rate(crawl_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 10%"
- alert: FrontierGrowing
expr: frontier_size > 100000000
for: 1h
labels:
severity: warning
annotations:
summary: "URL frontier exceeding 100M URLs"
- alert: WorkerDown
expr: up{job="crawler"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Crawler worker is down"Practice with an AI interviewer and get instant feedback on your system design skills.