swelearn 0.1.0 - Docs.rs

---
title: "Design a Web Crawler"
tags: [distributed-systems, queues, politeness, deduplication, dns, scraping]
difficulty: medium
estimated_time: 45min
companies: [Google, Bing, Amazon, Cloudflare]
---

## Problem Statement

Design a web crawler that systematically browses the internet, downloads web pages,
and stores their content — the foundational component of a search engine index.

## Clarifying Questions

Ask these before designing:

- Scope: general-purpose or domain-specific? (Assume general-purpose)
- How many pages to crawl? (Assume 1B pages, ~10KB avg = 10 PB raw)
- How often to re-crawl (freshness)? (Assume every 2–4 weeks for most pages)
- Handle JavaScript-rendered content? (No — defer to Phase 2)
- Store raw HTML or parsed content? (Both — raw for re-processing, parsed for indexing)

## Capacity Estimates

  Target: 1B pages crawled over 30 days
  Rate:   ~400 pages/sec sustained
  Storage: 1B * 10KB = 10 TB per crawl cycle
  DNS:    ~400 DNS lookups/sec (cache aggressively)

## High-Level Design

```
[Seed URLs]
     |
     v
[URL Frontier / Priority Queue]
     |
     v
[Fetcher Workers] ──> [DNS Cache]
     |
     v
[HTML Parser / Link Extractor]
     |            |
     v            v
[Content Store] [URL Dedup Filter]
(S3 / HDFS)         |
                     v
              [URL Frontier] (loop)
```

Three core loops:
1. Fetch: pull URL from frontier, download page, store raw HTML
2. Parse: extract links and text from stored HTML
3. Schedule: filter duplicate/already-seen URLs, re-enqueue new ones

## URL Frontier

The frontier is more than a queue — it enforces:

**Politeness:** respect robots.txt and crawl delays per domain.
  - Per-domain back-queue with minimum delay (e.g., 1 req/sec per host)
  - Separate priority queue by importance (PageRank estimate, freshness)

**Priority:**
  - High: news sites, frequently-changing pages
  - Low: deep archive pages, rarely-linked content

Implementation: distributed priority queue (Kafka partitioned by host hash, or
Redis sorted sets for small scale).

## Deduplication

Two levels of dedup are required:

**URL dedup:** don't re-fetch the same URL.
  - Bloom filter in memory (~1 GB for 1B URLs at 1% false positive rate)
  - Persistent seen-set in Redis or a key-value store for durability

**Content dedup:** don't store identical pages under different URLs.
  - Compute SimHash (or MD5) of page content
  - Near-duplicate detection with SimHash bit-difference threshold

## robots.txt

Every crawler MUST obey robots.txt:
  1. Fetch `https://host/robots.txt` once per domain, cache for 24h
  2. Check Disallow rules before fetching any URL from that host
  3. Respect Crawl-delay directive
  4. Set a descriptive User-Agent identifying your crawler

Violating robots.txt risks IP bans and legal issues.

## Deep Dives

### Distributed Fetcher Fleet
  - Stateless fetcher workers — scale horizontally
  - Each worker pulls a batch of URLs for a specific host subset (partitioned to
    respect per-domain politeness)
  - Health checks; failed fetches go back on the queue with a retry count

### Handling Traps
  - Spider traps: dynamically generated infinite URL spaces (e.g., calendars)
    - Limit URL path depth (e.g., max 6 levels deep)
    - Detect patterns: URLs with incrementing query params
  - Duplicate content under canonical vs. non-canonical URLs
    - Follow <link rel="canonical"> headers

### Re-crawl Scheduling
  - Track last-modified headers and ETag — use conditional GET (If-Modified-Since)
  - Pages that change often → shorter re-crawl interval (exponential smoothing)
  - Pages with 404 → back off exponentially then drop after N failures

### DNS at Scale
  - ~400 DNS lookups/sec requires a local caching DNS resolver
  - Batch resolve hostnames ahead of fetch time
  - Respect DNS TTL but cache aggressively for known stable domains

## Key Decisions to Highlight

1. URL Frontier partitioned by host — enforces politeness without coordination overhead
2. Bloom filter for URL dedup — O(1) lookup, probabilistic but tunable
3. Separate fetch → parse → schedule pipeline — each stage scales independently
4. robots.txt compliance cached per domain — respect crawl delay, avoid bans