dw2md 0.2.1

Crawl a DeepWiki repository and compile all pages into a single, LLM-friendly markdown file
# `dw2md` — DeepWiki to Markdown Compile

**Version:** 0.2.0
**Language:** Rust
**Purpose:** Crawl a DeepWiki repository and compile all pages into a single, LLM-friendly markdown file.

---

## Motivation

DeepWiki generates excellent structured documentation for open-source repositories, but there's no good way to grab an entire wiki and feed it to an LLM as context. The content lives across many pages on `deepwiki.com`, each rendered client-side via Next.js, making naive scraping impractical. `dw2md` solves this by talking directly to the official DeepWiki MCP server — a free, no-auth JSON-RPC endpoint — to pull the wiki structure and all page contents, then compile them into a clean markdown document optimized for LLM consumption.

---

## Core Workflow

The tool operates in three phases:

**Phase 1 — Resolve the repository.** The user provides either a full DeepWiki URL (`https://deepwiki.com/owner/repo`) or a shorthand (`owner/repo`). The tool parses out the owner and repo name.

**Phase 2 — Fetch structure and contents via MCP.** The tool connects to `https://mcp.deepwiki.com/mcp` using the MCP Streamable HTTP protocol (JSON-RPC 2.0 over HTTP POST). It first calls `read_wiki_structure` to get the table of contents, then calls `read_wiki_contents` for each page (or in batches, if the tool supports it). Concurrency is bounded and configurable.

**Phase 3 — Compile and emit.** All pages are assembled into a single markdown document in table-of-contents order, with metadata headers, navigation aids, and clean formatting. The result is written to stdout or a file.

---

## MCP Client Implementation

The tool implements a minimal MCP client targeting the Streamable HTTP transport. No session management is needed — the DeepWiki MCP server is stateless.

### Protocol Details

- **Endpoint:** `https://mcp.deepwiki.com/mcp`
- **Transport:** HTTP POST with JSON-RPC 2.0 body
- **Content-Type:** `application/json`
- **Accept:** `application/json, text/event-stream`
- **Auth:** None required

### Initialization Handshake

Before calling tools, the client must perform the MCP initialization sequence:

```
POST /mcp
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {
    "protocolVersion": "2025-03-26",
    "capabilities": {},
    "clientInfo": {
      "name": "dw2md",
      "version": "0.2.0"
    }
  }
}
```

The server responds with its capabilities and may return a `Mcp-Session-Id` header. If present, include it in subsequent requests. Follow up with the `notifications/initialized` notification:

```
POST /mcp
{
  "jsonrpc": "2.0",
  "method": "notifications/initialized",
  "params": {}
}
```

### Tool Calls

After initialization, call tools via `tools/call`:

```
POST /mcp
{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/call",
  "params": {
    "name": "read_wiki_structure",
    "arguments": {
      "repo": "owner/repo"
    }
  }
}
```

The three available tools are:

| Tool                  | Purpose                                                              | Key Arguments                                                                               |
| --------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| `read_wiki_structure` | Returns the wiki's table of contents (page titles, hierarchy, slugs) | `repo` (e.g. `"tinygrad/tinygrad"`)                                                         |
| `read_wiki_contents`  | Returns the markdown content for a specific page                     | `repo`, page identifier (slug or title — discover exact schema via `tools/list` at runtime) |
| `ask_question`        | Not used by `dw2md`, but available for future extensions             | `repo`, `question`                                                                          |

**Important:** The exact argument schemas for these tools should be discovered at runtime by calling `tools/list` during initialization. The schemas above are based on documentation and observed behavior, but the server is the source of truth. The tool should call `tools/list` on first run and cache the schemas, or at minimum handle schema mismatches gracefully.

### Response Handling

The server may respond with either `application/json` (a single JSON-RPC response) or `text/event-stream` (SSE). The client must handle both:

- **JSON response:** Parse directly as a JSON-RPC result.
- **SSE response:** Read `data:` lines, parse each as a JSON-RPC message, and concatenate text content blocks from the result.

Tool call results come back in the standard MCP content block format:

```json
{
  "jsonrpc": "2.0",
  "id": 2,
  "result": {
    "content": [
      {
        "type": "text",
        "text": "...the actual markdown content..."
      }
    ]
  }
}
```

---

## CLI Interface

```
dw2md [OPTIONS] <REPO>
```

### Arguments

- `<REPO>` — Repository identifier. Accepts any of:
  - `owner/repo` (e.g. `tinygrad/tinygrad`)
  - `https://deepwiki.com/owner/repo`
  - `https://deepwiki.com/owner/repo/3.1-some-page` (extracts `owner/repo`, ignores page path)

### Options

| Flag                 | Short | Default    | Description                                                                 |
| -------------------- | ----- | ---------- | --------------------------------------------------------------------------- |
| `--output <FILE>`    | `-o`  | stdout     | Write output to a file instead of stdout                                    |
| `--concurrency <N>`  | `-j`  | `4`        | Max concurrent page fetches                                                 |
| `--format <FMT>`     | `-f`  | `markdown` | Output format: `markdown` or `json`                                         |
| `--timeout <SECS>`   | `-t`  | `30`       | Per-request timeout in seconds                                              |
| `--pages <FILTER>`   | `-p`  | all        | Comma-separated page slugs to include (e.g. `1-overview,3.1-data-pipeline`) |
| `--exclude <FILTER>` | `-x`  | none       | Comma-separated page slugs to exclude                                       |
| `--no-toc`           |       | false      | Omit the structure tree from output                                         |
| `--no-metadata`      |       | false      | Omit the metadata header block                                              |
| `--quiet`            | `-q`  | false      | Suppress progress output on stderr                                          |
| `--verbose`          | `-v`  | false      | Show detailed progress and debug info                                       |

### Examples

```bash
# Basic usage — prints to stdout
dw2md tinygrad/tinygrad

# Save to file with progress
dw2md tinygrad/tinygrad -o react-wiki.md

# Just the architecture sections
dw2md AsyncFuncAI/deepwiki-open -p 3-architecture,3.1-data-pipeline,3.2-rag-system

# As JSON for programmatic use
dw2md tinygrad/tinygrad -f json -o react-wiki.json

# From a full URL
dw2md https://deepwiki.com/tokio-rs/tokio -o tokio.md
```

---

## Output Format

### Markdown (default)

The markdown output is designed for LLM and agent workflows. The two key design goals are:

1. **Fast structural scanning** — an agent should be able to read the table of contents and understand the document's hierarchy without processing the full content.
2. **Selective extraction** — an agent should be able to `grep` for a section delimiter and extract only the sections relevant to its current task, rather than stuffing the entire document into context.

The compiled document follows this structure:

```markdown
<!-- dw2md v0.2.0 | tinygrad/tinygrad | 2026-02-12T15:30:00Z | 47 pages -->

# tinygrad/tinygrad — DeepWiki

> Compiled from https://deepwiki.com/tinygrad/tinygrad
> Generated: 2026-02-12T15:30:00Z | Pages: 47

## Structure

├── 1 Overview
│   ├── 1.1 Key Features
│   └── 1.2 System Requirements
├── 2 Getting Started
│   ...
└── 8 API Reference

## Contents

<<< SECTION: 1 Overview [1-overview] >>>

[page content with original heading levels preserved]

<<< SECTION: 1.1 Key Features [1-1-key-features] >>>

[page content]

<<< SECTION: 2 Getting Started [2-getting-started] >>>

[page content]
```

#### Section delimiters

Each page is preceded by a delimiter line with the format:

```
<<< SECTION: {title} [{slug}] >>>
```

This is designed to be trivially grep-able by agents and scripts:

```bash
# List all sections
grep "^<<< SECTION:" wiki.md

# Extract a specific section (content between two delimiters)
sed -n '/^<<< SECTION: 1 Overview/,/^<<< SECTION:/p' wiki.md

# Regex to capture title and slug
# ^<<< SECTION: (.+?) \[(.+?)\] >>>$
```

The slug in `[brackets]` is the same identifier used by `--pages` and `--exclude` flags, so an agent can discover slugs from the structure, then re-invoke `dw2md` with `--pages` to fetch only what it needs.

#### Tree table of contents

The `## Structure` section uses ASCII tree characters (`├──`, `└──`, `│`) — the same visual language as the Unix `tree` command. This is more scannable than indented bullet lists and conveys hierarchy at a glance.

When `--no-toc` is passed, the structure tree and the `## Contents` header are both omitted; section delimiters go directly after the metadata.

#### Design decisions

- **Original heading levels preserved** — page content keeps its source heading structure. No heading-level bumping is performed because the `<<< SECTION >>>` delimiter (not markdown heading depth) is the structural boundary. This saves tokens and avoids information loss.
- **Token efficient** — compared to the previous format: no repeated horizontal rules (`---`), no anchor link markup in the TOC, no extra `#` characters from heading bumping.
- **HTML comment metadata on line 1** — machine-parseable but invisible to most renderers. Lets a tool quickly identify the document without reading the whole thing.
- **Source annotations preserved** — DeepWiki pages contain `Sources: file.py:1-50` annotations linking to GitHub. These provide useful code location context for LLMs.
- **Mermaid blocks preserved** — left as fenced code blocks. Many LLMs can interpret these.

### JSON format

When `--format json` is specified:

```json
{
  "repo": "tinygrad/tinygrad",
  "url": "https://deepwiki.com/tinygrad/tinygrad",
  "generated_at": "2026-02-12T15:30:00Z",
  "tool_version": "0.2.0",
  "page_count": 47,
  "pages": [
    {
      "slug": "1-overview",
      "title": "1 Overview",
      "depth": 0,
      "content": "...markdown content..."
    },
    {
      "slug": "1-1-key-features",
      "title": "1.1 Key Features",
      "depth": 1,
      "content": "..."
    }
  ]
}
```

The JSON format is useful for downstream tooling — feeding individual pages into different context windows, building a retrieval index, etc.

---

## Architecture

### Crate Dependencies

| Crate                  | Purpose                                                   |
| ---------------------- | --------------------------------------------------------- |
| `clap`                 | CLI argument parsing (derive API)                         |
| `reqwest`              | HTTP client (with `rustls-tls` for no OpenSSL dependency) |
| `tokio`                | Async runtime                                             |
| `serde` / `serde_json` | JSON serialization                                        |
| `futures`              | Stream combinators for SSE parsing                        |
| `indicatif`            | Progress bars on stderr                                   |
| `anyhow`               | Error handling                                            |

### Module Layout

```
src/
├── main.rs          # CLI entry point, clap parsing
├── mcp/
│   ├── mod.rs       # MCP client: init, tool calls, response parsing
│   ├── transport.rs # HTTP transport layer, SSE handling
│   └── types.rs     # JSON-RPC and MCP type definitions
├── compiler/
│   ├── mod.rs       # Orchestrates fetch + compile pipeline
│   ├── markdown.rs  # Markdown output assembly, tree TOC, section delimiters
│   └── json.rs      # JSON output assembly
└── wiki/
    ├── mod.rs       # Wiki types: Page, Structure, etc.
    └── filter.rs    # Page include/exclude filtering
```

### Concurrency Model

Page fetching uses a `tokio::sync::Semaphore` to bound concurrency. The fetch pipeline:

1. Call `read_wiki_structure` → get ordered list of pages.
2. Apply include/exclude filters.
3. Spawn one task per page, gated by the semaphore.
4. Collect results into a `Vec<Page>` preserving the original order.
5. Pass to the compiler for output assembly.

The semaphore default of 4 is conservative to be respectful of the free DeepWiki MCP endpoint. Users can increase it, but the tool should document that aggressive concurrency may result in rate limiting.

---

## Error Handling

The tool should handle these failure modes gracefully:

- **Network errors / timeouts** — retry each page up to 3 times with exponential backoff (1s, 2s, 4s). After exhausting retries, log the failure and continue with remaining pages. The final output should note which pages failed.
- **MCP protocol errors** — if initialization fails, exit with a clear error message suggesting the endpoint may be down. If `tools/list` returns unexpected schemas, log a warning and attempt to proceed with best-guess arguments.
- **Repository not indexed** — if `read_wiki_structure` returns an error indicating the repo isn't on DeepWiki, print a helpful message: `"Repository 'owner/repo' is not indexed on DeepWiki. Visit https://deepwiki.com to request indexing."`
- **Partial failures** — the tool should always produce output for whatever pages it successfully fetched, appending a summary of failures at the end.

---

## Future Extensions

These are explicitly out of scope for v0.2.0 but worth designing around:

- **`--ask <QUESTION>`** — pipe the compiled wiki to the `ask_question` tool and print the answer. Useful for one-shot queries.
- **Caching** — store fetched wikis in `~/.cache/dw2md/` keyed by `owner/repo` with a TTL. Skip fetching if fresh.
- **Multiple repos** — accept multiple repo arguments and compile them into one document or separate files.
- **Diff mode** — compare a cached version with the current wiki and show what changed.
- **Piping to clipboard** — `dw2md owner/repo | pbcopy` already works since the default output is stdout.

---

## Build & Install

```bash
# From source
cargo install --path .

# Or directly from crates.io (once published)
cargo install dw2md
```

The binary should be statically linkable (no OpenSSL dependency thanks to `rustls`) and produce a single ~5MB binary.