dci-tool 0.1.0

Direct Corpus Interaction: a sandboxed, ripgrep-backed corpus-search toolset and agent for cyber-focused LLM agents, built on rig.
Documentation
# dci-tool

**Direct Corpus Interaction for Rig Agents**

[![Crates.io](https://img.shields.io/crates/v/dci-tool.svg)](https://crates.io/crates/dci-tool)
[![Documentation](https://docs.rs/dci-tool/badge.svg)](https://docs.rs/dci-tool)
[![CI](https://github.com/ForeverAngry/dci-tool/actions/workflows/ci.yml/badge.svg)](https://github.com/ForeverAngry/dci-tool/actions/workflows/ci.yml)
[![License: MIT OR Apache-2.0](https://img.shields.io/crates/l/dci-tool.svg)](#license)

`dci-tool` is a utility suite for [Rig](https://github.com/0xPlaygrounds/rig) that gives AI agents the ability to interact *directly* with raw textual corpora (codebases, logs, directory structures) using sandboxed commands like `search`, `find`, `read`, and `list`.

## Table of Contents

- [When should I use this?]#when-should-i-use-this
- [Features]#features
- [Installation]#installation
- [Quickstart]#quickstart
- [Feature Flags]#feature-flags
- [CLI Usage]#cli-usage
- [MCP Client Integration]#mcp-client-integration
- [Safety Model]#safety-model
- [Development]#development
- [License]#license

## When should I use this?

**Use this when vector embeddings are the wrong fit.**
Standard RAG embeds text into a vector database, which destroys structural context (like file organization) and struggles with precise lexical matches (like error codes, IP addresses, or exact variable names).
Use `dci-tool` when you need an agent to:
- **Investigate massive codebases or logs** by iterating with exact-match regex and path globs.
- **Produce deterministic evidence citations** mapped exactly to `path:line` format.
- **Examine local directory structures** directly without requiring a pre-indexing step.

## Features

- **In-process Ripgrep Engine**: Powered natively by the `grep`, `ignore`, and `globset` crates. Zero subprocesses means zero shell-injection risks.
- **Sandboxed `CorpusRoot`**: Path traversal, symlink escapes, and operations outside the strict corpus root are natively denied. Enforces caps on duration, hits, files walked, and bytes returned.
- **Rig Tools**: Strongly typed `SearchTool`, `FindTool`, `ReadTool`, and `ListTool` implement `rig_core::tool::Tool`.
- **DCI Agent**: Brings everything together under `DciAgent` using any rig-compatible `CompletionModel`.
- **MCP Server** *(optional)*: Exposes the full agent over the Model Context Protocol as a stateful capability with session continuity.
- **Evaluation Harness** *(optional)*: Re-uses `rig-retrieval-evals` to run deterministic head-to-head benchmarking of lexical DCI vs Semantic Vector Retrieval against BEIR-style qrels files.
- **Telemetry** *(optional)*: Automatic latency, tool success, and token consumption observability via `rig-tap`.

## Installation

Add the library to your project:

```bash
cargo add dci-tool
```

Optional capabilities are gated behind [feature flags](#feature-flags). For example, to pull in the MCP server and evaluation harness:

```bash
cargo add dci-tool --features mcp,eval
```

To install the command-line binaries (`dci`, `dci-mcp`, `dci-eval`):

```bash
cargo install dci-tool --features cli,mcp,eval
```

## Quickstart

There are two main ways to use `dci-tool` with a Rig agent: using the pre-packed `DciAgent` wrapper, or bringing the raw `CorpusTools` into your own custom agent.

### 1. The "Batteries Included" Agent (`DciAgent`)

The easiest way to get started is by using `DciAgent`. It automatically configures the agent with a pre-written investigation preamble (teaching it the search -> narrow -> read -> cite loop) and attaches the tools.

```rust
use dci_tool::{CorpusRoot, DciAgent};
use rig_core::providers::openai;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = openai::Client::from_env()?;
    let model = client.completion_model("gpt-4o");

    // 1. Define the safe root boundary that the agent is not allowed to escape
    let corpus = CorpusRoot::new("/path/to/my/project")?;

    // 2. Build the DCI agent
    let agent = DciAgent::builder(model, corpus)
        .max_turns(10) // Set tool iteration budget
        .build();

    // 3. Ask it a question
    let answer = agent.investigate("Where is the authentication bug?").await?;
    println!("{answer}");

    Ok(())
}
```

### 2. Manual Tool Integration (`CorpusTools`)

If you want complete control over your Rig agent's prompt, or if you just want to provide file system superpowers to an existing custom toolset, you can mount the `CorpusTools` manually onto an `AgentBuilder`:

```rust
use dci_tool::{CorpusRoot, tools::CorpusTools};
use rig_core::agent::AgentBuilder;
use rig_core::providers::openai;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = openai::Client::from_env()?;
    let model = client.completion_model("gpt-4o");

    let corpus = CorpusRoot::new("/path/to/my/project")?;
    
    // Instantiate the bundle of SearchTool, FindTool, ReadTool, and ListTool
    let tools = CorpusTools::new(corpus);

    // Build your standard rig Agent
    let agent = AgentBuilder::new(model)
        .preamble("You are an expert vulnerability scanner. Use the tools to inspect the local filesystem.")
        // Wire in the DCI tools manually
        .tool(tools.search)
        .tool(tools.find)
        .tool(tools.read)
        .tool(tools.list)
        .build();

    let response = agent.prompt("Find the auth bug")
        .max_turns(15) // Give it enough turns to search and read!
        .await?;
        
    println!("{}", response);

    Ok(())
}
```

## Feature Flags

All optional capabilities are off by default. Enable only what you need.

| Feature     | Default | Enables                                                                                          |
| ----------- | :-----: | ------------------------------------------------------------------------------------------------ |
| `cli`       |    No   | The `dci`, `dci-mcp`, and `dci-eval` command-line binaries.                                       |
| `mcp`       |    No   | Exposes the agent as a stateful Model Context Protocol server (`DciMcpService`).                  |
| `eval`      |    No   | Retrieval-quality benchmarking of lexical DCI vs. semantic vector retrieval via BEIR-style qrels. |
| `telemetry` |    No   | Latency, tool-success, and token-consumption observability through `rig-tap`.                    |

## CLI Usage

The crate includes `dci` and `dci-mcp` binaries (via the `cli` and `mcp` features).

```bash
# Ask a question across your local directory:
dci --corpus ./src --provider openai "Where is the authentication bug?"

# Serve as a stateful Model Context Protocol server over stdio:
dci-mcp --corpus ./src --provider anthropic --model claude-3-7-sonnet
```

## MCP Client Integration

Because `dci-mcp` implements the **Model Context Protocol**, you can plug it directly into AI clients like Claude Desktop, Cursor, or VS Code, giving them a stateful `dci_investigate` capability out of the box.

### Claude Desktop

Add this to your `claude_desktop_config.json`. Make sure to use an **absolute path** for your corpus directory:

```json
{
  "mcpServers": {
    "dci": {
      "command": "dci-mcp",
      "args": [
        "--corpus", "/absolute/path/to/project",
        "--provider", "openai"
      ],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}
```

### Agent Instructions (`.cursorrules` / `.github/copilot-instructions.md`)

While the MCP protocol self-advertises its schema automatically, you can add this snippet to your workspace rules to aggressively nudge your IDE's agent to prefer `dci-tool` over its default search:

```md
When investigating massive logs, unknown directories, or complex codebase issues, DO NOT use your default workspace search. Instead, invoke the `dci_investigate` MCP tool to recursively interrogate the corpus. Always pass a stable `session_id` to maintain context across multiple turns during the same investigation.
```

## Safety Model

1. Every caller-supplied path is canonicalized and must maintain a prefix of the initialized `CorpusRoot`.
2. Symlink evaluation happens before boundary-checks.
3. Completely read-only operations. (No `fs::write` or `fs::remove` APIs used anywhere).
4. No external network execution from tools.
5. All executions are strictly bounded by elapsed wall-time, total files processed, and maximum results gathered.

## Development

This repo uses [`just`](https://github.com/casey/just) as a task runner. Run `just` to list recipes.

```bash
just check   # fmt + clippy + tests + msrv + doc + publish dry-run
just test    # run the test suite across feature combinations
just doc     # build rustdoc with strict warnings
```

Releases are automated with [release-plz](https://release-plz.dev/), driven by
[Conventional Commits](https://www.conventionalcommits.org/). Merging to `master`
opens a release PR; merging that PR publishes to crates.io and tags the release.

## License

Licensed under either of

- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE or <http://www.apache.org/licenses/LICENSE-2.0>)
- MIT License ([LICENSE-MIT]LICENSE-MIT or <http://opensource.org/licenses/MIT>)

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
dual licensed as above, without any additional terms or conditions.