dci-tool 0.1.0

Direct Corpus Interaction: a sandboxed, ripgrep-backed corpus-search toolset and agent for cyber-focused LLM agents, built on rig.
Documentation

dci-tool

Direct Corpus Interaction for Rig Agents

Crates.io Documentation CI License: MIT OR Apache-2.0

dci-tool is a utility suite for Rig that gives AI agents the ability to interact directly with raw textual corpora (codebases, logs, directory structures) using sandboxed commands like search, find, read, and list.

Table of Contents

When should I use this?

Use this when vector embeddings are the wrong fit. Standard RAG embeds text into a vector database, which destroys structural context (like file organization) and struggles with precise lexical matches (like error codes, IP addresses, or exact variable names). Use dci-tool when you need an agent to:

  • Investigate massive codebases or logs by iterating with exact-match regex and path globs.
  • Produce deterministic evidence citations mapped exactly to path:line format.
  • Examine local directory structures directly without requiring a pre-indexing step.

Features

  • In-process Ripgrep Engine: Powered natively by the grep, ignore, and globset crates. Zero subprocesses means zero shell-injection risks.
  • Sandboxed CorpusRoot: Path traversal, symlink escapes, and operations outside the strict corpus root are natively denied. Enforces caps on duration, hits, files walked, and bytes returned.
  • Rig Tools: Strongly typed SearchTool, FindTool, ReadTool, and ListTool implement rig_core::tool::Tool.
  • DCI Agent: Brings everything together under DciAgent using any rig-compatible CompletionModel.
  • MCP Server (optional): Exposes the full agent over the Model Context Protocol as a stateful capability with session continuity.
  • Evaluation Harness (optional): Re-uses rig-retrieval-evals to run deterministic head-to-head benchmarking of lexical DCI vs Semantic Vector Retrieval against BEIR-style qrels files.
  • Telemetry (optional): Automatic latency, tool success, and token consumption observability via rig-tap.

Installation

Add the library to your project:

cargo add dci-tool

Optional capabilities are gated behind feature flags. For example, to pull in the MCP server and evaluation harness:

cargo add dci-tool --features mcp,eval

To install the command-line binaries (dci, dci-mcp, dci-eval):

cargo install dci-tool --features cli,mcp,eval

Quickstart

There are two main ways to use dci-tool with a Rig agent: using the pre-packed DciAgent wrapper, or bringing the raw CorpusTools into your own custom agent.

1. The "Batteries Included" Agent (DciAgent)

The easiest way to get started is by using DciAgent. It automatically configures the agent with a pre-written investigation preamble (teaching it the search -> narrow -> read -> cite loop) and attaches the tools.

use dci_tool::{CorpusRoot, DciAgent};
use rig_core::providers::openai;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = openai::Client::from_env()?;
    let model = client.completion_model("gpt-4o");

    // 1. Define the safe root boundary that the agent is not allowed to escape
    let corpus = CorpusRoot::new("/path/to/my/project")?;

    // 2. Build the DCI agent
    let agent = DciAgent::builder(model, corpus)
        .max_turns(10) // Set tool iteration budget
        .build();

    // 3. Ask it a question
    let answer = agent.investigate("Where is the authentication bug?").await?;
    println!("{answer}");

    Ok(())
}

2. Manual Tool Integration (CorpusTools)

If you want complete control over your Rig agent's prompt, or if you just want to provide file system superpowers to an existing custom toolset, you can mount the CorpusTools manually onto an AgentBuilder:

use dci_tool::{CorpusRoot, tools::CorpusTools};
use rig_core::agent::AgentBuilder;
use rig_core::providers::openai;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = openai::Client::from_env()?;
    let model = client.completion_model("gpt-4o");

    let corpus = CorpusRoot::new("/path/to/my/project")?;
    
    // Instantiate the bundle of SearchTool, FindTool, ReadTool, and ListTool
    let tools = CorpusTools::new(corpus);

    // Build your standard rig Agent
    let agent = AgentBuilder::new(model)
        .preamble("You are an expert vulnerability scanner. Use the tools to inspect the local filesystem.")
        // Wire in the DCI tools manually
        .tool(tools.search)
        .tool(tools.find)
        .tool(tools.read)
        .tool(tools.list)
        .build();

    let response = agent.prompt("Find the auth bug")
        .max_turns(15) // Give it enough turns to search and read!
        .await?;
        
    println!("{}", response);

    Ok(())
}

Feature Flags

All optional capabilities are off by default. Enable only what you need.

Feature Default Enables
cli No The dci, dci-mcp, and dci-eval command-line binaries.
mcp No Exposes the agent as a stateful Model Context Protocol server (DciMcpService).
eval No Retrieval-quality benchmarking of lexical DCI vs. semantic vector retrieval via BEIR-style qrels.
telemetry No Latency, tool-success, and token-consumption observability through rig-tap.

CLI Usage

The crate includes dci and dci-mcp binaries (via the cli and mcp features).

# Ask a question across your local directory:
dci --corpus ./src --provider openai "Where is the authentication bug?"

# Serve as a stateful Model Context Protocol server over stdio:
dci-mcp --corpus ./src --provider anthropic --model claude-3-7-sonnet

MCP Client Integration

Because dci-mcp implements the Model Context Protocol, you can plug it directly into AI clients like Claude Desktop, Cursor, or VS Code, giving them a stateful dci_investigate capability out of the box.

Claude Desktop

Add this to your claude_desktop_config.json. Make sure to use an absolute path for your corpus directory:

{
  "mcpServers": {
    "dci": {
      "command": "dci-mcp",
      "args": [
        "--corpus", "/absolute/path/to/project",
        "--provider", "openai"
      ],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Agent Instructions (.cursorrules / .github/copilot-instructions.md)

While the MCP protocol self-advertises its schema automatically, you can add this snippet to your workspace rules to aggressively nudge your IDE's agent to prefer dci-tool over its default search:

When investigating massive logs, unknown directories, or complex codebase issues, DO NOT use your default workspace search. Instead, invoke the `dci_investigate` MCP tool to recursively interrogate the corpus. Always pass a stable `session_id` to maintain context across multiple turns during the same investigation.

Safety Model

  1. Every caller-supplied path is canonicalized and must maintain a prefix of the initialized CorpusRoot.
  2. Symlink evaluation happens before boundary-checks.
  3. Completely read-only operations. (No fs::write or fs::remove APIs used anywhere).
  4. No external network execution from tools.
  5. All executions are strictly bounded by elapsed wall-time, total files processed, and maximum results gathered.

Development

This repo uses just as a task runner. Run just to list recipes.

just check   # fmt + clippy + tests + msrv + doc + publish dry-run
just test    # run the test suite across feature combinations
just doc     # build rustdoc with strict warnings

Releases are automated with release-plz, driven by Conventional Commits. Merging to master opens a release PR; merging that PR publishes to crates.io and tags the release.

License

Licensed under either of

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.