impactsense-parser 0.1.1

Multi-language static analysis: parse codebases into an in-memory dependency graph for impact analysis
Documentation
# ImpactSense Parser

A multi-language static analysis tool written in Rust that parses source code using [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/), extracts structural symbols (files, classes, functions, API endpoints), and builds a dependency graph in [Neo4j](https://neo4j.com/) for impact analysis.

Given a codebase, it answers questions like *"If I change this class, which functions and files are affected?"* by constructing a queryable graph of code relationships.

## Supported Languages

| Language   | Parsing | Classes | Call Graph | File Dependencies | API Endpoints |
|------------|---------|---------|------------|-------------------|---------------|
| Java       | Full AST | Yes    | Partial    | Yes (imports)     | Spring        |
| C#         | Full AST | Yes    | Partial    | Yes (`using`)     | ASP.NET       |
| Go         | Full AST | Yes    | Partial    | Yes (`import`)    | Chi/Gin/Echo  |
| Erlang     | AST      | Module | Yes        | Yes               | Cowboy        |
| JavaScript | Full AST | No     | Intra-file | Yes (imports)     | No            |
| TypeScript | Full AST | No     | Intra-file | Yes (imports)     | No            |
| Python     | Full AST | No     | Intra-file | Yes (imports)     | No            |
| Rust       | Full AST | No     | Intra-file | Yes (`use`)       | No            |

## Architecture

```
                          ┌──────────────────────┐
                          │   CLI  (main.rs)     │
                          │   clap arg parsing   │
                          └──────────┬───────────┘
                          ┌──────────────────────┐
                          │  scanner.rs           │
                          │  walkdir + rayon      │
                          │  parallel file parse  │
                          └──────────┬───────────┘
                          Vec<ParsedFile>
                    ┌────────────────┼────────────────┐
                    ▼                                  ▼
         ┌───────────────────┐             ┌────────────────────┐
         │  JSON output       │             │  graph.rs           │
         │  (--output-json)   │             │  Neo4j persistence  │
         │  AST summaries     │             │  (--push-to-neo4j)  │
         └───────────────────┘             └────────────────────┘
```

1. **Scan** — Recursively walks the target directory, identifies source files by extension, and filters by max file size.
2. **Parse** — Each file is parsed in parallel (via Rayon) using Tree-Sitter grammars, producing an AST per file.
3. **Extract** — Language-specific extractors pull out classes, functions, imports, call sites, API endpoints, and external API references.
4. **Persist** — Extracted symbols and relationships are written to Neo4j as a labeled property graph. Relationships are batched (3000 edges per flush) to reduce round-trips.
5. **Post-process**`SAME_API` edges are created between internal `ApiEndpoint` nodes and `ExternalApi` nodes that share a normalized path.

## Prerequisites

- **Rust** (edition 2024) — install via [rustup]https://rustup.rs/
- **Neo4j 5** — run via Docker (see below)
- **C compiler** — required by `build.rs` to compile the vendored Erlang Tree-Sitter grammar

## Installation

```bash
git clone http://git.redbus.com/sujal.v/impactdependency.git
cd impactdependency/parser
cargo build --release
```

The build step compiles the vendored Erlang grammar from `vendor/tree-sitter-erlang/` via `build.rs`.

## Neo4j Setup

Start a Neo4j 5 instance with Docker:

```bash
docker run -d \
  --name neo4j-parser \
  -p 7474:7474 \
  -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/parser1234 \
  neo4j:5
```

The Neo4j Browser will be available at `http://localhost:7474/`.

## Usage

### Basic — parse and output JSON

```bash
cargo run -- /path/to/repo --output-json parsed_output.json
```

### Parse and push to Neo4j

```bash
cargo run -- /path/to/repo \
  --output-json parsed_output.json \
  --push-to-neo4j
```

### Full options with custom Neo4j credentials

```bash
cargo run -- /path/to/repo \
  --output-json parsed_output.json \
  --push-to-neo4j \
  --clean \
  --neo4j-uri bolt://localhost:7687 \
  --neo4j-user neo4j \
  --neo4j-password myStrongPass123
```

### CLI Reference

| Argument             | Type   | Default                  | Description                                |
|----------------------|--------|--------------------------|--------------------------------------------|
| `ROOT`               | path   | *(required)*             | Root directory to scan                     |
| `--output-json`      | path   || Write AST summaries to a JSON file         |
| `--push-to-neo4j`    | flag   | `false`                  | Push the parsed graph into Neo4j           |
| `--clean`            | flag   | `false`                  | Delete all existing nodes before pushing   |
| `--neo4j-uri`        | string | `bolt://localhost:7688`  | Neo4j Bolt URI                             |
| `--neo4j-user`       | string | `neo4j`                  | Neo4j username                             |
| `--neo4j-password`   | string | `parser1234`             | Neo4j password                             |
| `--follow-symlinks`  | flag   | `false`                  | Follow symbolic links during traversal     |
| `--max-file-size`    | bytes  | 2 MiB                    | Skip files larger than this                |

## Graph Schema

### Node Types

| Label          | Key Properties                                           |
|----------------|----------------------------------------------------------|
| `File`         | `path`, `language`, `framework?`, `project_name?`, `is_test?` |
| `Module`       | `name`, `path`, `language` (Erlang modules)              |
| `Class`        | `name`, `fqn`, `path`, `language?`, `project_name?`     |
| `Function`     | `name`, `fqn`, `path`, `language`, `arity?`, `return_type?`, `param_count?` |
| `ApiEndpoint`  | `methods[]`, `path`, `norm_path?`, `framework?`         |
| `ExternalApi`  | `name`, `base_url?`, `path?`, `norm_path?`, `provider?` |

### Relationships

```
(:File)-[:DECLARES_MODULE]->(:Module)
(:File)-[:DECLARES_CLASS]->(:Class)
(:File)-[:DECLARES_FUNCTION]->(:Function)
(:Class)-[:DECLARES_FUNCTION]->(:Function)
(:Module)-[:DECLARES_FUNCTION]->(:Function)
(:File)-[:DEPENDS_ON_FILE]->(:File)
(:Function)-[:CALLS_FUNCTION]->(:Function)
(:Function)-[:USES_CLASS]->(:Class)
(:ApiEndpoint)-[:HANDLED_BY]->(:Function)
(:Function)-[:CALLS_EXTERNAL_API]->(:ExternalApi)
(:ApiEndpoint)-[:SAME_API]->(:ExternalApi)
```

## Example Queries

Once the graph is in Neo4j, you can run Cypher queries for impact analysis:

```cypher
// Which functions call OrderDetail.setAmenities?
MATCH (caller:Function)-[:CALLS_FUNCTION]->(target:Function {name: "setAmenities"})
WHERE target.fqn CONTAINS "OrderDetail"
RETURN caller.fqn, caller.path

// Which files depend on OrderDetail.java?
MATCH (f:File)-[:DEPENDS_ON_FILE]->(dep:File)
WHERE dep.path CONTAINS "OrderDetail.java"
RETURN f.path

// All functions reachable within 3 hops from a given function
MATCH path = (start:Function {name: "processOrder"})-[:CALLS_FUNCTION*1..3]->(downstream:Function)
RETURN downstream.fqn, length(path) AS depth

// API endpoints and their handler functions
MATCH (ep:ApiEndpoint)-[:HANDLED_BY]->(fn:Function)
RETURN ep.path, ep.methods, fn.fqn
```

## MCP Server Integration

The parser ships with a [FastMCP](https://github.com/jlowin/fastmcp) server so it can be invoked as a tool from Cursor IDE or any MCP-compatible client.

### Setup

```bash
cd parser/mcp
pip install -r requirements.txt
python main.py
```

The MCP server exposes a `parse_repository` tool with parameters matching the CLI arguments. It runs `cargo run` as a subprocess, pipes progress logs to stderr (to keep the JSON-RPC stdout channel clean), and returns the parse results.

### Tool: `parse_repository`

| Parameter          | Type   | Description                        |
|--------------------|--------|------------------------------------|
| `root_path`        | string | Directory to parse                 |
| `follow_symlinks`  | bool   | Follow symlinks                    |
| `max_file_size`    | int    | Max file size in bytes             |
| `push_to_neo4j`    | bool   | Push graph to Neo4j                |
| `neo4j_uri`        | string | Neo4j Bolt URI                     |
| `neo4j_user`       | string | Neo4j username                     |
| `neo4j_password`   | string | Neo4j password                     |

## Project Structure

```
parser/
├── Cargo.toml                 # Rust dependencies and build config
├── build.rs                   # Compiles vendored Erlang grammar (C → .a)
├── graph_schema.md            # Neo4j node/relationship schema reference
├── src/
│   ├── main.rs                # CLI entry point (clap)
│   ├── lib.rs                 # Language registry and Tree-Sitter wrapper
│   ├── scanner.rs             # Directory walker + parallel parser
│   ├── graph.rs               # Symbol extraction + Neo4j persistence
│   ├── edge.rs                # Relationship type enum
│   ├── schema.rs              # Node labels and property constants
│   ├── ir.rs                  # Intermediate representation for serialization
│   └── erlang.rs              # FFI binding for vendored Erlang grammar
├── vendor/
│   └── tree-sitter-erlang/    # Vendored Erlang Tree-Sitter grammar (C source)
├── mcp/
│   ├── main.py                # MCP server entry point
│   ├── app.py                 # FastMCP app definition
│   ├── services/
│   │   └── parser_service.py  # Subprocess runner for cargo
│   ├── tools/
│   │   └── parser_tools.py    # parse_repository tool definition
│   └── requirements.txt       # Python dependencies
└── prompts/                   # Prompt templates for MCP tool usage
```

## Known Limitations

- **Java imports** are filtered to `com.redbus.genai.*` by default — other internal packages are not tracked.
- **C# and Go** lack file-level dependency edges (`DEPENDS_ON_FILE`).
- **Erlang** uses regex-based text parsing instead of the Tree-Sitter AST for function extraction.
- **JS, TS, Python, Rust** only extract top-level functions (no classes); call graphs are intra-file and file dependencies are best-effort from imports/`use`.
- **Class inheritance** (`extends`/`implements`) is not tracked for any language.
- **Neo4j writes are sequential** per file, which can be slow for large codebases (10k+ files).
- **No incremental parsing in CLI** — the full codebase is re-parsed on every CLI run (MCP server supports incremental file-watcher updates).

See `shortcomings.txt` for a detailed analysis.

---

## Client-side library (in-memory graph)

Add from [crates.io](https://crates.io/crates/impactsense-parser):

```toml
[dependencies]
impactsense-parser = "0.1"
```

The `impactsense-parser` crate builds an **`InMemoryGraph`** in RAM with indexed queries for IDE/MCP use. Optional **RedCompressor** integration stores Zstd `code_bytes` on symbols in [`ProjectIr`](src/ir.rs) (same HTTP API as the Neo4j server path).

```rust
use impactsense_parser::graph::CompressorConfig;
use impactsense_parser::pipeline::ScanOptions;
use impactsense_parser::parse_project;
use impactsense_parser::store::GraphStore;

let scan = ScanOptions::default(); // compression on by default (RedCompressor HTTP API)

let graph = parse_project("/path/to/repo", &scan)?;
let callers = graph.callers("com.example.OrderService.create");
let impact = graph.impact("com.example.OrderService.create", Default::default());
```

Export IR as JSON from the CLI:

```bash
cargo install impactsense-parser
impactsense-parser /path/to/repo --output-json project_ir.json
```

### Cargo features

| Feature | Default | Description |
|---------|---------|-------------|
| `neo4j` | yes | Neo4j persistence (`--push-to-neo4j`, webhook) |
| `compressor` | no | Feature flag placeholder (compressor is always available via `CompressorConfig`) |

---

## Cursor MCP setup

One install gives you both the CLI and the MCP server:

```bash
cargo install impactsense-parser
```

Binaries are placed in `~/.cargo/bin/`:

- `impactsense-parser` — CLI
- `impactsense-mcp` — MCP server for Cursor

Create `.cursor/mcp.json` in your project:

```json
{
  "mcpServers": {
    "impactsense": {
      "command": "/Users/YOUR_USER/.cargo/bin/impactsense-mcp",
      "args": ["--root", "${workspaceFolder}"]
    }
  }
}
```

Replace `YOUR_USER` with your username, or run `which impactsense-mcp` after install to get the exact path.

Restart Cursor. The server parses your open workspace once at startup, then keeps the graph updated as you edit files.

Compression is on by default for MCP and the library. Disable with `IMPACTSENSE_COMPRESS_CODEBLOCKS=0` or CLI `--no-compress-codeblocks`. Override the API URL with `REDCOMPRESSOR_URL` (default `http://10.166.1.220:8787`).

### MCP tools

| Tool | Description |
|------|-------------|
| `find_symbol` | Search by name or FQN substring |
| `callers` / `callees` | Direct call graph neighbors |
| `file_dependencies` | Import/file deps for a path |
| `symbols_in_file` | Declared symbols in one file |
| `impact_analysis` | Transitive callers (bounded depth) |
| `graph_stats` | Node/edge counts |
| `explain_symbol_logic` | Decompressed implementation source for a symbol FQN (use when you need to read what code *does*, not only who calls it) |

Use **`explain_symbol_logic`** when an agent needs the body of a function, class, module, or property. Use **`callers`** / **`callees`** / **`impact_analysis`** for dependency and blast-radius questions. Optional `include_callers` / `include_callees` on explain attach direct neighbors for functions.

The graph lives in MCP process memory. Restart MCP/Cursor to re-bootstrap after large branch switches.