impactsense-parser 0.1.0

Multi-language static analysis: parse codebases into an in-memory dependency graph for impact analysis
Documentation
# ImpactSense Parser

A multi-language static analysis tool written in Rust that parses source code using [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/), extracts structural symbols (files, classes, functions, API endpoints), and builds a dependency graph in [Neo4j](https://neo4j.com/) for impact analysis.

Given a codebase, it answers questions like *"If I change this class, which functions and files are affected?"* by constructing a queryable graph of code relationships.

## Supported Languages

| Language   | Parsing | Classes | Call Graph | File Dependencies | API Endpoints |
|------------|---------|---------|------------|-------------------|---------------|
| Java       | Full AST | Yes    | Partial    | Yes (imports)     | Spring        |
| C#         | Full AST | Yes    | Partial    | No                | ASP.NET       |
| Go         | Full AST | Yes    | Partial    | No                | Chi/Gin/Echo  |
| Erlang     | Text     | Module | Approximate| Yes               | Cowboy        |
| JavaScript | Full AST | No     | No         | No                | No            |
| TypeScript | Full AST | No     | No         | No                | No            |
| Python     | Full AST | No     | No         | No                | No            |
| Rust       | Full AST | No     | No         | No                | No            |

## Architecture

```
                          ┌──────────────────────┐
                          │   CLI  (main.rs)     │
                          │   clap arg parsing   │
                          └──────────┬───────────┘
                          ┌──────────────────────┐
                          │  scanner.rs           │
                          │  walkdir + rayon      │
                          │  parallel file parse  │
                          └──────────┬───────────┘
                          Vec<ParsedFile>
                    ┌────────────────┼────────────────┐
                    ▼                                  ▼
         ┌───────────────────┐             ┌────────────────────┐
         │  JSON output       │             │  graph.rs           │
         │  (--output-json)   │             │  Neo4j persistence  │
         │  AST summaries     │             │  (--push-to-neo4j)  │
         └───────────────────┘             └────────────────────┘
```

1. **Scan** — Recursively walks the target directory, identifies source files by extension, and filters by max file size.
2. **Parse** — Each file is parsed in parallel (via Rayon) using Tree-Sitter grammars, producing an AST per file.
3. **Extract** — Language-specific extractors pull out classes, functions, imports, call sites, API endpoints, and external API references.
4. **Persist** — Extracted symbols and relationships are written to Neo4j as a labeled property graph. Relationships are batched (3000 edges per flush) to reduce round-trips.
5. **Post-process**`SAME_API` edges are created between internal `ApiEndpoint` nodes and `ExternalApi` nodes that share a normalized path.

## Prerequisites

- **Rust** (edition 2024) — install via [rustup]https://rustup.rs/
- **Neo4j 5** — run via Docker (see below)
- **C compiler** — required by `build.rs` to compile the vendored Erlang Tree-Sitter grammar

## Installation

```bash
git clone http://git.redbus.com/sujal.v/impactdependency.git
cd impactdependency/parser
cargo build --release
```

The build step compiles the vendored Erlang grammar from `vendor/tree-sitter-erlang/` via `build.rs`.

## Neo4j Setup

Start a Neo4j 5 instance with Docker:

```bash
docker run -d \
  --name neo4j-parser \
  -p 7474:7474 \
  -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/parser1234 \
  neo4j:5
```

The Neo4j Browser will be available at `http://localhost:7474/`.

## Usage

### Basic — parse and output JSON

```bash
cargo run -- /path/to/repo --output-json parsed_output.json
```

### Parse and push to Neo4j

```bash
cargo run -- /path/to/repo \
  --output-json parsed_output.json \
  --push-to-neo4j
```

### Full options with custom Neo4j credentials

```bash
cargo run -- /path/to/repo \
  --output-json parsed_output.json \
  --push-to-neo4j \
  --clean \
  --neo4j-uri bolt://localhost:7687 \
  --neo4j-user neo4j \
  --neo4j-password myStrongPass123
```

### CLI Reference

| Argument             | Type   | Default                  | Description                                |
|----------------------|--------|--------------------------|--------------------------------------------|
| `ROOT`               | path   | *(required)*             | Root directory to scan                     |
| `--output-json`      | path   || Write AST summaries to a JSON file         |
| `--push-to-neo4j`    | flag   | `false`                  | Push the parsed graph into Neo4j           |
| `--clean`            | flag   | `false`                  | Delete all existing nodes before pushing   |
| `--neo4j-uri`        | string | `bolt://localhost:7688`  | Neo4j Bolt URI                             |
| `--neo4j-user`       | string | `neo4j`                  | Neo4j username                             |
| `--neo4j-password`   | string | `parser1234`             | Neo4j password                             |
| `--follow-symlinks`  | flag   | `false`                  | Follow symbolic links during traversal     |
| `--max-file-size`    | bytes  | 2 MiB                    | Skip files larger than this                |

## Graph Schema

### Node Types

| Label          | Key Properties                                           |
|----------------|----------------------------------------------------------|
| `File`         | `path`, `language`, `framework?`, `project_name?`, `is_test?` |
| `Module`       | `name`, `path`, `language` (Erlang modules)              |
| `Class`        | `name`, `fqn`, `path`, `language?`, `project_name?`     |
| `Function`     | `name`, `fqn`, `path`, `language`, `arity?`, `return_type?`, `param_count?` |
| `ApiEndpoint`  | `methods[]`, `path`, `norm_path?`, `framework?`         |
| `ExternalApi`  | `name`, `base_url?`, `path?`, `norm_path?`, `provider?` |

### Relationships

```
(:File)-[:DECLARES_MODULE]->(:Module)
(:File)-[:DECLARES_CLASS]->(:Class)
(:File)-[:DECLARES_FUNCTION]->(:Function)
(:Class)-[:DECLARES_FUNCTION]->(:Function)
(:Module)-[:DECLARES_FUNCTION]->(:Function)
(:File)-[:DEPENDS_ON_FILE]->(:File)
(:Function)-[:CALLS_FUNCTION]->(:Function)
(:Function)-[:USES_CLASS]->(:Class)
(:ApiEndpoint)-[:HANDLED_BY]->(:Function)
(:Function)-[:CALLS_EXTERNAL_API]->(:ExternalApi)
(:ApiEndpoint)-[:SAME_API]->(:ExternalApi)
```

## Example Queries

Once the graph is in Neo4j, you can run Cypher queries for impact analysis:

```cypher
// Which functions call OrderDetail.setAmenities?
MATCH (caller:Function)-[:CALLS_FUNCTION]->(target:Function {name: "setAmenities"})
WHERE target.fqn CONTAINS "OrderDetail"
RETURN caller.fqn, caller.path

// Which files depend on OrderDetail.java?
MATCH (f:File)-[:DEPENDS_ON_FILE]->(dep:File)
WHERE dep.path CONTAINS "OrderDetail.java"
RETURN f.path

// All functions reachable within 3 hops from a given function
MATCH path = (start:Function {name: "processOrder"})-[:CALLS_FUNCTION*1..3]->(downstream:Function)
RETURN downstream.fqn, length(path) AS depth

// API endpoints and their handler functions
MATCH (ep:ApiEndpoint)-[:HANDLED_BY]->(fn:Function)
RETURN ep.path, ep.methods, fn.fqn
```

## MCP Server Integration

The parser ships with a [FastMCP](https://github.com/jlowin/fastmcp) server so it can be invoked as a tool from Cursor IDE or any MCP-compatible client.

### Setup

```bash
cd parser/mcp
pip install -r requirements.txt
python main.py
```

The MCP server exposes a `parse_repository` tool with parameters matching the CLI arguments. It runs `cargo run` as a subprocess, pipes progress logs to stderr (to keep the JSON-RPC stdout channel clean), and returns the parse results.

### Tool: `parse_repository`

| Parameter          | Type   | Description                        |
|--------------------|--------|------------------------------------|
| `root_path`        | string | Directory to parse                 |
| `follow_symlinks`  | bool   | Follow symlinks                    |
| `max_file_size`    | int    | Max file size in bytes             |
| `push_to_neo4j`    | bool   | Push graph to Neo4j                |
| `neo4j_uri`        | string | Neo4j Bolt URI                     |
| `neo4j_user`       | string | Neo4j username                     |
| `neo4j_password`   | string | Neo4j password                     |

## Project Structure

```
parser/
├── Cargo.toml                 # Rust dependencies and build config
├── build.rs                   # Compiles vendored Erlang grammar (C → .a)
├── graph_schema.md            # Neo4j node/relationship schema reference
├── src/
│   ├── main.rs                # CLI entry point (clap)
│   ├── lib.rs                 # Language registry and Tree-Sitter wrapper
│   ├── scanner.rs             # Directory walker + parallel parser
│   ├── graph.rs               # Symbol extraction + Neo4j persistence
│   ├── edge.rs                # Relationship type enum
│   ├── schema.rs              # Node labels and property constants
│   ├── ir.rs                  # Intermediate representation for serialization
│   └── erlang.rs              # FFI binding for vendored Erlang grammar
├── vendor/
│   └── tree-sitter-erlang/    # Vendored Erlang Tree-Sitter grammar (C source)
├── mcp/
│   ├── main.py                # MCP server entry point
│   ├── app.py                 # FastMCP app definition
│   ├── services/
│   │   └── parser_service.py  # Subprocess runner for cargo
│   ├── tools/
│   │   └── parser_tools.py    # parse_repository tool definition
│   └── requirements.txt       # Python dependencies
└── prompts/                   # Prompt templates for MCP tool usage
```

## Known Limitations

- **Java imports** are filtered to `com.redbus.genai.*` by default — other internal packages are not tracked.
- **C# and Go** lack file-level dependency edges (`DEPENDS_ON_FILE`).
- **Erlang** uses regex-based text parsing instead of the Tree-Sitter AST for function extraction.
- **JS, TS, Python, Rust** only extract top-level functions — no classes, call graphs, or dependency edges.
- **Class inheritance** (`extends`/`implements`) is not tracked for any language.
- **Neo4j writes are sequential** per file, which can be slow for large codebases (10k+ files).
- **No incremental parsing in CLI** — the full codebase is re-parsed on every CLI run (MCP server supports incremental file-watcher updates).

See `shortcomings.txt` for a detailed analysis.

---

## Client-side library (in-memory graph)

Add from [crates.io](https://crates.io/crates/impactsense-parser):

```toml
[dependencies]
impactsense-parser = "0.1"
```

The `impactsense-parser` crate builds an **`InMemoryGraph`** in RAM with indexed queries for IDE/MCP use.

```rust
use impactsense_parser::pipeline::ScanOptions;
use impactsense_parser::parse_project;
use impactsense_parser::store::GraphStore;

let graph = parse_project("/path/to/repo", &ScanOptions::default())?;
let callers = graph.callers("com.example.OrderService.create");
let impact = graph.impact("com.example.OrderService.create", Default::default());
```

Export IR as JSON from the CLI:

```bash
cargo install impactsense-parser
impactsense-parser /path/to/repo --output-json project_ir.json
```

### Cargo features

| Feature | Default | Description |
|---------|---------|-------------|
| `neo4j` | yes | Neo4j persistence (`--push-to-neo4j`, webhook) |
| `compressor` | no | Reserved for RedCompressor integration |

---

## Cursor MCP setup

One install gives you both the CLI and the MCP server:

```bash
cargo install impactsense-parser
```

Binaries are placed in `~/.cargo/bin/`:

- `impactsense-parser` — CLI
- `impactsense-mcp` — MCP server for Cursor

Create `.cursor/mcp.json` in your project:

```json
{
  "mcpServers": {
    "impactsense": {
      "command": "/Users/YOUR_USER/.cargo/bin/impactsense-mcp",
      "args": ["--root", "${workspaceFolder}"]
    }
  }
}
```

Replace `YOUR_USER` with your username, or run `which impactsense-mcp` after install to get the exact path.

Restart Cursor. The server parses your open workspace once at startup, then keeps the graph updated as you edit files.

### MCP tools

| Tool | Description |
|------|-------------|
| `find_symbol` | Search by name or FQN substring |
| `callers` / `callees` | Direct call graph neighbors |
| `file_dependencies` | Import/file deps for a path |
| `symbols_in_file` | Declared symbols in one file |
| `impact_analysis` | Transitive callers (bounded depth) |
| `graph_stats` | Node/edge counts |

The graph lives in MCP process memory. Restart MCP/Cursor to re-bootstrap after large branch switches.