symtrace-0.1.1 is not a library.

symtrace

A deterministic semantic diff engine written in Rust that compares two Git commits using AST-based structural analysis instead of traditional line-based text diff.

Where git diff shows you lines that changed, symtrace shows you what semantically changed — functions moved, classes deleted, variables renamed, code blocks inserted — at the AST node level, with no false positives from formatting or comment edits.

━━━ src/handler.rs
  + [INSERT] function_item 'handle_request' inserted (L42)
  ~ [MODIFY] function_item 'parse_body' modified (L10 → L10) [75% similarity, medium]
  ✎ [RENAME] function_item renamed from 'process' to 'execute' (L5 → L5) [98% similarity, low]
  - [DELETE] function_item 'deprecated_fn' deleted (L88)
  ↔ [MOVE]   function_item 'helper' moved (L20 → L35) [100% similarity, low]
  ── Refactor Patterns ──
    ▸ 'process' renamed to 'execute' (confidence: 100%)

Features

Semantic operations — MOVE, RENAME, MODIFY, INSERT, DELETE detected at the AST node level
5-phase matching algorithm — exact hash match → structural match → similarity scoring → leftovers
4-hash BLAKE3 node identity — structural, content, identity, and context hashes per node
Refactor pattern detection — extract method, move method, rename variable
Cross-file symbol tracking — detects symbols that move, rename, or change API across files
Commit classification — automatically labels commits (feature, bugfix, refactor, cleanup, formatting_only, etc.)
Semantic similarity scoring — per-operation similarity percentage with intensity rating (low / medium / high)
Incremental parsing — tree-sitter tree reuse + BLAKE3 hash reuse for unchanged subtrees
AST caching — two-tier cache (in-memory LRU + on-disk) keyed by blob hash
Parallel processing — files parsed and diffed in parallel via rayon
Arena allocation — bumpalo arena for zero-overhead AST construction
Comment/whitespace filtering — --logic-only mode ignores non-logic changes
Machine-readable output — --json for CI/CD pipelines and tooling integration
Parser resource limits — configurable file size, node count, recursion depth, and timeout guards
Zero network access — fully offline, no telemetry, no data collection

Supported Languages

Language	Extensions
Rust	`.rs`
JavaScript	`.js`, `.jsx`, `.mjs`, `.cjs`
TypeScript	`.ts`, `.tsx`
Python	`.py`, `.pyi`
Java	`.java`

Files with unsupported extensions are silently skipped.

Quick Start

# Build

cargo build --release


# Compare the last two commits

symtrace . HEAD~1 HEAD


# Compare two specific commits

symtrace /path/to/repo a1b2c3d 9f8e7d6


# JSON output for scripting

symtrace . HEAD~1 HEAD --json


# Ignore comment/whitespace changes

symtrace . HEAD~1 HEAD --logic-only

Installation

Requires Rust (edition 2021+) and a C compiler (for libgit2 and tree-sitter native code).

# From source

git clone https://github.com/nicktretyakov/symtrace.git

cd symtrace

cargo install --path .


# Or build directly

cargo build --release

# Binary at target/release/symtrace (or .exe on Windows)

Build Scripts

# Production build (clean + fmt + lint + test + release)

./build.sh production     # macOS/Linux

.\build.ps1 -Target production   # Windows

make production           # GNU Make

See DEVELOPMENT.md for full build system documentation.

Usage

symtrace <REPO_PATH> <COMMIT_A> <COMMIT_B> [OPTIONS]

Arguments

Argument	Description
`REPO_PATH`	Path to a local Git repository (the folder containing `.git/`)
`COMMIT_A`	Older commit reference — hash, `HEAD~1`, branch name, tag
`COMMIT_B`	Newer commit reference — hash, `HEAD`, branch name, tag

Options

Flag	Default	Description
`--logic-only`	off	Ignore comments and whitespace-only nodes
`--json`	off	Emit machine-readable JSON instead of colored CLI output
`--no-incremental`	off	Disable incremental parsing (force full re-parse)
`--max-file-size <BYTES>`	5242880	Skip files larger than this (5 MiB default)
`--max-ast-nodes <N>`	200000	Skip files with more AST nodes than this
`--max-recursion-depth <N>`	2048	Maximum parser recursion depth
`--parse-timeout-ms <MS>`	2000	Per-file parse timeout (0 = no timeout)
`--help`		Print help
`--version`		Print version

Examples

# Compare feature branch against main

symtrace /repos/project main feature/my-feature


# Compare two tags

symtrace /repos/project v1.0.0 v2.0.0


# JSON output piped to jq

symtrace . HEAD~1 HEAD --json | jq '.summary'


# Logic-only JSON for CI

symtrace . HEAD~5 HEAD --logic-only --json


# Strict resource limits for untrusted repos

symtrace . HEAD~1 HEAD --max-file-size 1048576 --max-ast-nodes 50000 --parse-timeout-ms 500

How It Works

Repository commits
       │
       ▼
   git layer          ← libgit2: resolve refs, extract file blobs
       │
       ▼
  blob hash check     ← short-circuit: skip files with identical content
       │
       ▼
  AST parsing         ← tree-sitter: parallel, cached, incremental
       │                  (arena-allocated, resource-guarded)
       ▼
   BLAKE3 hashing     ← 4-hash identity per node (with incremental reuse)
       │
       ▼
  tree diffing        ← 5-phase matching algorithm (parallel per file)
       │
       ▼
  symbol tracking     ← cross-file move/rename/API-change detection
       │
       ▼
  classification      ← auto-classify commit type
       │
       ▼
   output             ← colored CLI  OR  structured JSON

Architecture

Module	Responsibility
`main.rs`	Pipeline orchestration, parallel dispatch, timing
`cli.rs`	CLI argument definitions via clap
`git_layer.rs`	Opens repo with libgit2, resolves commits, reads blobs
`language.rs`	File extension → language mapping, tree-sitter grammar provider
`ast_builder.rs`	Tree-sitter parsing (full + incremental), arena-allocated AST construction
`ast_cache.rs`	Two-tier AST cache — in-memory LRU + on-disk with versioned envelope
`incremental_parse.rs`	TreeCache (in-memory LRU for tree-sitter Trees), edit computation
`node_identity.rs`	BLAKE3 4-hash computation per node (with incremental hash reuse)
`tree_diff.rs`	5-phase matching algorithm, produces operation records
`semantic_similarity.rs`	Composite similarity scoring with complexity analysis
`refactor_detection.rs`	Pattern matching for extract/move/rename refactors
`symbol_tracking.rs`	Cross-file symbol tracking (moves, renames, API changes)
`commit_classification.rs`	Automatic commit classification by type and confidence
`output.rs`	Colored CLI renderer and JSON serializer
`types.rs`	All shared types (AstNode, OperationRecord, DiffOutput, ...)

Hashing Strategy

Every AST node receives four independent BLAKE3 hashes:

Hash	Input	Purpose
`structural_hash`	`node_kind + child structural_hashes`	Tree shape — detects moves regardless of content
`content_hash`	`actual leaf tokens`	Real content — detects any text change
`identity_hash`	`node_kind + <IDENTIFIER> placeholders`	Shape sans names — detects renames
`context_hash`	`parent_structural_hash + depth`	Position in tree — detects re-parenting

Matching Phases

Phase 1 — EXACT MATCH  (structural + content hash)
  └─ same path → silent  /  different path → MOVE

Phase 2 — STRUCTURAL MATCH  (structural hash only, different content)
  ├─ same name → MODIFY
  ├─ only identifiers changed → RENAME
  └─ otherwise → MODIFY

Phase 3 — SIMILARITY SCORING
  3a. Same kind + name → MODIFY
  3b. identity_hash match + ≥90% → RENAME
  3c. Composite ≥70% → MODIFY

Phase 4 — LEFTOVER OLD  → DELETE
Phase 5 — LEFTOVER NEW  → INSERT

Operation Types

Operation	Symbol	Meaning
MOVE	`↔`	Same content, different position in the tree
RENAME	`✎`	Same structure, different identifier names
MODIFY	`~`	Same kind/name, changed body
INSERT	`+`	Exists only in the new commit
DELETE	`-`	Exists only in the old commit

Similarity Scoring

Every matched operation carries a similarity breakdown:

$$\text{score} = 0.5 \times \text{structure} + 0.3 \times \text{tokens} + 0.2 \times \text{complexity}$$

Similarity	Intensity	Meaning
≥ 80%	`low`	Minor change — safe to auto-approve
50–79%	`medium`	Non-trivial — worth focused review
< 50%	`high`	Near-total rewrite — treat as new code

Incremental Parsing

When comparing commits, symtrace parses the old version of each file first, then uses tree-sitter's incremental parsing to reparse the new version:

Edit computation — common prefix/suffix byte comparison → minimal InputEdit
Tree reuse — tree-sitter internally reuses all unchanged subtrees
Hash reuse — BLAKE3 hashes copied from the old AST for nodes outside the changed region

This delivers up to 46% parse time reduction on files with localised changes. Disable with --no-incremental if needed.

Output Formats

CLI (default)

━━━ symtrace  Semantic Diff ━━━
Repository : /repos/project
Comparing  : HEAD~1 → HEAD

━━━ src/server.rs
  + [INSERT] function_item 'handle_request' inserted (L42)
  ~ [MODIFY] function_item 'parse_body' modified (L10 → L10) [75% similarity, medium]
  ✎ [RENAME] function_item renamed from 'old_name' to 'new_name' (L5 → L5) [98% similarity, low]
  - [DELETE] function_item 'deprecated_fn' deleted (L88)
  ↔ [MOVE]   function_item 'helper' moved (L20 → L35) [100% similarity, low]
  ── Refactor Patterns ──
    ▸ 'old_name' renamed to 'new_name' (confidence: 100%)

━━━ Summary ━━━
  Files          : 1
  Moves          : 1
  Renames        : 1
  Inserts        : 1
  Deletes        : 1
  Modifications  : 1

━━━ Cross-File Symbol Tracking ━━━
  Symbols tracked : 42
  ↔ [cross_file_move] variable 'config' moved from 'old.js' to 'new.js' (similarity: 100%)

━━━ Commit Classification ━━━
  Class          : refactor
  Confidence     : 85%

━━━ Performance ━━━
  Files processed   : 1
  Nodes compared    : 312
  Parse time        : 2.14 ms
  Diff time         : 0.38 ms
  Total time        : 12.05 ms
  Incremental       : 1 file(s), 156 nodes reused

JSON (`--json`)

{
  "repository": "/repos/project",
  "commit_a": "HEAD~1",
  "commit_b": "HEAD",
  "files": [
    {
      "file_path": "src/server.rs",
      "operations": [
        {
          "type": "MODIFY",
          "entity_type": "function",
          "old_location": "L10",
          "new_location": "L10",
          "details": "function_item 'parse_body' modified",
          "similarity": {
            "structure_similarity": 0.84,
            "token_similarity": 0.61,
            "node_count_delta": 3,
            "cyclomatic_delta": 1,
            "control_flow_changed": true,
            "similarity_percent": 75.2,
            "change_intensity": "medium"
          }
        }
      ],
      "refactor_patterns": []
    }
  ],
  "summary": { "total_files": 1, "moves": 1, "renames": 1, "inserts": 1, "deletes": 1, "modifications": 1 },
  "cross_file_tracking": { ... },
  "commit_classification": { "classification": "refactor", "confidence": 0.85 },
  "performance": { "total_files_processed": 1, "total_nodes_compared": 312, "parse_time_ms": 2.14, "diff_time_ms": 0.38, "total_time_ms": 12.05 }
}

JSON notes:

old_location omitted on INSERT; new_location omitted on DELETE
similarity omitted on INSERT and DELETE
entity_type: "function", "class", "variable", "block", "other"
change_intensity: "low", "medium", "high"

Performance

Benchmarks on expressjs/express (JavaScript, ~21k LOC), release build, 10-run average on Windows:

Scenario	`git diff`	`symtrace`	Ratio
2 JS files, 6k nodes	42.85 ms	40.61 ms	symtrace 1.06× faster
8 JS files, 21k nodes	44.57 ms	67.70 ms	git diff 1.52× faster
11 JS files, 26k nodes	44.55 ms	74.01 ms	git diff 1.66× faster

symtrace beats git diff on small file sets. On larger ranges, the overhead of full AST parsing + 5-phase matching is the cost of semantic understanding. Scaling is sub-linear: 4.4× more nodes → only 1.8× more time.

See benchmarks_v5.md for complete data, historical progression (v1–v5), and internal timing breakdowns.

Security

No network access — zero HTTP/TCP/DNS dependencies
No telemetry — no analytics, tracking, or data collection
No unsafe Rust — unsafe_code = "deny" enforced in Cargo.toml
No external process spawning — no std::process::Command
Bounded deserialization — AST cache limited to 20 MiB with version/integrity checks
Pinned dependencies — all versions exactly pinned (=x.y.z)
Supply chain hardening — cargo-deny configuration in deny.toml

See SECURITY.md for the full security audit.

Dependencies

Crate	Purpose
`clap`	CLI argument parsing
`git2`	libgit2 bindings for repository access
`tree-sitter`	Parser framework
`tree-sitter-{rust,javascript,typescript,python,java}`	Language grammars
`blake3`	SIMD-optimised hashing for node identity
`serde` / `serde_json`	JSON serialization
`bincode`	Binary serialization for AST cache
`rayon`	Data parallelism
`lru`	In-memory LRU cache
`bumpalo`	Arena allocator
`colored`	Terminal colors
`anyhow`	Error handling

All versions are exactly pinned. See Cargo.toml for specifics.

Contributing

See CONTRIBUTING.md for guidelines.

License

MIT

symtrace 0.1.1