codegraph 0.1.1

A fast, reliable, and flexible graph database optimized for storing and querying code relationships
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
Product Requirements Document: codegraph
Version: 1.0
Date: 2025-01-01
Status: Draft
Owner: Engineering

Executive Summary
codegraph is a specialized graph database for code analysis, designed to store and query relationships between code elements across any programming language. Unlike general-purpose graph databases, codegraph provides code-specific node types, relationship semantics, and query patterns optimized for dependency analysis, call graphs, and impact assessment.
Key Differentiation: We provide the storage and query layer—users bring their own parsers (tree-sitter, syn, swc, etc.). This reduces scope while enabling maximum flexibility.
Target Timeline: 3 months to v0.1.0
Target Market: Tool builders, researchers, architects, refactoring tool developers
Pricing Model: Open source core + premium integrations

Vision & Goals
Vision Statement

"The Rust-native storage foundation for every code analysis tool."

Primary Goals

Performance: Sub-100ms queries on graphs with 100K+ nodes
Persistence: Reliable RocksDB-backed storage with incremental updates
Simplicity: Intuitive API that feels natural for code relationships
Flexibility: Parser-agnostic design supporting any language
Quality: Production-ready with comprehensive test coverage

Non-Goals (v0.1.0)

❌ Built-in language parsers (users provide parsed data)
❌ Semantic analysis or type inference
❌ LSP server implementation
❌ Real-time synchronization with editors
❌ Cloud-hosted service


Target Users
Primary Personas
1. Tool Builder (Tom)

Building a custom refactoring tool for their company's codebase
Needs persistent storage of code relationships
Doesn't want to deal with graph database internals
Pain: Existing solutions (Neo4j, Kuzu) are overkill or have FFI overhead

2. Researcher (Reese)

Analyzing architectural patterns across 100+ open-source projects
Needs to query large code graphs efficiently
Wants to export data for visualization and analysis
Pain: No good way to store and query code relationships at scale

3. Platform Engineer (Pat)

Building internal developer tools for a large engineering org
Needs dependency tracking and impact analysis
Requires persistence and incremental updates
Pain: Building graph infrastructure is undifferentiated heavy lifting


Product Requirements
Core Features
1. Graph Database Foundation
FR-1.1: Node Management

Nodes represent code entities (files, functions, classes, etc.)
Each node has:

Unique ID (u64)
Type discriminator (enum: CodeFile, Function, Class, etc.)
Key-value properties (strings, ints, floats, bools, lists)


CRUD operations: create, read, update, delete
Batch operations for bulk inserts

FR-1.2: Edge Management

Edges represent relationships (imports, calls, contains, etc.)
Each edge has:

Unique ID (u64)
Source and target node IDs
Type discriminator (enum: Imports, Calls, Contains, etc.)
Optional properties for metadata


Directed graph model
Support for multiple edges between same nodes

FR-1.3: Property System

Flexible schema-less properties on nodes and edges
Supported types: String, Int, Float, Bool, StringList, IntList, Null
Type-safe access methods
Fluent builder API for construction

FR-1.4: Storage Backends

RocksDB Backend: Persistent on-disk storage

Column families for nodes/edges/metadata
Atomic batch writes
Crash recovery


Memory Backend: For testing and temporary graphs
Pluggable backend interface for future extensions

FR-1.5: Graph Algorithms

Breadth-First Search (BFS) with depth limit
Depth-First Search (DFS) with depth limit
Strongly Connected Components (Tarjan's algorithm for cycle detection)
Path finding (all paths between two nodes)
Neighbor queries with edge type filtering

2. Code-Specific Helpers
FR-2.1: Code Node Constructors
rustpub fn add_file(&mut self, path: &Path, language: &str) -> Result<NodeId>
pub fn add_function(&mut self, file_id: NodeId, name: &str, line_start: usize, line_end: usize) -> Result<NodeId>
pub fn add_class(&mut self, file_id: NodeId, name: &str, line_start: usize, line_end: usize) -> Result<NodeId>
pub fn add_module(&mut self, name: &str) -> Result<NodeId>
FR-2.2: Relationship Helpers
rustpub fn add_call(&mut self, caller: NodeId, callee: NodeId, line: usize) -> Result<EdgeId>
pub fn add_import(&mut self, from_file: NodeId, to_file: NodeId, symbols: Vec<String>) -> Result<EdgeId>
pub fn link_to_file(&mut self, entity: NodeId, file: NodeId) -> Result<EdgeId>
FR-2.3: Query Conveniences
rustpub fn get_callers(&self, function: NodeId) -> Result<Vec<NodeId>>
pub fn get_callees(&self, function: NodeId) -> Result<Vec<NodeId>>
pub fn get_functions_in_file(&self, file: NodeId) -> Result<Vec<NodeId>>
pub fn get_file_dependencies(&self, file: NodeId) -> Result<Vec<NodeId>>
pub fn get_file_dependents(&self, file: NodeId) -> Result<Vec<NodeId>>
FR-2.4: Transitive Queries
rustpub fn transitive_dependencies(&self, file: NodeId, max_depth: i32) -> Result<Vec<NodeId>>
pub fn transitive_dependents(&self, file: NodeId, max_depth: i32) -> Result<Vec<NodeId>>
pub fn call_chain(&self, from: NodeId, to: NodeId, max_depth: i32) -> Result<Vec<Vec<NodeId>>>
3. Query Builder
FR-3.1: Fluent Query API
rustgraph.query()
    .node_type(NodeType::Function)
    .in_file("src/main.rs")
    .name_contains("test_")
    .property("is_public", true)
    .execute()?
```

**FR-3.2: Supported Filters**
- Node type filtering
- File path filtering (exact or pattern)
- Property existence and value matching
- Name substring matching
- Custom predicate functions

**FR-3.3: Query Results**
- Return Vec<NodeId> for further processing
- Lazy evaluation where possible
- Sorted by relevance (configurable)

#### 4. Export & Visualization

**FR-4.1: DOT Format Export**
- Generate Graphviz-compatible DOT files
- Configurable node labels and colors
- Edge styling by type
- Subgraph support for file grouping

**FR-4.2: JSON Export**
- Schema: `{ nodes: [...], edges: [...] }`
- Compatible with D3.js and similar libraries
- Include all properties or filtered subset

**FR-4.3: CSV Export**
- Nodes as CSV (id, type, properties...)
- Edges as CSV (id, from, to, type, properties...)
- Suitable for pandas/R analysis

**FR-4.4: Triple Export**
- RDF-style triples for semantic web tools
- Format: (subject, predicate, object)
- Property flattening

#### 5. Performance & Reliability

**FR-5.1: Performance Targets**
- Graph with 100K nodes, 500K edges:
  - Single node lookup: <1ms
  - Neighbor query: <10ms
  - BFS traversal (depth=5): <50ms
  - Full graph scan: <500ms
- Indexing throughput: 10K+ nodes/sec on modern hardware

**FR-5.2: Memory Efficiency**
- Memory usage: ~50 bytes/node + ~30 bytes/edge (in-memory)
- RocksDB: Compressed storage with LSM tree compaction
- Batch operations to reduce write amplification

**FR-5.3: Reliability**
- No data loss on process crash (RocksDB WAL)
- Atomic batch operations
- Graceful degradation on storage errors
- Comprehensive error types with context

---

## Technical Architecture

### High-Level Design
```
┌─────────────────────────────────────────┐
│         User Code (Parser Logic)        │
│   (tree-sitter, syn, swc, custom)      │
└─────────────────┬───────────────────────┘
                                    ┌─────────────────────────────────────────┐
│           codegraph Public API          │
│  • add_file(), add_function()          │
│  • query().execute()                    │
│  • get_callers(), transitive_deps()    │
└─────────────────┬───────────────────────┘
                                    ┌─────────────────────────────────────────┐
│            Core Graph Engine            │
│  • Node/Edge CRUD                       │
│  • Property management                  │
│  • Adjacency indexing                   │
│  • Algorithm implementations            │
└─────────────────┬───────────────────────┘
                                    ┌─────────────────────────────────────────┐
│          Storage Backend Trait          │
│  ┌──────────────┬──────────────────┐   │
│  │ RocksDB      │  Memory          │   │
│  │ Backend      │  Backend         │   │
│  └──────────────┴──────────────────┘   │
└─────────────────────────────────────────┘
```

### Module Structure
```
codegraph/
├── Cargo.toml
├── src/
│   ├── lib.rs              # Public API surface
│   ├── error.rs            # GraphError type
│   ├── property.rs         # PropertyMap, PropertyValue
│   ├── node.rs             # Node, NodeId, NodeType
│   ├── edge.rs             # Edge, EdgeId, EdgeType, Direction
│   ├── graph.rs            # CodeGraph implementation
│   ├── algorithms.rs       # BFS, DFS, SCC, path finding
│   ├── query.rs            # QueryBuilder API
│   ├── export.rs           # DOT, JSON, CSV, Triple exports
│   ├── code_helpers.rs     # add_file, add_function, etc.
│   └── storage/
│       ├── mod.rs
│       ├── backend.rs      # StorageBackend trait
│       ├── memory.rs       # MemoryBackend
│       └── rocksdb.rs      # RocksDBBackend
├── examples/
│   ├── call_graph.rs       # Build call graph from syn
│   ├── dependency_tree.rs  # Analyze file dependencies
│   ├── impact_analysis.rs  # Find affected files
│   ├── visualize.rs        # Export to Graphviz
│   └── rust_parser.rs      # Complete Rust parser integration
└── benches/
    └── graph_operations.rs # Performance benchmarks
```

### Data Model

**Node Storage:**
```
Key: "n:{node_id}"
Value: bincode-serialized Node struct
```

**Edge Storage:**
```
Key: "e:{edge_id}"
Value: bincode-serialized Edge struct
Adjacency Index (in-memory):
ruststruct AdjacencyIndex {
    outgoing: HashMap<NodeId, HashSet<EdgeId>>,
    incoming: HashMap<NodeId, HashSet<EdgeId>>,
    typed_index: HashMap<(NodeId, EdgeType, Direction), HashSet<EdgeId>>,
}

Development Approach
Test-Driven Development (TDD)
All features MUST be developed using TDD methodology:

Write tests first before implementing features
Red-Green-Refactor cycle:

RED: Write a failing test for the desired behavior
GREEN: Implement minimum code to make test pass
REFACTOR: Clean up code while keeping tests green


Test coverage targets:

Core graph operations: 95%+ coverage
Storage backends: 90%+ coverage
Code helpers: 85%+ coverage
Export functions: 80%+ coverage



Test Categories:
Unit Tests (fast, isolated)

Property system conversions
Node/Edge serialization
Individual graph operations
Query filter logic
Export format generation

Integration Tests (with backends)

RocksDB persistence and recovery
Batch operations across transactions
Index rebuilding from storage
Concurrent access patterns

Performance Tests (benchmarks)

Large graph operations (100K+ nodes)
Query performance under load
Memory usage profiling
Storage backend comparison

Example TDD Workflow:
rust// 1. Write the test first (RED)
#[test]
fn test_add_function_creates_contains_edge() {
    let mut graph = CodeGraph::in_memory();
    let file = graph.add_file(Path::new("test.rs"), "rust").unwrap();
    let func = graph.add_function(file, "main", 1, 10).unwrap();
    
    // Should create a Contains edge from file to function
    let edges = graph.get_edges_between(file, func).unwrap();
    assert_eq!(edges.len(), 1);
    assert_eq!(edges[0].edge_type, EdgeType::Contains);
}

// 2. Implement minimum code to pass (GREEN)
impl CodeGraph {
    pub fn add_function(&mut self, file_id: NodeId, name: &str, 
                        line_start: usize, line_end: usize) -> Result<NodeId> {
        let func_id = self.add_node(
            NodeType::Function,
            PropertyMap::new()
                .with("name", name)
                .with("line_start", line_start as i64)
                .with("line_end", line_end as i64)
        )?;
        self.add_edge(file_id, func_id, EdgeType::Contains, PropertyMap::new())?;
        Ok(func_id)
    }
}

// 3. Refactor as needed while keeping tests green
Reference Implementation:

Use existing test patterns from native/rust-core/src/codegraph/ tests
Follow test organization from forge-search/src/lib.rs tests
Adopt benchmark structure from workspace if available


Implementation Timeline
Month 1: Core Foundation
Week 1-2: Graph Engine Polish

Review and optimize existing graph.rs implementation
Add missing edge cases in adjacency indexing
Write comprehensive unit tests for all core operations
Benchmark and profile hot paths
Deliverable: Core graph with 95%+ test coverage

Week 3-4: Code Helpers

Implement code_helpers.rs module
Add convenience constructors (add_file, add_function, add_class)
Implement relationship helpers (add_call, add_import)
Write tests for each helper function
Deliverable: Code-specific API layer

Month 2: Query & Export
Week 5-6: Query Builder

Design and implement QueryBuilder API
Add filters (node type, file path, properties, name matching)
Implement result sorting and pagination
Write integration tests with real graph data
Deliverable: Fluent query interface

Week 7-8: Export Formats

Implement DOT export with Graphviz styling
Implement JSON export for web visualization
Implement CSV export for data analysis
Add triple export for RDF tools
Write tests and example usage for each format
Deliverable: Multiple export formats

Month 3: Polish & Release
Week 9-10: Documentation & Examples

Write comprehensive API documentation
Create 5+ working examples:

Rust call graph using syn
Dependency tree analyzer
Impact assessment tool
Graphviz visualization
Python parser using tree-sitter-python


Write integration guide for common parsers
Deliverable: Production-ready documentation

Week 11-12: Release Preparation

Performance optimization based on benchmarks
API stabilization (breaking changes now or never)
Prepare crates.io package
Write announcement blog post
Create GitHub release with examples
Deliverable: codegraph v0.1.0 on crates.io


Success Metrics
Technical Metrics (Month 1)

✅ 90%+ test coverage on core modules
✅ All benchmarks passing performance targets
✅ Zero unsafe code (or fully documented rationale)
✅ No clippy warnings with pedantic lints

Adoption Metrics (Month 4)

✅ 200+ GitHub stars
✅ 50+ downloads/week on crates.io
✅ 3+ community contributions (PRs or issues with good feedback)
✅ 2+ external projects using codegraph

Quality Metrics (Ongoing)

✅ <5 bugs per 1000 users
✅ <24hr response time on GitHub issues
✅ 95%+ test coverage maintained
✅ Documentation coverage for all public APIs


Out of Scope (v0.1.0)
Explicitly Not Included
Language Parsers

No built-in tree-sitter integration
No syn/swc wrappers
Users bring their own parser
Rationale: Reduces scope, avoids version conflicts, maximizes flexibility

Incremental Indexing

No file watcher integration
No delta updates from git diffs
Manual re-indexing required
Future: v0.2.0 feature

Semantic Analysis

No type inference
No scope resolution
No symbol table
Future: artifact-sitter project (separate)

Distributed/Cloud Features

No multi-machine clustering
No hosted service
No real-time sync
Future: Commercial offering possibility

Language-Specific Optimizations

No Rust-specific optimizations (macro expansion, trait resolution)
No JavaScript-specific features (module resolution)
Rationale: Keep core language-agnostic


API Design Examples
Basic Usage
rustuse codegraph::{CodeGraph, NodeType, EdgeType};

// Create or open a graph
let mut graph = CodeGraph::open("./my_project.graph")?;

// Add files (user parses, we store)
let main_rs = graph.add_file(Path::new("src/main.rs"), "rust")?;
let lib_rs = graph.add_file(Path::new("src/lib.rs"), "rust")?;

// Add functions
let main_fn = graph.add_function(main_rs, "main", 1, 10)?;
let helper = graph.add_function(lib_rs, "helper", 5, 15)?;

// Record relationships
graph.add_call(main_fn, helper, 7)?;
graph.add_import(main_rs, lib_rs, vec!["helper".to_string()])?;

// Query
let callers = graph.get_callers(helper)?;
println!("Functions calling helper: {:?}", callers);

// Export for visualization
std::fs::write("graph.dot", graph.export_dot()?)?;
Query Builder Usage
rust// Find all test functions in a specific file
let test_functions = graph.query()
    .node_type(NodeType::Function)
    .in_file("tests/integration_test.rs")
    .name_contains("test_")
    .execute()?;

// Find public functions that call private functions
let public_fns = graph.query()
    .node_type(NodeType::Function)
    .property("visibility", "public")
    .execute()?;

for func in public_fns {
    let callees = graph.get_callees(func)?;
    // ... analyze callees
}
Integration with Parsers
rustuse syn::{parse_file, Item};

// User's parsing logic
let source = std::fs::read_to_string("src/lib.rs")?;
let ast = parse_file(&source)?;

let mut graph = CodeGraph::open("./project.graph")?;
let file_id = graph.add_file(Path::new("src/lib.rs"), "rust")?;

for item in ast.items {
    match item {
        Item::Fn(func) => {
            let name = func.sig.ident.to_string();
            let line = func.sig.ident.span().start().line;
            graph.add_function(file_id, &name, line, line + 10)?;
        }
        // ... other items
    }
}

Dependencies
Required Crates
toml[dependencies]
# Core
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
bincode = "1.3"
thiserror = "1.0"

# Storage
rocksdb = "0.22"

# Utilities
indexmap = "2.0"  # Ordered maps for deterministic output

[dev-dependencies]
tempfile = "3.10"
criterion = "0.5"  # Benchmarking
```

### Reference Codebases

**Existing implementations to reference:**
1. `forge-search/src/graph.rs` - Graph structure and algorithms
2. `forge-search/src/chunker.rs` - Code parsing patterns (informational)
3. `native/rust-core/src/codegraph/` - Current foundation (starting point)
4. `mcp-context-server` - Integration patterns for tools

**Study but don't copy:**
- `forge-search/src/embeddings.rs` - Caching patterns
- `forge-search/src/indexer.rs` - File scanning patterns

---

## Risks & Mitigations

| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| Performance doesn't meet targets | High | Medium | Benchmark early, optimize hot paths, consider alternative algorithms |
| RocksDB stability issues | High | Low | Comprehensive testing, graceful degradation, document known issues |
| API design doesn't feel natural | Medium | Medium | Early user feedback, iterate on examples, dogfood in own tools |
| Adoption slower than expected | Medium | High | Focus on documentation quality, create compelling examples, engage community early |
| Scope creep (adding parsers) | High | High | Strict adherence to "no built-in parsers" principle, redirect to examples |

---

## Open Questions

1. **Graph schema versioning:** How do we handle schema evolution when users upgrade?
   - **Proposed:** Semantic versioning, migration guides, backwards compatibility for 0.x series

2. **Concurrency model:** Should we support concurrent reads/writes from multiple threads?
   - **Proposed:** v0.1.0 is single-writer, v0.2.0 adds read-write lock support

3. **Node ID stability:** Should IDs persist across graph rebuilds?
   - **Proposed:** IDs are ephemeral, use content-based keys (file path + line) for stability

4. **Property schema:** Should we enforce property schemas per node type?
   - **Proposed:** No enforcement in v0.1.0, but provide optional validation helpers

5. **Export size limits:** What's the largest graph we should support exporting to DOT/JSON?
   - **Proposed:** Warn above 10K nodes, error above 100K nodes, suggest chunking

---

## Appendix A: EdgeType Semantics

| EdgeType | Meaning | Example |
|----------|---------|---------|
| `Imports` | File A imports File B | `main.rs` imports `lib.rs` |
| `ImportsFrom` | File A imports symbols from File B | `main.rs` imports `{foo, bar}` from `lib.rs` |
| `Contains` | Parent contains child | File contains Function |
| `Calls` | Function A calls Function B | `main()` calls `helper()` |
| `Invokes` | Function invokes method | `obj.method()` |
| `Instantiates` | Function creates instance | `new Class()` |
| `Extends` | Class A extends Class B | `class Child extends Parent` |
| `Implements` | Class implements interface | `impl Trait for Struct` |
| `Uses` | Generic usage relationship | Variable uses Type |
| `Defines` | Module defines entity | Module defines Function |
| `References` | Generic reference | Variable references constant |

---

## Appendix B: Performance Benchmarks

All benchmarks run on: AMD Ryzen 9 / 32GB RAM / NVMe SSD

**Target Performance:**
```
Graph Size: 100K nodes, 500K edges

Node Operations:
  insert_node                 1.2 µs
  get_node                    800 ns
  update_node                 1.5 µs
  delete_node (with edges)    45 µs

Edge Operations:
  insert_edge                 1.8 µs
  get_edge                    900 ns
  get_edges_between           12 µs

Traversal:
  neighbors (outgoing)        8 µs
  neighbors_typed             10 µs
  bfs_traversal (depth=5)     45 ms
  dfs_traversal (depth=5)     38 ms

Algorithms:
  find_callers                15 µs
  transitive_deps (depth=5)   120 ms
  circular_deps (100 nodes)   250 ms

Storage:
  batch_insert (1K nodes)     25 ms
  flush_to_disk              50 ms
  load_from_disk             180 ms

Appendix C: Example Projects (Post-Release)
Community contributions we'd like to see:

codegraph-rust - First-party Rust parser using syn
codegraph-js - JavaScript/TypeScript parser using swc
codegraph-py - Python parser using tree-sitter
codegraph-viz - Web-based graph visualizer
codegraph-cli - Command-line tool for common queries
codegraph-lsp - LSP server showing code relationships in editors