# Comprehensive Query API Implementation Plan
## Executive Summary
Based on analysis of the Sombra codebase, the current graph database has:
- ✅ Basic node/edge CRUD operations
- ✅ Property indexes for O(log n) lookups
- ✅ Label-based node queries
- ✅ Basic neighbor traversal (outgoing/incoming)
- ✅ Multi-hop traversal (2-hop, 3-hop, BFS, parallel BFS)
- ✅ Transaction support
**Missing critical APIs for code analysis and general graph querying:**
- ✅ Edge type filtering in traversals (Phase 1 Complete)
- ✅ Hierarchical queries (Phase 2 Complete)
- ✅ Path finding algorithms (Phase 3 Complete)
- ❌ Pattern matching queries
- ❌ Subgraph extraction
- ❌ Aggregation queries
- ❌ Edge property queries
---
## Phase 1: Edge Type Filtering (CRITICAL - Code Analysis Blocker)
### 1.1 Core Edge Type Filtering APIs
**Location:** `src/db/core/traversal.rs`
#### APIs to Implement:
```rust
/// Get neighbors filtered by edge type(s)
pub fn get_neighbors_by_edge_type(
&mut self,
node_id: NodeId,
edge_types: &[&str], // Multiple types for OR semantics
direction: EdgeDirection,
) -> Result<Vec<NodeId>>
/// Get edges filtered by type
pub fn get_edges_by_type(
&mut self,
node_id: NodeId,
edge_types: &[&str],
direction: EdgeDirection,
) -> Result<Vec<Edge>>
/// Get neighbors with their connecting edges filtered by type
pub fn get_neighbors_with_edges_by_type(
&mut self,
node_id: NodeId,
edge_types: &[&str],
direction: EdgeDirection,
) -> Result<Vec<(NodeId, Edge)>>
/// BFS traversal filtered by edge type
pub fn bfs_traversal_by_edge_type(
&mut self,
start_node_id: NodeId,
max_depth: usize,
edge_types: &[&str],
direction: EdgeDirection,
) -> Result<Vec<(NodeId, usize)>>
```
**Add EdgeDirection enum:**
```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum EdgeDirection {
Outgoing,
Incoming,
Both,
}
```
**Implementation Strategy:**
1. Reuse existing edge chain traversal logic
2. Add type name filtering during edge iteration
3. Cache edge types to avoid repeated deserialization
4. Optimize for single type filter (common case)
**Performance Optimization:**
- Edge type cache: `HashMap<EdgeId, String>`
- Lazy loading of edge types (only when filtering needed)
- Batch edge loading when multiple types requested
**Testing:**
- Unit tests with mixed edge types
- Benchmark vs unfiltered traversal overhead (<10% slowdown acceptable)
- Test with code graph patterns (CONTAINS, CALLS, REFERENCES)
---
## Phase 2: Pattern Matching Queries
### 2.1 Simple Pattern Matching
**Location:** `src/db/query/` (NEW module)
```rust
/// Match a simple path pattern
/// Example: (a:Person)-[:KNOWS]->(b:Person)
pub fn match_pattern(
&mut self,
pattern: &Pattern,
) -> Result<Vec<Match>>
/// Pattern DSL structs
pub struct Pattern {
pub nodes: Vec<NodePattern>,
pub edges: Vec<EdgePattern>,
}
pub struct NodePattern {
pub var_name: String,
pub labels: Vec<String>, // OR semantics
pub properties: PropertyFilters, // AND semantics
}
pub struct EdgePattern {
pub from_var: String,
pub to_var: String,
pub types: Vec<String>,
pub properties: PropertyFilters,
pub direction: EdgeDirection,
}
pub struct PropertyFilters {
pub equals: BTreeMap<String, PropertyValue>,
pub not_equals: BTreeMap<String, PropertyValue>,
pub ranges: Vec<PropertyRangeFilter>,
}
```
**Use Cases:**
1. Find all function calls to a specific function
2. Find all files containing a specific class
3. Find all import chains
---
## Phase 3: Advanced Traversal Queries ✅ COMPLETE
### 3.1 Variable-Length Path Queries
**Status:** ✅ Implemented
**Location:** `src/db/core/traversal.rs`, `src/model.rs`
**Implemented APIs:**
```rust
/// Find all paths between two nodes with constraints
pub fn find_paths(
&mut self,
start: NodeId,
end: NodeId,
min_hops: usize,
max_hops: usize,
edge_types: Option<&[&str]>,
) -> Result<Vec<Path>>
/// Find shortest path between nodes
pub fn shortest_path(
&mut self,
start: NodeId,
end: NodeId,
edge_types: Option<&[&str]>,
) -> Result<Option<Path>>
pub struct Path {
pub nodes: Vec<NodeId>,
pub edges: Vec<EdgeId>,
pub length: usize,
}
```
**Algorithms Implemented:**
- ✅ BFS for shortest path (unweighted)
- ✅ Limited DFS for all paths (with cycle detection)
- ⏳ Bidirectional BFS for long distances (future optimization)
**Features:**
- ✅ Shortest path finding with BFS
- ✅ All paths enumeration with min/max hop constraints
- ✅ Edge type filtering support
- ✅ Cycle detection in path finding
- ✅ Path struct with nodes, edges, and length
**Use Cases Validated:**
1. ✅ Impact analysis (call chains)
2. ✅ Dependency resolution
3. ✅ Reachability checks
**Testing:**
- ✅ 17 comprehensive tests in `tests/path_finding.rs`
- ✅ Chain graphs, grid graphs, diamond graphs
- ✅ Edge type filtering tests
- ✅ Min/max hop constraint tests
- ✅ Large graph performance tests (100 nodes)
- ✅ Code analysis use case tests
**Benchmarks:**
- ✅ Path finding benchmarks in `benches/path_finding_benchmark.rs`
- ✅ Chain traversal benchmarks
- ✅ Grid shortest path benchmarks
- ✅ Filtered path finding benchmarks
- ✅ Multiple paths enumeration benchmarks
**Performance:**
- Shortest path: O(V + E) time, O(V) space
- Find paths: O(V + E) per path, limited by max_hops
---
## Phase 4: Subgraph Extraction
### 4.1 Subgraph Queries
**Location:** `src/db/query/subgraph.rs` (NEW)
```rust
/// Extract subgraph around nodes
pub fn extract_subgraph(
&mut self,
root_nodes: &[NodeId],
depth: usize,
edge_filter: Option<EdgeTypeFilter>,
) -> Result<Subgraph>
/// Extract induced subgraph (nodes + all edges between them)
pub fn extract_induced_subgraph(
&mut self,
node_ids: &[NodeId],
) -> Result<Subgraph>
pub struct Subgraph {
pub nodes: Vec<Node>,
pub edges: Vec<Edge>,
pub boundary_nodes: Vec<NodeId>, // Nodes with edges outside subgraph
}
```
**Use Cases:**
1. Extract entire file AST
2. Extract function and all its dependencies
3. Extract module boundaries
---
## Phase 5: Aggregation & Analytics ✅ COMPLETE
### 5.1 Node/Edge Statistics
**Status:** ✅ Implemented
**Location:** `src/db/query/analytics.rs`
**Implemented APIs:**
```rust
/// Count nodes by label
pub fn count_nodes_by_label(&self) -> HashMap<String, usize>
/// Count edges by type
pub fn count_edges_by_type(&mut self) -> Result<HashMap<String, usize>>
/// Get degree distribution
pub fn degree_distribution(&mut self) -> Result<DegreeDistribution>
pub struct DegreeDistribution {
pub in_degree: HashMap<NodeId, usize>,
pub out_degree: HashMap<NodeId, usize>,
pub total_degree: HashMap<NodeId, usize>,
}
/// Find high-degree nodes (hubs)
pub fn find_hubs(
&mut self,
min_degree: usize,
degree_type: DegreeType,
) -> Result<Vec<(NodeId, usize)>>
pub enum DegreeType {
In,
Out,
Total,
}
/// Additional APIs:
/// - get_total_node_count()
/// - get_total_edge_count()
/// - find_isolated_nodes()
/// - find_leaf_nodes(direction)
/// - get_average_degree()
/// - get_density()
/// - count_nodes_with_label(label)
/// - count_edges_with_type(edge_type)
/// - get_label_statistics()
/// - get_edge_type_statistics()
/// - get_degree_statistics()
```
**Implementation Details:**
- 15 public analytics APIs
- Complete degree distribution tracking (in/out/total)
- Graph density and statistics calculations
- Efficient hub and isolated node detection
**Testing:**
- ✅ 10 unit tests (all passing)
- ✅ 5 integration tests (all passing)
- ✅ Code graph analytics use cases validated
**Use Cases Validated:**
1. ✅ Identify highly-coupled functions (god objects)
2. ✅ Find unused/dead code (isolated nodes)
3. ✅ Calculate graph statistics for query planning
4. ✅ Generate reports on code structure
---
## Phase 6: Ancestor & Descendant Queries
### 6.1 Tree/DAG Navigation
**Location:** `src/db/query/hierarchy.rs` (NEW)
```rust
/// Find first ancestor with matching label
pub fn find_ancestor_by_label(
&mut self,
start: NodeId,
label: &str,
edge_type: &str, // e.g., "CONTAINS"
) -> Result<Option<NodeId>>
/// Get all ancestors up to root
pub fn get_ancestors(
&mut self,
start: NodeId,
edge_type: &str,
max_depth: Option<usize>,
) -> Result<Vec<NodeId>>
/// Get all descendants (breadth-first)
pub fn get_descendants(
&mut self,
start: NodeId,
edge_type: &str,
max_depth: Option<usize>,
) -> Result<Vec<NodeId>>
/// Get containing file for a code node
pub fn get_containing_file(&mut self, node_id: NodeId) -> Result<NodeId> {
self.find_ancestor_by_label(node_id, "File", "CONTAINS")
.and_then(|opt| opt.ok_or(GraphError::NotFound("containing file")))
}
```
**Use Cases (Code Analysis):**
1. Find containing file/class/function
2. Get entire AST subtree
3. Navigate up the AST hierarchy
---
## Phase 7: Index Enhancements
### 7.1 Edge Type Index
**Location:** `src/db/core/graphdb.rs`
```rust
/// Add edge type index for fast edge type queries
pub(crate) edge_type_index: HashMap<String, BTreeSet<EdgeId>>
/// Index maintenance methods
impl GraphDB {
fn update_edge_type_index_on_add(&mut self, edge_id: EdgeId, type_name: &str) {
self.edge_type_index
.entry(type_name.to_string())
.or_default()
.insert(edge_id);
}
fn update_edge_type_index_on_delete(&mut self, edge_id: EdgeId, type_name: &str) {
if let Some(edges) = self.edge_type_index.get_mut(type_name) {
edges.remove(&edge_id);
}
}
/// Get all edges of a specific type
pub fn get_edges_by_type_global(&self, type_name: &str) -> Vec<EdgeId> {
self.edge_type_index
.get(type_name)
.map(|ids| ids.iter().copied().collect())
.unwrap_or_default()
}
}
```
**Benefits:**
- O(1) lookup of all edges by type
- Useful for schema analysis
- Supports type-based queries efficiently
---
## Phase 8: Query Builder API
### 8.1 Fluent Query API
**Location:** `src/db/query/builder.rs` (NEW)
```rust
/// Fluent query builder for complex queries
pub struct QueryBuilder<'db> {
db: &'db mut GraphDB,
filters: Vec<Filter>,
traversal: Option<TraversalSpec>,
}
impl<'db> QueryBuilder<'db> {
pub fn new(db: &'db mut GraphDB) -> Self { ... }
pub fn start_from(mut self, node_ids: Vec<NodeId>) -> Self { ... }
pub fn start_from_label(mut self, label: &str) -> Self { ... }
pub fn start_from_property(mut self, label: &str, key: &str, value: PropertyValue) -> Self { ... }
pub fn traverse(mut self, edge_types: &[&str], direction: EdgeDirection, depth: usize) -> Self { ... }
pub fn filter_nodes(mut self, filter: NodeFilter) -> Self { ... }
pub fn filter_edges(mut self, filter: EdgeFilter) -> Self { ... }
pub fn limit(mut self, n: usize) -> Self { ... }
pub fn execute(self) -> Result<QueryResult> { ... }
}
// Usage example:
let results = QueryBuilder::new(&mut db)
.start_from_label("Function")
.filter_nodes(|n| n.properties.get("name") == Some(&PropertyValue::String("foo".into())))
.traverse(&["CALLS"], EdgeDirection::Outgoing, 3)
.limit(100)
.execute()?;
```
**Benefits:**
- Ergonomic API for complex queries
- Query optimization opportunities
- Type-safe query construction
---
## Implementation Priority for Code Analysis
### High Priority (Phase 1) - Blocking Code Analysis
1. ✅ **Edge type filtering** (`get_neighbors_by_edge_type`, `get_edges_by_type`)
2. ✅ **Ancestor queries** (`find_ancestor_by_label`, `get_containing_file`)
3. ✅ **Filtered BFS** (`bfs_traversal_by_edge_type`)
### Medium Priority (Phase 2-3) - Essential for Full Functionality
4. ⚡ **Path finding** (`shortest_path`, `find_paths`)
5. ⚡ **Subgraph extraction** (`extract_subgraph`)
6. ⚡ **Edge type index** (performance optimization)
### Lower Priority (Phase 4-5) - Nice to Have
7. 🔹 **Pattern matching** (can use edge filtering + manual logic initially)
8. 🔹 **Aggregations** (can compute on-demand initially)
9. 🔹 **Query builder** (syntactic sugar)
---
## Testing Strategy
### Unit Tests
```rust
#[cfg(test)]
mod tests {
// Edge type filtering
#[test]
fn test_filter_by_single_edge_type() { ... }
#[test]
fn test_filter_by_multiple_edge_types() { ... }
#[test]
fn test_edge_type_with_direction() { ... }
// Ancestor queries
#[test]
fn test_find_ancestor_by_label() { ... }
#[test]
fn test_get_containing_file() { ... }
// Path finding
#[test]
fn test_shortest_path() { ... }
#[test]
fn test_find_all_paths() { ... }
}
```
### Integration Tests (Code Graph Scenarios)
```rust
// tests/code_graph.rs
#[test]
fn test_impact_analysis_call_chain() {
// Build graph: File -> Function -> CallExpr -> Function
// Query: Find all functions affected by change to funcA
}
#[test]
fn test_unused_function_detection() {
// Build graph with entry points
// Query: Find functions with no incoming CALLS edges
}
#[test]
fn test_import_chain_resolution() {
// Build graph: File -IMPORTS-> File
// Query: Find all transitive imports
}
```
### Benchmark Tests
```rust
// benches/query_benchmark.rs
fn bench_edge_type_filtering(c: &mut Criterion) { ... }
fn bench_ancestor_traversal(c: &mut Criterion) { ... }
fn bench_path_finding(c: &mut Criterion) { ... }
```
---
## Module Structure
```
src/
├── db/
│ ├── core/
│ │ ├── traversal.rs # Enhanced with edge type filtering
│ │ └── ...
│ ├── query/ # NEW: High-level query APIs
│ │ ├── mod.rs
│ │ ├── builder.rs # Fluent query builder
│ │ ├── pattern.rs # Pattern matching
│ │ ├── hierarchy.rs # Ancestor/descendant queries
│ │ ├── subgraph.rs # Subgraph extraction
│ │ └── analytics.rs # Aggregations & statistics
│ └── ...
└── ...
```
---
## Performance Targets
| Edge type filtering | < 5% overhead vs unfiltered | Not implemented |
| Ancestor query (depth=5) | < 1ms | Not implemented |
| Shortest path (100 hops) | < 10ms | Not implemented |
| Subgraph extraction (1000 nodes) | < 50ms | Not implemented |
| Property lookup (indexed) | < 1ms | ✅ Implemented |
| BFS traversal (10K nodes) | < 100ms | ✅ Implemented |
---
## API Documentation Requirements
Each new API must include:
1. Rust doc comments with description
2. Parameter documentation
3. Return value documentation
4. Error conditions
5. Time complexity (Big-O)
6. Space complexity
7. Example usage
8. Related APIs (see also)
Example:
```rust
/// Finds the first ancestor node with the specified label.
///
/// Traverses backwards through edges of the given type until finding
/// a node with the target label, or reaching a node with no incoming
/// edges of that type.
///
/// # Arguments
/// * `start` - The node ID to start from
/// * `label` - The label to search for
/// * `edge_type` - The edge type to traverse (e.g., "CONTAINS")
///
/// # Returns
/// * `Ok(Some(node_id))` - Ancestor node found
/// * `Ok(None)` - No ancestor with that label exists
/// * `Err(...)` - Database error
///
/// # Time Complexity
/// O(depth) where depth is the distance to the ancestor
///
/// # Example
/// ```rust
/// // Find the file containing a function node
/// let file_id = db.find_ancestor_by_label(func_id, "File", "CONTAINS")?;
/// ```
///
/// # See Also
/// * [`get_ancestors`] - Get all ancestors up to root
/// * [`get_containing_file`] - Specialized version for file lookup
pub fn find_ancestor_by_label(
&mut self,
start: NodeId,
label: &str,
edge_type: &str,
) -> Result<Option<NodeId>>
```
---
## Migration & Backward Compatibility
1. **New APIs are additive** - no breaking changes to existing APIs
2. **Existing traversal APIs remain unchanged** - new APIs are additions
3. **Index migration** - edge type index built on first use, persisted on checkpoint
4. **Feature flag** - `query-api` feature flag for new query module (default enabled)
---
## Success Metrics
1. ✅ All code analysis use cases can be implemented efficiently
2. ✅ < 10% performance overhead for filtered vs unfiltered queries
3. ✅ 100% test coverage for new query APIs
4. ✅ Comprehensive documentation with examples
5. ✅ Benchmarks show O(log n) or better for indexed operations
6. ✅ Zero breaking changes to existing API
---
## Next Steps
1. Review and approve plan
2. Start with Phase 1 (edge type filtering) - highest priority
3. Implement in order: Phase 1 → Phase 6 → Phase 3 → Phase 4 → Phase 5 → Phase 2 → Phase 7 → Phase 8