laburnum 1.17.1

# Laburnum

Laburnum is a Language Server Protocol (LSP) framework built on an incremental
query tree. It provides the foundation for implementing language servers whose
analyses — parsing, name resolution, type checking, diagnostics, completions —
are expressed as tasks over a content-addressed, incrementally recomputed
database. As the editor sends file changes, only the tasks whose inputs have
actually changed are re-run; everything else is reused from prior chunks.

The LSP layer (protocol types, request routing, position encoding, capability
negotiation) is part of the framework, but Laburnum itself is language-agnostic:
language-specific behaviour lives in the implementing crate.

## Incremental Query Tree

The big idea at the heart of Laburnum is to create an incrementally computed
tree, where updates are worked on in parallel then merged back into the tree.

To make this work a computation task has access to a read only view of the tree,
and can query the tree in a number of different ways. A task can wait for queries
to return records, or timeout.

New records are created inside a task, and stored in a `Chunk`. When the task
completes, the chunk is merged into the tree, and can be queried by other tasks.

The database is queryable as a HashMap of BTrees similar to DynamoDB. With each
`Record` being identified with a `partition_key` and a `sort_key`.

Each `Chunk` is immutable once created, and can contain records from many different
partitions.

It is important that Tasks are as deterministic as practical. For the same input
and query results, the same records should be returned. The values of the records
can be different, but the partition_key and sort_key should be the same.

This is a constraint to some systems, but a natural rule for incremental systems
like compilers. Where computation Tasks form a dataflow DAG.

Because our data ends up distributed across `Chunks` we can often parallelise
queries with decent performance improvements, see [benchmarks](#benchmarks).

Tasks can spawn other tasks, creating a task graph that is automatically tracked.
You can check for the results of a task, and spawn it if they're not there.

It's expected that Tasks form a graph, and will be recomputed from the start of
the graph each time. As all the expensive work is stored in the database, we can
walk over the graph and only recompute what's changed.

## Architecture Overview

### Task System

Laburnum's computation model is built around asynchronous tasks that incrementally
compute data:

- **LaburnumTask**: A struct that wraps an async future producing records. Each task
  executes via `poll_once()` and writes results as chunks to the database.

- **TaskContext**: Given to tasks during execution. Provides a `query_client()` method
  to read from the database and a `spawn_task()` method to create child tasks.

- **Parent Tracking**: Parent-child relationships between tasks are tracked via the
  `parent_task_id` field on chunks. When a task spawns another task via
  `TaskContext::spawn_task()`, the parent is recorded on the resulting chunk.

- **Scheduler**: Manages task execution. You add tasks with `queue()` and call
  `run()` to execute them. Tasks are executed by a worker pool according to lane
  priority.

The parent tracking enables incremental recompilation - when inputs change, only tasks
that depend on those inputs need to be recomputed.

### Chunks and Storage

All data in Laburnum is stored in immutable **Chunks**:

- Each chunk is identified by a content hash computed from its inputs and dependencies
- Chunks contain records organized by `partition_key` and `sort_key`
- Multiple chunks can contain records for the same partition key - this is expected
  and beneficial for parallelism
- Chunks are never modified - new computations create new chunks

The database is generic over storage via the **RecordStorage** trait, allowing
different backends for different use cases (compiler IR, metadata, etc.).

### Query System

Tasks query the database through **QueryClient**:

- **Direct queries**: `get_record(partition_key, sort_key)` for exact lookups
- **Fluent queries**: `query(partition_key).sort_key_begins_with(prefix).execute()`
  for range and prefix queries
- **Batch queries**: `batch_get_items()` for multiple lookups in one call

When a task queries for data, the QueryClient automatically:

1. Tracks which chunks were read (building dependency graph)
2. Records missing data as "pending dependencies"
3. Decides whether to execute sequentially or in parallel (see Parallelization)

If a query returns no results, the task blocks until another task produces that data.

### Supporting Components

**Identifiers (Ident):**

- Deterministic hash-based identifiers from strings (using rapidhash v3)
- Used for `partition_key`, `task_id`, and other keys
- Enables fast equality checks and hashing

**Spans (Span & SpanCache):**

- Track source code locations for error reporting
- SpanCache is a per-file cache (created with file_id and version)
- Supports creating rich, user-friendly diagnostics

## Key Design Features

### Content-Addressed Storage

Chunks are immutable and identified by content hash. New source changes create
new chunks, not replacements. Old chunks remain until garbage collected.

### Dependency Tracking

Automatic tracking of which chunks were read during computation. Forms a DAG
from source files through compilation stages. Enables precise incremental
recompilation.

### Query Patterns

- **Location-based**: Query by primary key + sort key (gets latest data)
- **Content-based**: Query by content hash (immutable lookup)
- **Prefix search**: Query by sort key prefix for hierarchical data
- **Range queries**: Query by sort key ranges (between, less than, greater than)

Queries naturally return all matching records across all chunks - this is the
correct behavior. The query system aggregates results from multiple chunks and
returns them sorted.

### Performance Optimizations

Parallel queries across chunks when beneficial. Adaptive thresholds based on
chunk count. No defragmentation needed - embraces distributed chunks as a
performance feature.

### Garbage Collection (partially implemented)

Mark-and-sweep from entry points (source files). Lazy GC during idle periods.
Maintains referential integrity across indexes.