laburnum 1.17.0

We need to design the data structures for the database for our content adressed compiler.

The database is composed of a series of Chunks.

A Chunk is the result of a Task, and contains any records that were created in that task.

The overarching design of this Database follows DynamoDB's HashMap of BTrees.

We have a type Ident(u64) that is a deterministic hash of a string. We use this in place of strings everwhere we have an identifier. This includes item/entity names (like the name of a function of struct), fields, properties or other data about an entity, and so on.

We want to run compilation such that as much as possible can be done in isolation. SO we break up the work into asyncTasks. The tasks have some extra machinery in them so that when a task wants to get some data (looking something by a hash), the Task brokers the call, and if the data is not available it prevents the task from running until it it. It does this by checking in the Future poll method if there is any pending dependencies.

NOTE: you do not need to implement the Task code.

We're looking at using Dashmap as it looks to allow us to query on, and update a hashmap of chunks, with minimal locking. Locking at the chunk level is ideal, in comparison to the whole map.

Now we get to the tricky bit. The Chunk returned from a computation isn't going to neatly cover everything about an item.

When querying the database, it's likely that we would make multiple calls to fetch data from many chunks.

A Chunk is therefore a bit of an implementation detail, or at least, it's an arena. Why an arena? Well for incremental compilation it's just as important to update data, as it is to know which data to drop. We don't want unbounded memory growth.

So it follows that we will need a number of indexes in the database, that help us navigate the chunks.

Finally, the table composition itself.

WE're building a library for use by multiple compilers, each with similar but different AST's and symbol table requirements.

We have a baseline requirement that we are only supporting Algol/C/Rust style languages. That is, modules, functions, structs, and so on. But these concepts will not be baked into the table.

We have found it's easier to build fault tolerant compilers and language servers when the semantics of the language are NOT baked into the host languages constructs and type system.

That is to say, representing the language as Data is the goal.

However, we still need to know the specific types of the columns in our database.

So, lets consider what a Chunk contains. As we know from the DynamoDB style, we have a primary key (Ident) and a sort key (for the Btree). A chunk contains zero or more records, that may have different primary keys.

It would be simple if every chunk was the whole partition for a primary key, but that's not useful.

## Your Task:

With the above context, collaborate with me to design how the database and Chunks can work.

Initially, we'll work with an idealised simple Chunk with a few fields declared in the Chunk struct. Later we'll expand on how to make that customisable per language, but for now we wont. Lets use the fields name: Ident, path: Vec<Ident> (ie, a function in a module might have a path of the module_name, function_name, kind(which we'll make an enum of function, struct, expression, record[ie key value pair]),

When we compute a chunk, the id is something we can deterministically calculate, as all computation starts from a file, which has an id and version number. If we have two chunks, with version numbers, we can always know the higher version is newer.

Show me options for how we can structure this database

how can we add indexes over the chunks?

how can we manage chunks, so that we remove older versions?

if we perhaps, track which chunks produced other chunks, we could know which ones to delete after computation.

we dont have to eagerly delete chunks, as we're optimising for LSP usage, where there will be moments of high activity followed by lull periods.

When we want to query a for data we query with similar operations to dynamodb, whereby we must have the primary_key and either the exact sort_key, no sort key at all (for all records), or a prefix search with a partial sort key

---

As our records are broken up into chunks, can we take advantage of this, by running
our queries over multiple threads?

We would probably need to do something where chunks are distributed across
threads.

---

The magic is that when a source file changes:

New chunk is computed with new ID
Any computation that depends on records from that file will:

Query by location (file_id + path)
Get the NEW records (because we updated the primary_index)
Produce a NEW chunk ID (because the dependency changed)

This cascades through the DAG naturally