tilezz 0.1.0 - Docs.rs

tilezz-dafsa JSON schema (version 1)
====================================

  Describes:  the single-JSON serialization of a generic DAFSA
              (`Dafsa::write_json`'s output; embedded as
              `dafsa.inner` inside a tilezz-rat-dafsa).
  Distinct from: tilezz-rat-dafsa-blocks (the multi-file blocked
                 asset format -- see blocks_schema.txt for that one)

A serialized DAFSA is a single JSON object with the following fields.
All arrays use 0-based indexing. State 0 is always the root. The
label alphabet is `i8`.

  {
    "format":      "tilezz-dafsa",       // discriminator
    "version":     1,                     // u32, currently 1
    "scalar":      "i8",                  // label element type, always "i8"
    "n_states":    <usize>,               // == edges_start.len()
                                          //  == counts.len()
    "n_edges":     <usize>,               // == labels.len()
                                          //  == targets.len()
    "edges_start": [u32, ...],            // CSR row pointers
    "labels":      [i8, ...],             // per edge (sorted within state)
    "targets":     [u32, ...],            // per edge
    "counts":      [u64, ...]             // per state: # accepted seqs
                                          //  reachable, including empty
                                          //  continuation if accepting
  }

Reading the automaton
---------------------

A state `s` (0 <= s < n_states) owns the edge slice
  labels[ edges_start[s] .. edges_start_or_n_edges(s+1) ]
  targets[ edges_start[s] .. edges_start_or_n_edges(s+1) ]
where `edges_start_or_n_edges(s+1)` is `edges_start[s+1]` if `s+1 <
n_states` and `n_edges` otherwise. Edges within a state appear sorted
ascending by `label`; binary search by label is the natural lookup.

Derived: is_accept and total accept count
-----------------------------------------

Two values are NOT stored on disk because they are redundant with
the data above; reconstruct them once at load time:

* `is_accept[s] := counts[s] > sum(counts[targets[k]]
                                   for k in edges_start[s] ..
                                           edges_start_or_n_edges(s+1))`
* `accept_count := counts[0]`  (the count at the root)

In any well-formed file the difference

  counts[s] - sum-over-children-of-counts

is exactly 0 (non-accepting) or 1 (accepting -- the +1 comes from
the empty continuation at s). Readers should reject files where the
difference falls outside {0, 1}, since that indicates a corrupt or
non-minimal automaton.

Set membership for a sequence `seq`:

  s = 0
  for symbol in seq:
      slice = edges of state s
      idx   = binary_search(slice.labels, symbol)
      if not found: return False
      s = slice.targets[idx]
  return is_accept[s]

Indexed get (the i-th accepted sequence in lex order):

  out = []; s = 0; rem = i
  while True:
      if is_accept[s]:
          if rem == 0: return out
          rem -= 1
      for label, target in edges of state s (in stored order):
          n = counts[target]
          if rem < n:
              out.append(label); s = target; break
          rem -= n

Lex-sorted iteration is just a depth-first walk that yields a copy of
the running prefix each time it enters an accepting state.

Scalars
-------

`scalar` is always `"i8"`. Readers should reject files whose
`scalar` value is anything else; future label widths would require a
format-version bump.