Expand description
FSST (Fast Static Symbol Table) codec for string/log columns.
Builds a lightweight dictionary of common substrings (1-8 bytes) and encodes strings as sequences of symbol table indices. Unlike whole-string dictionary encoding, FSST handles partial overlap — strings sharing prefixes or suffixes compress well even if no two strings are identical.
Compression: 3-5x on string columns before any terminal compressor. Combined with lz4_flex terminal: 8-15x total on structured log data.
Decompression: simple table lookup — fast enough to query directly over encoded data.
Wire format:
[2 bytes] symbol count (LE u16, max 255)
[symbol_count × (1 + len) bytes] symbol table: (len: u8, bytes: [u8; len])
[4 bytes] total encoded length (LE u32)
[4 bytes] string count (LE u32)
[string_count × 4 bytes] encoded string offsets (cumulative LE u32)
[N bytes] encoded data (symbol indices interleaved with escape+literal)Escape mechanism: byte value 255 followed by a literal byte encodes bytes not covered by any symbol. Symbol indices are 0..254.
Functions§
- decode
- Decode FSST-compressed data back to strings.
- decode_
delimited - Convenience: decode and reassemble with delimiter.
- encode
- Encode a batch of strings using FSST compression.
- encode_
delimited - Convenience: encode a single contiguous byte buffer that contains multiple strings separated by a delimiter (e.g., newlines for log data).