Module jupiter::ig

source ·
Expand description

Provides a compact and highly efficient representation of in-memory databases.

infograph permits to generate or load structured data, which can then be queried in various ways. This could be anything from simple code lists (name value pair) to complex graphs of master data like nested lookup tables translated into several languages.

The work horse is the Doc struct which contains the root Node and the SymbolTable. Nodes as the internal representation of the supported data types (strings, ints, bools, lists and maps). The external interface used to access them is Element which carries the symbol table along.

Symbols are simply numbers (i32) which are managed by a SymbolTable. The reason is that many documents contain lots of inner objects (maps) which all share the same keys. To optimize memory consumption and also to improve access times, these keys are only put in them symbol table once and then referenced as Symbol.

If a Symbol or a Query (which is a path of symbols to access an inner object) is used several times, it can be compiled and then re-used, which further improves performance.

For lists of objects a Table can be used to further speed up querying objects and data from it. A table permits to add indices for common queries and also keeps a cache for both, pre-compiled queries and frequent lookups around.

To create a Doc programmatically, a DocBuilder can be used.

Also there are helpers to read common data formats:

Performance

As stated above, infograph is built to serve queries against large in-memory data structures which are read frequently and changed seldom.

To guarantee optimal performance and to also optimize memory consumption, three main techniques are applied:

  1. Small string optimizations: Strings which are rather short, are stored in place instead of using a classic String. Note that if the value doesn’t fit in place, we still only store a pointer to a [u8] instead of a whole Vec<u8> which saves 8 bytes for the skipped capacity field (which isn’t required as the strings are immutable anyway).

  2. Compression via SymbolTable: As the keys in objects are most probably repeated very often we store the strings in a symbol table and only carry an i32 along.Think if a list of 1000 object which all look like { Foo: 42, Bar: 24 }. Instead of storing the strings repeatedly, we have a symbol table: {'Foo': 1, 'Bar': 2} accompanied with a lookup table [ 'Foo', 'Bar' ]. The object itself is stored as {1: 42, 2: 24}.

  3. Optimized object representation: Objects are maps which map Symbols (i32) to nodes. As many objects only contain a few entries, using a HashMap would be quite an overkill (and slow). Therefore we use a Vec<(Symbol, Node)> with a small initial capacity. This provides quite a compact memory layout which is very fast due to its efficient usage of L1 cache. We also sort entries by their Symbol so that we can use a binary search when looking up an entry.

Data Layout

Using loaders (defined in the Repository) one can control for which columns and index is created. These lookup indices drastically improve the performance of lookups and searches. We support exact indices (given as indices) and search indices (specified as fulltextIndices). The former only store full field values, where the latter also store tokens (e.g. “hello”, “world”, for “hello world”).

Note that if a n index is used for a list or a map, all its child values are also mapped. Note that also, we create a sub-index - e.g. when this is indexed:

mappings:
    acme: "test"

Using an index for “mappings”, we’d index “text” and also create an index for mappings.acme with the same value.

Modules

  • Provides DocBuilder which used to create a Doc programmatically.
  • Permits to load CSV data into a Doc.
  • Contains the core elements to read / query an information graph.
  • Permits to load JSON data into a Doc.
  • Provides a SymbolTable and a SymbolMap for an efficient and compact representation of objects / inner maps.
  • Permits to load XML data using a pull principle.
  • Permits to load YAML data into a Doc.