Crate stam

source ·
Expand description

§Introduction

STAM is a standalone data model for stand-off text annotation. This is a software library to work with the model from Rust, and is the primary library/reference implementation for STAM. It aims to implement the full model as per the STAM specification and most of the extensions.

What can you do with this library?

  • Keep, build and manipulate an efficient in-memory store of texts and annotations on texts
  • Search in annotations, data and text, either programmatically or via the STAM Query Language.
    • Search annotations by data, textual content, relations between text fragments (overlap, embedding, adjacency, etc).
    • Search in text (incl. via regular expressions) and find annotations targeting found text selections.
    • Elementary text operations with regard for text offsets (splitting text on a delimiter, stripping text).
    • Search in data (set,key,value) and find annotations that use the data.
    • Convert between different kind of offsets (absolute, relative to other structures, UTF-8 bytes vs unicode codepoints, etc)
  • Read and write resources and annotations from/to STAM JSON, STAM CSV, or an optimised binary (CBOR) representation.
    • The underlying STAM model aims to be clear and simple. It is flexible and does not commit to any vocabulary or annotation paradigm other than stand-off annotation.

This STAM library is intended as a foundation upon which further applications can be built that deal with stand-off annotations on text. We implement all the low-level logic in dealing this so you no longer have to and can focus on your actual application. The library is written with performance in mind.

This is the root module for the STAM library. The STAM library consists of two APIs, a low-level API and a high-level API, the latter is of most interest to end users and is implemented in api/*.rs.

§Table of Contents (abridged)

Structs§

  • Annotation represents a particular instance of annotation and is the central concept of the model. They can be considered the primary nodes of the graph model. The instance of annotation is strictly decoupled from the data or key/value of the annotation (AnnotationData). After all, multiple instances can be annotated with the same label (multiple annotations may share the same annotation data). Moreover, an Annotation can have multiple annotation data associated. The result is that multiple annotations with the exact same content require less storage space, and searching and indexing is facilitated.
  • This is the builder that builds Annotation. The actual building is done by passing this structure to AnnotationStore::annotate(), there is no build() method for this builder.
  • AnnotationData holds the actual content of an annotation; a key/value pair. (the term feature is regularly seen for this in certain annotation paradigms). Annotation Data is deliberately decoupled from the actual Annotation instances so multiple annotation instances can point to the same content without causing any overhead in storage. Moreover, it facilitates indexing and searching. The annotation data is part of an AnnotationDataSet, which effectively defines a certain user-defined vocabulary.
  • This is the builder for AnnotationData. It contains public IDs or handles that will be resolved. This structure is usually not instantiated directly but via the AnnotationBuilder.with_data(), AnnotationDataSet.insert_data() or AnnotationDataSet.with_data() or AnnotationDataSet.build_insert_data() methods. It also does not have its own build() method but is resolved via the aforementioned methods.
  • Handle to an instance of AnnotationData in the store (AnnotationDataSet).
  • An AnnotationDataSet stores the keys DataKey and values AnnotationData (which in turn encapsulates DataValue) that are used by annotations. It effectively defines a certain vocabulary, i.e. key/value pairs. The AnnotationDataSet does not store the Annotation instances, those are in the AnnotationStore. The datasets themselves are also held by the AnnotationStore.
  • Handle to an instance of Annotation in the store.
  • An Annotation Store is a collection of annotations, resources and annotation data sets. It can be seen as the root of the graph model and the glue that holds everything together. It is the entry point for any stam model.
  • This holds the configuration. It is not limited to configuring a single part of the model, but unifies all in a single configuration.
  • The DataKey structure defines a vocabulary field or feature, as it is called in some annotation paradigms. it belongs to a certain AnnotationDataSet. An AnnotationData instance in turn makes reference to a DataKey and assigns it a value, producing a full key/value pair.
  • Handle to an instance of DataKey in the store (AnnotationDataSet)
  • An iterator that applies a filter to constrain annotations. This iterator implements AnnotationIterator and is itself produced by the various filter_*() methods on that trait.
  • An iterator that applies a filter to constrain annotation data. This iterator implements DataIterator and is itself produced by the various filter_*() methods on that trait.
  • An iterator that applies a filter to constrain keys. This iterator implements KeyIterator and is itself produced by the various filter_*() methods on that trait.
  • An iterator that applies a filter to constrain resources. This iterator implements ResourcesIterator and is itself produced by the various filter*() methods on that trait.
  • An iterator that applies a filter to constrain text selections. This iterator implements TextSelectionIterator and is itself produced by the various filter_*() methods on that trait.
  • This iterator is produced by FindText::find_text_nocase() and searches a text for a single fragment, without regard for casing. It has more overhead than the exact (case sensitive) variant FindTextIter.
  • This iterator is produced by FindText::find_text_regex() and searches a text based on regular expressions.
  • This match structure is returned by the FindRegexIter iterator, which is in turn produced by FindText::find_text_regex() and searches a text based on regular expressions. This structure represents a single regular-expression match of the iterator on the text.
  • This iterator is produced by FindText::find_text() and searches a text for a single fragment. The search is case sensitive. See FindNoCaseTextIter for a case-insensitive variant. The iterator yields ResultTextSelection items (which encapsulates TextSelection).
  • Iterator that turns iterators over full handles into ResultItem<T>, holds a reference to the AnnotationStore
  • Holds a collection of items. The collection may be either owned or borrowed from the store (usually from a reverse index).
  • Text selection offset. Specifies begin and end offsets to select a range of a text, via two Cursor instances. The end-point is non-inclusive.
  • This represents a query that can be performed on an AnnotationStore via AnnotationStore::query() to obtain anything in the store. A query can be formulated in STAMQL, a dedicated query language (via Query::parse(), or it can be instantiated programmatically via Query::new().
  • Iterator over the results of a Query. Querying will be performed as the iterator is consumed (lazy evaluation). If it is not consumed, no actual querying will be done. See AnnotationStore::query() for an example.
  • This is a simple hashmap that can resolve all variable names used in the query to the internally used index numbers See AnnotationStore::query() for an example.
  • Represents an entire result row, each result stems from a query
  • A compiled regular expression for searching Unicode haystacks.
  • Match multiple, possibly overlapping, regexes in a single search.
  • This is a smart pointer that encapsulates both the item and the store that owns it. It allows the item to have some more introspection as it knows who its immediate parent is. It is heavily used as a return type all throughout the higher-level API. Most API traits are implemented for a particular variant of this type.
  • An iterator that may be sorted or not and knows a-priori whether it is or not, it may also be a completely empty iterator.
  • Iterator that turns iterators over ResultItem<TextSelection> into ResultTextSelection.
  • Iterator that returns the selector itself, plus all selectors under it (recursively)
  • This iterator is produced by FindText::split_text() and splits a text based on a delimiter. The iterator yields ResultTextSelection (which encapsulates TextSelection).
  • An iterator over the actual text of text selections. This iterator yields &str instances and is typically produced by a .text() method when there may be multiple text slices.
  • This holds the textual resource to be annotated. It holds the full text in memory.
  • This is a helper structure to build TextResource instances in a builder pattern. Example:
  • Handle to an instance of TextResource in the store (AnnotationStore).
  • Corresponds to a slice of the text. This only contains minimal information; i.e. the begin offset, end offset and optionally a handle. if the textselection is already known in the model. This is similar to Offset, but that one uses cursors which may be relative. TextSelection specifies an offset in more absolute terms.
  • Handle to an instance of TextSelection in the store (TextResource).
  • This iterator is used for iterating over TextSelections in a resource in a sorted fashion using the so-called position index.
  • A TextSelectionSet holds one or more TextSelection items and a reference to the TextResource from which they’re drawn. All textselections in a set must reference the same resource, which implies they are comparable.

Enums§

  • This determines how far to look up or down in an annotation hierarchy tree formed by AnnotationSelectors.
  • BuildItem offers various ways of referring to a data structure of type T in the core STAM model It abstracts over public IDs (both owned an and borrowed), handles, and references.
  • A constraint is a part of a Query that poses specific selection criteria that must be met. A query can have multiple constraints which must all be satisfied. See the documentation for Query for examples.
  • A cursor points to a specific point in a text. I Used to select offsets. Units are unicode codepoints (not bytes!) and are 0-indexed.
  • Data formats for serialisation and deserialisation supported by the library.
  • This type defines a test that can be done on a DataValue (via DataValue::test()). The operator does not merely consist of the operator-part, but also holds the value that is tested against, which may be one of various types, hence the many variants of this type.
  • This type encapsulates a value and its type. It is held by AnnotationData alongside a reference to a DataKey, resulting in a key/value pair.
  • This determines how a filter is applied when there the filter is provided with multiple reference instances to match against. It determines if the filter requires a match with any of the instances (default), or with all of them.
  • The offset mode represents the ways in which the user can specify an Offset, it expresses whether the cursors (Cursor) for the begin and end positions of the offset are specified as begin-aligned or end-aligned.
  • Used as a parameter for TextResource::positions()
  • This structure encapsulates the different kind of result items that can be returned from queries. See AnnotationStore::query() for an example of it in use.
  • This type abstracts over all the main iterators. This abstraction uses dynamic dispatch so comes with a small performance cost
  • Holds the type of a Query.
  • This structure holds a TextSelection, along with references to its TextResource and the AnnotationStore and provides a high-level API on it.
  • This is determines whether a query Constraint is applied normally or with a particular altered meaning.
  • A Selector identifies the target of an annotation and the part of the target that the annotation applies to. Selectors can be considered the labelled edges of the graph model, tying all nodes together. There are multiple types of selectors, all captured in this enum.
  • A SelectorBuilder is a recipe that, when applied, identifies the target of an annotation and the part of the target that the annotation applies to. They produce a Selector. You turn a SelectorBuilder into a Selector using AnnotationStore::selector.
  • See Selector, this is a simplified variant that carries only the type, not the target.
  • This enum groups the different kind of errors that this STAM library can produce
  • Determines whether a text search is exact (case sensitive) or case insensitive.
  • The TextSelectionOperator, simply put, allows comparison of two TextSelection instances. It allows testing for all kinds of spatial relations (as embodied by this enum) in which two TextSelection instances can be, such as overlap, embedding, adjacency, etc…
  • An enumeration of STAM data types. This is used for introspection via TypeInfo.

Traits§

  • Trait for iteration over annotations (ResultItem<Annotation>; encapsulation over Annotation). Implements numerous filter methods to further constrain the iterator, as well as methods to map from annotations to other items.
  • Trait for iteration over annotation data (ResultItem<AnnotationData>; encapsulation over AnnotationData). Implements numerous filter methods to further constrain the iterator, as well as methods to map from annotation data to other items.
  • This trait provides text-searching methods that operate on structures that hold or represent text content. It builds upon the lower-level Text trait.
  • The handle trait is implemented for various handle types. They have in common that refer to the internal id of a Storable item in a struct implementing StoreFor by index. Types implementing this are lightweight and do not borrow anything, they can be passed and copied freely. This is a sealed trait, not implementable outside this crate.
  • Trait for iteration over annotations (ResultItem<DataKey>; encapsulation over DataKey). Implements numerous filter methods to further constrain the iterator, as well as methods to map from keys to other items.
  • An iterator that may be sorted or not and knows a-priori whether it is or not.
  • This trait is implemented for types that can serve as a request for a specific item of type T from the store. It is typically implemented on strings (both owned and borrowed) in which case the request is for a particular public identifier, or it is implemented on handles.
  • Trait for iteration over resources (ResultItem<TextResource>; encapsulation over TextResource). Implements numerous filter methods to further constrain the iterator, as well as methods to map from resources to other items.
  • This trait is implemented by types that can return a Selector to themselves
  • This trait allows sorting a collection in textual order, meaning that items are returned in the same order as they appear in the original text.
  • This trait defines the Self::or_fail method that is used to turn an Option<T> into Result<T,StamError>.
  • This is a low-level trait that is implemented on the various STAM data structures that are held in a store, such as Annotation, AnnotationData,TextResource, etc.. All storable elements have a Handle, defined by the associated Self::HandleType. It corresponds directly to their index in a vector, so this type is a simple wrapper around usize. This is a sealed trait, not implementable outside this crate.
  • This trait is implemented on types that provide storage for a certain other generic type (T) It belongs to the low-level API. It is a sealed trait, not implementable outside this crate.
  • This trait defines the test() methods for testing relations between two text selections (or sets thereof).
  • This iterator implements a simple .test() method that just checks whether an iterator is empty or yields results. It is implemented alongside traits like AnnotationIterator, DataIterator, etc…
  • This trait provides methods that operate on structures that hold or represent text content. They are fairly low-level methods but are exposed in the public API. The FindText trait subsequently builds upon this one with high-level search methods.
  • Trait for iteration over text selections (ResultTextSelection; encapsulation over TextSelection). Implements numerous filter methods to further constrain the iterator, as well as methods to map from text selections to other items.
  • This trait is implemented on iterators over ResultItem<T> and turns effectively collects these items, by only their handles and a reference to a store, as Handles<T>. It is implemented alongside traits like AnnotationIterator, DataIterator, etc…
  • This trait provides some introspection on STAM data types. It is a sealed trait that can not be implemented.

Functions§

Type Aliases§