Expand description
§Introduction
STAM is a standalone data model for stand-off text annotation. This is a software library to work with the model from Rust, and is the primary library/reference implementation for STAM. It aims to implement the full model as per the STAM specification and most of the extensions.
What can you do with this library?
- Keep, build and manipulate an efficient in-memory store of texts and annotations on texts
- Search in annotations, data and text, either programmatically or via the STAM Query Language.
- Search annotations by data, textual content, relations between text fragments (overlap, embedding, adjacency, etc).
- Search in text (incl. via regular expressions) and find annotations targeting found text selections.
- Elementary text operations with regard for text offsets (splitting text on a delimiter, stripping text).
- Search in data (set,key,value) and find annotations that use the data.
- Convert between different kind of offsets (absolute, relative to other structures, UTF-8 bytes vs unicode codepoints, etc)
- Read and write resources and annotations from/to STAM JSON, STAM CSV, or an optimised binary (CBOR) representation.
- The underlying STAM model aims to be clear and simple. It is flexible and does not commit to any vocabulary or annotation paradigm other than stand-off annotation.
This STAM library is intended as a foundation upon which further applications can be built that deal with stand-off annotations on text. We implement all the low-level logic in dealing this so you no longer have to and can focus on your actual application. The library is written with performance in mind.
This is the root module for the STAM library. The STAM library consists of two APIs, a
low-level API and a high-level API, the latter is of most interest to end users and is
implemented in api/*.rs
.
§Table of Contents (abridged)
AnnotationStore
- The main annotation store that holds everything together.- Result items: - These encapsulate the underlying primary structures and is the main way in which things are returned throughout the high-level API.
- Values and Operators:
DataValue
- Encapsulates an actual value and its type.DataOperator
- Defines a test done on aDataValue
TextSelectionOperator
- Performs a particular comparison of text selections (e.g. overlap, embedding, adjacency, etc..)
- Iterators:
AnnotationIterator
- Iterator trait to iterate over annotations, typically produced by anannotations()
method.DataIterator
- Iterator trait to iterate over annotation data, typically produced by adata()
method.TextSelectionIterator
- iterator (trait), typically produced by atextselections()
orrelated_text()
method.ResourcesIterator
- iterator (trait), typically produced by aresources()
method.KeyIterator
- iterator (trait), typically produced by akeys()
method.TextIter
- iterator over actual text, typically produced by atext()
method.
- Text operations:
- Collections:
Annotations
==Handles<Annotation>
- Arbitrary collection ofAnnotation
(by reference)Data
==Handles<AnnotationData>
- Arbitrary collection ofAnnotationData
(by reference)Resources
==Handles<TextResource>
- Arbitrary collection ofTextResource
(by reference).Keys
==Handles<DataKey>
- Arbitrary collection ofDataKey
(by reference).
- Querying:
Query
- Holds a query, may be parsed from STAMQL.QueryResultItems
QueryResultItem
- Referencing Text (both high and low-level API):
- Primary structures (low level API):
Structs§
Annotation
represents a particular instance of annotation and is the central concept of the model. They can be considered the primary nodes of the graph model. The instance of annotation is strictly decoupled from the data or key/value of the annotation (AnnotationData
). After all, multiple instances can be annotated with the same label (multiple annotations may share the same annotation data). Moreover, anAnnotation
can have multiple annotation data associated. The result is that multiple annotations with the exact same content require less storage space, and searching and indexing is facilitated.- This is the builder that builds
Annotation
. The actual building is done by passing this structure toAnnotationStore::annotate()
, there is nobuild()
method for this builder. - AnnotationData holds the actual content of an annotation; a key/value pair. (the term feature is regularly seen for this in certain annotation paradigms). Annotation Data is deliberately decoupled from the actual
Annotation
instances so multiple annotation instances can point to the same content without causing any overhead in storage. Moreover, it facilitates indexing and searching. The annotation data is part of anAnnotationDataSet
, which effectively defines a certain user-defined vocabulary. - This is the builder for
AnnotationData
. It contains public IDs or handles that will be resolved. This structure is usually not instantiated directly but via theAnnotationBuilder.with_data()
,AnnotationDataSet.insert_data()
orAnnotationDataSet.with_data()
orAnnotationDataSet.build_insert_data()
methods. It also does not have its ownbuild()
method but is resolved via the aforementioned methods. - An
AnnotationDataSet
stores the keysDataKey
and valuesAnnotationData
(which in turn encapsulatesDataValue
) that are used by annotations. It effectively defines a certain vocabulary, i.e. key/value pairs. TheAnnotationDataSet
does not store theAnnotation
instances, those are in theAnnotationStore
. The datasets themselves are also held by theAnnotationStore
. - Handle to an instance of
Annotation
in the store. - An Annotation Store is a collection of annotations, resources and annotation data sets. It can be seen as the root of the graph model and the glue that holds everything together. It is the entry point for any stam model.
- A substore is a sub-collection of annotations that is serialised as an independent AnnotationStore, The actual contents are still defined and kept by the parent AnnotationStore. This structure only holds references used for serialisation purposes.
- This holds the configuration. It is not limited to configuring a single part of the model, but unifies all in a single configuration.
- The DataKey structure defines a vocabulary field or feature, as it is called in some annotation paradigms. it belongs to a certain
AnnotationDataSet
. AnAnnotationData
instance in turn makes reference to a DataKey and assigns it a value, producing a full key/value pair. - ISO 8601 combined date and time with time zone.
- An iterator that applies a filter to constrain annotations. This iterator implements
AnnotationIterator
and is itself produced by the variousfilter_*()
methods on that trait. - An iterator that applies a filter to constrain annotation data. This iterator implements
DataIterator
and is itself produced by the variousfilter_*()
methods on that trait. - An iterator that applies a filter to constrain keys. This iterator implements
KeyIterator
and is itself produced by the variousfilter_*()
methods on that trait. - An iterator that applies a filter to constrain keys. This iterator implements
KeyIterator
and is itself produced by the variousfilter_*()
methods on that trait. - An iterator that applies a filter to constrain resources. This iterator implements
ResourcesIterator
and is itself produced by the variousfilter*()
methods on that trait. - An iterator that applies a filter to constrain text selections. This iterator implements
TextSelectionIterator
and is itself produced by the variousfilter_*()
methods on that trait. - This iterator is produced by
FindText::find_text_nocase()
and searches a text for a single fragment, without regard for casing. It has more overhead than the exact (case sensitive) variantFindTextIter
. - This iterator is produced by
FindText::find_text_regex()
and searches a text based on regular expressions. - This match structure is returned by the
FindRegexIter
iterator, which is in turn produced byFindText::find_text_regex()
and searches a text based on regular expressions. This structure represents a single regular-expression match of the iterator on the text. - This iterator is produced by
FindText::find_text()
and searches a text for a single fragment. The search is case sensitive. SeeFindNoCaseTextIter
for a case-insensitive variant. The iterator yieldsResultTextSelection
items (which encapsulatesTextSelection
). - The time zone with fixed offset, from UTC-23:59:59 to UTC+23:59:59.
- Iterator that turns iterators over full handles into
ResultItem<T>
, holds a reference to theAnnotationStore
- Holds a collection of items. The collection may be either owned or borrowed from the store (usually from a reverse index).
- The local timescale.
- Text selection offset. Specifies begin and end offsets to select a range of a text, via two
Cursor
instances. The end-point is non-inclusive. - This represents a query that can be performed on an
AnnotationStore
viaAnnotationStore::query()
to obtain anything in the store. A query can be formulated in STAMQL, a dedicated query language (viaQuery::parse()
, or it can be instantiated programmatically viaQuery::new()
. - Iterator over the results of a
Query
. Querying will be performed as the iterator is consumed (lazy evaluation). If it is not consumed, no actual querying will be done. SeeAnnotationStore::query()
for an example. - Represents an entire result row, each result stems from a query
- A compiled regular expression for searching Unicode haystacks.
- Match multiple, possibly overlapping, regexes in a single search.
- This is a smart pointer that encapsulates both the item and the store that owns it. It allows the item to have some more introspection as it knows who its immediate parent is. It is heavily used as a return type all throughout the higher-level API. Most API traits are implemented for a particular variant of this type.
- An iterator that may be sorted or not and knows a-priori whether it is or not, it may also be a completely empty iterator.
- A TextSelectionSet holds one or more
TextSelection
items and a reference to the TextResource from which they’re drawn. This structure encapsulates such aTextSelectionSet
and contains a reference to the underlyingAnnotationStore
. - Iterator that turns iterators over
ResultItem<TextSelection>
intoResultTextSelection
. - Iterator that returns the selector itself, plus all selectors under it (recursively)
- This iterator is produced by
FindText::split_text()
and splits a text based on a delimiter. The iterator yieldsResultTextSelection
(which encapsulatesTextSelection
). - An iterator over the actual text of text selections. This iterator yields
&str
instances and is typically produced by a.text()
method when there may be multiple text slices. - This holds the textual resource to be annotated. It holds the full text in memory.
- This is a helper structure to build
TextResource
instances in a builder pattern. Example: - Corresponds to a slice of the text. This only contains minimal information; i.e. the begin offset, end offset and optionally a handle. if the textselection is already known in the model. This is similar to
Offset
, but that one uses cursors which may be relative. TextSelection specifies an offset in more absolute terms. - This iterator is used for iterating over TextSelections in a resource in a sorted fashion using the so-called position index.
- A TextSelectionSet holds one or more
TextSelection
items and a reference to the TextResource from which they’re drawn. All textselections in a set must reference the same resource, which implies they are comparable. - The UTC time zone. This is the most efficient time zone when you don’t need the local time. It is also used as an offset (which is also a dummy type).
Enums§
- This determines how far to look up or down in an annotation hierarchy tree formed by AnnotationSelectors.
- An assignemnt is a part of an ADD
Query
that assigns data to a new annotation BuildItem
offers various ways of referring to a data structure of typeT
in the core STAM model It abstracts over public IDs (both owned an and borrowed), handles, and references.- A cursor points to a specific point in a text. I Used to select offsets. Units are unicode codepoints (not bytes!) and are 0-indexed.
- Data formats for serialisation and deserialisation supported by the library.
- This type defines a test that can be done on a
DataValue
(viaDataValue::test()
). The operator does not merely consist of the operator-part, but also holds the value that is tested against, which may be one of various types, hence the many variants of this type. - This type encapsulates a value and its type. It is held by
AnnotationData
alongside a reference to aDataKey
, resulting in a key/value pair. - This determines how a filter is applied when there the filter is provided with multiple reference instances to match against. It determines if the filter requires a match with any of the instances (default), or with all of them.
- Used as a parameter for
TextResource::positions()
- This is determines whether a query is applied normally or with a particular altered meaning.
- This structure encapsulates the different kind of result items that can be returned from queries. See
AnnotationStore::query()
for an example of it in use. - This type abstracts over all the main iterators. This abstraction uses dynamic dispatch so comes with a small performance cost
- Holds the type of a
Query
. - This structure holds a
TextSelection
, along with references to itsTextResource
and theAnnotationStore
and provides a high-level API on it. - This is determines whether a query
Constraint
is applied normally or with a particular altered meaning. - A
Selector
identifies the target of an annotation and the part of the target that the annotation applies to. Selectors can be considered the labelled edges of the graph model, tying all nodes together. There are multiple types of selectors, all captured in this enum. - A
SelectorBuilder
is a recipe that, when applied, identifies the target of an annotation and the part of the target that the annotation applies to. They produce aSelector
. You turn aSelectorBuilder
into aSelector
usingAnnotationStore::selector
. - See
Selector
, this is a simplified variant that carries only the type, not the target. - This enum groups the different kind of errors that this STAM library can produce
- Determines whether a text search is exact (case sensitive) or case insensitive.
- The TextSelectionOperator, simply put, allows comparison of two
TextSelection
instances. It allows testing for all kinds of spatial relations (as embodied by this enum) in which twoTextSelection
instances can be, such as overlap, embedding, adjacency, etc… - An enumeration of STAM data types. This is used for introspection via
TypeInfo
.
Traits§
- Trait for iteration over annotations (
ResultItem<Annotation>
; encapsulation overAnnotation
). Implements numerous filter methods to further constrain the iterator, as well as methods to map from annotations to other items. - Trait for iteration over annotation data (
ResultItem<AnnotationData>
; encapsulation overAnnotationData
). Implements numerous filter methods to further constrain the iterator, as well as methods to map from annotation data to other items. - Trait for iteration over datasets (
ResultItem<AnnotationDataSet>
; encapsulation overAnnotationDataSet
). Implements numerous filter methods to further constrain the iterator, as well as methods to map from keys to other items. - This trait provides text-searching methods that operate on structures that hold or represent text content. It builds upon the lower-level
Text
trait. - The handle trait is implemented for various handle types. They have in common that refer to the internal id of a
Storable
item in a struct implementingStoreFor
by index. Types implementing this are lightweight and do not borrow anything, they can be passed and copied freely. This is a sealed trait, not implementable outside this crate. - An iterator that grabs the first available data value
- Trait for iteration over data keys (
ResultItem<DataKey>
; encapsulation overDataKey
). Implements numerous filter methods to further constrain the iterator, as well as methods to map from keys to other items. - An iterator that can extract an arbitrary subrange, even with relative coordinates (at which point it will allocate a buffer)
- An iterator that may be sorted or not and knows a-priori whether it is or not.
- This trait is implemented for types that can serve as a request for a specific item of type
T
from the store. It is typically implemented on strings (both owned and borrowed) in which case the request is for a particular public identifier, or it is implemented on handles. - Trait for iteration over resources (
ResultItem<TextResource>
; encapsulation overTextResource
). Implements numerous filter methods to further constrain the iterator, as well as methods to map from resources to other items. - This trait is implemented by types that can return a Selector to themselves
- This trait allows sorting a collection in textual order, meaning that items are returned in the same order as they appear in the original text.
- This trait defines the
Self::or_fail
method that is used to turn anOption<T>
intoResult<T,StamError>
. - This is a low-level trait that is implemented on the various STAM data structures that are held in a store, such as
Annotation
,AnnotationData
,TextResource
, etc.. All storable elements have aHandle
, defined by the associatedSelf::HandleType
. It corresponds directly to their index in a vector, so this type is a simple wrapper aroundusize
. This is a sealed trait, not implementable outside this crate. - This trait is implemented on types that provide storage for a certain other generic type (T) It belongs to the low-level API. It is a sealed trait, not implementable outside this crate.
- This trait defines the
test()
methods for testing relations between two text selections (or sets thereof). - This iterator implements a simple
.test()
method that just checks whether an iterator is empty or yields results. It is implemented alongside traits likeAnnotationIterator
,DataIterator
, etc… - This trait provides methods that operate on structures that hold or represent text content. They are fairly low-level methods but are exposed in the public API. The
FindText
trait subsequently builds upon this one with high-level search methods. - Trait for iteration over text selections (
ResultTextSelection
; encapsulation overTextSelection
). Implements numerous filter methods to further constrain the iterator, as well as methods to map from text selections to other items. - This trait is implemented on iterators over
ResultItem<T>
and turns effectively collects these items, by only their handles and a reference to a store, asHandles<T>
. It is implemented alongside traits likeAnnotationIterator
,DataIterator
, etc… - This trait provides some introspection on STAM data types. It is a sealed trait that can not be implemented.
Functions§
- Generate an ID with a random 21-byte and ID/URI-safe component This does no collision check (but they will be extremely unlikely)
- Tests whether a string is a valid IRI
- Take an existing ID an apply a update stategy to create a derived new ID
Type Aliases§
- Holds a collection of
AnnotationDataSet
(by reference to anAnnotationStore
and handles). This structure is produced by callingToHandles::to_handles()
, which is available on all iterators over keys (ResultItem<AnnotationDataSet>
). - Holds a collection of
Annotation
(by reference to anAnnotationStore
and handles). This structure is produced by callingToHandles::to_handles()
, which is available on all iterators over annotations (ResultItem<Annotation>
). - Holds a collection of
AnnotationData
(by reference to anAnnotationStore
and handles). This structure is produced by callingToHandles::to_handles()
, which is available on all iterators over data. - Iterator over the handles in a
Handles<T>
collection. - Holds a collection of
DataKey
(by reference to anAnnotationStore
and handles). This structure is produced by callingToHandles::to_handles()
, which is available on all iterators over keys (ResultItem<DataKey>
). - This points to a particular subquery inside a query
- Holds a collection of
TextResource
(by reference to anAnnotationStore
and handles). This structure is produced by callingToHandles::to_handles()
, which is available on all iterators over resources (ResultItem<TextResource>
). - Type for Store elements. The struct that owns a field of this type should implement the trait
StoreFor<T>
This is a low-level construct. Do not confuse withAnnotationStore
. - Holds a collection of
TextSelection
(by reference to anAnnotationStore
and handles). This structure is produced by callingToHandles::to_handles()
, which is available on all iterators over texts selections.