Struct graphannis::CorpusStorage
source · pub struct CorpusStorage { /* private fields */ }
Expand description
A thread-safe API for managing corpora stored in a common location on the file system.
Multiple corpora can be part of a corpus storage and they are identified by their unique name. Corpora are loaded from disk into main memory on demand: An internal main memory cache is used to avoid re-loading a recently queried corpus from disk again.
Implementations§
source§impl CorpusStorage
impl CorpusStorage
sourcepub fn with_cache_strategy(
db_dir: &Path,
cache_strategy: CacheStrategy,
use_parallel_joins: bool
) -> Result<CorpusStorage>
pub fn with_cache_strategy( db_dir: &Path, cache_strategy: CacheStrategy, use_parallel_joins: bool ) -> Result<CorpusStorage>
Create a new instance with a maximum size for the internal corpus cache.
db_dir
- The path on the filesystem where the corpus storage content is located. Must be an existing directory.cache_strategy
: A strategy for clearing the cache.use_parallel_joins
- Iftrue
parallel joins are used by the system, using all available cores.
sourcepub fn with_auto_cache_size(
db_dir: &Path,
use_parallel_joins: bool
) -> Result<CorpusStorage>
pub fn with_auto_cache_size( db_dir: &Path, use_parallel_joins: bool ) -> Result<CorpusStorage>
Create a new instance with a an automatic determined size of the internal corpus cache.
Currently, set the maximum cache size to 25% of the available/free memory at construction time. This behavior can change in the future.
db_dir
- The path on the filesystem where the corpus storage content is located. Must be an existing directory.use_parallel_joins
- Iftrue
parallel joins are used by the system, using all available cores.
sourcepub fn list(&self) -> Result<Vec<CorpusInfo>>
pub fn list(&self) -> Result<Vec<CorpusInfo>>
List all available corpora in the corpus storage.
sourcepub fn info(&self, corpus_name: &str) -> Result<CorpusInfo>
pub fn info(&self, corpus_name: &str) -> Result<CorpusInfo>
Return detailled information about a specific corpus with a given name (corpus_name
).
sourcepub fn import_all_from_zip<R, F>(
&self,
zip_file: R,
disk_based: bool,
overwrite_existing: bool,
progress_callback: F
) -> Result<Vec<String>>where
R: Read + Seek,
F: Fn(&str),
pub fn import_all_from_zip<R, F>( &self, zip_file: R, disk_based: bool, overwrite_existing: bool, progress_callback: F ) -> Result<Vec<String>>where R: Read + Seek, F: Fn(&str),
Import all corpora from a ZIP file.
This function will unzip the file to a temporary location and find all relANNIS and GraphML files in the ZIP file. The formats of the corpora can be relANNIS or GraphML.
zip_file
- The content of the ZIP file.disk_based
- Iftrue
, prefer disk-based annotation and graph storages instead of memory-only ones.overwrite_existing
- Iftrue
, overwrite existing corpora. Otherwise ignore.progress_callback
- A callback function to which the import progress is reported to.
Returns the names of the imported corpora.
sourcepub fn import_from_fs<F>(
&self,
path: &Path,
format: ImportFormat,
corpus_name: Option<String>,
disk_based: bool,
overwrite_existing: bool,
progress_callback: F
) -> Result<String>where
F: Fn(&str),
pub fn import_from_fs<F>( &self, path: &Path, format: ImportFormat, corpus_name: Option<String>, disk_based: bool, overwrite_existing: bool, progress_callback: F ) -> Result<String>where F: Fn(&str),
Import a corpus from an external location on the file system into this corpus storage.
path
- The location on the file system where the corpus data is located.format
- The format in which this corpus data is stored.corpus_name
- Optionally override the name of the new corpus for file formats that already provide a corpus name. This only works if the imported file location only contains one corpus.disk_based
- Iftrue
, prefer disk-based annotation and graph storages instead of memory-only ones.overwrite_existing
- Iftrue
, overwrite existing corpora. Otherwise ignore.progress_callback
- A callback function to which the import progress is reported to.
Returns the name of the imported corpus.
sourcepub fn export_to_zip<W, F>(
&self,
corpus_name: &str,
use_corpus_subdirectory: bool,
zip: &mut ZipWriter<W>,
progress_callback: F
) -> Result<()>where
W: Write + Seek,
F: Fn(&str),
pub fn export_to_zip<W, F>( &self, corpus_name: &str, use_corpus_subdirectory: bool, zip: &mut ZipWriter<W>, progress_callback: F ) -> Result<()>where W: Write + Seek, F: Fn(&str),
Export a corpus to a ZIP file.
In comparison to CorpusStorage::export_to_fs
this allows to give the zip file writer as argument
and to have a custom progress callback function.
corpus_name
- The name of the corpus to write to the ZIP file.use_corpus_subdirectory
- If true, the corpus is written into a sub-directory inside the ZIP file. This is useful when storing multiple corpora inside the same file.zip
- A writer for the already created ZIP file.progress_callback
- A callback function to which the export progress is reported to.
sourcepub fn export_to_fs<S: AsRef<str>>(
&self,
corpora: &[S],
path: &Path,
format: ExportFormat
) -> Result<()>
pub fn export_to_fs<S: AsRef<str>>( &self, corpora: &[S], path: &Path, format: ExportFormat ) -> Result<()>
Export a corpus to an external location on the file system using the given format.
corpora
- The corpora to include in the exported file(s).path
- The location on the file system where the corpus data should be written to.format
- The format in which this corpus data will be stored stored.
sourcepub fn delete(&self, corpus_name: &str) -> Result<bool>
pub fn delete(&self, corpus_name: &str) -> Result<bool>
Delete a corpus from this corpus storage.
Returns true
if the corpus was successfully deleted and false
if no such corpus existed.
sourcepub fn apply_update(
&self,
corpus_name: &str,
update: &mut GraphUpdate
) -> Result<()>
pub fn apply_update( &self, corpus_name: &str, update: &mut GraphUpdate ) -> Result<()>
Apply a sequence of updates (update
parameter) to this graph for a corpus given by the corpus_name
parameter.
It is ensured that the update process is atomic and that the changes are persisted to disk if the result is Ok
.
sourcepub fn preload(&self, corpus_name: &str) -> Result<()>
pub fn preload(&self, corpus_name: &str) -> Result<()>
Preloads all annotation and graph storages from the disk into a main memory cache.
sourcepub fn validate_query<S: AsRef<str>>(
&self,
corpus_names: &[S],
query: &str,
query_language: QueryLanguage
) -> Result<bool>
pub fn validate_query<S: AsRef<str>>( &self, corpus_names: &[S], query: &str, query_language: QueryLanguage ) -> Result<bool>
Parses a query
and checks if it is valid.
corpus_names
- The name of the corpora the query would be executed on (needed to catch certain corpus-specific semantic errors).query
- The query as string.query_language
The query language of the query (e.g. AQL).
Returns true
if valid and an error with the parser message if invalid.
sourcepub fn plan<S: AsRef<str>>(
&self,
corpus_names: &[S],
query: &str,
query_language: QueryLanguage
) -> Result<String>
pub fn plan<S: AsRef<str>>( &self, corpus_names: &[S], query: &str, query_language: QueryLanguage ) -> Result<String>
Returns a string representation of the execution plan for a query
.
corpus_names
- The name of the corpora to execute the query on.query
- The query as string.query_language
The query language of the query (e.g. AQL).
sourcepub fn count<S: AsRef<str>>(&self, query: SearchQuery<'_, S>) -> Result<u64>
pub fn count<S: AsRef<str>>(&self, query: SearchQuery<'_, S>) -> Result<u64>
Count the number of results for a query
.
query
- The search query definition. Returns the count as number.
sourcepub fn count_extra<S: AsRef<str>>(
&self,
query: SearchQuery<'_, S>
) -> Result<CountExtra>
pub fn count_extra<S: AsRef<str>>( &self, query: SearchQuery<'_, S> ) -> Result<CountExtra>
Count the number of results for a query
and return both the total number of matches and also the number of documents in the result set.
query
- The search query definition.
sourcepub fn find<S: AsRef<str>>(
&self,
query: SearchQuery<'_, S>,
offset: usize,
limit: Option<usize>,
order: ResultOrder
) -> Result<Vec<String>>
pub fn find<S: AsRef<str>>( &self, query: SearchQuery<'_, S>, offset: usize, limit: Option<usize>, order: ResultOrder ) -> Result<Vec<String>>
Find all results for a query
and return the match ID for each result.
The query is paginated and an offset and limit can be specified.
query
- The search query definition.offset
- Skip then
first results, wheren
is the offset.limit
- Return at mostn
matches, wheren
is the limit. UseNone
to allow unlimited result sizes.order
- Specify the order of the matches.
Returns a vector of match IDs, where each match ID consists of the matched node annotation identifiers separated by spaces. You can use the subgraph(…) method to get the subgraph for a single match described by the node annnotation identifiers.
sourcepub fn subgraph(
&self,
corpus_name: &str,
node_ids: Vec<String>,
ctx_left: usize,
ctx_right: usize,
segmentation: Option<String>
) -> Result<AnnotationGraph>
pub fn subgraph( &self, corpus_name: &str, node_ids: Vec<String>, ctx_left: usize, ctx_right: usize, segmentation: Option<String> ) -> Result<AnnotationGraph>
Return the copy of a subgraph which includes the given list of node annotation identifiers, the nodes that cover the same token as the given nodes and all nodes that cover the token which are part of the defined context.
corpus_name
- The name of the corpus for which the subgraph should be generated from.node_ids
- A set of node annotation identifiers describing the subgraph.ctx_left
andctx_right
- Left and right context in token distance to be included in the subgraph.segmentation
- The name of the segmentation which should be used to as base for the context. UseNone
to define the context in the default token layer.
Handling of gaps
The context definition can cause gaps in the returned subgraph, e.g. if the given node IDs are too far apart for their context to overlap. Since only edges for the nodes of the contexts are included, it is impossible use the original ordering edges to sort the results alone, since there will be no connection between the tokens of the non-overlapping context regions.
To allow sorting the non-overlapping context regions by their order in
the datasource, an edge in the special Ordering/annis/datasource-gap
component is added between the last token of each context region and the
first token of the next one.
sourcepub fn subgraph_for_query(
&self,
corpus_name: &str,
query: &str,
query_language: QueryLanguage,
component_type_filter: Option<AnnotationComponentType>
) -> Result<AnnotationGraph>
pub fn subgraph_for_query( &self, corpus_name: &str, query: &str, query_language: QueryLanguage, component_type_filter: Option<AnnotationComponentType> ) -> Result<AnnotationGraph>
Return the copy of a subgraph which includes all nodes matched by the given query
.
corpus_name
- The name of the corpus for which the subgraph should be generated from.query
- The query which defines included nodes.query_language
- The query language of the query (e.g. AQL).component_type_filter
- If set, only include edges of that belong to a component of the given type.
sourcepub fn subcorpus_graph(
&self,
corpus_name: &str,
corpus_ids: Vec<String>
) -> Result<AnnotationGraph>
pub fn subcorpus_graph( &self, corpus_name: &str, corpus_ids: Vec<String> ) -> Result<AnnotationGraph>
Return the copy of a subgraph which includes all nodes that belong to any of the given list of sub-corpus/document identifiers.
corpus_name
- The name of the corpus for which the subgraph should be generated from.corpus_ids
- A set of sub-corpus/document identifiers describing the subgraph.
sourcepub fn corpus_graph(&self, corpus_name: &str) -> Result<AnnotationGraph>
pub fn corpus_graph(&self, corpus_name: &str) -> Result<AnnotationGraph>
Return the copy of the graph of the corpus structure given by corpus_name
.
sourcepub fn frequency<S: AsRef<str>>(
&self,
query: SearchQuery<'_, S>,
definition: Vec<FrequencyDefEntry>
) -> Result<FrequencyTable<String>>
pub fn frequency<S: AsRef<str>>( &self, query: SearchQuery<'_, S>, definition: Vec<FrequencyDefEntry> ) -> Result<FrequencyTable<String>>
Execute a frequency query.
query
- The search query definition.definition
- A list of frequency query definitions.
Returns a frequency table of strings.
sourcepub fn node_descriptions(
&self,
query: &str,
query_language: QueryLanguage
) -> Result<Vec<QueryAttributeDescription>>
pub fn node_descriptions( &self, query: &str, query_language: QueryLanguage ) -> Result<Vec<QueryAttributeDescription>>
Parses a query
and return a list of descriptions for its nodes.
query
- The query to be analyzed.query_language
- The query language of the query (e.g. AQL).
sourcepub fn list_components(
&self,
corpus_name: &str,
ctype: Option<AnnotationComponentType>,
name: Option<&str>
) -> Result<Vec<Component<AnnotationComponentType>>>
pub fn list_components( &self, corpus_name: &str, ctype: Option<AnnotationComponentType>, name: Option<&str> ) -> Result<Vec<Component<AnnotationComponentType>>>
Returns a list of all components of a corpus given by corpus_name
.
ctype
- Optionally filter by the component type.name
- Optionally filter by the component name.
sourcepub fn list_node_annotations(
&self,
corpus_name: &str,
list_values: bool,
only_most_frequent_values: bool
) -> Result<Vec<Annotation>>
pub fn list_node_annotations( &self, corpus_name: &str, list_values: bool, only_most_frequent_values: bool ) -> Result<Vec<Annotation>>
Returns a list of all node annotations of a corpus given by corpus_name
.
list_values
- If true include the possible values in the result.only_most_frequent_values
- If both this argument andlist_values
are true, only return the most frequent value for each annotation name.
sourcepub fn list_edge_annotations(
&self,
corpus_name: &str,
component: &Component<AnnotationComponentType>,
list_values: bool,
only_most_frequent_values: bool
) -> Result<Vec<Annotation>>
pub fn list_edge_annotations( &self, corpus_name: &str, component: &Component<AnnotationComponentType>, list_values: bool, only_most_frequent_values: bool ) -> Result<Vec<Annotation>>
Returns a list of all edge annotations of a corpus given by corpus_name
and the component
.
list_values
- If true include the possible values in the result.only_most_frequent_values
- If both this argument andlist_values
are true, only return the most frequent value for each annotation name.