Expand description
§tantivy
Tantivy is a search engine library.
Think Lucene
, but in Rust.
// First we need to define a schema ...
// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful to reconstruct the
// documents that were selected during the search phase.
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// Indexing documents
let index = Index::create_in_dir(index_path, schema.clone())?;
// Here we use a buffer of 100MB that will be split
// between indexing threads.
let mut index_writer = index.writer(100_000_000)?;
// Let's index one documents!
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in \
the Gulf Stream and he had gone eighty-four days \
now without taking a fish."
))?;
// We need to call .commit() explicitly to force the
// index_writer to finish processing the documents in the queue,
// flush the current index to the disk, and advertise
// the existence of new documents.
index_writer.commit()?;
// # Searching
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// QueryParser may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been opened regarding this problem.
let query = query_parser.parse_query("sea whale")?;
// Perform search.
// `topdocs` contains the 10 most relevant doc ids, sorted by decreasing scores...
let top_docs: Vec<(Score, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(10))?;
for (_score, doc_address) in top_docs {
// Retrieve the actual content of documents given its `doc_address`.
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
A good place for you to get started is to check out the example code ( literate programming / source code)
Re-exports§
pub use crate::error::TantivyError;
pub use crate::directory::Directory;
pub use crate::postings::Postings;
pub use crate::schema::DateOptions;
pub use crate::schema::DatePrecision;
pub use crate::schema::Document;
pub use crate::schema::Term;
pub use time;
Modules§
- aggregation
- Aggregations
- collector
- Collectors
- directory
- WORM (Write Once Read Many) directory abstraction.
- error
- Definition of Tantivy’s errors and results.
- fastfield
- Column oriented field storage for tantivy.
- fieldnorm
- The fieldnorm represents the length associated with a given Field of a given document.
- merge_
policy - Defines tantivy’s merging strategy
- positions
- Tantivy can (if instructed to do so in the schema) store the term positions in a given field. This position is expressed as token ordinal. For instance, In “The beauty and the beast”, the term “the” appears in position 0 and position 3. This information is useful to run phrase queries.
- postings
- Postings module (also called inverted index)
- query
- Module containing the different query implementations.
- schema
- Schema definition for tantivy’s indices.
- space_
usage - Representations for the space usage of various parts of a Tantivy index.
- store
- Compressed/slow/row-oriented storage for documents.
- termdict
- The term dictionary main role is to associate the sorted
Term
s to aTermInfo
struct that contains some meta-information about the term. - tokenizer
- Tokenizer are in charge of chopping text into a stream of tokens ready for indexing.
Macros§
- doc
doc!
is a shortcut that helps buildingDocument
objects.
Structs§
- Date
Time - A date/time value with microsecond precision.
- Demux
Mapping - DemuxMapping can be used to reorganize data from multiple segments.
- DocAddress
DocAddress
contains all the necessary information to identify a document given aSearcher
object.- DocId
ToSegment Ordinal - DocIdToSegmentOrdinal maps from doc_id within a segment to the new segment ordinal for demuxing.
- Future
Result FutureResult
is a handle that makes it possible to wait for the completion of an ongoing task.- Index
- Search Index
- Index
Builder - IndexBuilder can be used to create an index.
- Index
Meta - Meta information about the
Index
. - Index
Reader IndexReader
is your entry point to read and search the index.- Index
Reader Builder IndexReader
builder- Index
Settings - Search Index Settings.
- Index
Sort ByField - Settings to presort the documents in an index
- Index
Writer IndexWriter
is the user entry-point to add document to an index.- Inventory
- The
Inventory
register and keeps track of all of the objects alive. - Inverted
Index Reader - The inverted index reader is in charge of accessing the inverted index associated with a specific field.
- Prepared
Commit - A prepared commit
- Searcher
- Holds a list of
SegmentReader
s ready for search. - Searcher
Generation - Identifies the searcher generation accessed by a
Searcher
. - Segment
- A segment is a piece of the index.
- Segment
Id - Uuid identifying a segment.
- Segment
Meta SegmentMeta
contains simple meta information about a segment.- Segment
Reader - Entry point to access all of the datastructures of the
Segment
- Snippet
Snippet
Contains a fragment of a document, and some highlighted parts inside it.- Snippet
Generator SnippetGenerator
- Tracked
Object - Your tracked object.
- Version
- Structure version for the index.
Enums§
- Executor
- Search executor whether search request are single thread or multithread.
- Order
- The order to sort by
- Reload
Policy - Defines when a new version of the index should be reloaded.
- Segment
Component - Enum describing each component of a tantivy segment.
Each component is stored in its own file,
using the pattern
segment_uuid
.component_extension
, except the delete component that takes ansegment_uuid
.delete_opstamp
.component_extension
- User
Operation - UserOperation is an enum type that encapsulates other operation types.
Constants§
- TERMINATED
- Sentinel value returned when a
DocSet
has been entirely consumed.
Traits§
- DocSet
- Represents an iterable set of sorted doc ids.
- HasLen
- Has length trait
- Segment
Attributes Merger - Allows to implement custom behaviour while merging
SegmentAttributes
of multiple segments. - Warmer
Warmer
can be used to maintain segment-level state e.g. caches.
Functions§
- demux
- Demux the segments according to
demux_mapping
. SeeDemuxMapping
. The number of output_directories need to match max new segment ordinal fromdemux_mapping
. - f64_
to_ u64 - Maps a
f64
tou64
- i64_
to_ u64 - Maps a
i64
tou64
- u64_
to_ f64 - Reverse the mapping given by
f64_to_u64()
. - u64_
to_ i64 - Reverse the mapping given by
i64_to_u64()
. - version
- Expose the current version of tantivy as found in Cargo.toml during compilation. eg. “0.11.0” as well as the compression scheme used in the docstore.
- version_
string - Exposes the complete version of tantivy as found in Cargo.toml during compilation as a string. eg. “tantivy v0.11.0, index_format v1, store_compression: lz4”.
Type Aliases§
- DocId
- A
u32
identifying a document within a segment. Documents have theirDocId
assigned incrementally, as they are added in the segment. - Opstamp
- A u64 assigned to every operation incrementally
- Result
- Tantivy result.
- Score
- A Score that represents the relevance of the document to the query
- Segment
Ordinal - A
SegmentOrdinal
identifies a segment, within aSearcher
orMerger
.