Struct analiticcl::VariantModel

source ·
pub struct VariantModel {
Show 16 fields pub decoder: VocabDecoder, pub encoder: VocabEncoder, pub alphabet: Alphabet, pub index: AnaIndex, pub sortedindex: BTreeMap<u16, Vec<AnaValue>>, pub ngrams: HashMap<NGram, u32>, pub freq_sum: Vec<usize>, pub have_freq: bool, pub have_lm: bool, pub context_rules: Vec<ContextRule>, pub tags: Vec<String>, pub weights: Weights, pub lexicons: Vec<String>, pub confusables: Vec<Confusable>, pub confusables_before_pruning: bool, pub debug: u8,
}
Expand description

The VariantModel is the most high-level model of analiticcl, it holds all data required for variant matching.

Fields§

§decoder: VocabDecoder

Maps Vocabulary IDs to their textual strings and other related properties

§encoder: VocabEncoder

Map strings to vocabulary IDs

§alphabet: Alphabet

Defines the alphabet used for the variant model

§index: AnaIndex

The main index, mapping anagrams to instances

§sortedindex: BTreeMap<u16, Vec<AnaValue>>

A secondary sorted index indices of the outer vector correspond to the length of an anagram (in chars) - 1 Inner vector is always sorted

§ngrams: HashMap<NGram, u32>

Ngrams for simple context-sensitive language modelling when finding the most probable sequence of variants

§freq_sum: Vec<usize>

Total frequency, index corresponds to n-1 size, so this holds the total count for unigrams, bigrams, etc.

§have_freq: bool

Do we have frequency information for variant matching?

§have_lm: bool

Do we have an LM?

§context_rules: Vec<ContextRule>

Context rules

§tags: Vec<String>

Tags used by the context rules

§weights: Weights

Weights used in distance scoring

§lexicons: Vec<String>

Stores the names of the loaded lexicons, they will be referenced by index from individual items for provenance reasons

§confusables: Vec<Confusable>

Holds weighted confusable recipes that can be used in scoring and ranking

§confusables_before_pruning: bool

Process confusables before pruning by max_matches

§debug: u8

Implementations§

source§

impl VariantModel

source

pub fn new(alphabet_file: &str, weights: Weights, debug: u8) -> VariantModel

Instantiate a new variant model

source

pub fn new_with_alphabet( alphabet: Alphabet, weights: Weights, debug: u8 ) -> VariantModel

Instantiate a new variant model, explicitly passing an alphabet rather than loading one from file.

source

pub fn set_confusables_before_pruning(&mut self)

Configure the model to match against known confusables prior to pruning on maximum weight. This may lead to better results but may have a significant performance impact.

source

pub fn alphabet_size(&self) -> CharIndexType

Returns the size of the alphabet, this is typically +1 longer than the actual alphabet file as it includes the UNKNOWN symbol.

source

pub fn get_or_create_index<'a, 'b>( &'a mut self, anahash: &'b AnaValue ) -> &'a mut AnaIndexNode

Get an item from the index or insert it if it doesn’t exist yet

source

pub fn build(&mut self)

Build the anagram index (and secondary index) so the model is ready for variant matching

source

pub fn contains_key(&self, key: &AnaValue) -> bool

Tests if the anagram value exists in the index

source

pub fn get_anagram_instances(&self, text: &str) -> Vec<&VocabValue>

Get all anagram instances for a specific entry

source

pub fn get(&self, text: &str) -> Option<&VocabValue>

Get an exact item in the lexicon (if it exists)

source

pub fn has(&self, text: &str) -> bool

Tests if the lexicon has a specific entry, by text

source

pub fn get_vocab(&self, vocab_id: VocabId) -> Option<&VocabValue>

Resolves a vocabulary ID

source

pub fn decompose_anavalue(&self, av: &AnaValue) -> Vec<&str>

Decomposes and decodes and anagram value into the characters that make it up. Mostly intended for debugging purposes.

source

pub fn read_alphabet(&mut self, filename: &str) -> Result<(), Error>

Read the alphabet from a TSV file The file contains one alphabet entry per line, but may consist of multiple tab-separated alphabet entries on that line, which will be treated as the identical. The alphabet is not limited to single characters but may consist of longer string, a greedy matching approach will be used so order matters (but only for this)

source

pub fn read_confusablelist(&mut self, filename: &str) -> Result<(), Error>

Read a confusiblelist from a TSV file Contains edit scripts in the first columned (formatted in sesdiff style) and optionally a weight in the second column. favourable confusables have a weight > 1.0, unfavourable ones are < 1.0 (penalties) Weight values should be relatively close to 1.0 as they are applied to the entire score

source

pub fn add_to_confusables( &mut self, editscript: &str, weight: f64 ) -> Result<(), Error>

Add a confusable

source

pub fn add_variant( &mut self, ref_id: VocabId, variant: &str, score: f64, freq: Option<u32>, params: &VocabParams ) -> bool

Add a (weighted) variant to the model, referring to a reference that already exists in the model. Variants will be added to the lexicon automatically when necessary. Set VocabType::TRANSPARENT if you want variants to only be used as an intermediate towards items that have already been added previously through a more authoritative lexicon.

source

pub fn add_variant_by_id( &mut self, ref_id: VocabId, variantid: VocabId, score: f64 ) -> bool

Add a (weighted) variant to the model, referring to a reference that already exists in the model. Variants will be added to the lexicon automatically when necessary. Set VocabType::TRANSPARENT if you want variants to only be used as an intermediate towards items that have already been added previously through a more authoritative lexicon.

source

pub fn read_vocabulary( &mut self, filename: &str, params: &VocabParams ) -> Result<(), Error>

Read vocabulary (a lexicon or corpus-derived lexicon) from a TSV file May contain frequency information The parameters define what value can be read from what column

source

pub fn read_contextrules(&mut self, filename: &str) -> Result<(), Error>

source

pub fn add_contextrule( &mut self, pattern: &str, score: f32, tag: Vec<&str>, tagoffset: Vec<&str> ) -> Result<(), Error>

source

pub fn read_variants( &mut self, filename: &str, params: Option<&VocabParams>, transparent: bool ) -> Result<(), Error>

Read a weighted variant list from a TSV file. Contains a canonical/reference form in the first column, and variants with score (two columns) in the following columns. May also contain frequency information (auto detected), in which case the first column has the canonical/reference form, the second column the frequency, and all further columns hold variants, their score and their frequency (three columns). Consumes much more memory than equally weighted variants.

source

pub fn add_to_vocabulary( &mut self, text: &str, frequency: Option<u32>, params: &VocabParams ) -> VocabId

Adds an entry in the vocabulary

source

pub fn find_variants( &self, input: &str, params: &SearchParameters ) -> Vec<VariantResult>

Find variants in the vocabulary for a given string (in its totality), returns a vector of vocabulary ID and score pairs Returns a vector of three-tuples (VocabId, distance_score, freq_score) The resulting vocabulary Ids can be resolved through get_vocab()

source

pub fn learn_variants<'a, I>( &mut self, input: I, params: &SearchParameters, strict: bool, auto_build: bool ) -> usize
where I: IntoParallelIterator<Item = &'a String> + IntoIterator<Item = &'a String>,

Processes input and finds variants (like [find_variants()]), but all variants that are found (which meet the set thresholds) will be stored in the model rather than returned. Unlike find_variants(), this is invoked with an iterator over multiple inputs and returns no output by itself. It will automatically apply parallellisation.

source

pub fn rescore_confusables(&self, results: &mut Vec<VariantResult>, input: &str)

Rescore results according to confusables

source

pub fn rank_results(&self, results: &mut Vec<VariantResult>, freq_weight: f32)

Sorts a result vector of (VocabId, distance_score, freq_score) in decreasing order (best result first)

source

pub fn expand_variants(&self, results: Vec<VariantResult>) -> Vec<VariantResult>

Expand variants, adding all references for variants In case variants are ‘transparent’, only the references will be retained as results. The results list does not need to be sorted yet. This function may yield duplicates. For performance, call this only when you know there are variants that may be expanded.

source

pub fn compute_confusable_weight(&self, input: &str, candidate: VocabId) -> f64

compute weight over known confusables Should return 1.0 when there are no known confusables < 1.0 when there are unfavourable confusables

1.0 when there are favourable confusables

source

pub fn add_to_reverse_index( &self, reverseindex: &mut ReverseIndex, input: &str, matched_vocab_id: VocabId, score: f64 )

Adds the input item to the reverse index, as instantiation of the given vocabulary id

source

pub fn find_all_matches<'a>( &self, text: &'a str, params: &SearchParameters ) -> Vec<Match<'a>>

Searches a text and returns all highest-ranking variants found in the text

source

pub fn test_context_rules<'a>( &self, sequence: &Sequence ) -> (f64, Vec<Vec<PatternMatchResult>>)

Favours or penalizes certain combinations of lexicon matches. matching words X and Y respectively with lexicons A and B might be favoured over other combinations. This returns either a bonus or penalty (number slightly above/below 1.0) score/ for the sequence as a whole.

source

pub fn lm_score<'a>( &self, sequence: &Sequence, boundaries: &[Match<'a>] ) -> (f32, f64)

Computes the logprob and perplexity for a given sequence as produced in most_likely_sequence()

source

pub fn lm_score_tokens<'a>(&self, tokens: &Vec<Option<VocabId>>) -> (f32, f64)

Computes the logprob and perplexity for a given sequence of tokens. The tokens are either in the vocabulary or are None if out-of-vocabulary.

source

pub fn add_ngram(&mut self, ngram: NGram, frequency: u32)

Add an ngram for language modelling

source

pub fn match_to_str<'a>(&'a self, m: &Match<'a>) -> &'a str

Gives the text representation for this match, always uses the solution (if any) and falls back to the input text only when no solution was found.

source

pub fn match_to_vocabvalue<'a>( &'a self, m: &Match<'a> ) -> Option<&'a VocabValue>

Gives the vocabitem for this match, always uses the solution (if any) and falls back to the input text only when no solution was found.

source

pub fn ngram_to_str(&self, ngram: &NGram) -> String

Turns the ngram into a tokenised string; the tokens in the ngram will be separated by a space.

source

pub fn match_to_ngram<'a>( &'a self, m: &Match<'a>, boundaries: &[Match<'a>] ) -> Result<NGram, String>

Converts a match to an NGram representation, this only works if all tokens in the ngram are in the vocabulary.

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T> Pointable for T

source§

const ALIGN: usize = _

The alignment of pointer.
§

type Init = T

The type for initializers.
source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
source§

impl<T> Same for T

§

type Output = T

Should always be Self
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V