Struct nlprule::tokenizer::tag::Tagger [−][src]
pub struct Tagger { /* fields omitted */ }
The lexical tagger. Created from a dictionary that looks like this:
actualize actualize VB
actualize actualize VBP
actualized actualize VBD
actualized actualize VBN
actualizes actualize VBZ
actualizing actualize VBG
actually actually RB
i.e. one word (left) associated with one or more pairs of lemma (middle) and POS (part-of-speech) tag (right). From this structure, the tagger must be able to look up:
- lemma and pos by word: all lemmas and POS tags associated with a given word.
(1) is called extensively (at least once for every word) so it has to be as fast as possible.
Implementation
The tagger stores two bidirectional maps:
- A POS bimap: A bimap assigning each POS tag a 16-bit ID. POS tags are a closed set, so there is an entry for every tag in the bimap. This allows e.g. storing a set of IDs of matching tags for a regex instead of actually evaluating it in the matcher logic.
- A word bimap: A bimap assigning each known word a 32-bit ID. Words are not a closed set, so if an entry in this map does not exist it only means that the word is not known to nlprule. Still, the map can be used for optimizations similar to the POS bimap. The word bimap also stores lemmas since there is often a large overlap between known words and known lemmas.
These two maps can be used to relatively cheaply in terms of memory allow (1) while retaining fast lookup.
There is a tags
map which associates a Word ID with multiple pairs of
(lemma_id, pos_id)
where the ID for the lemma is a regular 32-bit Word ID.
Implementations
impl Tagger
[src]
pub fn id_tag<'a>(&self, tag: &'a str) -> PosId<'a>
[src]
Tags the given string representation of a part-of-speech tag. Part-of-speech tags are treated as a closed set so each valid part-of-speech tag will get a numerical id.
pub fn id_word<'a>(&self, text: Cow<'a, str>) -> WordId<'a>
[src]
Tags the given text. Unknown words will not get a numerical id.
pub fn get_tags_with_options<'a>(
&'a self,
word: &'a str,
add_lower: Option<bool>,
use_compound_split_heuristic: Option<bool>
) -> TagIter<'a>ⓘ
[src]
&'a self,
word: &'a str,
add_lower: Option<bool>,
use_compound_split_heuristic: Option<bool>
) -> TagIter<'a>ⓘ
Get the tags and lemmas (as WordData) for the given word.
Arguments
word
: The word to lookup data for.add_lower
: Whether to add data for the lowercase variant of the word. IfNone
, will be set according to the language options.use_compound_split_heuristic
: Whether to use a heuristic to split compound words. IfNone
, will be set according to the language options. If true, will attempt to find tags for words which are longer than some cutoff and unknown by looking up tags for substrings from left to right until tags are found or a minimum length reached.
pub fn get_tags<'a>(&'a self, word: &'a str) -> TagIter<'a>ⓘ
[src]
Trait Implementations
impl Clone for Tagger
[src]
impl Default for Tagger
[src]
impl<'de> Deserialize<'de> for Tagger
[src]
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error> where
__D: Deserializer<'de>,
[src]
__D: Deserializer<'de>,
impl Serialize for Tagger
[src]
Auto Trait Implementations
impl !RefUnwindSafe for Tagger
impl Send for Tagger
impl Sync for Tagger
impl Unpin for Tagger
impl UnwindSafe for Tagger
Blanket Implementations
impl<T> Any for T where
T: 'static + ?Sized,
[src]
T: 'static + ?Sized,
impl<T> Borrow<T> for T where
T: ?Sized,
[src]
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
[src]
T: ?Sized,
pub fn borrow_mut(&mut self) -> &mut T
[src]
impl<T> DeserializeOwned for T where
T: for<'de> Deserialize<'de>,
[src]
T: for<'de> Deserialize<'de>,
impl<T> From<T> for T
[src]
impl<T, U> Into<U> for T where
U: From<T>,
[src]
U: From<T>,
impl<T> Pointable for T
pub const ALIGN: usize
type Init = T
The type for initializers.
pub unsafe fn init(init: <T as Pointable>::Init) -> usize
pub unsafe fn deref<'a>(ptr: usize) -> &'a T
pub unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T
pub unsafe fn drop(ptr: usize)
impl<T> ToOwned for T where
T: Clone,
[src]
T: Clone,
type Owned = T
The resulting type after obtaining ownership.
pub fn to_owned(&self) -> T
[src]
pub fn clone_into(&self, target: &mut T)
[src]
impl<T, U> TryFrom<U> for T where
U: Into<T>,
[src]
U: Into<T>,
type Error = Infallible
The type returned in the event of a conversion error.
pub fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
[src]
impl<T, U> TryInto<U> for T where
U: TryFrom<T>,
[src]
U: TryFrom<T>,