Module nlprule::tokenizer[−][src]

A tokenizer to split raw text into tokens. Tokens are assigned lemmas and part-of-speech tags by lookup from a Tagger and chunks containing information about noun / verb and grammatical case by a statistical Chunker. Tokens are disambiguated (i. e. information from the initial assignment is changed) in a rule-based way by DisambiguationRules.

Modules

chunk	A Chunker ported from OpenNLP.
multiword	Checks if the input text contains multi-token phrases from a finite list (might contain e. g. city names) and assigns lemmas and part-of-speech tags accordingly.
tag	A dictionary-based tagger. The raw format is tuples of the form `(word, lemma, part-of-speech)` where each word typically has multiple entries with different part-of-speech tags.

Structs

Tokenizer	The complete Tokenizer doing tagging, chunking and disambiguation.
TokenizerOptions	Options for a tokenizer.

Functions

finalize

Finalizes the tokens by e. g. adding a specific UNKNOWN part-of-speech tag. After finalization grammatical error correction rules can be used on the tokens.