Struct caribon::Parser [] [src]

pub struct Parser {
    // some fields omitted

Parser which can load a string, detects repetition on it and outputs an HTML file.


impl Parser

fn list_languages() -> Vec<&'static str>

Returns a vector containing all languages that are implemented.

These values are correct values to give to Parser::new.

fn get_ignored_from_string(list: &str) -> Vec<String>

Returns a vector of ignored words from a string.


  • list – A space or comma separated string, containing words that should be ignored (i.e., don't count repetitions on them).


let v = caribon::Parser::get_ignored_from_string("some, words; to ignore");
assert_eq!(v.len(), 4);

fn get_ignored_from_lang(lang: &str) -> Vec<String>

Returns a vector containing the default ignored words for this language.

fn new(lang: &str) -> Result<Parser>

Returns Ok(Parser) if language is ok, Err(Error) else.


lang – The input text language. This will be used to create the stemmer; it also determines what list of ignored words to use. If lang == "no_stemmer", stemming is disabled


let result = caribon::Parser::new("english");
let result = caribon::Parser::new("incorrect language");
let result = caribon::Parser::new("no_stemmer");

fn with_fuzzy(self, fuzzy: Option<f32>) -> Parser

Sets fuzzy string matching (default None)

If sets to Some(x), instead of just using equality to compare string, the Parser will use Levenhstein distance.


  • fuzzyNone to deactivate fuzzy matching, or Some(x) to activate it. x must be between 0.0 and 1.0 as it corresponds to the relative distance, e.g "Caribon" has a length of 7 so if fuzzy is set with Some(0.5), it will requires a maximal distance of 3 (actually 3.5 but distance is Integer)


let mut parser = caribon::Parser::new("english").unwrap()
let mut ast = parser.tokenize("trust Rust").unwrap();
parser.detect_local(&mut ast, 1.9);
let result = parser.ast_to_markdown(&ast); // not the best output format, but easy to debug
assert_eq!(&result, "**trust** **Rust**"); // these two words do have some letters in common

fn with_max_distance(self, max_dist: u32) -> Parser

Sets max distance for repetitions (default 50).


max_dist – A number corresponding to a number of words. If two occurences of a same word are separated by more than this distance, it will not be counted as a repetition.


let mut parser = caribon::Parser::new("english").unwrap()
let mut ast = parser.tokenize("This word is repeated in a few words").unwrap();
parser.detect_local(&mut ast, 1.9);
let result = parser.ast_to_markdown(&ast); // not the best output format, but easy to debug
assert_eq!(&result, "This **word** is repeated in a few **words**"); //repetition detected
let mut parser = caribon::Parser::new("english").unwrap()
let mut ast = parser.tokenize("This word is repeated in a few words").unwrap();
parser.detect_local(&mut ast, 1.9);
let result = parser.ast_to_markdown(&ast); // not the best output format, but easy to debug
assert_eq!(&result, "This word is repeated in a few words"); // repetition not detected because of
                                                             // excessively low max_distance

fn with_html(self, html: bool) -> Parser

Sets HTML detection in input (default true).

You should set it to false if a text is text-formatted, and to true if it contains HTML.

fn with_ignore_proper(self, proper: bool) -> Parser

Sets whether repetition detection should ignore proper nouns (default false).

Basically, if set to true, words that start with a capital and are not at the beginning of a sentence won't be counted for repetitions. Currently, there are still counted if they are in the beginning of a sentence, but with most texts it won't be enough to highligth them as repetitions.

fn with_ignored(self, list: &str) -> Parser

Sets the ignored list with a list of words contained in the argument string.

This method replaces the default list of ignored words. If you want to add ignored words to the default list of a language, use with_ignored instead.


  • list – A comma or whitespace separated list of words that should be ignored.

fn with_more_ignored(self, list: &str) -> Parser

Appends a list of words contained in the argument string to the list of ignored words


  • list – A comma or whitespace separated list of words that should be ignored.

fn tokenize(&mut self, s: &str) -> Result<Ast>

Tokenize a string into a list of words.

This is the step that converts a string to some inner representation.


  • s – The string to tokenize.

fn detect_local(&self, ast: &mut Ast, threshold: f32)

Detect the local number of repetitions.

For each word, the repetition value is set to the total number of occurences of this word since there has been hat least self.max_distance between two occurences.

It is the default algorithm, and probably the one you want to use.


ast – A mutable reference to an internal data structure returned by tokenize threshold – The threshold to consider a repetition (e.g. 1.9)


let mut parser = caribon::Parser::new("english").unwrap();
let mut ast = parser.tokenize("Testing whether this repetition detector works or does not work").unwrap();
parser.detect_local(&mut ast, 1.9);
let result = parser.ast_to_markdown(&ast); // not the most useful output format, but the easiest to debug
assert_eq!(&result, "Testing whether this repetition detector **works** or does not **work**");

fn words_stats(&self, ast: &Ast) -> (HashMap<Stringf32>, u32)

Returns stats about the words


words – A reference to a list of words


This method retuns a tuple: * the first element is a hashmap between stemmed strings and the number of occurences of this word * the second oelement is the total number of (valid) words in the list (non counting whitespace, HTML tags...)

fn detect_global(&self, ast: &mut Ast, threshold: f32)

Detect the global number of repetitions.

For each word, repetition value is set to the total number of occurences of this word in whole text, divided by total number of words in the text


  • vec – A vector of Word.
  • threshold – A threshold to highlight repetitions (e.g. 0.01)

fn ast_to_terminal(&self, ast: &Ast) -> String

Display the words to terminal, higlighting the repetitions.

Use terminal colour codes to highlight the repetitions


  • ast – A reference to Ast, returned by tokenize and modified by detect_*

fn ast_to_markdown(&self, ast: &Ast) -> String

Display the Ast to markdown, emphasizing the repetitions.

This is more limited than HTML or even terminal output, as it completely discards colour information that have been passed by detect_* methods, but it might be useful if e.g. you want to generate some files later with Pandoc (or any other program).


  • ast – An Ast containing repetitions.

fn ast_to_html(&self, ast: &mut Ast, standalone: bool) -> String

Display the Ast to HTML, higlighting the repetitions.

Use some basic CSS/Js for underlining repetitions and highlighting the over occurrences of the word under the mouse.


  • ast – An Ast containing repetitions.
  • standalone – If true, generate a standalone HTML file, else just an HTML fragment