notmecab 0.1.0

Library for tokenizing text with mecab dictionaries. Not a mecab wrapper.
notmecab-rs is a very basic mecab clone, designed only to do parsing, not training.

This is meant to be used as a library by other tools such as frequency analyzers. Not directly by people.
It also only works with UTF-8 dictionaries. (Stop using encodings other than UTF-8 for infrastructural software.)
Support for unk.dic is currently unimplemented, so in rare situations, the parse might be different from mecab.

Licensed under the Apache License, Version 2.0.


Get unidic's sys.dic and matrix.bin and put them under a new folder next to src/ called data/. Then invoke tests from the repository root.

Example (from tests):

    let sysdic_raw = File::open("data/sys.dic").unwrap(); // you need to acquire a mecab dictionary and place its sys.dic file here manually
    let mut sysdic = BufReader::new(sysdic_raw);
    let matrix_raw = File::open("data/matrix.bin").unwrap(); // you need to acquire a mecab dictionary and place its matrix.bin file here manually
    let mut matrix = BufReader::new(matrix_raw);
    let dict = Dict::load(&mut sysdic, &mut matrix).unwrap();
    let result = parse(&dict, &"これを持っていけ".to_string());
    if let Some(result) = result
        for token in &result.0
            println!("{}", token.feature);
        let split_up_string = tokenstream_to_string(&result.0, "|");
        println!("{}", split_up_string);
        assert_eq!(split_up_string, "これ|を|持っ|て|いけ"); // this test might fail if you're not testing with unidic (i.e. the parse might be different)

Output of example:


You can also call parse_to_lexertoken, which less string allocation, but you don't get the feature string as a string, and you need to feed it chars, not a string.

NOTE: This software is unusably slow if optimizations are disabled.


- implement unk.dic and its right/left context IDs