charabia 0.8.6

A simple library to detect the language, tokenize the text and normalize the tokens
Documentation
# Charabia
Library used by Meilisearch to tokenize queries and documents

## Role

The tokenizerโ€™s role is to take a sentence or phrase and split it into smaller units of language, called tokens. It finds and retrieves all the words in a string based on the languageโ€™s particularities.

## Details

Charabia provides a simple API to segment, normalize, or tokenize (segment + normalize) a text of a specific language by detecting its Script/Language and choosing the specialized pipeline for it.

## Supported languages

**Charabia is multilingual**, featuring optimized support for:


|  Script / Language  |                           specialized segmentation                            | specialized normalization | Segmentation Performance level | Tokenization Performance level |
|---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---|
| **Latin** | โœ… CamelCase segmentation | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + lowercase + [nonspacing-marks]https://www.compart.com/en/unicode/category/Mn removal          | ๐ŸŸฉ ~23MiB/sec    | ๐ŸŸจ ~9MiB/sec    |
| **Greek** | โŒ | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + lowercase + final sigma normalization         | ๐ŸŸฉ ~27MiB/sec    | ๐ŸŸจ ~8MiB/sec    |
| **Cyrillic** - **Georgian** | โŒ | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + lowercase          | ๐ŸŸฉ ~27MiB/sec    | ๐ŸŸจ ~9MiB/sec    |
| **Chinese** **CMN** ๐Ÿ‡จ๐Ÿ‡ณ | โœ… [jieba]https://github.com/messense/jieba-rs | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + pinyin conversion | ๐ŸŸจ ~10MiB/sec    | ๐ŸŸง ~5MiB/sec    |
| **Hebrew** ๐Ÿ‡ฎ๐Ÿ‡ฑ | โŒ | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + [nonspacing-marks]https://www.compart.com/en/unicode/category/Mn removal  | ๐ŸŸฉ ~33MiB/sec    | ๐ŸŸจ ~11MiB/sec    |
| **Arabic**  | โœ… `ุงู„` segmentation | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + [nonspacing-marks]https://www.compart.com/en/unicode/category/Mn removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization]  | ๐ŸŸฉ ~36MiB/sec    | ๐ŸŸจ ~11MiB/sec    |
| **Japanese** ๐Ÿ‡ฏ๐Ÿ‡ต | โœ… [lindera]https://github.com/lindera-morphology/lindera IPA-dict | โŒ [compatibility decomposition]https://unicode.org/reports/tr15/ | ๐ŸŸง ~3MiB/sec    | ๐ŸŸง ~3MiB/sec    |
| **Korean** ๐Ÿ‡ฐ๐Ÿ‡ท | โœ… [lindera]https://github.com/lindera-morphology/lindera KO-dict | โŒ [compatibility decomposition]https://unicode.org/reports/tr15/ | ๐ŸŸฅ ~2MiB/sec    | ๐ŸŸฅ ~2MiB/sec    |
| **Thai** ๐Ÿ‡น๐Ÿ‡ญ | โœ… [dictionary based]https://github.com/PyThaiNLP/nlpo3 | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ + [nonspacing-marks]https://www.compart.com/en/unicode/category/Mn removal | ๐ŸŸฉ ~22MiB/sec    | ๐ŸŸจ ~11MiB/sec    |
| **Khmer** ๐Ÿ‡ฐ๐Ÿ‡ญ | โœ… dictionary based | โœ… [compatibility decomposition]https://unicode.org/reports/tr15/ | ๐ŸŸง ~7MiB/sec    | ๐ŸŸง ~5MiB/sec    |

We aim to provide global language support, and your feedback helps us [move closer to that goal](https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/charabia/issues/new/choose).

If you have a particular need that charabia does not support, please share it in the product repository by creating a [dedicated discussion](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer).

### About Performance level

Performances are based on the throughput (MiB/sec) of the tokenizer (computed on a [scaleway Elastic Metal server EM-A410X-SSD](https://www.scaleway.com/en/pricing/) - CPU: Intel Xeon E5 1650 - RAM: 64 Go) using jemalloc:
- 0๏ธโƒฃโฌ›๏ธ:  0  ->  1  MiB/sec
- 1๏ธโƒฃ๐ŸŸฅ:  1  ->  3  MiB/sec
- 2๏ธโƒฃ๐ŸŸง:  3  ->  8  MiB/sec
- 3๏ธโƒฃ๐ŸŸจ:  8  -> 20  MiB/sec
- 4๏ธโƒฃ๐ŸŸฉ: 20  -> 50  MiB/sec
- 5๏ธโƒฃ๐ŸŸช: 50 MiB/sec or more

## Examples

#### Tokenization

```rust
use charabia::Tokenize;

let orig = "Thรฉ quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3ยฐF!";

// tokenize the text.
let mut tokens = orig.tokenize();

let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thรฉ` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());

let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());
```

#### Segmentation

```rust
use charabia::Segment;

let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3ยฐF!";

// segment the text.
let mut segments = orig.segment_str();

assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));
```