# Charabia
Library used by Meilisearch to tokenize queries and documents
## Role
The tokenizerโs role is to take a sentence or phrase and split it into smaller units of language, called tokens. It finds and retrieves all the words in a string based on the languageโs particularities.
## Details
Charabia provides a simple API to segment, normalize, or tokenize (segment + normalize) a text of a specific language by detecting its Script/Language and choosing the specialized pipeline for it.
## Supported languages
**Charabia is multilingual**, featuring optimized support for:
| **Latin** | โ
CamelCase segmentation | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | ๐ฉ ~23MiB/sec | ๐จ ~9MiB/sec |
| **Greek** | โ | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + final sigma normalization | ๐ฉ ~27MiB/sec | ๐จ ~8MiB/sec |
| **Cyrillic** - **Georgian** | โ | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase | ๐ฉ ~27MiB/sec | ๐จ ~9MiB/sec |
| **Chinese** **CMN** ๐จ๐ณ | โ
[jieba](https://github.com/messense/jieba-rs) | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | ๐จ ~10MiB/sec | ๐ง ~5MiB/sec |
| **Hebrew** ๐ฎ๐ฑ | โ | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | ๐ฉ ~33MiB/sec | ๐จ ~11MiB/sec |
| **Arabic** | โ
`ุงู` segmentation | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization] | ๐ฉ ~36MiB/sec | ๐จ ~11MiB/sec |
| **Japanese** ๐ฏ๐ต | โ
[lindera](https://github.com/lindera-morphology/lindera) IPA-dict | โ [compatibility decomposition](https://unicode.org/reports/tr15/) | ๐ง ~3MiB/sec | ๐ง ~3MiB/sec |
| **Korean** ๐ฐ๐ท | โ
[lindera](https://github.com/lindera-morphology/lindera) KO-dict | โ [compatibility decomposition](https://unicode.org/reports/tr15/) | ๐ฅ ~2MiB/sec | ๐ฅ ~2MiB/sec |
| **Thai** ๐น๐ญ | โ
[dictionary based](https://github.com/PyThaiNLP/nlpo3) | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | ๐ฉ ~22MiB/sec | ๐จ ~11MiB/sec |
| **Khmer** ๐ฐ๐ญ | โ
dictionary based | โ
[compatibility decomposition](https://unicode.org/reports/tr15/) | ๐ง ~7MiB/sec | ๐ง ~5MiB/sec |
We aim to provide global language support, and your feedback helps us [move closer to that goal](https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/charabia/issues/new/choose).
If you have a particular need that charabia does not support, please share it in the product repository by creating a [dedicated discussion](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer).
### About Performance level
Performances are based on the throughput (MiB/sec) of the tokenizer (computed on a [scaleway Elastic Metal server EM-A410X-SSD](https://www.scaleway.com/en/pricing/) - CPU: Intel Xeon E5 1650 - RAM: 64 Go) using jemalloc:
- 0๏ธโฃโฌ๏ธ: 0 -> 1 MiB/sec
- 1๏ธโฃ๐ฅ: 1 -> 3 MiB/sec
- 2๏ธโฃ๐ง: 3 -> 8 MiB/sec
- 3๏ธโฃ๐จ: 8 -> 20 MiB/sec
- 4๏ธโฃ๐ฉ: 20 -> 50 MiB/sec
- 5๏ธโฃ๐ช: 50 MiB/sec or more
## Examples
#### Tokenization
```rust
use charabia::Tokenize;
let orig = "Thรฉ quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3ยฐF!";
// tokenize the text.
let mut tokens = orig.tokenize();
let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thรฉ` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());
let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());
```
#### Segmentation
```rust
use charabia::Segment;
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3ยฐF!";
// segment the text.
let mut segments = orig.segment_str();
assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));
```