ragit-korean
Ragit-korean is a very simple korean tokenizer.
Ragit used to use charabia to tokenize cjk documents, but it has too many issues.
- Charabia bundles cjk dictionaries in the binary, which makes the file 70MiB bigger.
- It silently converts 완성형 korean to 조합형 korean. That silently messes up tfidf searches.