Expand description
A library for the handling and analysis of Japanese text, particularly Kanji. It can be used to find the density of Kanji in given texts according to their Level classification, as defined by the Japan Kanji Aptitude Testing Foundation (日本漢字能力検定協会).
The Kanji data presented here matches the Foundation’s official 2020 February charts. Note that some Kanji had their levels changed (pdf) as of 2020.
§Usage
Two main useful types are Character
and Kanji
.
§Reading Japanese Text
To reinterpret every char
of the input into a Character
we can reason
about:
use std::fs;
use kanji::Character;
let cs: Option<Vec<Character>> = fs::read_to_string("your-text.txt")
.map(|content| content.chars().map(Character::new).collect())
.ok();
But maybe we’re just interested in the Kanji
:
use std::fs;
use kanji::Kanji;
let ks: Option<Vec<Kanji>> = fs::read_to_string("your-text.txt")
.map(|content| content.chars().filter_map(Kanji::new).collect())
.ok();
Alongside normal pattern matching, the Character::kanji
method can also
help us extract Kanji
values.
§Filtering
In general, when we want to reduce a text to a single Character
subtype,
we can filter
:
let orig = "そこで犬が寝ている";
let ks: String = orig.chars().filter(|c| kanji::is_kanji(*c)).collect();
assert_eq!("犬寝", ks);
let hs: String = orig.chars().filter(|c| kanji::is_hiragana(*c)).collect();
assert_eq!("そこでがている", hs);
§Level Analysis
To find out how many Kanji of each exam level belong to some text:
let level_table = kanji::level_table();
let texts = vec![
"非常に面白い文章",
"誰でも読んだ事のある名作",
"飛行機で空を飛ぶ",
];
for t in texts {
let counts = kanji::kanji_counts(t, &level_table);
println!("{:#?}", counts);
}
And if you want to know what the Kanji were from a particular level:
let level_table = kanji::level_table();
let text = "日常生活では、鮫に遭う事は基本的にない。";
let ks: String = text
.chars()
// Filter out all chars that aren't Kanji.
.filter_map(kanji::Kanji::new)
// Preserve only those that appear in Level 10.
.filter_map(|k| match level_table.get(&k) {
Some(kanji::Level::Ten) => Some(k.get()),
_ => None,
})
// Fold them all back into a String.
.collect();
assert_eq!("日生本", ks);
§Notes on Unicode
All Japanese characters, Kanji or otherwise, are a single Unicode Scalar
Value, and thus can be safely represented by a single internal char
.
Further, the ordering of Kanji in the official Foundation lists is in no way related to their ordering in Unicode, since in Unicode, Kanji are grouped by radical. So:
use kanji::exam_lists;
let same_as_uni = exam_lists::LEVEL_10.chars().max() < exam_lists::LEVEL_09.chars().min();
assert!(!same_as_uni);
§Features
serde
: Enableserde
trait implementations.
§Resources
Modules§
- exam_
lists - A complete list of all Kanji in every level of the exam.
Structs§
- ASCII
- A standard ASCII character.
- Alpha
Num - Japanese full-width alphanumeric characters and a few punctuation symbols.
- Hiragana
- A Hiragana character, from あ to ん.
- Kanji
- A single symbol of Kanji, also known as a CJK Unified Ideograph.
- Katakana
- A Katakana character, from ア to ン.
- Punctuation
- Japanese symbols and punctuation.
Enums§
- Character
- General categories for characters, at least as is useful for thinking about Japanese.
- Level
- A level or “kyuu” (級) of Japanese Kanji ranking.
Functions§
- all_
kanji - All possible Kanji characters, as well as non-character radicals, in a
heap-allocated UTF-8
String
. - is_
alphanum - Does a given
char
belong to the set of Japanese alphanumeric characters and western punctuation? - is_
hiragana - Is a given
char
betwen あ and ゟ? - is_
japanese_ punct - Does a given
char
belong to the set of Japanese symbols and punctuation? - is_
kanji - Kanji appear in the Unicode range 4e00 to 9ffc. The final Japanese Kanji is 9fef (鿯).
- is_
katakana - Is a given
char
between ゠ and ヿ? - kanji_
counts - Determine how many Kanji of each exam level appear in some text, given a lookup table.
- level_
table - Using the data stored in the
LEVEL_*
constants, generate a lookup table for Kanji levels.