Expand description
A library for the handling and analysis of Japanese text, particularly Kanji. It can be used to find the density of Kanji in given texts according to their Level classification, as defined by the Japan Kanji Aptitude Testing Foundation (日本漢字能力検定協会).
The Kanji data presented here matches the Foundation’s official 2020 February charts. Note that some Kanji had their levels changed (pdf) as of 2020.
Usage
Two main useful types are Character
and Kanji
.
Reading Japanese Text
To reinterpret every char
of the input into a Character
we can reason
about:
use std::fs;
use kanji::Character;
let cs: Option<Vec<Character>> = fs::read_to_string("your-text.txt")
.map(|content| content.chars().map(Character::new).collect())
.ok();
But maybe we’re just interested in the Kanji
:
use std::fs;
use kanji::Kanji;
let ks: Option<Vec<Kanji>> = fs::read_to_string("your-text.txt")
.map(|content| content.chars().filter_map(Kanji::new).collect())
.ok();
Alongside normal pattern matching, the Character::kanji
method can also
help us extract Kanji
values.
Filtering
In general, when we want to reduce a text to a single Character
subtype,
we can filter
:
let orig = "そこで犬が寝ている";
let ks: String = orig.chars().filter(|c| kanji::is_kanji(*c)).collect();
assert_eq!("犬寝", ks);
let hs: String = orig.chars().filter(|c| kanji::is_hiragana(*c)).collect();
assert_eq!("そこでがている", hs);
Level Analysis
To find out how many Kanji of each exam level belong to some text:
let level_table = kanji::level_table();
let texts = vec![
"非常に面白い文章",
"誰でも読んだ事のある名作",
"飛行機で空を飛ぶ",
];
for t in texts {
let counts = kanji::kanji_counts(t, &level_table);
println!("{:#?}", counts);
}
And if you want to know what the Kanji were from a particular level:
let level_table = kanji::level_table();
let text = "日常生活では、鮫に遭う事は基本的にない。";
let ks: String = text
.chars()
// Filter out all chars that aren't Kanji.
.filter_map(kanji::Kanji::new)
// Preserve only those that appear in Level 10.
.filter_map(|k| match level_table.get(&k) {
Some(kanji::Level::Ten) => Some(k.get()),
_ => None,
})
// Fold them all back into a String.
.collect();
assert_eq!("日生本", ks);
Notes on Unicode
All Japanese characters, Kanji or otherwise, are a single Unicode Scalar
Value, and thus can be safely represented by a single internal char
.
Further, the ordering of Kanji in the official Foundation lists is in no way related to their ordering in Unicode, since in Unicode, Kanji are grouped by radical. So:
use kanji::exam_lists;
let same_as_uni = exam_lists::LEVEL_10.chars().max() < exam_lists::LEVEL_09.chars().min();
assert!(!same_as_uni);
Features
serde
: Enableserde
trait implementations.
Resources
Modules
A complete list of all Kanji in every level of the exam.
Structs
A standard ASCII character.
Japanese full-width alphanumeric characters and a few punctuation symbols.
A Hiragana character, from あ to ん.
A single symbol of Kanji, also known as a CJK Unified Ideograph.
A Katakana character, from ア to ン.
Japanese symbols and punctuation.
Enums
General categories for characters, at least as is useful for thinking about Japanese.
A level or “kyuu” (級) of Japanese Kanji ranking.
Functions
All possible Kanji characters, as well as non-character radicals, in a
heap-allocated UTF-8 String
.
Does a given char
belong to the set of Japanese alphanumeric characters
and western punctuation?
Is a given char
betwen あ and ゟ?
Does a given char
belong to the set of Japanese symbols and punctuation?
Kanji appear in the Unicode range 4e00 to 9ffc. The final Japanese Kanji is 9fef (鿯).
Is a given char
between ゠ and ヿ?
Determine how many Kanji of each exam level appear in some text, given a lookup table.
Using the data stored in the LEVEL_*
constants, generate a lookup table
for Kanji levels.