[][src]Crate kanji

A library for the handling and analysis of Japanese text, particularly Kanji. It can be used to find the density of Kanji in given texts according to their Level classification, as defined by the Japan Kanji Aptitude Testing Foundation (日本漢字能力検定協会).

The Kanji data presented here matches the Foundation's official 2020 February charts. Note that some Kanji had their levels changed (pdf) as of 2020.

Usage

Two main useful types are Character and Kanji.

Reading Japanese Text

To reinterpret every char of the input into a Character we can reason about:

use std::fs;
use kanji::Character;

let cs: Option<Vec<Character>> = fs::read_to_string("your-text.txt")
  .map(|content| content.chars().map(Character::new).collect())
  .ok();

But maybe we're just interested in the Kanji:

use std::fs;
use kanji::Kanji;

let ks: Option<Vec<Kanji>> = fs::read_to_string("your-text.txt")
  .map(|content| content.chars().filter_map(Kanji::new).collect())
  .ok();

Alongside normal pattern matching, the Character::kanji method can also help us extract Kanji values.

Filtering

In general, when we want to reduce a text to a single Character subtype, we can filter:

let orig = "そこで犬が寝ている";

let ks: String = orig.chars().filter(kanji::is_kanji).collect();
assert_eq!("犬寝", ks);

let hs: String = orig.chars().filter(kanji::is_hiragana).collect();
assert_eq!("そこでがている", hs);

Level Analysis

To find out how many Kanji of each exam level belong to some text:

let level_table = kanji::level_table();
let texts = vec![
    "非常に面白い文章",
    "誰でも読んだ事のある名作",
    "飛行機で空を飛ぶ",
];

for t in texts {
    let counts = kanji::kanji_counts(t, &level_table);
    println!("{:#?}", counts);
}

And if you want to know what the Kanji were from a particular level:

let level_table = kanji::level_table();
let text = "日常生活では、鮫に遭う事は基本的にない。";

let ks: String = text
    .chars()
    // Filter out all chars that aren't Kanji.
    .filter_map(kanji::Kanji::new)
    // Preserve only those that appear in Level 10.
    .filter_map(|k| match level_table.get(&k) {
        Some(kanji::Level::Ten) => Some(k.get()),
        _ => None,
    })
    // Fold them all back into a String.
    .collect();

assert_eq!("日生本", ks);

Notes on Unicode

All Japanese characters, Kanji or otherwise, are a single Unicode Scalar Value, and thus can be safely represented by a single internal char.

Further, the ordering of Kanji in the official Foundation lists is in no way related to their ordering in Unicode, since in Unicode, Kanji are grouped by radical. So:

use kanji::exam_lists;

let same_as_uni = exam_lists::LEVEL_10.chars().max() < exam_lists::LEVEL_09.chars().min();
assert!(!same_as_uni);

Resources

Modules

exam_lists

A complete list of all Kanji in every level of the exam.

Structs

ASCII

A standard ASCII character.

AlphaNum

Japanese full-width alphanumeric characters and a few punctuation symbols.

Hiragana

A Hiragana character, from あ to ん.

Kanji

A single symbol of Kanji, also known as a CJK Unified Ideograph.

Katakana

A Katakana character, from ア to ン.

Punctuation

Japanese symbols and punctuation.

Enums

Character

General categories for characters, at least as is useful for thinking about Japanese.

Level

A level or "kyuu" (級) of Japanese Kanji ranking.

Functions

all_kanji

All possible Kanji characters, as well as non-character radicals, in a heap-allocated UTF-8 String.

is_alphanum

Does a given char belong to the set of Japanese alphanumeric characters and western punctuation?

is_hiragana

Is a given char betwen あ and ん?

is_japanese_punct

Does a given char belong to the set of Japanese symbols and punctuation?

is_kanji

Kanji appear in the Unicode range 4e00 to 9ffc. The final Japanese Kanji is 9fef (鿯).

is_kanji_extended

Detect if a char is Kanji while accounting for all of the Unicode CJK extensions.

is_katakana

Is a given char between ア and ン?

kanji_counts

Determine how many Kanji of each exam level appear in some text, given a lookup table.

level_table

Using the data stored in the LEVEL_* constants, generate a lookup table for Kanji levels.