Japanese Codepoints
A high-performance, zero-dependency Rust library for Japanese character validation and code point handling based on JIS standards.
This library provides collections of Unicode code points for various Japanese character sets, with advanced validation macros and static caching for optimal performance. Perfect for input validation in systems with strict character requirements (e.g., legacy mainframe integration, payment systems, government applications).
Rust port of
terasoluna-gfw-codepoints(Java) - reimagined with Rust's type system and zero-cost abstractions.
Features
- High Performance: Static caching via
OnceLockeliminates repeated allocations (1900× faster than creating new instances) - Zero Dependencies: No external dependencies for faster compile times and smaller binaries
- Type Safety: Leverages Rust's ownership system to prevent runtime errors
- Feature Flags: Compile only the character sets you need
- Validation Macros: Ergonomic macros for common validation patterns
- Zero-Copy Operations: Efficient set operations (union, intersection, difference)
Character Sets
The library is organized using feature flags to keep it lightweight. You only need to enable the character sets you require.
| Feature Flag | Characters | Description |
|---|---|---|
default (codepoints) |
225 | Core CodePoints struct + ASCII control/printable |
codepoints-jisx0201 |
158 | JIS X 0201: Half-width Katakana (63) + Latin letters (95) |
codepoints-jisx0208 |
~700 | JIS X 0208: Hiragana, Katakana, Latin, Greek, Cyrillic, symbols, box-drawing |
codepoints-jisx0208kanji |
6,355 | JIS X 0208: Level 1 & 2 Kanji |
codepoints-jisx0213kanji |
10,050 | JIS X 0213: Level 1-4 Kanji (extends JIS X 0208) |
full |
~17,500 | All character sets |
When to use which feature?
- Web forms with Japanese input:
codepoints-jisx0208(Hiragana/Katakana) - Legacy system integration:
codepoints-jisx0201(halfwidth) +codepoints-jisx0208kanji - Full Japanese text support:
fullor specific kanji features - ASCII-only validation:
default(no extra features needed)
Quick Start
Requirements
- Rust: 1.70+ (for
std::sync::OnceLock) - Edition: 2021
Installation
Add this to your Cargo.toml:
[]
# Minimal: ASCII validation only
= "0.1.0"
# Specific character sets (recommended)
= { = "0.1.0", = ["codepoints-jisx0208", "codepoints-jisx0208kanji"] }
# Everything (largest binary size)
= { = "0.1.0", = ["full"] }
# Disable default features for minimal build
= { = "0.1.0", = false, = ["codepoints-jisx0208"] }
Usage Examples
Basic Validation
use Hiragana;
// Use cached instance for best performance
let hiragana = cached;
// Validate string contains only hiragana
assert!;
assert!; // 'A' is not hiragana
// Get detailed error information
if let Some = hiragana.codepoints.first_excluded_with_position
Validation Macros (Recommended)
use ;
// Simple validation
validate_hiragana!?; // OK
validate_katakana!?; // OK
// Mixed validation (Hiragana + Katakana + ASCII)
validate_japanese_mixed!?; // OK
validate_japanese_mixed!?; // Error: contains kanji
Multi-Set Validation
use ;
use ;
let allowed = ;
// Each character must be in at least one set
assert!; // ✓ Hiragana + ASCII
assert!; // ✓ Katakana
assert!; // ✗ Kanji not in any set
Set Operations
use ;
let hiragana = cached;
let katakana = cached;
// Union: Combine sets
let all_kana = hiragana.codepoints.union;
assert!;
// Custom character set
let no_a = hiragana.codepoints.difference;
assert!;
assert!;
Comparison with Java Original
| Feature | Java (terasoluna-gfw) | Rust (this crate) |
|---|---|---|
| Caching | ConcurrentHashMap + Reflection |
OnceLock (lock-free) |
| Memory Safety | Runtime checks | Compile-time guarantees |
| Dependencies | Jakarta Validation + Spring | Zero dependencies |
| Set Operations | Basic (union, intersect) | Full set algebra |
| Error Handling | Exceptions | Result<T, E> |
| Validation | Annotation-based | Macros + Functions |
| Binary Size | JVM + dependencies | Native, minimal |
Testing
# All tests
# Documentation tests
# Examples
# Check all feature combinations
Examples
# Comprehensive demo
# Individual character sets
Architecture
- Core Layer (always included)
- CodePoints (HashSet wrapper)
- Set operations (union, intersection, diff, etc.)
- ASCII sets (control, printable, all, crlf)
- Character Set Modules (feature-gated)
- jisx0201: Latin + Halfwidth Katakana
- jisx0208: Hiragana, Katakana, Symbols, etc.
- jisx0208kanji: 6,355 JIS X 0208 Kanji
- jisx0213kanji: 10,050 JIS X 0213 Kanji
- Validation Layer
- ValidationError (structured errors)
- Macros (validate_hiragana!, etc.)
Common Pitfalls
UTF-8 vs Unicode Code Points
This library validates Unicode code points (scalar values), not bytes:
// ✓ Works: validates Unicode characters
let hiragana = cached;
hiragana.contains; // U+3042
// ✗ Not for byte validation
// For byte-level validation, use encoding_rs or similar
Cached vs New
Always use cached() in production:
// ✗ Creates new HashSet every call
let h = new;
// ✓ Zero-allocation after first call
let h = cached;
Feature Flags
Don't forget to enable features:
// Cargo.toml
// ✗ This won't compile if you use JIS X 0208
japanese-codepoints = "0.1.0"
// ✓ Enable the features you need
japanese-codepoints =
License
This project is licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Acknowledgments
This is a Rust port of the terasoluna gfw codepoints library. Special thanks to the original authors for the excellent Java implementation.