Expand description
BudouX parser port in Rust.
This crate provides the core BudouX segmentation algorithm and optional
HTML processing utilities. The parser splits a sentence into semantic
chunks based on a trained model.
§Features
std: Default feature for std-enabled builds.alloc:no_std-compatible build usingallocandhashbrown.vendored-models: Bundles default Japanese/Chinese/Thai models.html: Enables HTML processing utilities based onkuchikikiki(requiresstd).cli: Enables thebudouyCLI (requiresstd, impliesvendored-models).wasm: Enables WebAssembly bindings viawasm-bindgen(impliesallocandvendored-models).
Note: std and alloc are mutually exclusive.
§no_std
This crate supports no_std with alloc. Disable default features and enable alloc:
budouy = { version = "0.1", default-features = false, features = ["alloc"] }The html and cli features require std.
§Examples
Parse a sentence with a custom model:
use std::collections::HashMap;
use budouy::{Model, Parser};
use budouy::model::FeatureKey;
let mut model: Model = HashMap::new();
model.insert(FeatureKey::UW4, HashMap::from([("a".to_string(), 10_000)]));
let parser = Parser::new(model);
let chunks = parser.parse("abcdeabcd");
assert_eq!(chunks, vec!["abcde", "abcd"]);Use the default Japanese model (requires vendored-models):
use budouy::model::load_default_japanese_parser;
let parser = load_default_japanese_parser();
let chunks = parser.parse("今日は良い天気です");
println!("{:?}", chunks);Process HTML (requires html + vendored-models):
use budouy::{HTMLProcessingParser, model::load_default_japanese_parser};
let parser = load_default_japanese_parser();
let html_parser = HTMLProcessingParser::new(parser, None);
let input = "今日は<strong>良い</strong>天気です";
let output = html_parser.translate_html_string(input);
println!("{}", output);§WebAssembly
Build for web with wasm-pack:
wasm-pack build --target web --no-default-features --features wasmUse from JavaScript:
import init, { BudouY } from './pkg/budouy.js';
await init();
const parser = BudouY.japanese();
const chunks = parser.parse("今日は良い天気です");Modules§
- model
- Model types and loaders. Model types and loaders.
Structs§
- HTML
Processing Parser html BudouXparser with HTML processing support.- HTML
Processor html - HTML processor that applies
BudouXboundaries to a DOM. - HTML
Processor Options html - Options for
HTMLProcessor. - Parser
BudouXparser for semantic line breaks.
Enums§
- Separator
html - Separator inserted at semantic boundaries.
Type Aliases§
- Model
BudouXmodel data.