Skip to main content

Crate budouy

Crate budouy 

Source
Expand description

BudouX parser port in Rust.

This crate provides the core BudouX segmentation algorithm and optional HTML processing utilities. The parser splits a sentence into semantic chunks based on a trained model.

§Features

  • std: Default feature for std-enabled builds.
  • alloc: no_std-compatible build using alloc and hashbrown.
  • vendored-models: Bundles default Japanese/Chinese/Thai models.
  • html: Enables HTML processing utilities based on kuchikikiki (requires std).
  • cli: Enables the budouy CLI (requires std, implies vendored-models).
  • wasm: Enables WebAssembly bindings via wasm-bindgen (implies alloc and vendored-models).

Note: std and alloc are mutually exclusive.

§no_std

This crate supports no_std with alloc. Disable default features and enable alloc:

budouy = { version = "0.1", default-features = false, features = ["alloc"] }

The html and cli features require std.

§Examples

Parse a sentence with a custom model:

use std::collections::HashMap;
use budouy::{Model, Parser};
use budouy::model::FeatureKey;

let mut model: Model = HashMap::new();
model.insert(FeatureKey::UW4, HashMap::from([("a".to_string(), 10_000)]));
let parser = Parser::new(model);
let chunks = parser.parse("abcdeabcd");
assert_eq!(chunks, vec!["abcde", "abcd"]);

Use the default Japanese model (requires vendored-models):

use budouy::model::load_default_japanese_parser;

let parser = load_default_japanese_parser();
let chunks = parser.parse("今日は良い天気です");
println!("{:?}", chunks);

Process HTML (requires html + vendored-models):

use budouy::{HTMLProcessingParser, model::load_default_japanese_parser};

let parser = load_default_japanese_parser();
let html_parser = HTMLProcessingParser::new(parser, None);
let input = "今日は<strong>良い</strong>天気です";
let output = html_parser.translate_html_string(input);
println!("{}", output);

§WebAssembly

Build for web with wasm-pack:

wasm-pack build --target web --no-default-features --features wasm

Use from JavaScript:

import init, { BudouY } from './pkg/budouy.js';

await init();
const parser = BudouY.japanese();
const chunks = parser.parse("今日は良い天気です");

Modules§

model
Model types and loaders. Model types and loaders.

Structs§

HTMLProcessingParserhtml
BudouX parser with HTML processing support.
HTMLProcessorhtml
HTML processor that applies BudouX boundaries to a DOM.
HTMLProcessorOptionshtml
Options for HTMLProcessor.
Parser
BudouX parser for semantic line breaks.

Enums§

Separatorhtml
Separator inserted at semantic boundaries.

Type Aliases§

Model
BudouX model data.