sakurs-core
High-performance sentence boundary detection library using the Delta-Stack Monoid algorithm.
⚠️ API Stability Warning: This crate is in pre-release (v0.1.0). APIs may change significantly before v1.0.0. We recommend pinning to exact versions:
= "=0.1.0"
Table of Contents
Features
- Parallel Processing: Efficient speedup with multiple cores using the Delta-Stack Monoid algorithm
- Language Support: Configurable rules for English and Japanese via TOML-based configuration
- Mathematically Sound: Based on monoid algebra, ensuring correct results in parallel execution
- Complex Text Support: Handles nested quotes, abbreviations, and cross-chunk boundaries correctly
Quick Start
use ;
// Create processor with default configuration
let processor = with_language?;
// Process text
let text = "Hello world. This is a test.";
let output = processor.process?;
// Use the boundaries
for boundary in &output.boundaries
Advanced Usage
Custom Configuration
use ;
let config = builder
.language? // Japanese language rules
.threads // Use 4 threads
.chunk_size_kb // 512KB chunks
.build?;
let processor = with_config?;
Processing Files
use ;
let processor = new;
let output = processor.process?;
println!;
println!;
Streaming Large Files
use ;
// Use streaming configuration for memory-efficient processing
let config = streaming
.language?
.build?;
let processor = with_config?;
let output = processor.process?;
Language Support
Currently supported:
- English (
en) - Japanese (
ja)
Language rules are configured via TOML files. See the main repository for documentation on adding new languages.
Algorithm
This library implements the Delta-Stack Monoid algorithm, which represents parsing state as an associative monoid. This mathematical property enables:
- Splitting text into chunks
- Processing chunks in parallel
- Combining results in any order
- Getting identical results to sequential processing
For detailed algorithm documentation, see the main repository.
License
MIT License. See LICENSE for details.