Crate srx

Expand description

A simple and reasonably fast Rust implementation of the Segmentation Rules eXchange 2.0 standard for text segmentation. srx is not fully compliant with the standard.

This crate is intended for segmentation of plaintext so markup information (<formathandle> and segmentsubflows) is ignored.

Not complying with the SRX spec, overlapping matches of the same <rule> are not found which could lead to different behavior in a few edge cases.

Example

use std::{fs, str::FromStr};
use srx::SRX;

let srx = SRX::from_str(&fs::read_to_string("data/segment.srx").unwrap())?;
let english_rules = srx.language_rules("en");

assert_eq!(
    english_rules.split("e.g. U.K. and Mr. do not split. SRX is a rule-based format.").collect::<Vec<_>>(),
    vec!["e.g. U.K. and Mr. do not split. ", "SRX is a rule-based format."]
);

Features

serde: Serde serialization and deserialization support for SRX.
from_xml: SRX::from_reader method and std::str::FromStr implementation to load from an XML file in SRX format.

This crate uses the regex crate for parsing and executing regular expressions. The regex crate is mostly compatible with the regular expression standard from the SRX specification. However, some metacharacters such as \Q and \E are not supported.

To still be able to use files containing unsupported rules and to parse useful SRX files such as segment.srx from LanguageTool which does not comply with the standard by e. g. using look-ahead and look-behind, srx ignores <rule> elements with invalid regular expressions and provides information about them via the SRX::errors function.

Structs

Language
Newtype denoting a language (languagerulename attribute in SRX).
Rules
An ordered set of rules. Rules are executed in order. Once a rule matches on an index, no other rule can match at the same index. Each rule either breaks (i. e. splits the text at this index) or prevents breaking.
SRX
The SRX root. Does not execute rules on is own.

Enums

Errorfrom_xml

Crate srx

Example

Features

A note on regular expressions

Structs

Enums