Trait TextEncoder

Source

pub trait TextEncoder: Send + Sync {
    // Required methods
    fn encode(&self, text: &str) -> Result<EncoderOutput>;
    fn encode_batch(&self, texts: &[&str]) -> Result<(Vec<f32>, RaggedBatch)>;
    fn hidden_dim(&self) -> usize;
    fn max_length(&self) -> usize;
    fn architecture(&self) -> &'static str;
}

Expand description

Text encoder trait for transformer-based encoders.

§Motivation

Modern NER systems require converting raw text into dense vector representations that capture semantic meaning. This trait abstracts the encoding step, allowing different transformer architectures to be used interchangeably.

§Supported Architectures

Architecture	Context	Key Features	Speed
ModernBERT	8,192	RoPE, GeGLU, unpadded inference	3x faster
DeBERTaV3	512	Disentangled attention	Baseline
BERT/RoBERTa	512	Classic, widely available	Baseline

§Research Alignment (ModernBERT, Dec 2024)

From ModernBERT paper (arXiv:2412.13663):

“Pareto improvements to BERT… encoder-only models offer great performance-size tradeoff for retrieval and classification.”

Key innovations:

Alternating Attention: Global attention every 3 layers, local (128-token window) elsewhere. Reduces complexity for long sequences.
Unpadding: “ModernBERT unpads inputs before the token embedding layer and optionally repads model outputs leading to a 10-to-20 percent performance improvement over previous methods.”
RoPE: Rotary positional embeddings enable extrapolation to longer sequences.
GeGLU: Gated activation function improves over GELU.

§Example

use anno::TextEncoder;

fn process_document(encoder: &dyn TextEncoder, text: &str) {
    let output = encoder.encode(text).unwrap();
    println!("Encoded {} tokens into {} dimensions",
             output.num_tokens, output.hidden_dim);

    // Token offsets map back to character positions
    for (i, (start, end)) in output.token_offsets.iter().enumerate() {
        println!("Token {}: chars {}..{}", i, start, end);
    }
}