pub trait TextEncoder: Send + Sync {
// Required methods
fn encode(&self, text: &str) -> Result<EncoderOutput>;
fn encode_batch(&self, texts: &[&str]) -> Result<(Vec<f32>, RaggedBatch)>;
fn hidden_dim(&self) -> usize;
fn max_length(&self) -> usize;
fn architecture(&self) -> &'static str;
}Expand description
Text encoder trait for transformer-based encoders.
§Motivation
Modern NER systems require converting raw text into dense vector representations that capture semantic meaning. This trait abstracts the encoding step, allowing different transformer architectures to be used interchangeably.
§Supported Architectures
| Architecture | Context | Key Features | Speed |
|---|---|---|---|
| ModernBERT | 8,192 | RoPE, GeGLU, unpadded inference | 3x faster |
| DeBERTaV3 | 512 | Disentangled attention | Baseline |
| BERT/RoBERTa | 512 | Classic, widely available | Baseline |
§Research Alignment (ModernBERT, Dec 2024)
From ModernBERT paper (arXiv:2412.13663):
“Pareto improvements to BERT… encoder-only models offer great performance-size tradeoff for retrieval and classification.”
Key innovations:
- Alternating Attention: Global attention every 3 layers, local (128-token window) elsewhere. Reduces complexity for long sequences.
- Unpadding: “ModernBERT unpads inputs before the token embedding layer and optionally repads model outputs leading to a 10-to-20 percent performance improvement over previous methods.”
- RoPE: Rotary positional embeddings enable extrapolation to longer sequences.
- GeGLU: Gated activation function improves over GELU.
§Example
ⓘ
use anno::TextEncoder;
fn process_document(encoder: &dyn TextEncoder, text: &str) {
let output = encoder.encode(text).unwrap();
println!("Encoded {} tokens into {} dimensions",
output.num_tokens, output.hidden_dim);
// Token offsets map back to character positions
for (i, (start, end)) in output.token_offsets.iter().enumerate() {
println!("Token {}: chars {}..{}", i, start, end);
}
}Required Methods§
Sourcefn encode(&self, text: &str) -> Result<EncoderOutput>
fn encode(&self, text: &str) -> Result<EncoderOutput>
Sourcefn encode_batch(&self, texts: &[&str]) -> Result<(Vec<f32>, RaggedBatch)>
fn encode_batch(&self, texts: &[&str]) -> Result<(Vec<f32>, RaggedBatch)>
Get the hidden dimension of the encoder.
Sourcefn max_length(&self) -> usize
fn max_length(&self) -> usize
Get the maximum sequence length.
Sourcefn architecture(&self) -> &'static str
fn architecture(&self) -> &'static str
Get the encoder architecture name.