Skip to main content

TextEncoder

Trait TextEncoder 

Source
pub trait TextEncoder: Send + Sync {
    // Required methods
    fn encode(&self, text: &str) -> Result<EncoderOutput>;
    fn encode_batch(&self, texts: &[&str]) -> Result<(Vec<f32>, RaggedBatch)>;
    fn hidden_dim(&self) -> usize;
    fn max_length(&self) -> usize;
    fn architecture(&self) -> &'static str;
}
Expand description

Text encoder trait for transformer-based encoders.

§Motivation

Modern NER systems require converting raw text into dense vector representations that capture semantic meaning. This trait abstracts the encoding step, allowing different transformer architectures to be used interchangeably.

§Supported Architectures

ArchitectureContextKey FeaturesSpeed
ModernBERT8,192RoPE, GeGLU, unpadded inference3x faster
DeBERTaV3512Disentangled attentionBaseline
BERT/RoBERTa512Classic, widely availableBaseline

§Research Alignment (ModernBERT, Dec 2024)

From ModernBERT paper (arXiv:2412.13663):

“Pareto improvements to BERT… encoder-only models offer great performance-size tradeoff for retrieval and classification.”

Key innovations:

  • Alternating Attention: Global attention every 3 layers, local (128-token window) elsewhere. Reduces complexity for long sequences.
  • Unpadding: “ModernBERT unpads inputs before the token embedding layer and optionally repads model outputs leading to a 10-to-20 percent performance improvement over previous methods.”
  • RoPE: Rotary positional embeddings enable extrapolation to longer sequences.
  • GeGLU: Gated activation function improves over GELU.

§Example

use anno::TextEncoder;

fn process_document(encoder: &dyn TextEncoder, text: &str) {
    let output = encoder.encode(text).unwrap();
    println!("Encoded {} tokens into {} dimensions",
             output.num_tokens, output.hidden_dim);

    // Token offsets map back to character positions
    for (i, (start, end)) in output.token_offsets.iter().enumerate() {
        println!("Token {}: chars {}..{}", i, start, end);
    }
}

Required Methods§

Source

fn encode(&self, text: &str) -> Result<EncoderOutput>

Encode text into token embeddings.

§Arguments
  • text - Input text to encode
§Returns
  • Token embeddings as flattened [num_tokens, hidden_dim]
  • Attention mask indicating valid tokens
Source

fn encode_batch(&self, texts: &[&str]) -> Result<(Vec<f32>, RaggedBatch)>

Encode a batch of texts.

§Arguments
  • texts - Batch of input texts
§Returns
  • RaggedBatch containing all embeddings with document boundaries
Source

fn hidden_dim(&self) -> usize

Get the hidden dimension of the encoder.

Source

fn max_length(&self) -> usize

Get the maximum sequence length.

Source

fn architecture(&self) -> &'static str

Get the encoder architecture name.

Implementors§