Expand description
Unified byte/character/token offset handling.
§The Three Coordinate Systems
When working with text, different tools use different ways to count positions. This causes bugs when tools disagree on where an entity starts and ends.
┌──────────────────────────────────────────────────────────────────────────┐
│ THE OFFSET ALIGNMENT PROBLEM │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Text: "The café costs €50" │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ BYTE INDEX (what regex/file I/O returns) │ │
│ │ │ │
│ │ T h e c a f [ é ] c o s t s │ │
│ │ 0 1 2 3 4 5 6 7-8 9 10 11 12 13 14 15 16 │ │
│ │ └─2 bytes─┘ │ │
│ │ │ │
│ │ [ € ] 5 0 │ │
│ │ 17-18-19 20 21 22 │ │
│ │ └─3 bytes──┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CHAR INDEX (what humans count, what eval tools expect) │ │
│ │ │ │
│ │ T h e c a f é c o s t s € 5 │ │
│ │ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 │ │
│ │ └─1 char─┘ └─1 char─┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TOKEN INDEX (what BERT/transformers return) │ │
│ │ │ │
│ │ [CLS] The café costs € 50 [SEP] │ │
│ │ 0 1 2 3 4 5 6 │ │
│ │ │ │
│ │ But wait! "café" might be split: │ │
│ │ [CLS] The ca ##fe costs € 50 [SEP] │ │
│ │ 0 1 2 3 4 5 6 7 │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ THE PROBLEM: │
│ • Regex finds "€50" at byte positions (17, 22) │
│ • Evaluation tool expects char positions (15, 18) │
│ • BERT returns token positions (5, 6) │
│ │
│ Without conversion, your F1 score will be WRONG. │
└──────────────────────────────────────────────────────────────────────────┘§The Subword Problem
Transformer models split words into subword tokens. This breaks NER labels:
Text: "playing"
Tokenizer: WordPiece splits unknown words
"playing" → ["play", "##ing"]
Problem: Which token gets the NER label?
┌────────────────────────────────────────────────────┐
│ OPTION 1: First-only │
│ │
│ Tokens: ["play", "##ing"] │
│ Labels: [B-PER, O ] ← "##ing" ignored! │
│ │
│ Problem: Model never learns "##ing" is part of │
│ the entity. Loses signal. │
├────────────────────────────────────────────────────┤
│ OPTION 2: All tokens │
│ │
│ Tokens: ["play", "##ing"] │
│ Labels: [B-PER, I-PER ] ← Continuation! │
│ │
│ Better, but requires propagating labels during │
│ both training AND inference. │
└────────────────────────────────────────────────────┘§Solution: Dual Representations
┌────────────────────────────────────────────────────┐
│ Use TextSpan at boundaries, TokenSpan for models │
├────────────────────────────────────────────────────┤
│ │
│ Entity: "John" in "Hello John!" │
│ │
│ TextSpan { │
│ byte_start: 6, byte_end: 10, │
│ char_start: 6, char_end: 10, // ASCII: same│
│ } │
│ │
│ TokenSpan { │
│ token_start: 2, // [CLS] Hello John [SEP] │
│ token_end: 3, // 0 1 2 3 │
│ } │
│ │
│ Store BOTH. Convert at boundaries. │
└────────────────────────────────────────────────────┘This module provides:
TextSpan: Stores both byte and char offsets togetherTokenSpan: Stores subword token indicesOffsetMapping: Maps between token ↔ character positionsCharOffset: Newtype wrapper for character offsets (type safety)ByteOffset: Newtype wrapper for byte offsets (type safety)
§API Boundary Conventions
Anno uses character offsets as the canonical representation at API boundaries:
| Type | Offset Convention | Notes |
|---|---|---|
Entity.start/end | Character | Public API, evaluation, serialization |
Signal with Location::Text | Character | Grounded document model |
Span::Text | Character | Entity span representation |
| Backend internals | Often byte | Regex, JSON parsing, byte slicing |
| Token indices | Token | BERT/transformer models |
Rule of thumb: Convert to character offsets as early as possible (at the
backend boundary), and use the newtype wrappers (CharOffset, ByteOffset)
when you need to be explicit about which you’re working with.
§Type Safety with Newtypes
The most common source of Unicode bugs is accidentally mixing byte and character offsets. Use the newtype wrappers to make this impossible at compile time:
use anno::offset::{CharOffset, ByteOffset};
fn process_span(start: CharOffset, end: CharOffset) {
// Can only receive CharOffset, not ByteOffset
}
let char_pos = CharOffset(5);
let byte_pos = ByteOffset(10);
process_span(char_pos, CharOffset(10)); // OK
// process_span(byte_pos, CharOffset(10)); // Compile error!Structs§
- Byte
Offset - A byte offset (raw byte index into UTF-8 string).
- Byte
Range - A byte range (start and end as byte offsets).
- Char
Offset - A character offset (Unicode scalar value index).
- Char
Range - A character range (start and end as character offsets).
- Offset
Mapping - Offset mapping from tokenizer.
- Span
Converter - Converter for efficiently handling many spans from the same text.
- Text
Span - A text span with both byte and character offsets.
- Token
Span - Span in subword token space.
Functions§
- build_
byte_ to_ char_ map - Build an offset mapping table for efficient repeated conversions.
- build_
char_ to_ byte_ map - Build an offset mapping table from char to byte.
- bytes_
to_ chars - Convert byte offsets to character offsets.
- chars_
to_ bytes - Convert character offsets to byte offsets.
- is_
ascii - Fast check if text is ASCII-only.