Module offset

Expand description

Unified byte/character/token offset handling.

§The Three Coordinate Systems

When working with text, different tools use different ways to count positions. This causes bugs when tools disagree on where an entity starts and ends.

┌──────────────────────────────────────────────────────────────────────────┐
│                    THE OFFSET ALIGNMENT PROBLEM                          │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Text: "The café costs €50"                                              │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ BYTE INDEX (what regex/file I/O returns)                            │ │
│  │                                                                     │ │
│  │   T   h   e       c   a   f   [  é  ]       c   o   s   t   s       │ │
│  │   0   1   2   3   4   5   6   7-8   9  10  11  12  13  14  15  16   │ │
│  │                               └─2 bytes─┘                           │ │
│  │                                                                     │ │
│  │   [     €     ]   5   0                                             │ │
│  │   17-18-19   20  21  22                                             │ │
│  │   └─3 bytes──┘                                                      │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ CHAR INDEX (what humans count, what eval tools expect)              │ │
│  │                                                                     │ │
│  │   T   h   e       c   a   f   é       c   o   s   t   s       €   5 │ │
│  │   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 │ │
│  │                               └─1 char─┘              └─1 char─┘    │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ TOKEN INDEX (what BERT/transformers return)                         │ │
│  │                                                                     │ │
│  │   [CLS]  The  café  costs   €    50   [SEP]                         │ │
│  │     0     1    2      3     4     5     6                           │ │
│  │                                                                     │ │
│  │   But wait! "café" might be split:                                  │ │
│  │   [CLS]  The  ca  ##fe  costs   €    50   [SEP]                     │ │
│  │     0     1    2    3     4     5     6     7                       │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  THE PROBLEM:                                                            │
│  • Regex finds "€50" at byte positions (17, 22)                          │
│  • Evaluation tool expects char positions (15, 18)                       │
│  • BERT returns token positions (5, 6)                                   │
│                                                                          │
│  Without conversion, your F1 score will be WRONG.                        │
└──────────────────────────────────────────────────────────────────────────┘

§The Subword Problem

Transformer models split words into subword tokens. This breaks NER labels:

Text:      "playing"

Tokenizer: WordPiece splits unknown words
           "playing" → ["play", "##ing"]

Problem:   Which token gets the NER label?

┌────────────────────────────────────────────────────┐
│                 OPTION 1: First-only               │
│                                                    │
│   Tokens:  ["play", "##ing"]                       │
│   Labels:  [B-PER,    O    ]  ← "##ing" ignored!   │
│                                                    │
│   Problem: Model never learns "##ing" is part of  │
│            the entity. Loses signal.              │
├────────────────────────────────────────────────────┤
│                 OPTION 2: All tokens               │
│                                                    │
│   Tokens:  ["play", "##ing"]                       │
│   Labels:  [B-PER,  I-PER ]  ← Continuation!       │
│                                                    │
│   Better, but requires propagating labels during  │
│   both training AND inference.                    │
└────────────────────────────────────────────────────┘

§Solution: Dual Representations

┌────────────────────────────────────────────────────┐
│  Use TextSpan at boundaries, TokenSpan for models  │
├────────────────────────────────────────────────────┤
│                                                    │
│  Entity: "John" in "Hello John!"                   │
│                                                    │
│  TextSpan {                                        │
│      byte_start: 6,   byte_end: 10,                │
│      char_start: 6,   char_end: 10,  // ASCII: same│
│  }                                                 │
│                                                    │
│  TokenSpan {                                       │
│      token_start: 2,  // [CLS] Hello John [SEP]    │
│      token_end: 3,    //   0     1     2     3     │
│  }                                                 │
│                                                    │
│  Store BOTH. Convert at boundaries.                │
└────────────────────────────────────────────────────┘

This module provides:

TextSpan: Stores both byte and char offsets together
TokenSpan: Stores subword token indices
OffsetMapping: Maps between token ↔ character positions
CharOffset: Newtype wrapper for character offsets (type safety)
ByteOffset: Newtype wrapper for byte offsets (type safety)

§API Boundary Conventions

Anno uses character offsets as the canonical representation at API boundaries:

Type	Offset Convention	Notes
`Entity.start/end`	Character	Public API, evaluation, serialization
`Signal` with `Location::Text`	Character	Grounded document model
`Span::Text`	Character	Entity span representation
Backend internals	Often byte	Regex, JSON parsing, byte slicing
Token indices	Token	BERT/transformer models

Rule of thumb: Convert to character offsets as early as possible (at the backend boundary), and use the newtype wrappers (CharOffset, ByteOffset) when you need to be explicit about which you’re working with.

§Type Safety with Newtypes

The most common source of Unicode bugs is accidentally mixing byte and character offsets. Use the newtype wrappers to make this impossible at compile time:

use anno::offset::{CharOffset, ByteOffset};

fn process_span(start: CharOffset, end: CharOffset) {
    // Can only receive CharOffset, not ByteOffset
}

let char_pos = CharOffset(5);
let byte_pos = ByteOffset(10);

process_span(char_pos, CharOffset(10));  // OK
// process_span(byte_pos, CharOffset(10));  // Compile error!

Structs§

ByteOffset: A byte offset (raw byte index into UTF-8 string).
ByteRange: A byte range (start and end as byte offsets).
CharOffset: A character offset (Unicode scalar value index).
CharRange: A character range (start and end as character offsets).
OffsetMapping: Offset mapping from tokenizer.
SpanConverter: Converter for efficiently handling many spans from the same text.
TextSpan: A text span with both byte and character offsets.
TokenSpan: Span in subword token space.

Functions§

build_byte_to_char_map: Build an offset mapping table for efficient repeated conversions.
build_char_to_byte_map: Build an offset mapping table from char to byte.
bytes_to_chars: Convert byte offsets to character offsets.
chars_to_bytes: Convert character offsets to byte offsets.
is_ascii: Fast check if text is ASCII-only.

Module offset

Module offset Copy item path

§The Three Coordinate Systems

§The Subword Problem

§Solution: Dual Representations

§API Boundary Conventions

§Type Safety with Newtypes

Structs§

Functions§

Module offset