Skip to main content

Module offset

Module offset 

Source
Expand description

Unified byte/character/token offset handling.

§The Three Coordinate Systems

When working with text, different tools use different ways to count positions. This causes bugs when tools disagree on where an entity starts and ends.

┌──────────────────────────────────────────────────────────────────────────┐
│                    THE OFFSET ALIGNMENT PROBLEM                          │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Text: "The café costs €50"                                              │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ BYTE INDEX (what regex/file I/O returns)                            │ │
│  │                                                                     │ │
│  │   T   h   e       c   a   f   [  é  ]       c   o   s   t   s       │ │
│  │   0   1   2   3   4   5   6   7-8   9  10  11  12  13  14  15  16   │ │
│  │                               └─2 bytes─┘                           │ │
│  │                                                                     │ │
│  │   [     €     ]   5   0                                             │ │
│  │   17-18-19   20  21  22                                             │ │
│  │   └─3 bytes──┘                                                      │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ CHAR INDEX (what humans count, what eval tools expect)              │ │
│  │                                                                     │ │
│  │   T   h   e       c   a   f   é       c   o   s   t   s       €   5 │ │
│  │   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 │ │
│  │                               └─1 char─┘              └─1 char─┘    │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ TOKEN INDEX (what BERT/transformers return)                         │ │
│  │                                                                     │ │
│  │   [CLS]  The  café  costs   €    50   [SEP]                         │ │
│  │     0     1    2      3     4     5     6                           │ │
│  │                                                                     │ │
│  │   But wait! "café" might be split:                                  │ │
│  │   [CLS]  The  ca  ##fe  costs   €    50   [SEP]                     │ │
│  │     0     1    2    3     4     5     6     7                       │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  THE PROBLEM:                                                            │
│  • Regex finds "€50" at byte positions (17, 22)                          │
│  • Evaluation tool expects char positions (15, 18)                       │
│  • BERT returns token positions (5, 6)                                   │
│                                                                          │
│  Without conversion, your F1 score will be WRONG.                        │
└──────────────────────────────────────────────────────────────────────────┘

§The Subword Problem

Transformer models split words into subword tokens. This breaks NER labels:

Text:      "playing"

Tokenizer: WordPiece splits unknown words
           "playing" → ["play", "##ing"]

Problem:   Which token gets the NER label?

┌────────────────────────────────────────────────────┐
│                 OPTION 1: First-only               │
│                                                    │
│   Tokens:  ["play", "##ing"]                       │
│   Labels:  [B-PER,    O    ]  ← "##ing" ignored!   │
│                                                    │
│   Problem: Model never learns "##ing" is part of  │
│            the entity. Loses signal.              │
├────────────────────────────────────────────────────┤
│                 OPTION 2: All tokens               │
│                                                    │
│   Tokens:  ["play", "##ing"]                       │
│   Labels:  [B-PER,  I-PER ]  ← Continuation!       │
│                                                    │
│   Better, but requires propagating labels during  │
│   both training AND inference.                    │
└────────────────────────────────────────────────────┘

§Solution: Dual Representations

┌────────────────────────────────────────────────────┐
│  Use TextSpan at boundaries, TokenSpan for models  │
├────────────────────────────────────────────────────┤
│                                                    │
│  Entity: "John" in "Hello John!"                   │
│                                                    │
│  TextSpan {                                        │
│      byte_start: 6,   byte_end: 10,                │
│      char_start: 6,   char_end: 10,  // ASCII: same│
│  }                                                 │
│                                                    │
│  TokenSpan {                                       │
│      token_start: 2,  // [CLS] Hello John [SEP]    │
│      token_end: 3,    //   0     1     2     3     │
│  }                                                 │
│                                                    │
│  Store BOTH. Convert at boundaries.                │
└────────────────────────────────────────────────────┘

This module provides:

  • TextSpan: Stores both byte and char offsets together
  • TokenSpan: Stores subword token indices
  • OffsetMapping: Maps between token ↔ character positions
  • CharOffset: Newtype wrapper for character offsets (type safety)
  • ByteOffset: Newtype wrapper for byte offsets (type safety)

§API Boundary Conventions

Anno uses character offsets as the canonical representation at API boundaries:

TypeOffset ConventionNotes
Entity.start/endCharacterPublic API, evaluation, serialization
Signal with Location::TextCharacterGrounded document model
Span::TextCharacterEntity span representation
Backend internalsOften byteRegex, JSON parsing, byte slicing
Token indicesTokenBERT/transformer models

Rule of thumb: Convert to character offsets as early as possible (at the backend boundary), and use the newtype wrappers (CharOffset, ByteOffset) when you need to be explicit about which you’re working with.

§Type Safety with Newtypes

The most common source of Unicode bugs is accidentally mixing byte and character offsets. Use the newtype wrappers to make this impossible at compile time:

use anno::offset::{CharOffset, ByteOffset};

fn process_span(start: CharOffset, end: CharOffset) {
    // Can only receive CharOffset, not ByteOffset
}

let char_pos = CharOffset(5);
let byte_pos = ByteOffset(10);

process_span(char_pos, CharOffset(10));  // OK
// process_span(byte_pos, CharOffset(10));  // Compile error!

Structs§

ByteOffset
A byte offset (raw byte index into UTF-8 string).
ByteRange
A byte range (start and end as byte offsets).
CharOffset
A character offset (Unicode scalar value index).
CharRange
A character range (start and end as character offsets).
OffsetMapping
Offset mapping from tokenizer.
SpanConverter
Converter for efficiently handling many spans from the same text.
TextSpan
A text span with both byte and character offsets.
TokenSpan
Span in subword token space.

Functions§

build_byte_to_char_map
Build an offset mapping table for efficient repeated conversions.
build_char_to_byte_map
Build an offset mapping table from char to byte.
bytes_to_chars
Convert byte offsets to character offsets.
chars_to_bytes
Convert character offsets to byte offsets.
is_ascii
Fast check if text is ASCII-only.