Module tokenization

Module tokenization 

Source
Expand description

§Tokenization Module

This module provides accurate token counting using OpenAI’s tiktoken tokenizer, replacing the simple character-based estimation used previously.

§Features

  • Accurate Token Counting: Uses tiktoken cl100k_base encoding (GPT-4 compatible)
  • Multiple Encoding Support: Supports different OpenAI encodings
  • Content-Aware Estimation: Handles code content more accurately than character counting
  • Budget Management: Token budget allocation and tracking

§Usage

use scribe_core::tokenization::{TokenCounter, TokenizerConfig};

let config = TokenizerConfig::default();
let counter = TokenCounter::new(config)?;

let content = "fn main() { println!(\"Hello, world!\"); }";
let token_count = counter.count_tokens(content)?;
println!("Token count: {}", token_count);

Modules§

utils
Utilities for working with tokens and content

Structs§

TokenBudget
Token budget tracker for selection algorithms
TokenCounter
Main tokenizer interface for accurate token counting
TokenizationComparison
Comparison between tiktoken and legacy tokenization
TokenizerConfig
Configuration for the tokenizer

Enums§

ContentType
Content type for budget recommendations