Skip to main content

Module encoding

Module encoding 

Source
Expand description

Character Encoding Detection and Conversion Module

This module provides robust character encoding detection and conversion capabilities to handle the diverse encoding scenarios found in real-world PDF files.

§Overview

Many PDFs contain text encoded in various character sets beyond UTF-8, including:

  • Latin-1 (ISO 8859-1) - Common in European documents
  • Windows-1252 - Microsoft’s extension of Latin-1
  • MacRoman - Apple’s legacy encoding
  • Various PDF-specific encodings

This module provides automatic detection and graceful conversion with fallback handling for unrecognized characters.

Structs§

EncodingIssue
Information about encoding issues encountered
EncodingOptions
Configuration for character encoding processing
EncodingResult
Character encoding detection and conversion result
EnhancedDecoder
Enhanced character decoder implementation

Enums§

EncodingType
Supported encoding types for PDF text
IssueResolution

Traits§

CharacterDecoder
Main character decoder trait

Functions§

decode_text
Convenience function to decode bytes with default settings
decode_text_with_encoding
Convenience function to decode bytes with specific encoding