Expand description
Character Encoding Detection and Conversion Module
This module provides robust character encoding detection and conversion capabilities to handle the diverse encoding scenarios found in real-world PDF files.
§Overview
Many PDFs contain text encoded in various character sets beyond UTF-8, including:
- Latin-1 (ISO 8859-1) - Common in European documents
- Windows-1252 - Microsoft’s extension of Latin-1
- MacRoman - Apple’s legacy encoding
- Various PDF-specific encodings
This module provides automatic detection and graceful conversion with fallback handling for unrecognized characters.
Structs§
- Encoding
Issue - Information about encoding issues encountered
- Encoding
Options - Configuration for character encoding processing
- Encoding
Result - Character encoding detection and conversion result
- Enhanced
Decoder - Enhanced character decoder implementation
Enums§
- Encoding
Type - Supported encoding types for PDF text
- Issue
Resolution
Traits§
- Character
Decoder - Main character decoder trait
Functions§
- decode_
text - Convenience function to decode bytes with default settings
- decode_
text_ with_ encoding - Convenience function to decode bytes with specific encoding