# PDF Spec Compliance Cleanup Roadmap
**Goal**: Remove all non-PDF-spec-compliant code from core extraction. Reorganize into:
- **Core**: Spec-compliant extraction only
- **Enhancements**: Optional, user-controlled features
- **Clear boundary**: No mixed concerns
**Current State**: ~2,000+ lines of non-spec code across 12 areas (Phase 10 status: 4.4/10)
**Target State**: Spec-strict core + optional enhancement layers (8.5+/10)
---
## PHASE 1: Removal (Lowest Priority - Just Delete)
These features have NO DEPENDENCY from core extraction. Safe to remove entirely.
### 1.1 Table Detection Module
**Location**: `src/layout/table_detector.rs` (300+ lines)
**Status**: Can be removed immediately
**Action**:
- [ ] Delete `src/layout/table_detector.rs`
- [ ] Remove from `src/layout/mod.rs` exports
- [ ] Remove from `src/extractors/mod.rs` if exported
- [ ] Update any tests that reference table detection
- [ ] Remove `TableDetector`, `DetectedTable`, `TableDetectorConfig` from public API
**Why**: Tables are semantic concepts NOT in PDF spec. Users should use structure tree (Section 14.7) if they want real table information.
**Impact**: ✅ No code breakage (optional feature)
**Effort**: 30 minutes
---
### 1.2 Heading Detection Heuristics
**Location**: `src/layout/heading_detector.rs` (300+ lines)
**Status**: Can be removed if not core to extraction
**Action**:
- [ ] Check if heading_detector is used in critical path
- [ ] If yes: Keep but add `spec_compliant: false` flag
- [ ] If no: Delete entire module
- [ ] Remove hardcoded font size thresholds (22pt, 18pt, etc.)
**Why**: Font-based heading detection is linguistic interpretation, not PDF spec feature. Use structure tree for real heading info.
**Impact**: Depends on usage - potentially breaks heading detection
**Effort**: 1 hour if deletable, 2 hours if keeping with annotations
---
### 1.3 ML Heading Classifier
**Location**: `src/ml/heading_classifier.rs` (200+ lines)
**Status**: Remove or move to optional module
**Action**:
- [ ] Delete `src/ml/heading_classifier.rs`
- [ ] Remove from `src/ml/mod.rs`
- [ ] If ML module becomes empty, consider removing entire `src/ml/`
- [ ] Remove all DistilBERT references from docs
**Why**: ML-based semantic analysis is antithetical to spec compliance. It's a proprietary classification layer.
**Impact**: ✅ No code breakage
**Effort**: 30 minutes
---
## PHASE 2: Migration to Optional (Medium Priority)
These are NEEDED for quality but NOT in PDF spec. Move to optional post-processing layer.
### 2.1 CamelCase Word Splitting
**Location**: `src/extractors/text.rs:1467-1475, 2057-2141, 3671-3989`
**Current State**: Already disabled but code still present
**Action**:
- [ ] Create new module: `src/post_processors/word_splitter.rs`
- [ ] Move `split_fused_words()`, `split_on_camelcase()` to new module
- [ ] Remove calls from main extraction pipeline
- [ ] Add post-processing layer to text extraction with opt-in flag
- [ ] Document: "Optional feature - not PDF spec-based"
- [ ] Add unit tests: verify "theGeneral" → "the General" works
- [ ] Remove dead code from text.rs: lines 3671-3989
**Configuration**:
```rust
pub struct TextExtractionConfig {
// ... existing spec-based config ...
// Optional enhancements (NOT spec-based)
pub enable_word_splitting: bool, // Default: false
}
```
**Impact**: ✅ +3 word fusions fixed when enabled
**Effort**: 2-3 hours
---
### 2.2 Document Type Detection & Profiles
**Location**: `src/extractors/gap_statistics.rs:154-248` (configuration), various `.policy_documents()`, `.academic()` methods
**Current State**: Active, controlling 1,623 spurious spaces
**Challenge**: Removing this LOWERS quality. Need to decide:
- **Option A**: Delete (pure spec-only, quality drops to 3.5/10)
- **Option B**: Keep but annotate as "empirical heuristic" (current approach)
- **Option C**: Move to optional module with better documentation
**Recommendation**: **Option B (Keep with Annotations)** - for now
- [ ] Add comments to all doc-type profiles explaining they're non-spec
- [ ] Create config flag: `use_adaptive_thresholds: bool` (default: true)
- [ ] Document why: "Empirical tuning for real-world PDFs"
- [ ] Create variant: `spec_strict_config()` that disables all adaptive features
- [ ] Later: Can move to optional module after implementing better spec-based solution
**Example Documentation**:
```rust
/// **NON-SPEC HEURISTIC**: Document-type-specific thresholds
/// These multipliers (1.3x for policy, 1.6x for academic) are empirically chosen
/// and NOT derived from ISO 32000-1:2008. They improve practical quality but
/// reduce spec compliance. Disable via: config.use_adaptive_thresholds = false
pub fn policy_documents() -> Self {
Self {
median_multiplier: 1.3, // Tight spacing in policy docs
// ...
}
}
```
**Impact**: Maintains current quality until we find better approach
**Effort**: 1-2 hours (annotation only)
---
### 2.3 Column Detection & Layout Analysis
**Location**: `src/layout/document_analyzer.rs:118-408` (bin sizes, gap ratios, Gaussian sigma)
**Current State**: Active, used for adaptive layout
**Decision**: KEEP but separate into "layout enhancement" module
**Action**:
- [ ] Move to new module: `src/enhancements/layout_analysis.rs`
- [ ] Mark all magic numbers with sources (ICDAR paper reference)
- [ ] Add config flag: `enable_layout_analysis: bool` (default: true)
- [ ] Document: "Uses ICDAR 2005 layout algorithm, not PDF spec-based"
- [ ] Keep in extractors but with clear separation
**Configuration**:
```rust
pub struct LayoutAnalysisConfig {
pub enabled: bool,
// Bin width for projection profile (ICDAR algorithm)
pub histogram_bin_width_pt: f32, // default: 10.0
// ... other ICDAR parameters ...
}
```
**Impact**: Maintains layout analysis, improves documentation
**Effort**: 2-3 hours
---
## PHASE 3: Annotation (High Priority - Quick Wins)
Add clear documentation to all non-spec code that stays.
### 3.1 Tag All Non-Spec Code
**Action**:
- [ ] Find all non-spec implementations (use analysis output)
- [ ] Add comment block:
```rust
```
**Locations to annotate**:
- [ ] `gap_statistics.rs`: All multiplier-based thresholds
- [ ] `geometric_spacing.rs`: Document the 0.25em ratio choice
- [ ] `document_analyzer.rs`: All ICDAR algorithm parameters
- [ ] `column_detector.rs`: XY-Cut algorithm parameters
- [ ] `bold_validation.rs`: Unicode whitespace handling
**Effort**: 3-4 hours
---
### 3.2 Create Spec Compliance Reference
**New file**: `docs/SPEC_COMPLIANCE_GUIDE.md`
**Content**:
- List all PDF spec sections used (9.3, 9.4.3, 9.4.4, etc.)
- List all non-spec features and justifications
- Configuration guide: How to enable/disable features
- Quality vs. Compliance trade-offs
- Comparison with pdfplumber, pdfminer.six
**Effort**: 2-3 hours
---
## PHASE 4: Create Optional Enhancement Layers
### 4.1 Post-Processor Framework
**New file**: `src/post_processors/mod.rs`
**Purpose**: Apply non-spec fixes AFTER spec-compliant extraction
```rust
pub trait PostProcessor {
fn process(&self, document: &mut ExtractedDocument) -> Result<()>;
}
pub struct TextRepairProcessor {
pub split_camelcase: bool,
pub fix_empty_markers: bool,
// ...
}
pub fn apply_post_processors(
document: &mut ExtractedDocument,
config: &PostProcessorConfig,
) -> Result<()> {
if config.word_splitting.enabled {
TextRepairProcessor::split_fused_words(document)?;
}
if config.bold_validation.enabled {
BoldMarkerValidator::fix_empty_markers(document)?;
}
// ...
}
```
**Effort**: 3-4 hours
---
## PHASE 5: Create Spec-Strict Mode
**New configuration**: `TextExtractionConfig::spec_strict()`
```rust
impl TextExtractionConfig {
/// Returns configuration that ONLY uses PDF spec features
/// - TJ array offsets (Section 9.4.3)
/// - Boundary whitespace (Section 9.4.3)
/// - Geometric gaps with fixed 0.25em threshold (Section 9.4.4)
/// - Font metrics (Section 9.3)
pub fn spec_strict() -> Self {
Self {
// Core spec features
use_tj_offsets: true,
use_geometric_gaps: true,
use_boundary_whitespace: true,
// Disable ALL non-spec features
use_adaptive_thresholds: false,
enable_word_splitting: false,
enable_layout_analysis: false,
enable_table_detection: false,
enable_heading_detection: false,
// Fixed thresholds (from pdfplumber)
geometric_gap_threshold_em: 0.25, // Standard 0.25em
..Default::default()
}
}
}
```
**Testing**:
- [ ] Add test: `test_spec_strict_mode_disabled()`
- [ ] Run regression suite with `spec_strict()`
- [ ] Expected: Lower quality (3.5-4.5/10) but spec-compliant
**Effort**: 1-2 hours
---
## Execution Order (Recommended)
### Stage 1: Quick Removals + Annotations
1. **Phase 1.1**: Delete table_detector.rs (30 min)
2. **Phase 1.2**: Delete heading_detector.rs or annotate (1-2 hrs)
3. **Phase 1.3**: Delete ML classifier (30 min)
4. **Phase 3.1**: Annotate all non-spec code (3-4 hrs)
5. **Total**: ~6-8 hours → Immediate clarity on what's non-spec
### Stage 2: Documentation + Refactoring
6. **Phase 3.2**: Create spec compliance guide (2-3 hrs)
7. **Phase 2.1**: Move CamelCase to post-processor (2-3 hrs)
8. **Phase 2.3**: Move layout analysis to enhancement module (2-3 hrs)
9. **Total**: ~6-9 hours → Clean separation of concerns
### Stage 3: Framework + Testing
10. **Phase 4.1**: Create post-processor framework (3-4 hrs)
11. **Phase 5**: Create spec-strict mode (1-2 hrs)
12. **Testing**: Regression suite + quality metrics (2-3 hrs)
13. **Total**: ~6-9 hours → Production-ready clean architecture
---
## File Structure After Cleanup
```
src/
├── core/ # SPEC-COMPLIANT ONLY
│ ├── text_extraction.rs # Core text extraction (TJ, boundaries, gaps)
│ ├── geometric_spacing.rs # Fixed 0.25em threshold (CURRENT geometric_spacing.rs)
│ └── font_metrics.rs # Font state parameters (Tc, Tw, Th)
│
├── enhancements/ # OPTIONAL, USER-CONTROLLED
│ ├── adaptive_thresholds.rs # Gap statistics multipliers (from gap_statistics.rs)
│ ├── layout_analysis.rs # Document analysis, column detection (ICDAR-based)
│ └── config.rs # Unified enhancement configuration
│
├── post_processors/ # APPLIED AFTER EXTRACTION (NON-SPEC)
│ ├── mod.rs # PostProcessor trait
│ ├── word_splitter.rs # CamelCase splitting (from split_fused_words)
│ ├── bold_validator.rs # Empty bold marker fixes (moved from converters)
│ └── spurious_space_fixer.rs # Fix double spaces (Issue #2)
│
├── converters/
│ └── markdown.rs # Markdown output (use post-processors)
│
└── [other modules unchanged]
docs/
├── PHASE10_PDF_SPEC_COMPLIANCE.md # Existing
├── CLEANUP_ROADMAP.md # This file
└── SPEC_COMPLIANCE_GUIDE.md # New - Comprehensive guide
```
---
## Quality & Compliance Matrix
| spec_strict | ❌ 3 | ✅ 0 | ❌ 2-3 | 3.5/10 | ✅ 100% |
| default | ❌ 3 | ✅ 0 | ❌ 2-3 | 4.4/10 | 🟡 70% |
| with_enhancements | ❌ 3 | ✅ 0 | ❌ 2-3 | 6.5/10 | 🟡 50% |
| with_all_fixes | ✅ 0 | ✅ 0 | ✅ 0 | 8.5/10 | 🟡 40% |
---
## Critical Notes
### What to KEEP (with justification)
1. **Geometric spacing 0.25em threshold** ✅
- Justified: pdfplumber standard, widely proven
- Spec: Section 9.4.4 supports this interpretation
- Config: Fixed (not adaptive)
2. **Boundary whitespace detection** ✅
- Justified: Directly in PDF spec (Section 9.4.3)
- Spec: "Spaces in text strings"
- Config: No option (always on)
3. **TJ offset signals** ✅
- Justified: Directly in PDF spec (Section 9.4.3)
- Spec: "TJ array offsets determine positioning"
- Config: No option (always on)
4. **Bold/italic detection from font flags** ✅
- Justified: Font properties in PDF spec (Section 5.3.3)
- Spec: Font.Flags, Font.FontWeight etc.
- Config: Always on (core feature)
### What to REMOVE (no spec justification)
1. ❌ Table detection (move to optional)
2. ❌ Heading detection heuristics (move to optional)
3. ❌ ML classifiers (delete)
4. ❌ CamelCase splitting (move to post-processor)
5. ❌ Document-type profiles (annotate as heuristic)
### What to ANNOTATE (keep but document)
1. 📝 Adaptive gap multipliers (empirical, non-spec)
2. 📝 ICDAR layout analysis (academic, non-spec)
3. 📝 Unicode whitespace handling (PDF-specific workaround)
---
## Success Criteria
- [ ] All non-spec code clearly marked with **NON-SPEC HEURISTIC** comments
- [ ] New modules: `core/`, `enhancements/`, `post_processors/`
- [ ] `spec_strict()` configuration works (3.5/10 quality, 100% compliant)
- [ ] Default configuration improved (4.4→5.0/10, ~70% compliant)
- [ ] All fixes as optional post-processors (8.5/10, ~40% compliant but user-controlled)
- [ ] Comprehensive spec compliance guide published
- [ ] Regression suite passes for all configurations
- [ ] Clear user documentation: When to enable/disable features
---
## Commands for Testing
```bash
# Test spec-strict mode
cargo test --test quality_metrics -- --spec-strict
# Test with all enhancements
cargo test --test quality_metrics -- --enable-all
# Test post-processors
cargo test --test quality_metrics -- --with-post-processors
# Full regression suite
cargo test --test regression_suite
```
---
## Questions Before Starting
1. **Question 1**: Should we delete table detection entirely, or keep it but move to optional module?
- **Recommended**: Delete (false positives, users have structure tree)
2. **Question 2**: For adaptive gaps, should we move to `enhancements/` or keep in core?
- **Recommended**: Keep in core but annotate heavily (needed for current quality)
3. **Question 3**: Should spec_strict_mode() be the default or opt-in?
- **Recommended**: Opt-in (users expect good quality by default)
---
## Timeline
- **Phase 1-2**: 8 hours → Remove/migrate non-spec code
- **Phase 3-4**: 8 hours → Create framework + documentation
- **Phase 5**: 3 hours → Testing + verification
- **Total**: ~19 hours → Production-ready clean architecture
Should we start with Phase 1 (quick removals)?