graphrag-core 0.1.0

Core portable library for GraphRAG - works on native and WASM
Documentation
# Document Structure Recognition and Metadata Enrichment - Implementation Summary

## Overview

This document summarizes the complete implementation of document structure recognition and semantic metadata enrichment in the `graphrag-core` library. The implementation provides automatic detection of document structure (chapters, sections, headings) and enriches text chunks with contextual semantic metadata.

## Research Foundation

The implementation is based on research into modern RAG systems and production implementations:

- **Microsoft GraphRAG**: Hierarchical document structure with Chapter/Section/Subsection nodes
- **LightRAG**: Dual-level retrieval with entity and relationship embeddings
- **Azure AI Document Intelligence**: ML-based layout analysis
- **Real-world implementations**: TF-IDF keyword extraction, extractive summarization, hierarchical parsing

## Architecture

The implementation follows a **7-layer bottom-up architecture** to ensure no dependencies on incomplete components:

### Layer 1: Data Structures ✅
- **ChunkMetadata** (15 fields): chapter, section, subsection, topic, keywords, summary, structural_level, position_in_document, heading_path, confidence, etc.
- **DocumentStructure**: Heading, Section, HeadingHierarchy, SectionNumber
- Updated **TextChunk** to include metadata field

### Layer 2: Core Algorithms ✅
- **TextAnalyzer**: Real heading detection (Markdown #, ALL CAPS, numbered sections, Roman numerals, alphabetic)
- **TfIdfKeywordExtractor**: Complete TF-IDF implementation with document frequency tracking
- **ExtractiveSummarizer**: Sentence ranking based on 5 criteria (position, length, word frequency, proper nouns, numeric content)

### Layer 3: Document Parsers ✅
- **MarkdownLayoutParser**: Detects `#`, `##`, `###` headings with proper validation
- **PlainTextLayoutParser**: Heuristic-based detection (underlines `===`, `---`, ALL CAPS, numbered sections)
- **HtmlLayoutParser**: Parses `<h1>` through `<h6>` tags
- **LayoutParserFactory**: Auto-detects document format from extension or content

### Layer 4: Enrichment Pipeline ✅
- **ChunkEnricher**: Orchestrates all enrichment components
  - Document structure parsing
  - Section mapping for chunks
  - Heading path construction
  - Keyword extraction using TF-IDF
  - Summary generation for longer chunks
  - Position calculation
  - Confidence scoring
- **EnrichmentStatistics**: Tracks enrichment quality metrics

### Layer 5: Integration ✅
- **TextProcessor** integration with new methods:
  - `chunk_text_with_enrichment()`: Manual enricher control
  - `chunk_text_hierarchical_with_enrichment()`: Hierarchical boundaries + enrichment
  - `chunk_and_enrich()`: Convenience method with auto-detection
  - `chunk_hierarchical_and_enrich()`: Hierarchical + auto-detection
  - `create_default_enricher()`: Factory for default enricher

### Layer 6: Testing ✅
- **Parser tests**: 9 tests covering all parsers (Markdown, Plain Text, HTML)
- **Enricher tests**: 2 tests for chunk enrichment and statistics
- **Integration tests**: 2 tests for TextProcessor enrichment methods
- **All tests passing**: 13/13 tests successful

### Layer 7: Examples ✅
- **document_enrichment_demo.rs**: Comprehensive example demonstrating:
  - Markdown document enrichment
  - HTML document enrichment
  - Plain text document enrichment
  - Enrichment strategy comparison
  - Statistics display
  - Hierarchy visualization

## Key Features

### 1. Automatic Format Detection
```rust
let processor = TextProcessor::new(300, 50)?;
let chunks = processor.chunk_and_enrich(&document)?; // Auto-detects format
```

### 2. Rich Metadata
Each chunk receives:
- **Structural context**: chapter, section, subsection, heading_path
- **Semantic information**: keywords (TF-IDF), summary (extractive)
- **Position tracking**: position_in_document (0.0 to 1.0)
- **Confidence score**: Based on metadata completeness

### 3. Multiple Document Formats
- **Markdown** (.md): `#`, `##`, `###` headings
- **HTML** (.html, .htm): `<h1>` through `<h6>` tags
- **Plain Text** (.txt): Heuristic detection (underlines, ALL CAPS, numbering)

### 4. Flexible APIs
```rust
// Convenience methods
processor.chunk_and_enrich(&document)?;
processor.chunk_hierarchical_and_enrich(&document)?;

// Manual control
let mut enricher = TextProcessor::create_default_enricher(&document);
processor.chunk_text_with_enrichment(&document, &mut enricher)?;

// Custom parser
let parser = Box::new(MarkdownLayoutParser::new());
let mut enricher = ChunkEnricher::new_default(parser);
processor.chunk_text_with_enrichment(&document, &mut enricher)?;
```

### 5. Statistics and Monitoring
```rust
let stats = enricher.get_statistics(&chunks);
stats.print_summary();
// Output:
//   Total chunks: 6
//   Chunks with structure: 6 (100.0%)
//   Chunks with keywords: 6 (100.0%)
//   Average completeness: 0.74
```

## Real Implementations (No Mocks)

All components use **real algorithms**:

- **TF-IDF**: Actual term frequency and inverse document frequency calculation
- **Extractive Summarization**: Multi-criteria sentence scoring and ranking
- **Document Parsing**: Real pattern matching and hierarchy building
- **Metadata Enrichment**: Actual keyword extraction and summarization

## Files Created/Modified

### New Files (11 files)
1. `src/core/metadata.rs` - ChunkMetadata structure
2. `src/text/document_structure.rs` - DocumentStructure, Heading, Section, HeadingHierarchy
3. `src/text/analysis.rs` - TextAnalyzer with real algorithms
4. `src/text/keyword_extraction.rs` - TF-IDF implementation
5. `src/text/extractive_summarizer.rs` - Sentence ranking
6. `src/text/layout_parser.rs` - LayoutParser trait and factory
7. `src/text/parsers/mod.rs` - Parser module exports
8. `src/text/parsers/markdown.rs` - Markdown parser
9. `src/text/parsers/plaintext.rs` - Plain text parser
10. `src/text/parsers/html.rs` - HTML parser
11. `src/text/chunk_enricher.rs` - Enrichment orchestration
12. `examples/document_enrichment_demo.rs` - Comprehensive demo

### Modified Files (2 files)
1. `src/core/mod.rs` - Added metadata module and field to TextChunk
2. `src/text/mod.rs` - Added enrichment modules and TextProcessor methods

## Usage Examples

### Basic Usage
```rust
use graphrag_core::{
    core::{Document, DocumentId},
    text::TextProcessor,
};

let document = Document::new(
    DocumentId::new("doc1".to_string()),
    "guide.md".to_string(),
    "# Chapter 1\n\nContent about machine learning...".to_string(),
);

let processor = TextProcessor::new(300, 50)?;
let chunks = processor.chunk_and_enrich(&document)?;

// Access enriched metadata
for chunk in chunks {
    if let Some(chapter) = chunk.metadata.chapter {
        println!("Chapter: {}", chapter);
    }
    println!("Keywords: {:?}", chunk.metadata.keywords);
    if let Some(summary) = chunk.metadata.summary {
        println!("Summary: {}", summary);
    }
}
```

### Custom Enricher
```rust
use graphrag_core::text::{ChunkEnricher, TfIdfKeywordExtractor, ExtractiveSummarizer};

let parser = Box::new(MarkdownLayoutParser::new());
let keyword_extractor = TfIdfKeywordExtractor::new_default();
let summarizer = ExtractiveSummarizer::new();

let enricher = ChunkEnricher::new(parser, keyword_extractor, summarizer);
```

## Performance

- **Parsing**: O(n) for document parsing where n = document size
- **TF-IDF**: O(m * k) where m = terms, k = documents in corpus
- **Summarization**: O(s²) where s = sentences (for sentence scoring)
- **Overall**: Linear with document size for typical use cases

## Test Results

All 13 tests passing:
```
✅ Parser tests (9/9)
  - Markdown: parsing, hierarchy building
  - Plain text: underlines, ALL CAPS, numbered sections
  - HTML: heading parsing, hierarchy, nested tags, format support

✅ Enricher tests (2/2)
  - Chunk enrichment with Markdown
  - Enrichment statistics

✅ Integration tests (2/2)
  - Enriched chunking with auto-detection
  - Custom enricher usage
```

## Future Enhancements

Potential areas for expansion:
1. PDF parsing support
2. Multi-column layout detection
3. Table and figure extraction
4. Language-specific stop words
5. Abstractive summarization (LLM-based)
6. Entity recognition integration
7. Semantic similarity clustering
8. Cross-document topic modeling

## Conclusion

This implementation provides a **complete, production-ready solution** for document structure recognition and semantic metadata enrichment in GraphRAG systems. All components use real algorithms (no mocks or placeholders), are fully tested, and include comprehensive documentation and examples.

The system automatically detects document structure, extracts semantic information, and enriches text chunks with contextual metadata, enabling more effective retrieval and generation in RAG applications.