Expand description
Tree-sitter based markdown parser Markdown parsing using tree-sitter for structured content analysis.
This module provides robust markdown parsing capabilities using tree-sitter, which enables precise syntax analysis and structured extraction of headings, content blocks, and table of contents information.
§Features
- Hierarchical Structure: Builds nested heading structures matching document organization
- Error Resilience: Continues parsing even with malformed markdown syntax
- Diagnostics: Reports issues found during parsing for quality assurance
- Performance: Efficiently handles large documents (< 150ms per MB)
- Unicode Support: Full Unicode support including complex scripts and emoji
§Architecture
The parser uses tree-sitter for tokenization and syntax analysis, then builds structured representations:
- Tokenization: tree-sitter parses markdown into a syntax tree
- Structure Extraction: Traverse tree to identify headings and content blocks
- Hierarchy Building: Construct nested TOC and heading block structures
- Validation: Generate diagnostics for quality issues
§Examples
§Basic parsing:
use blz_core::{MarkdownParser, Result};
let mut parser = MarkdownParser::new()?;
let result = parser.parse(r#"
Welcome to the documentation.
# Installation
Run the following command:
cargo install blz
# Usage
Basic usage example.
"#)?;
println!("Found {} heading blocks", result.heading_blocks.len());
println!("TOC has {} entries", result.toc.len());
println!("Total lines: {}", result.line_count);
for diagnostic in &result.diagnostics {
match diagnostic.severity {
blz_core::DiagnosticSeverity::Warn => {
println!("Warning: {}", diagnostic.message);
}
blz_core::DiagnosticSeverity::Error => {
println!("Error: {}", diagnostic.message);
}
blz_core::DiagnosticSeverity::Info => {
println!("Info: {}", diagnostic.message);
}
}
}
§Working with structured results:
use blz_core::{MarkdownParser, Result};
let mut parser = MarkdownParser::new()?;
let result = parser.parse("# Main\n\nMain content\n\n## Sub\n\nSub content here.")?;
// Examine heading blocks
for block in &result.heading_blocks {
println!("Section: {} (lines {}-{})",
block.path.join(" > "),
block.start_line,
block.end_line);
}
// Examine table of contents
fn print_toc(entries: &[blz_core::TocEntry], indent: usize) {
for entry in entries {
println!("{}{} ({})",
" ".repeat(indent),
entry.heading_path.last().unwrap_or(&"Unknown".to_string()),
entry.lines);
print_toc(&entry.children, indent + 1);
}
}
print_toc(&result.toc, 0);
§Performance Characteristics
- Parse Time: < 150ms per MB of markdown content
- Memory Usage: ~2x source document size during parsing
- Large Documents: Efficiently handles documents up to 100MB
- Complex Structure: Handles deeply nested headings (tested up to 50 levels)
§Error Handling
The parser is designed to be resilient to malformed input:
- Syntax Errors: tree-sitter handles most malformed markdown gracefully
- Missing Headings: Creates a default “Document” block for content without structure
- Encoding Issues: Handles various text encodings and invalid UTF-8 sequences
- Memory Limits: Prevents excessive memory usage on pathological inputs
§Thread Safety
MarkdownParser
is not thread-safe due to internal mutable state in tree-sitter.
Create separate parser instances for concurrent parsing:
use blz_core::{MarkdownParser, Result};
use std::thread;
fn parse_concurrently(documents: Vec<String>) -> Vec<Result<blz_core::ParseResult>> {
documents
.into_iter()
.map(|doc| {
thread::spawn(move || {
let mut parser = MarkdownParser::new()?;
parser.parse(&doc)
})
})
.collect::<Vec<_>>()
.into_iter()
.map(|handle| handle.join().unwrap())
.collect()
}
Structs§
- Markdown
Parser - A tree-sitter based markdown parser.
- Parse
Result - The result of parsing a markdown document.