Module parser

Module parser 

Source
Expand description

Tree-sitter based markdown parser Markdown parsing using tree-sitter for structured content analysis.

This module provides robust markdown parsing capabilities using tree-sitter, which enables precise syntax analysis and structured extraction of headings, content blocks, and table of contents information.

§Features

  • Hierarchical Structure: Builds nested heading structures matching document organization
  • Error Resilience: Continues parsing even with malformed markdown syntax
  • Diagnostics: Reports issues found during parsing for quality assurance
  • Performance: Efficiently handles large documents (< 150ms per MB)
  • Unicode Support: Full Unicode support including complex scripts and emoji

§Architecture

The parser uses tree-sitter for tokenization and syntax analysis, then builds structured representations:

  1. Tokenization: tree-sitter parses markdown into a syntax tree
  2. Structure Extraction: Traverse tree to identify headings and content blocks
  3. Hierarchy Building: Construct nested TOC and heading block structures
  4. Validation: Generate diagnostics for quality issues

§Examples

§Basic parsing:

use blz_core::{MarkdownParser, Result};

let mut parser = MarkdownParser::new()?;
let result = parser.parse(r#"

Welcome to the documentation.

# Installation

Run the following command:
cargo install blz

# Usage

Basic usage example.
"#)?;

println!("Found {} heading blocks", result.heading_blocks.len());
println!("TOC has {} entries", result.toc.len());
println!("Total lines: {}", result.line_count);

for diagnostic in &result.diagnostics {
    match diagnostic.severity {
        blz_core::DiagnosticSeverity::Warn => {
            println!("Warning: {}", diagnostic.message);
        }
        blz_core::DiagnosticSeverity::Error => {
            println!("Error: {}", diagnostic.message);
        }
        blz_core::DiagnosticSeverity::Info => {
            println!("Info: {}", diagnostic.message);
        }
    }
}

§Working with structured results:

use blz_core::{MarkdownParser, Result};

let mut parser = MarkdownParser::new()?;
let result = parser.parse("# Main\n\nMain content\n\n## Sub\n\nSub content here.")?;

// Examine heading blocks
for block in &result.heading_blocks {
    println!("Section: {} (lines {}-{})",
        block.path.join(" > "),
        block.start_line,
        block.end_line);
}

// Examine table of contents
fn print_toc(entries: &[blz_core::TocEntry], indent: usize) {
    for entry in entries {
        println!("{}{} ({})",
            "  ".repeat(indent),
            entry.heading_path.last().unwrap_or(&"Unknown".to_string()),
            entry.lines);
        print_toc(&entry.children, indent + 1);
    }
}
print_toc(&result.toc, 0);

§Performance Characteristics

  • Parse Time: < 150ms per MB of markdown content
  • Memory Usage: ~2x source document size during parsing
  • Large Documents: Efficiently handles documents up to 100MB
  • Complex Structure: Handles deeply nested headings (tested up to 50 levels)

§Error Handling

The parser is designed to be resilient to malformed input:

  • Syntax Errors: tree-sitter handles most malformed markdown gracefully
  • Missing Headings: Creates a default “Document” block for content without structure
  • Encoding Issues: Handles various text encodings and invalid UTF-8 sequences
  • Memory Limits: Prevents excessive memory usage on pathological inputs

§Thread Safety

MarkdownParser is not thread-safe due to internal mutable state in tree-sitter. Create separate parser instances for concurrent parsing:

use blz_core::{MarkdownParser, Result};
use std::thread;

fn parse_concurrently(documents: Vec<String>) -> Vec<Result<blz_core::ParseResult>> {
    documents
        .into_iter()
        .map(|doc| {
            thread::spawn(move || {
                let mut parser = MarkdownParser::new()?;
                parser.parse(&doc)
            })
        })
        .collect::<Vec<_>>()
        .into_iter()
        .map(|handle| handle.join().unwrap())
        .collect()
}

Structs§

MarkdownParser
A tree-sitter based markdown parser.
ParseResult
The result of parsing a markdown document.