Skip to main content

Module parser

Module parser 

Source
Expand description

Document parsing module.

This module provides parsers for different document formats. Each parser extracts RawNodes from documents that can then be organized into a [DocumentTree].

§Supported Formats

  • Markdown - Full support via MarkdownParser
  • PDF - Full support via PdfParser with TOC extraction
  • DOCX - Full support via DocxParser with heading detection
  • HTML - Planned (placeholder)

§Example

use vectorless::parser::{DocumentParser, MarkdownParser, DocumentFormat};

// Create a parser
let parser = MarkdownParser::new();

// Parse content
let content = "# Title\n\nContent here.";
let result = parser.parse(content).await?;

println!("Extracted {} nodes", result.node_count());
for node in &result.nodes {
    println!("  - {} (level {})", node.title, node.level);
}

Re-exports§

pub use docx::DocxParser;
pub use markdown::MarkdownConfig;
pub use markdown::MarkdownParser;
pub use pdf::PdfParser;

Modules§

docx
DOCX document parsing module.
markdown
Production-ready Markdown parser module.
pdf
PDF document parsing module.
toc
Table of Contents (TOC) processing module.

Structs§

DocumentMeta
Document metadata.
ParseResult
Result of parsing a document.
ParserRegistry
Registry for document parsers.
RawNode
A raw node extracted from a document.

Enums§

DocumentFormat
Supported document formats.

Traits§

DocumentParser
A parser for extracting content from documents.

Functions§

get_parser
Get a parser for the given format.
get_parser_for_file
Get a parser for a file based on its extension.
parse_content
Parse a document from content using the appropriate parser.
parse_file
Parse a document from a file.