Expand description
Document parsing module.
This module provides parsers for different document formats.
Each parser extracts RawNodes from documents that can then be
organized into a [DocumentTree].
§Supported Formats
- Markdown - Full support via
MarkdownParser - PDF - Full support via
PdfParserwith TOC extraction - DOCX - Full support via
DocxParserwith heading detection - HTML - Full support via
HtmlParserwith heading hierarchy
§Example
use vectorless::parser::{DocumentParser, MarkdownParser, DocumentFormat};
// Create a parser
let parser = MarkdownParser::new();
// Parse content
let content = "# Title\n\nContent here.";
let result = parser.parse(content).await?;
println!("Extracted {} nodes", result.node_count());
for node in &result.nodes {
println!(" - {} (level {})", node.title, node.level);
}Re-exports§
pub use docx::DocxParser;pub use html::HtmlConfig;pub use html::HtmlParser;pub use markdown::MarkdownConfig;pub use markdown::MarkdownParser;pub use pdf::PdfParser;
Modules§
- docx
- DOCX document parsing module.
- html
- HTML document parser.
- markdown
- Production-ready Markdown parser module.
- PDF document parsing module.
- toc
- Table of Contents (TOC) processing module.
Structs§
- Document
Meta - Document metadata.
- Parse
Result - Result of parsing a document.
- Parser
Registry - Registry for document parsers.
- RawNode
- A raw node extracted from a document.
Enums§
- Document
Format - Supported document formats.
Traits§
- Document
Parser - A parser for extracting content from documents.
Functions§
- get_
parser - Get a parser for the given format.
- get_
parser_ for_ file - Get a parser for a file based on its extension.
- parse_
content - Parse a document from content using the appropriate parser.
- parse_
file - Parse a document from a file.