Expand description
§🛠️ HTML Parser Developer Guide
This guide is designed to help you quickly get started with developing and integrating oak-html.
§🚦 Quick Start
Add the dependency to your Cargo.toml:
[dependencies]
oak-html = { path = "..." }§Basic Parsing Example
The following is a standard workflow for parsing a modern HTML5 document with attributes and nested elements:
use oak_html::{HtmlParser, SourceText, HtmlLanguage};
fn main() {
// 1. Prepare source code
let code = r#"
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Oak HTML Example</title>
</head>
<body>
<div id="app" class="container">
<h1>Hello, Oak!</h1>
<img src="logo.png" alt="Oak Logo" />
</div>
<script src="app.js"></script>
</body>
</html>
"#;
let source = SourceText::new(code);
// 2. Initialize parser
let config = HtmlLanguage::new();
let parser = HtmlParser::new(&config);
// 3. Execute parsing
let result = parser.parse(&source);
// 4. Handle results
if result.is_success() {
println!("Parsing successful! AST node count: {}", result.node_count());
} else {
eprintln!("Errors found during parsing.");
}
}§🔍 Core API Usage
§1. Syntax Tree Traversal
After a successful parse, you can use the built-in visitor pattern or manually traverse the Green/Red Tree to extract HTML-specific constructs like element tags, attribute values, text content, or specific script and style blocks.
§2. Incremental Parsing
No need to re-parse a massive HTML document when small changes occur:
// Assuming you have an old parse result 'old_result' and new source text 'new_source'
let new_result = parser.reparse(&new_source, &old_result);§3. Diagnostics
oak-html provides rich error contexts specifically tailored for web developers, handling complex scenarios like unclosed tags or malformed attribute syntax:
for diag in result.diagnostics() {
println!("[{}:{}] {}", diag.line, diag.column, diag.message);
}§🏗️ Architecture Overview
- Lexer: Tokenizes HTML source text into a stream of tokens, including support for tags, attributes, text nodes, and special handling for
script/stylecontent. - Parser: Syntax analyzer based on the Pratt parsing algorithm to handle HTML’s hierarchical structure, void elements, and self-closing tags.
- AST: A strongly-typed syntax abstraction layer designed for high-performance HTML analysis tools, scrapers, and IDEs.
§🔗 Advanced Resources
- Full Examples: Check the examples/ folder in the project root.
- API Documentation: Run
cargo doc --openfor detailed type definitions. - Test Cases: See tests/ for handling of various HTML5 edge cases and “tag soup.”
§🚀 Oak HTML Parser
Structuring the Web with Precision — A high-performance, incremental HTML parser built on the Oak framework. Optimized for web scraping, static analysis, and modern IDE support for web development.
§🎯 Project Vision
HTML is the backbone of the web, and its complexity often arises from its flexibility and real-world “tag soup.” oak-html aims to provide a robust, high-performance parsing solution that can handle modern HTML5 standards with industrial-grade reliability. By utilizing Oak’s incremental parsing capabilities, it enables the creation of highly responsive tools for web development—from real-time preview engines to intelligent code refactoring tools.
§✨ Core Features
- ⚡ Blazing Fast: Leverages Rust’s performance to deliver sub-millisecond parsing, essential for real-time web development tools and large-scale web analysis.
- 🔄 Incremental Parsing: Built-in support for partial updates—re-parse only the sections of the HTML that changed, significantly improving performance for complex web pages.
- 🌳 High-Fidelity AST: Generates a detailed and easy-to-traverse Abstract Syntax Tree capturing:
- Elements, Attributes, and nested structures
- Comments, Doctype declarations, and Text nodes
- Support for modern HTML5 features
- 🛡️ Industrial-Grade Error Recovery: Engineered to handle malformed or “tag soup” HTML gracefully, providing precise diagnostics while maintaining a valid tree structure.
- 🧩 Ecosystem Integration: Part of the Oak family—easily integrate with
oak-lspfor full LSP support or other Oak-based web analysis utilities.
§🏗️ Architecture
The parser follows the Green/Red Tree architecture (inspired by Roslyn), which allows for:
- Efficient Immutability: Share nodes across different versions of the tree without copying.
- Lossless Syntax Trees: Retains all trivia (whitespace and comments), enabling faithful code formatting and refactoring.
- Type Safety: Strongly-typed “Red” nodes provide a convenient and safe API for tree traversal and analysis.
§🤝 Contributing
We welcome contributions of all kinds! If you find a bug, have a feature request, or want to contribute code, please check our issues or submit a pull request. Html support for the Oak language framework.
Re-exports§
pub use crate::ast::HtmlDocument;pub use crate::builder::HtmlBuilder;pub use crate::language::HtmlLanguage;pub use crate::lexer::HtmlLexer;pub use crate::parser::HtmlParser;pub use crate::lsp::highlighter::HtmlHighlighter;pub use crate::lsp::HtmlLanguageService;pub use lexer::token_type::HtmlTokenType;pub use parser::element_type::HtmlElementType;
Modules§
- ast
- AST module for HTML nodes.
- builder
- Builder module for constructing HTML trees.
- language
- Kind module defining HTML syntax types. Language module for HTML configuration.
- lexer
- Lexer module for HTML tokenization.
- lsp
- LSP module for HTML language service features.
- mcp
- MCP module.
- parser
- Parser module for HTML syntax analysis.