html2json
A Rust port of cheerio-json-mapper.
Overview
- Input: HTML source + Extractor spec (JSON)
- Output: JSON matching the structure defined in the spec
- Available as: Rust crate, CLI tool, and WebAssembly npm package
Installation
npm / WebAssembly
From crates.io (Rust)
From source
# or from a git repository
Using just
Usage
JavaScript / TypeScript
import from "@qretaio/html2json";
const html = `
<article class="post">
<h2>My Article</h2>
<p class="author">John Doe</p>
<div class="tags">
<span>rust</span>
<span>wasm</span>
</div>
</article>
`;
const spec = ;
const result = await ;
console.log;
// {
// "title": "My Article",
// "author": "John Doe",
// "tags": [{"name": "rust"}, {"name": "wasm"}]
// }
CLI
# Extract from file
# Extract from stdin (pipe from curl)
|
# Extract from stdin (pipe from cat)
|
# Check output matches expected JSON (useful for testing/CI)
CLI Options
--spec, -s <FILE>- Path to JSON extractor spec file (required)--check, -c <FILE>- Compare output against expected JSON file. Exits with 0 if match, 1 if differ (with colored diff).
Spec Format
The spec is a JSON object where each key defines an output field and each value defines a CSS selector to extract that field.
Basic Selectors
Attributes
Pipes (Transformations)
Available pipes:
trim- Trim whitespacelower- Convert to lowercaseupper- Convert to uppercasesubstr:start:end- Extract substringregex:pattern- Regex capture (first group)parseAs:int- Parse as integerparseAs:float- Parse as floatattr:name- Get attribute valuevoid- Extract from void elements, useful for extracting xml
Collections (Arrays)
Scoping ($ selector)
Fallback Operators (||)
Optional Fields (?)
Optional fields that return null are removed from the output.
LICENSE
MIT