aozora 0.4.1

Aozora Bunko notation parser — single front door (Document + AozoraTree)
Documentation
# aozora

<p align="center">
  <a href="https://github.com/P4suta/aozora/actions/workflows/ci.yml"><img alt="ci" src="https://github.com/P4suta/aozora/actions/workflows/ci.yml/badge.svg"></a>
  <a href="https://github.com/P4suta/aozora/actions/workflows/docs.yml"><img alt="docs deploy" src="https://github.com/P4suta/aozora/actions/workflows/docs.yml/badge.svg"></a>
  <a href="https://github.com/P4suta/aozora/releases/latest"><img alt="latest release" src="https://img.shields.io/github/v/release/P4suta/aozora?display_name=tag&sort=semver"></a>
  <a href="./LICENSE-APACHE"><img alt="license" src="https://img.shields.io/badge/license-Apache--2.0%20OR%20MIT-blue"></a>
  <a href="./rust-toolchain.toml"><img alt="msrv" src="https://img.shields.io/badge/rust-1.95-orange"></a>
</p>

<p align="center">
  🎮 <a href="https://p4suta.github.io/aozora/playground/"><strong>Playground</strong></a>
  · 📚 <a href="https://p4suta.github.io/aozora/"><strong>Handbook (mdbook)</strong></a>
  · 📖 <a href="https://p4suta.github.io/aozora/api/aozora/"><strong>API reference (rustdoc)</strong></a>
  · 📦 <a href="https://github.com/P4suta/aozora/releases"><strong>Releases &amp; binaries</strong></a>
  · 🇯🇵 <a href="./README.ja.md"><strong>日本語</strong></a>
</p>

Pure-functional Rust parser for **青空文庫記法** (Aozora Bunko notation):
ruby (`|青梅《おうめ》`), bouten (`[#「X」に傍点]`), 縦中横, 外字
references (`※[#…、第3水準1-85-54]`), kunten / kaeriten,
indent / align-end containers (`[#ここから2字下げ]… [#ここで字下げ終わり]`),
and page / section breaks.

The parser is **CommonMark-free, Markdown-free** — this repository deals
only with the 青空文庫 notation itself. The renderer emits semantic HTML5;
the lexer reports structured diagnostics; the AST is a borrowed-arena
tree that can be walked in O(n) without copying source bytes.

## Installation

### Pre-built CLI

Pre-built `aozora` CLI binaries for **Linux x86_64**, **macOS arm64**,
and **Windows x86_64** are attached to every GitHub Release —
[the releases page](https://github.com/P4suta/aozora/releases) carries
`aozora-vX.Y.Z-<target>.{tar.gz,zip}` archives with `SHA256SUMS`.

### Build from source

```sh
cargo install --git https://github.com/P4suta/aozora --locked aozora-cli
```

(builds the latest `main`; pin to a release tag for reproducible builds —
see [the install chapter](https://p4suta.github.io/aozora/getting-started/install.html)
for the tag-pinned form.)

### As a Rust library

The `Cargo.toml` snippet (with the current release tag) lives in the
[install chapter](https://p4suta.github.io/aozora/getting-started/install.html#as-a-rust-library) —
keeping it in one place avoids version-pin drift across multiple READMEs.
crates.io publication tracks the 1.0 API freeze.

For WASM / C ABI / Python bindings see the
[Bindings chapters](https://p4suta.github.io/aozora/bindings/rust.html) of
the handbook.

## Quickstart

```rust
use aozora::Document;

let source = "|青梅《おうめ》".to_owned();
let doc = Document::new(source);
let tree = doc.parse();

let html: String = tree.to_html();
let canonical: String = tree.serialize();
let diagnostics = tree.diagnostics();

assert_eq!(canonical, "|青梅《おうめ》");
```

`Document` owns a [`bumpalo`](https://docs.rs/bumpalo) arena; `tree`
borrows from it for the lifetime of the `Document`. Dropping the
`Document` releases every node in a single `Bump::reset` step.

## CLI

```sh
aozora check FILE.txt           # lex + report diagnostics
aozora fmt --check FILE.txt     # round-trip parse ∘ serialize check
aozora render FILE.txt          # render to HTML on stdout
aozora check -E sjis FILE.txt   # Shift_JIS source from Aozora Bunko
```

All subcommands accept `-` (or no path argument) to read from stdin.
See the [CLI reference chapter](https://p4suta.github.io/aozora/ref/cli.html)
for the full subcommand reference.

## Crate layout

aozora is a 21-crate workspace.
[`crates/aozora`](./crates/aozora) is the public facade — library
consumers usually import only this one.

| Crate | Purpose |
|---|---|
| [`crates/aozora`]./crates/aozora | Top-level facade. `Document::parse() → AozoraTree<'_>`, structured `Diagnostic`s, `SLUGS` catalogue, `canonicalise_slug`. The single front door. |
| [`crates/aozora-spec`]./crates/aozora-spec | Single source of truth for shared types: `Span`, `TriggerKind`, `PairKind`, `Diagnostic`, PUA sentinel codepoints, `SLUGS` dispatch table. No internal dependency. |
| [`crates/aozora-syntax`]./crates/aozora-syntax | AST types (`AozoraNode` borrowed-arena variants, `ContainerKind`, `BoutenKind`, `Indent`). |
| [`crates/aozora-encoding`]./crates/aozora-encoding | Shift_JIS decoding + 外字 lookup (compile-time PHF, JIS X 0213 + UCS resolution). |
| [`crates/aozora-scan`]./crates/aozora-scan | SIMD-friendly multi-pattern scanner backends (Teddy / structural-bitmap / Hoehrmann DFA / naive fallback). |
| [`crates/aozora-veb`]./crates/aozora-veb | Eytzinger-layout sorted-set lookup (cache-friendly binary search). |
| [`crates/aozora-pipeline`]./crates/aozora-pipeline | 4-phase lexer (sanitize → events → pair → classify) plus the `lex_into_arena` orchestrator — pure `fn(&str, &Arena) -> BorrowedLexOutput<'_>`. |
| [`crates/aozora-render`]./crates/aozora-render | HTML and serialise renderers — `html::render_to_string`, `serialize::serialize`. |
| [`crates/aozora-cst`]./crates/aozora-cst | rowan-backed lossless concrete syntax tree. Editor/formatter surface. |
| [`crates/aozora-query`]./crates/aozora-query | Tree-sitter-style pattern DSL (`SyntaxKind` + capture) for queries over the CST. |
| [`crates/aozora-pandoc`]./crates/aozora-pandoc | Pandoc AST projection (`AozoraTree``pandoc_ast::Pandoc`); unlocks 50+ output formats via Pandoc writers. |
| [`crates/aozora-cli`]./crates/aozora-cli | `aozora` binary: `check` / `fmt` / `schema` / `kinds` / `explain` / `pandoc`. |
| [`crates/aozora-wasm`]./crates/aozora-wasm | `wasm32-unknown-unknown` target for `wasm-pack build --target web`. |
| [`crates/aozora-ffi`]./crates/aozora-ffi | C ABI driver (opaque handle, JSON-encoded structured data). |
| [`crates/aozora-py`]./crates/aozora-py | PyO3 bindings, distributed via `maturin`. |
| [`crates/aozora-bench`]./crates/aozora-bench | Criterion + corpus-driven probes (PGO profile source). |
| [`crates/aozora-conformance`]./crates/aozora-conformance | WPT-style conformance fixture runner (golden HTML / serialize / diagnostics / wire across 23 fixtures). |
| [`crates/aozora-corpus`]./crates/aozora-corpus | Corpus source abstraction for sweep tests (dev-only, set `AOZORA_CORPUS_ROOT`). |
| [`crates/aozora-proptest`]./crates/aozora-proptest | Shared proptest strategies (`aozora_fragment` / `pathological_aozora` / `unicode_adversarial` and friends; dev-only). |
| [`crates/aozora-trace`]./crates/aozora-trace | DWARF symbolicator for samply traces. |
| [`crates/aozora-xtask`]./crates/aozora-xtask | Repo automation (samply wrapper, trace analysis, corpus pack/unpack, schema dumps). |

See the [Architecture chapter](https://p4suta.github.io/aozora/arch/pipeline.html)
of the handbook for the layered design, the borrowed-arena AST, the
SIMD scanner backends, and the dependency graph between these
crates.

## Development

Everything runs inside Docker — the host toolchain is never invoked.
Bring up the dev image once, then drive every operation through `just`:

```sh
just                # list targets
just build          # cargo build --workspace --all-targets
just test           # cargo nextest run --workspace
just prop           # property-based sweep (128 cases per block)
just lint           # fmt + clippy pedantic+nursery + typos + strict-code
just deny           # cargo-deny licenses + advisories + bans
just coverage       # cargo llvm-cov branch coverage
just ci             # full CI replica
just book-build     # render the mdbook handbook
just book-serve     # live-preview the handbook at localhost:3000
```

Use `just run` to invoke the CLI inside the container:

```sh
just run check FILE.txt
just run render -E sjis FILE.txt > out.html
```

See [`CONTRIBUTING.md`](./CONTRIBUTING.md) for the contribution flow,
testing strategy, and lint policy.

## Documentation

- 📚 [**Handbook**]https://p4suta.github.io/aozora/ — the mdbook
  site: notation reference, architecture (borrowed-arena AST,
  SIMD scanner backends, encoding), bindings (Rust / WASM / C ABI /
  Python), performance (samply / bench / corpus sweep), CLI / API /
  env reference, and the contributor guide.
- 📖 [**API reference (rustdoc)**]https://p4suta.github.io/aozora/api/aozora/
  — auto-deployed alongside the handbook.
- [`CONTRIBUTING.md`]./CONTRIBUTING.md — dev setup, TDD flow,
  PR rules.
- [`SECURITY.md`]./SECURITY.md — vulnerability disclosure.
- [`CHANGELOG.md`]./CHANGELOG.md — release history.

## Related projects

| Repo | What it is |
|---|---|
| [`P4suta/afm`]https://github.com/P4suta/afm | CommonMark + GFM + 青空文庫記法 integrated Markdown dialect, built on top of this parser. |
| [`P4suta/aozora-tools`]https://github.com/P4suta/aozora-tools | Authoring tools: formatter, LSP server, tree-sitter grammar, VS Code extension. |

## License

Dual-licensed under [Apache-2.0](./LICENSE-APACHE) OR [MIT](./LICENSE-MIT)
at your option, matching Rust community convention. See
[`NOTICE`](./NOTICE) for third-party attribution (Aozora Bunko spec
snapshots and public-domain sample works used in tests).