Skip to main content

lex_babel/
lib.rs

1//! Multi-format interoperability for Lex documents
2//!
3//!     This crate provides a uniform interface for converting between Lex AST and various document
4//!     formats (Markdown, HTML, Pandoc JSON, etc.).
5//!
6//!     TLDR: For format authors:
7//!         - Babel never parses or serializes any format, but instead relies on the format's libraries
8//!         - The conversion should be by converting to the IR, running the common code in common if relevant (it usually is), then to the AST of the target format.
9//!         - We should use the testing harness (see lex-core/src/lex/testing.rs) to load documents and process them into ASTs.
10//!         - Each element should use the harness above and the available file for isolated element testing with unit tests (load with the lib, assert with AST / IR)
11//!         - Each format should have trifecta unit tested in from and to formats to lex.
12//!         - Each format should have a kitchensink unit tested in from and to formats to lex
13//!         - Read the README.lex for full details
14//!
15//! Architecture
16//!
17//!     The goal here is to, as much as possible, split what is the common logic for multiple formats
18//!     conversions into a format agnostic layer. This is done by using the IR representation (./ir/mod.rs),
19//!     and having the common code in ./common/mod.rs. This allows for the format specific code to be focused on the data format transformations, while having a strong, focused core that can be well tested in isolation.
20//!
21//!     This is a pure lib, that is, it powers the lexd CLI but is shell agnostic, that is no code
22//!     should be written that supposes a shell environment, be it to std print, env vars etc.
23//!
24//!     The file structure:
25//!     .
26//!     ├── error.rs
27//!     ├── format.rs               # Format trait definition
28//!     ├── registry.rs             # FormatRegistry for discovery and selection
29//!     ├── formats
30//!     │   └── <format>
31//!     │       ├── parser.rs       # Parser implementation
32//!     │       ├── serializer.rs   # Serializer implementation
33//!     │       └── mod.rs
34//!     ├── lib.rs
35//!     ├── ir                      # Intermediate Representation
36//!     │   ├── nodes.rs            # IR data model (DocNode, InlineContent, etc.)
37//!     │   ├── events.rs           # Flat event stream representation
38//!     │   ├── from_lex.rs         # Lex AST → IR conversion
39//!     │   └── to_lex.rs           # IR → Lex AST conversion
40//!     └── common                  # Common mapping code
41//!         ├── flat_to_nested.rs   # Events → IR tree (with auto-closing)
42//!         ├── nested_to_flat.rs   # IR tree → Events
43//!         └── verbatim/           # Verbatim handler registry (doc.table, doc.image, etc.)
44//!
45//! Testing
46//!     tests
47//!     └── <format>
48//!         ├── <testname>.rs
49//!         └── fixtures
50//!         ├── <docname>.<format>
51//!         ├── kitchensink.html
52//!         ├── kitchensink.lex
53//!         └── kitchensink.md
54//!
55//!     Note that rust does not by default discover tests in subdirectories, so we need to include these
56//!     in the mod.
57//!
58//!
59//! Core Algorithms
60//!
61//!     The most complex part of the work is reconstructing a nested representation from a flat document, followed by the reverse operations. For this reason we have a common IR (./ir/mod.rs) that is used for all formats.
62//!     Over this representation we implement both algorithms (see ./common/flat_to_nested.rs and ./common/nested_to_flat.rs).
63//!     This means that all the heavy lifting is done by a core, well tested and maintained module,
64//!     freeing format adaptations to be focused on the simpler data format transformations.
65//!
66//!
67//! Formats
68//!
69//!     Format specific capabilities are implemented with the Format trait. Formats should have a
70//!     parse() and serialize() method, a name and file extensions. See the trait def [./format.rs]
71//!     - Format trait: Uniform interface for all formats (parsing and/or serialization)
72//!     - FormatRegistry: Centralized discovery and selection of formats
73//!     - Format implementations: Concrete implementations for each supported format
74//!
75//!
76//! The Lex Format
77//!
78//!     The Lex format itself is implemented as a format, see ./formats/lex/mod.rs, which allows for
79//!     a homogeneous API where all formats have identical interfaces.
80//!
81//!     Note that Lex is a more expressive format than most, which means that converting from Lex is
82//!     simple, but always lossy. In particular converting to Lex requires some consideration on how
83//!     to best represent the author's intent.
84//!
85//!     This means that full format interop round tripping is not possible.
86//!
87//! Format Selection
88//!
89//!     The choice for the formats is pretty sensible:
90//!
91//!     - HTML: self-arguing, as it's the most common format for publishing and viewing.
92//!     - Markdown: both in and to, as Markdown is the universal format for plain text editing.
93//!     - Tag (XML): serializing Lex to a structural XML representation is trivial and useful for storage.
94//!     - RFC XML: parse-only import from IETF RFC XML (v3) documents.
95//!
96//!     These are table stakes, that is a format that can't export to HTML, convert to markdown or
97//!     lack a good semantic pure XML output is a non starter.
98//!
99//!     For everything else, there are good arguments for a variety of formats. The one that has the strongest fit
100//!     and use case is LaTeX, as Lex can be very useful for scientific writing. But LaTeX is
101//!     complicated, and having Pandoc in the pipeline allows us to serve reasonably well pretty much
102//!     any other format.
103//!
104//! Library Choices
105//!
106//!     This, not being Lex's core, means that we will offload as much as possible to better, specialized crates
107//!     for each format. The scope here is mainly to adapt the ASTs from Lex to the format or vice
108//!     versa. For example we never write the serializer for, say, Markdown, but pass the AST to the
109//!     Markdown library. To support a format inbound, we write the format AST → Lex AST adapter.
110//!     Likewise, for outbound formats we will do the reverse, converting from the Lex AST to the
111//!     format's.
112//!
113//!     As much as possible, we will use Rust crates, and avoid shelling out and having outside dependencies.
114//!
115pub mod error;
116pub mod format;
117pub mod formats;
118pub mod publish;
119pub mod registry;
120pub mod templates;
121pub mod transforms;
122
123pub mod common;
124pub mod ir;
125
126pub use error::FormatError;
127pub use format::{Format, SerializedDocument};
128pub use registry::FormatRegistry;
129
130/// Converts a lex document to the Intermediate Representation (IR).
131///
132/// # Information Loss
133///
134/// The IR is a simplified, semantic representation. The following
135/// Lex information is lost during conversion:
136/// - Blank line grouping (BlankLineGroup nodes)
137/// - Source positions and token information
138/// - Comment annotations at document level
139///
140/// For lossless Lex representation, use the AST directly.
141pub fn to_ir(doc: &lex_core::lex::ast::elements::Document) -> ir::nodes::Document {
142    ir::from_lex::from_lex_document(doc)
143}
144
145/// Converts an IR document back to Lex AST.
146///
147/// This is useful for round-trip conversions: Format → IR → Lex.
148pub fn from_ir(doc: &ir::nodes::Document) -> lex_core::lex::ast::elements::Document {
149    ir::to_lex::to_lex_document(doc)
150}