lex_babel/lib.rs
1//! Multi-format interoperability for Lex documents
2//!
3//! This crate provides a uniform interface for converting between Lex AST and various document
4//! formats (Markdown, HTML, Pandoc JSON, etc.).
5//!
6//! TLDR: For format authors:
7//! - Babel never parses or serializes any format, but instead relies on the format's libraries
8//! - The conversion should be by converting to the IR, running the common code in common if relevant (it usually is), then to the AST of the target format.
9//! - We should use the testing harness (see lex-core/src/lex/testing.rs) to load documents and process them into ASTs.
10//! - Each element should use the harness above and the available file for isolated element testing with unit tests (load with the lib, assert with AST / IR)
11//! - Each format should have trifecta unit tested in from and to formats to lex.
12//! - Each format should have a kitchensink unit tested in from and to formats to lex
13//! - Read the README.lex for full details
14//!
15//! Architecture
16//!
17//! The goal here is to, as much as possible, split what is the common logic for multiple formats
18//! conversions into a format agnostic layer. This is done by using the IR representation (./ir/mod.rs),
19//! and having the common code in ./common/mod.rs. This allows for the format specific code to be focused on the data format transformations, while having a strong, focused core that can be well tested in isolation.
20//!
21//! This is a pure lib, that is, it powers the lexd CLI but is shell agnostic, that is no code
22//! should be written that supposes a shell environment, be it to std print, env vars etc.
23//!
24//! The file structure:
25//! .
26//! ├── error.rs
27//! ├── format.rs # Format trait definition
28//! ├── registry.rs # FormatRegistry for discovery and selection
29//! ├── formats
30//! │ └── <format>
31//! │ ├── parser.rs # Parser implementation
32//! │ ├── serializer.rs # Serializer implementation
33//! │ └── mod.rs
34//! ├── lib.rs
35//! ├── ir # Intermediate Representation
36//! │ ├── nodes.rs # IR data model (DocNode, InlineContent, etc.)
37//! │ ├── events.rs # Flat event stream representation
38//! │ ├── from_lex.rs # Lex AST → IR conversion
39//! │ └── to_lex.rs # IR → Lex AST conversion
40//! └── common # Common mapping code
41//! ├── flat_to_nested.rs # Events → IR tree (with auto-closing)
42//! ├── nested_to_flat.rs # IR tree → Events
43//! └── verbatim/ # Verbatim handler registry (doc.table, doc.image, etc.)
44//!
45//! Testing
46//! tests
47//! └── <format>
48//! ├── <testname>.rs
49//! └── fixtures
50//! ├── <docname>.<format>
51//! ├── kitchensink.html
52//! ├── kitchensink.lex
53//! └── kitchensink.md
54//!
55//! Note that rust does not by default discover tests in subdirectories, so we need to include these
56//! in the mod.
57//!
58//!
59//! Core Algorithms
60//!
61//! The most complex part of the work is reconstructing a nested representation from a flat document, followed by the reverse operations. For this reason we have a common IR (./ir/mod.rs) that is used for all formats.
62//! Over this representation we implement both algorithms (see ./common/flat_to_nested.rs and ./common/nested_to_flat.rs).
63//! This means that all the heavy lifting is done by a core, well tested and maintained module,
64//! freeing format adaptations to be focused on the simpler data format transformations.
65//!
66//!
67//! Formats
68//!
69//! Format specific capabilities are implemented with the Format trait. Formats should have a
70//! parse() and serialize() method, a name and file extensions. See the trait def [./format.rs]
71//! - Format trait: Uniform interface for all formats (parsing and/or serialization)
72//! - FormatRegistry: Centralized discovery and selection of formats
73//! - Format implementations: Concrete implementations for each supported format
74//!
75//!
76//! The Lex Format
77//!
78//! The Lex format itself is implemented as a format, see ./formats/lex/mod.rs, which allows for
79//! a homogeneous API where all formats have identical interfaces.
80//!
81//! Note that Lex is a more expressive format than most, which means that converting from Lex is
82//! simple, but always lossy. In particular converting to Lex requires some consideration on how
83//! to best represent the author's intent.
84//!
85//! This means that full format interop round tripping is not possible.
86//!
87//! Format Selection
88//!
89//! The choice for the formats is pretty sensible:
90//!
91//! - HTML: self-arguing, as it's the most common format for publishing and viewing.
92//! - Markdown: both in and to, as Markdown is the universal format for plain text editing.
93//! - Tag (XML): serializing Lex to a structural XML representation is trivial and useful for storage.
94//! - RFC XML: parse-only import from IETF RFC XML (v3) documents.
95//!
96//! These are table stakes, that is a format that can't export to HTML, convert to markdown or
97//! lack a good semantic pure XML output is a non starter.
98//!
99//! For everything else, there are good arguments for a variety of formats. The one that has the strongest fit
100//! and use case is LaTeX, as Lex can be very useful for scientific writing. But LaTeX is
101//! complicated, and having Pandoc in the pipeline allows us to serve reasonably well pretty much
102//! any other format.
103//!
104//! Library Choices
105//!
106//! This, not being Lex's core, means that we will offload as much as possible to better, specialized crates
107//! for each format. The scope here is mainly to adapt the ASTs from Lex to the format or vice
108//! versa. For example we never write the serializer for, say, Markdown, but pass the AST to the
109//! Markdown library. To support a format inbound, we write the format AST → Lex AST adapter.
110//! Likewise, for outbound formats we will do the reverse, converting from the Lex AST to the
111//! format's.
112//!
113//! As much as possible, we will use Rust crates, and avoid shelling out and having outside dependencies.
114//!
115pub mod error;
116pub mod format;
117pub mod formats;
118pub mod publish;
119pub mod registry;
120pub mod templates;
121pub mod transforms;
122
123pub mod common;
124pub mod ir;
125
126pub use error::FormatError;
127pub use format::{Format, SerializedDocument};
128pub use registry::FormatRegistry;
129
130/// Converts a lex document to the Intermediate Representation (IR).
131///
132/// # Information Loss
133///
134/// The IR is a simplified, semantic representation. The following
135/// Lex information is lost during conversion:
136/// - Blank line grouping (BlankLineGroup nodes)
137/// - Source positions and token information
138/// - Comment annotations at document level
139///
140/// For lossless Lex representation, use the AST directly.
141pub fn to_ir(doc: &lex_core::lex::ast::elements::Document) -> ir::nodes::Document {
142 ir::from_lex::from_lex_document(doc)
143}
144
145/// Converts an IR document back to Lex AST.
146///
147/// This is useful for round-trip conversions: Format → IR → Lex.
148pub fn from_ir(doc: &ir::nodes::Document) -> lex_core::lex::ast::elements::Document {
149 ir::to_lex::to_lex_document(doc)
150}