Crate markdown_ast

Source
Expand description

Parse a Markdown input string into a sequence of Markdown abstract syntax tree Blocks.

This crate is intentionally designed to interoperate well with the pulldown-cmark crate and the ecosystem around it. See Motivation and relation to pulldown-cmark for more information.

The AST types are designed to align with the structure defined by the CommonMark Specification.

§Quick Examples

Parse simple Markdown into an AST:

use markdown_ast::{markdown_to_ast, Block, Inline, Inlines};

let ast = markdown_to_ast("
Hello! This is a paragraph **with bold text**.
");

assert_eq!(ast, vec![
    Block::Paragraph(Inlines(vec![
        Inline::Text("Hello! This is a paragraph ".to_owned()),
        Inline::Strong(Inlines(vec![
            Inline::Text("with bold text".to_owned()),
        ])),
        Inline::Text(".".to_owned())
    ]))
]);

§API Overview

FunctionInputOutput
markdown_to_ast()&strVec<Block>
ast_to_markdown()&[Block]String
ast_to_events()&[Block]Vec<Event>
events_to_ast()&[Event]Vec<Block>
events_to_markdown()&[Event]String
markdown_to_events()&strVec<Event>
canonicalize()&strString
§Terminology

This crate is able to process and manipulate Markdown in three different representations:

TermTypeDescription
MarkdownStringRaw Markdown source / output string
Events&[Event]Markdown parsed by pulldown-cmark into a flat sequence of parser Events
ASTBlock / &[Block]Markdown parsed by markdown-ast into a hierarchical structure of Blocks
§Processing Steps
    String => Events => Blocks => Events => String
    |_____ A ______|    |______ C _____|
              |______ B _____|    |______ D _____|
    |__________ E ___________|
                        |___________ F __________|
    |____________________ G _____________________|

Note: A wraps pulldown_cmark::Parser, and D wraps pulldown_cmark_to_cmark::cmark().

§Detailed Examples

§Parse varied Markdown to an AST representation:
use markdown_ast::{
    markdown_to_ast, Block, HeadingLevel, Inline, Inlines, ListItem
};

let ast = markdown_to_ast("
# An Example Document

This is a paragraph that
is split across *multiple* lines.

* This is a list item
");

assert_eq!(ast, vec![
    Block::Heading(
        HeadingLevel::H1,
        Inlines(vec![
             Inline::Text("An Example Document".to_owned())
        ])
    ),
    Block::Paragraph(Inlines(vec![
        Inline::Text("This is a paragraph that".to_owned()),
        Inline::SoftBreak,
        Inline::Text("is split across ".to_owned()),
        Inline::Emphasis(Inlines(vec![
            Inline::Text("multiple".to_owned()),
        ])),
        Inline::Text(" lines.".to_owned()),
    ])),
    Block::List(vec![
        ListItem(vec![
            Block::Paragraph(Inlines(vec![
                Inline::Text("This is a list item".to_owned())
            ]))
        ])
    ])
]);
§Synthesize Markdown using programmatic construction of the document:

Note: This is a more user friendly alternative to a “string builder” approach where the raw Markdown string is constructed piece by piece, which suffers from extra bookkeeping that must be done to manage things like indent level and soft vs hard breaks.

use markdown_ast::{
    ast_to_markdown, Block, Inline, Inlines, ListItem,
    HeadingLevel,
};

let tech_companies = vec![
    ("Apple", 1976, 164_000),
    ("Microsoft", 1975, 221_000),
    ("Nvidia", 1993, 29_600),
];

let ast = vec![
    Block::Heading(HeadingLevel::H1, Inlines::plain_text("Tech Companies")),
    Block::plain_text_paragraph("The following are major tech companies:"),
    Block::List(Vec::from_iter(
        tech_companies
            .into_iter()
            .map(|(company_name, founded, employee_count)| {
                ListItem(vec![
                    Block::paragraph(vec![Inline::plain_text(company_name)]),
                    Block::List(vec![
                        ListItem::plain_text(format!("Founded: {founded}")),
                        ListItem::plain_text(format!("Employee count: {employee_count}"))
                    ])
                ])
            })
    ))
];

let markdown: String = ast_to_markdown(&ast);

assert_eq!(markdown, "\
# Tech Companies

The following are major tech companies:

* Apple
  
  * Founded: 1976
  
  * Employee count: 164000

* Microsoft
  
  * Founded: 1975
  
  * Employee count: 221000

* Nvidia
  
  * Founded: 1993
  
  * Employee count: 29600\
");

§Known Issues

Currently markdown-ast does not escape Markdown content appearing in leaf inline text:

use markdown_ast::{ast_to_markdown, Block};

let ast = vec![
    Block::plain_text_paragraph("In the equation a*b*c ...")
];

let markdown = ast_to_markdown(&ast);

assert_eq!(markdown, "In the equation a*b*c ...");

which will render as:

In the equation abc …

with the asterisks interpreted as emphasis formatting markers, contrary to the intention of the author.

Fixing this robustly will require either:

  • Adding automatic escaping of Markdown characters in Inline::Text during rendering (not ideal)

  • Adding pre-construction validation checks for Inline::Text that prevent constructing an Inline with Markdown formatting characters that have not been escaped correctly by the user.

In either case, fixing this bug will be considered a semver exempt change in behavior to markdown-ast.

§Motivation and relation to pulldown-cmark

pulldown-cmark is a popular Markdown parser crate. It provides a streaming event (pull parsing) based representation of a Markdown document. That representation is useful for efficient transformation of a Markdown document into another format, often HTML.

However, a streaming parser representation is less amenable to programmatic construction or human-understandable transformations of Markdown documents.

markdown-ast provides a abstract syntax tree (AST) representation of Markdown that is easy to construct and work with.

Additionally, pulldown-cmark is widely used in the Rust crate ecosystem, for example for mdbook extensions. Interoperability with pulldown-cmark is an intentional design choice for the usability of markdown-ast; one could imagine markdown-ast instead abstracting over the underlying parser implementation, but my view is that would limit the utility of markdown-ast.

Structs§

Inlines
A sequence of Inlines. (CommonMark: inlines)
ListItem
An item in a list. (CommonMark: list items)

Enums§

Block
A piece of structural Markdown content. (CommonMark: blocks, container blocks)
CodeBlockKind
HeadingLevel
Inline
An inline piece of atomic Markdown content. (CommonMark: inlines)

Functions§

ast_to_events
Convert AST Blocks into an Event sequence.
ast_to_markdown
Convert AST Blocks into a Markdown string.
canonicalize
Canonicalize (or format) a Markdown input by parsing and then converting back to a string.
events_to_ast
Parse Events into AST Blocks.
events_to_markdown
Convert Events into a Markdown string.
markdown_to_ast
Parse Markdown input string into AST Blocks.
markdown_to_events
Parse Markdown input string into Events.