parsuna 0.1.0 - Docs.rs

Using parsuna
=============

This page is language-agnostic: it covers the generator CLI, the shape
of a generated parser, and how consumers drive one. Backend specifics
are called out only when they matter.

The CLI
-------

The parsuna executable takes a grammar file plus a subcommand::

    parsuna <grammar.parsuna> <subcommand> [options]

The useful subcommands for day-to-day work are:

``check``
  Load, parse, and analyze the grammar. Print a one-line summary
  (``grammar `NAME' OK: N tokens, M rules, LL(k)``) and exit 0, or
  print diagnostics and exit non-zero. Use this as a pre-commit or
  CI gate.

``generate <target> [-o OUT]``
  Emit a parser for ``<target>``. Valid targets are ``rust``,
  ``python``, ``typescript``, ``go``, ``java``, ``csharp``, ``c``,
  and the meta-target ``all`` which emits every backend. With
  ``-o OUT``, files are written under that directory (one
  sub-directory per backend when multiple are emitted). Without
  ``-o``, files are written into the current working directory.

``tree-sitter [-o OUT]``
  Emit a tree-sitter ``grammar.js`` for editor tooling. The emitted
  grammar is purely declarative; it does not share the pull-parser
  runtime. Useful for syntax highlighting and code folding in
  editors that speak tree-sitter.

``debug <sub>``
  Dump internal state. The sub-commands are ``stats``, ``tokens``,
  ``rules --format tree|dot``, ``analysis``, ``lowering``, and
  ``dfa [--full] [--format plain|dot]``. Use ``rules --format dot``
  piped into Graphviz to view rule railroad diagrams; use ``dfa
  --format dot`` for the lexer DFA. These dumps are intended as a
  debugging aid while developing a grammar — the :doc:`pipeline/index`
  describes each layer in full.

The ``--name NAME`` option, accepted at any position, overrides the
identifier the backend uses for file and package names. By default
the name is the grammar file's stem (``foo.parsuna`` → ``foo``).

The shape of a generated parser
-------------------------------

Every backend produces the same five things, spelled in the idioms of
the target language:

* A **TokenKind** enumeration with one variant per declared token,
  plus the reserved ``EOF`` and ``ERROR`` sentinels. Skip tokens
  appear here like any other token; fragments do not.
* A **RuleKind** enumeration with one variant per non-fragment rule.
  Attached to every structural event so consumers can identify
  subtrees.
* A **parse_<rule>** entry point per non-fragment rule, accepting a
  source string or (where the target runtime supports it) a stream.
  The entry point returns a **Parser** object — the generated driver
  wrapped around the runtime's pull loop.
* The **Parser** object, which yields **Event** values one at a time.
  Every target spells this as its native iterator protocol
  (``Iterator`` in Rust, ``Iterable`` in Python, ``Iterator<T>`` in
  TypeScript, a ``NextEvent`` method in Go, etc.).
* **Event** itself: a tagged union with four cases (``Enter``,
  ``Exit``, ``Token``, ``Error``). See :doc:`event_model` for the
  full payload.

All of these come from the same state table, so whatever backend you
pick, the sequence of events you observe for a given input is the
same up to language-level encoding differences.

A minimal driver
----------------

The pattern is identical in every language: call the entry point,
iterate, switch on the event tag. In pseudocode::

    parser = parse_<rule>(source)
    for event in parser:
        match event.tag:
            case "enter":  # event.rule is a RuleKind
                on_enter(event.rule, event.pos)
            case "exit":
                on_exit(event.rule, event.pos)
            case "token":  # event.token.kind is a TokenKind
                on_token(event.token)
            case "error":
                on_error(event.error)

Two rules to keep in mind while writing the driver:

1. **Events are final in source order.** The parser never retracts or
   reorders events; once you have seen one, it will not be un-emitted.
2. **Error events do not stop the stream.** The parser recovers and
   keeps going. An application that wants to abort on the first error
   must do so in its own driver — the parser will happily continue.

Starting from a rule other than the default
-------------------------------------------

Every non-fragment rule has an entry point. The first rule declared
is the *default start*, but nothing stops you from calling
``parse_member`` or ``parse_number`` directly to parse a fragment of
input as if that rule were the top. This is useful for tests, for
editor tooling that parses at the cursor, and for composing parsers
(parse a request body with one entry, then parse its contents with
another).

Typical integration workflow
----------------------------

1. Write the grammar in a ``.parsuna`` file.
2. Run ``parsuna grammar.parsuna check`` until it reports OK. Fix
   undefined references, left recursion, or LL(k) conflicts as the
   checker reports them.
3. Run ``parsuna grammar.parsuna generate <target> -o src``. Commit
   the emitted files into your repository — they are plain source,
   and diffing them is how you notice grammar changes you did not
   intend.
4. In your application, call ``parse_<rule>`` and walk the event
   stream. Translate ``Enter``/``Exit`` pairs into whatever
   domain-specific tree you want; translate ``Token`` events into
   leaves; handle ``Error`` events by attaching a diagnostic to the
   surrounding construct.

Regenerating is cheap and should be fully automated — wire
``parsuna generate`` into your build system so the committed files
never drift from the grammar.

Tokens, skips, and whitespace
-----------------------------

Skip tokens (``?WS``, ``?COMMENT``) are re-attached to the event
stream just before the next structural event that follows them in
source order. Consumers who only care about structure can filter by
event tag; consumers building a formatter or a highlighter see the
skips in the correct positions.

``Error`` events do not consume the token they attach to — the parser
still either consumes it (if recovery synchronizes on it) or skips it
as part of recovery. Application code should treat ``Error`` as a
diagnostic carrier, not a replacement for a token.

Interpreting token text
-----------------------

The parser does not post-process token text. ``STRING`` tokens are
delivered with their quotes and escapes intact; ``NUMBER`` tokens are
delivered as the raw lexeme. Un-escaping and numeric conversion are
the consumer's job — this keeps the parser's source text faithful so
tools like formatters and go-to-definition work without losing
information.