lexpr 0.2.6

A representation for Lisp data
Documentation
# Emacs Lisp Strings

Strings in Emacs Lisp are somewhat difficult to deal with, for the
following reasons:

- They can be either "unibyte" strings, which correspond to byte
  vectors in Scheme, and "multibyte" strings, which can handle
  unicode. Whether a string is considered unibyte or multibyte depends
  on its contents; see Section 2.3.8.2, "Non-ASCII Characters in
  Strings" in the Emacs Lisp manual for details.

- Whether a string is considered unibyte or multibyte not only depends
  on its contents, but also the source it is read from.

- A multibyte string can include characters outside of the unicode
  codepoint range. This happens for instance when the string includes
  a hexadecimal or octal escape interpreted as a single byte,
  potentially violating the encoding rules of the multibyte source.

- Emacs Lisp string syntax supports a multitude of escaping modes,
  some of which originate from representing keyboard event sequences
  in strings. Using these "keyboard-oriented" escapes inside strings
  is explicitly discouraged in the Emacs Lisp manual.

The way `lexpr` deals with this complexity is the following:

- The input source is always considered to be "multibyte" using the
  UTF-8 encoding; other encodings are not supported.

- Mixing non-ASCII UTF-8 characters, either directly part of the input
  or represented using escape sequences, and hexadecimal or octal
  escape sequences resulting in a single byte outside of the ASCII
  range will result in a parse error. For instance, the following
  string cannot be parsed by `lexpr`:

  `"\xFC\N{U+203D}"`

  Emacs, however, would parse this as a string containing the
  "character" sequence `#x3ffffc`, `#x203d`. Note that the first
  "character" is not a valid unicode codepoint.

- Strings containing only ASCII characters and at least one
  single-byte hexadecimal or octal escape will be parsed as byte
  vectors instead of strings. This mirrors the Emacs Lisp rules for
  when a string will be considered to be "unibyte".

  When producing S-expression text, byte vectors will always be
  represented as a sequence of octal-escaped bytes.

- The escaping styles supported by `lexpr` are:

  - Hexadecimal (`\xN...`) and octal (`\N...`)
  - Unicode (`\uNNNN`, `\U00NNNNNN`)
  - Named unicode (`\N{U+X...}`). Note that the syntax that refers to
    codepoints using their full name (e.g. `\N{LATIN SMALL LETTER A
    WITH GRAVE}`) is deliberately not supported.

It is expected that these restrictions will not be an impediment when
using S-expressions as a data exchange format between Emacs Lisp and
Rust programs. In short, S-expressions produced by Rust should be
always be parsable by Emacs, and the other direction should work as
long as there are no strings with non-unicode "characters" are
involved.