# Emacs Lisp Strings
Strings in Emacs Lisp are somewhat difficult to deal with, for the
following reasons:
- They can be either "unibyte" strings, which correspond to byte
vectors in Scheme, and "multibyte" strings, which can handle
unicode. Whether a string is considered unibyte or multibyte depends
on its contents; see Section 2.3.8.2, "Non-ASCII Characters in
Strings" in the Emacs Lisp manual for details.
- Whether a string is considered unibyte or multibyte not only depends
on its contents, but also the source it is read from.
- A multibyte string can include characters outside of the unicode
codepoint range. This happens for instance when the string includes
a hexadecimal or octal escape interpreted as a single byte,
potentially violating the encoding rules of the multibyte source.
- Emacs Lisp string syntax supports a multitude of escaping modes,
some of which originate from representing keyboard event sequences
in strings. Using these "keyboard-oriented" escapes inside strings
is explicitly discouraged in the Emacs Lisp manual.
The way `lexpr` deals with this complexity is the following:
- The input source is always considered to be "multibyte" using the
UTF-8 encoding; other encodings are not supported.
- Mixing non-ASCII UTF-8 characters, either directly part of the input
or represented using escape sequences, and hexadecimal or octal
escape sequences resulting in a single byte outside of the ASCII
range will result in a parse error. For instance, the following
string cannot be parsed by `lexpr`:
`"\xFC\N{U+203D}"`
Emacs, however, would parse this as a string containing the
"character" sequence `#x3ffffc`, `#x203d`. Note that the first
"character" is not a valid unicode codepoint.
- Strings containing only ASCII characters and at least one
single-byte hexadecimal or octal escape will be parsed as byte
vectors instead of strings. This mirrors the Emacs Lisp rules for
when a string will be considered to be "unibyte".
When producing S-expression text, byte vectors will always be
represented as a sequence of octal-escaped bytes.
- The escaping styles supported by `lexpr` are:
- Hexadecimal (`\xN...`) and octal (`\N...`)
- Unicode (`\uNNNN`, `\U00NNNNNN`)
- Named unicode (`\N{U+X...}`). Note that the syntax that refers to
codepoints using their full name (e.g. `\N{LATIN SMALL LETTER A
WITH GRAVE}`) is deliberately not supported.
It is expected that these restrictions will not be an impediment when
using S-expressions as a data exchange format between Emacs Lisp and
Rust programs. In short, S-expressions produced by Rust should be
always be parsable by Emacs, and the other direction should work as
long as there are no strings with non-unicode "characters" are
involved.