udv 0.3.0

Unambiguous Delimited Values. A smarter successor to CSV.
Documentation
# UDV

Unambiguous Delimited Values. Similar to CSV, but consistent, unambiguous, and
predictable.


# Description

Uses leading delimiters and simple character escapes to allow simple and
unambiguous introduction of units and records, unambiguous header declaration,
unambiguous concatenation of documents, the ability to discern the
differences between 0 fields and 1 blank field, and the ability to use arbitrary
binary data.

This is encoding-agnostic, but delimiters are required to be a single codepoint.
The obvious canonical representations are utf-8 and binary, but any encoding is
possible.

The EBNF is like this, where the all-caps values are each a configurable
single-codepoint delimiter:

```ebnf
stream = {garbage}, { message, {garbage} }, ENDSTREAM;
garbage = (* - (MESSAGE | HEADER | ENDSTREAM) )
message = [header], MESSAGE, { record }, ENDMESSAGE;
header = HEADER, units;
record = RECORD, units;
units = { UNIT, unit };
unit = { (* - control) | (ESCAPE, *) };
control = ENDSTREAM | HEADER | MESSAGE | ENDMESSAGE | RECORD | UNIT | ESCAPE;
```

The default delimiters:

```ebnf
HEADER = "#";
MESSAGE = ">";
ENDMESSAGE = "<";
RECORD = ? ASCII newline ?;
UNIT = ",";
ESCAPE = "\";
ENDSTREAM = "!";
```

For the most part, this is a prefix-oriented format.  The two exceptions are the
ENDMESSAGE and ENDSTREAM delimiters.  The ENDMESSAGE delimiter allows
text-editing a text-UDV message file without an inserted newline at the end
causing problems (because the point of this format is to be unambiguous, a
newline at the end of the file would have to be considered part of the last
unit, or introducing an empty record).  The ENDSTREAM delimiter allows UDV data
to not depend on knowing the length ahead of time or relying on the physical
geometry of the buffer that the stream is in.  This is explicit, requiring any
ENDSTREAM codepoints in a UDV message to be explicitly escaped, allowing any UDV
parser to parse a UDV stream in an arbitrary location unmodified.  This can also
be leveraged to do something like embed a UDV stream at the end of the file, and
put the UDV offset after ENDSTREAM so the beginning can be located from the
trailing data.

## Examples, using default delimiters

### Single message with a header and two records

```
#,id,name,value>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
```

### Single message with no header and two records

```
>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
```

### Single message with a header and no records

```
#,id,name,value><
```

### Single message with a header and one empty record

```
#,id,name,value>
<
```

### Single message with a header with an empty unit, and a record of all empty units

```
#,id,name,,value>
,,,,<
```

### The shortest valid message

```
><
```

### Single message with no header and one record with one empty unit

```
>
,<
```

### Single message with no header and one record with zero empty units, one record with one empty unit, and one record with two empty units

```
>

,
,,<
```

### All the previous examples concatenated as a stream of messages, with an ENDSTREAM character to delimit the end

This takes advantage of the fact that any amount of garbage data may appear
before any MESSAGE, HEADER, or ENDSTREAM character, to allow trailing newlines to
not cause issues.

```
#,id,name,value>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
#,id,name,value><
#,id,name,value>
<
#,id,name,,value>
,,,,<
><
>
,<
>

,
,,<
!
```

### The shortest valid stream

```
!
```

# Advantages over CSV

* Headers are explicitly delimited, so there is never any guessing about whether
  the first row constitutes a header.
* Parsing is extremely simple and unambiguous.  There is not problem like CSV's
  having to determine whether a newline is part of a value, or if quotes around
  a value are part of the value, or how to escape commas and quotes inside of a
  value.
* It is possible to differentiate between a message with no records, a message
  with a record that has no units, and a message with a record that has one
  empty unit.
* Multiple documents with or without headers can be concatenated in the same
  stream and parsed without any loss.  Because of the rules for garbage data
  around messages, it's possible to usually naively concatenate message files,
  as long as there's no ENDSTREAM at the end of any of them.
* The ENDSTREAM to end the stream allows a UDV stream to be embedded in the
  middle of more data and parsed without issues using just a pointer to its
  start codepoint.
* When using the binary encoding, you can even embed binary data without issues.
  You can use the C0 control codes for a self-describing binary stream of
  messages as well, using the following rules:

```ebnf
HEADER = SOH;
MESSAGE = STX;
ENDMESSAGE = ETX;
RECORD = RS;
UNIT = US;
ESCAPE = ESC;
ENDSTREAM = EOT;
```

If you have a stream of mostly string messages, these rules can help serialize
into a compact stream with as little escaping as possible.  There is also a
C0-utf8 mode that does this while maintaining valid utf-8.

# Disadvantages

* The concessions needed to make the format unambiguous can make it somewhat
  unwieldy to read and write by hand, particularly when managing streams of
  many messages.
* Because strings are not length-prefixed, parsing strings has to be done
  character-by-character, and escape sequences can interfere with no-copy
  reading.  This is a disadvantage compared to other binary formats, not CSV or
  other text-oriented formats.