Expand description
Zet’s overall flow is:
- Form a starting
ZetSet
from the lines of the first input file. Each line in the set is represented by anIndexMap
key. TheIndexMap
value associated with each key is not part of the abstract set value but is used for operational bookkeeping. The type of these bookkeeping values depends on the operation being calculated and whether we’re keeping track of the number of times each line occurs or the number of files it occurs in. - Read the lines of each subsequent operand, updating the bookkeeping value as needed in order to decide whether to insert lines into or delete lines from the set.
- Output the lines of the resulting set, possibly annotated with count of the number of times the line appears in the input or the number of files the line appears in.
Zet’s structure is due to the following design decisions:
- We read the entire contents of the first input file into memory, so we can
borrow the
IndexMap
key that represents each of its lines rather than allocating aVec<u8>
for each of them. This saves both time and memory, on the assumption that few lines in the first file are duplicates. - We do not read the entire contents of subsequent files. This can cost us
time in key allocation, but often saves both time and memory:
Intersect
andDiff
never allocate, since they only remove lines from the set, while the other operation won’t do extensive allocation in the fairly common case where the second and subsequent input files have few lines not already present in the first file. - We start output with a Unicode byte order mark if and only the first input file begins with a byte order mark.
- We strip the line terminator (either
\r\n
or\n
) from the end of each input line. On output, we use the line terminator found at the end of the first line of the first input file. - We process all input files before doing any output. (This is not
absolutely necessary for the
Union
operation — see the huniq command. But it is for all other Zet operations.)
The set
module provides the ZetSet
structure. The ZetSet::new
function
takes a &[u8]
slice and a bookkeeping item used by the calling operation.
The call ZetSet::new(slice, item)
returns an initialized ZetSet
with:
- An
IndexMap
whose keys (lines) are borrowed fromslice
and initial bookkeeping values equal toitem
, and possibly updated if seen multiple times in the slice. - A field that indicates whether
slice
started with a byte order mark. - A field that holds the line terminator to be used, taken from the first
line of
slice
.
For a ZetSet
z
,
z.insert_or_update(operand, item)
usesIndexMap
’sentry
method to insertitem
as the value for lines inoperand
that were not already present inz
, or to callv.update_with(item)
on the bookkeeping item of lines that were present. Inserted lines are allocated, not borrowed, sooperand
need not outlivez
.z.update_if_present(operand, item)
callsv.update_with(file_number)
on the bookkeeping item of lines in operand that are present inz
, ignoring lines that are not already present.- Finally,
z.retain(keep)
retains lines for whichkeep(item.retention_value())
is true of the line’s bookkeeping item.
Modules§
- args
- Code to parse the command line using
clap
, and definitions of the parsed result - help
- operands
- Provides the
first_and_rest
function, which returns aVec<u8>
containing the contents of the first operand and anExactSizeIterator
over the remaining operands. Note: this different treatment of the first and remaining operands has the unfortunate result of requiring different code paths for translating UTF16 files into UTF8. That currently seems worth the cost. - operations
- Houses the
calculate
function - set
- Provides the
ZetSet
structure, intended to be initialized from the contents of the first input file. - styles