Expand description
Zet’s overall flow is:
- Form a starting
ZetSetfrom the lines of the first input file. Each line in the set is represented by anIndexMapkey. TheIndexMapvalue associated with each key is not part of the abstract set value but is used for operational bookkeeping. The type of these bookkeeping values depends on the operation being calculated and whether we’re keeping track of the number of times each line occurs or the number of files it occurs in. - Read the lines of each subsequent operand, updating the bookkeeping value as needed in order to decide whether to insert lines into or delete lines from the set.
- Output the lines of the resulting set, possibly annotated with count of the number of times the line appears in the input or the number of files the line appears in.
Zet’s structure is due to the following design decisions:
- We read the entire contents of the first input file into memory, so we can
borrow the
IndexMapkey that represents each of its lines rather than allocating aVec<u8>for each of them. This saves both time and memory, on the assumption that few lines in the first file are duplicates. - We do not read the entire contents of subsequent files. This can cost us
time in key allocation, but often saves both time and memory:
IntersectandDiffnever allocate, since they only remove lines from the set, while the other operation won’t do extensive allocation in the fairly common case where the second and subsequent input files have few lines not already present in the first file. - We start output with a Unicode byte order mark if and only the first input file begins with a byte order mark.
- We strip the line terminator (either
\r\nor\n) from the end of each input line. On output, we use the line terminator found at the end of the first line of the first input file. - We process all input files before doing any output. (This is not
absolutely necessary for the
Unionoperation — see the huniq command. But it is for all other Zet operations.)
The set module provides the ZetSet structure. The ZetSet::new function
takes a &[u8] slice and a bookkeeping item used by the calling operation.
The call ZetSet::new(slice, item) returns an initialized ZetSet with:
- An
IndexMapwhose keys (lines) are borrowed fromsliceand initial bookkeeping values equal toitem, and possibly updated if seen multiple times in the slice. - A field that indicates whether
slicestarted with a byte order mark. - A field that holds the line terminator to be used, taken from the first
line of
slice.
For a ZetSet z,
z.insert_or_update(operand, item)usesIndexMap’sentrymethod to insertitemas the value for lines inoperandthat were not already present inz, or to callv.update_with(item)on the bookkeeping item of lines that were present. Inserted lines are allocated, not borrowed, sooperandneed not outlivez.z.update_if_present(operand, item)callsv.update_with(file_number)on the bookkeeping item of lines in operand that are present inz, ignoring lines that are not already present.- Finally,
z.retain(keep)retains lines for whichkeep(item.retention_value())is true of the line’s bookkeeping item.
Modules§
- args
- Code to parse the command line using
clap, and definitions of the parsed result - help
- operands
- Provides the
first_and_restfunction, which returns aVec<u8>containing the contents of the first operand and anExactSizeIteratorover the remaining operands. Note: this different treatment of the first and remaining operands has the unfortunate result of requiring different code paths for translating UTF16 files into UTF8. That currently seems worth the cost. - operations
- Houses the
calculatefunction - set
- Provides the
ZetSetstructure, intended to be initialized from the contents of the first input file. - styles