splitby

A high-performance Rust command-line tool that splits text by a regex delimiter and returns selected parts of the result. A powerful, multi-threaded alternative to cut with regex support.
How to use
The usage format is:
The delimiter can be any regex string (wrapped in /.../) or a literal string, e.g. "/\\s+/" for regex or "," for literal.
The selection states which values you want. It can accept a single number 2 or a range 2-3. Indexes are 1-based, as standard for Unix text tools like cut and awk.
Negative numbers are valid, and count from the end, e.g. -1 or -3--1. Mixing positive and negative is allowed, however will cause an error if the starting index is greater than the ending index.
You can also use special keywords: start or first (equivalent to 1), and end or last (equivalent to -1). These can be used in ranges like first-last or start-2.
Multiple indexes can be used, with the syntax 1 3 4-5. Selections are joined by the delimiter.
Examples
Simple usecase
|
> boo
|
> hoo
Regex
|
> boo
Range
|
> is,a,test
Negative index
|
> a
|
> is
Multiple indexes
|
> this
Whole-input mode
|
> line2
Character mode
|
> cé # Character mode selects specific characters, rather than fields
Special keywords
|
> this
|
> this
|
> test
Installation
Homebrew (macOS/Linux):
Or you can find binaries to install under releases.
Alternatively, you can build from source if you prefer:
- Install Rust via rustup
git clone https://github.com/serenacula/splitbycargo build --releasemv ./target/release/splitby /usr/local/bin/
Useful Aliases
It's also suggested to add the following aliases to your .bashrc or .zshrc, for some common usecases:
# Split on newline
# Split on whitespace (regex), skipping empty fields
These allow for fast and simple string processing:
|
> line1
Or quick table processing:
file.txt:
Item Value
Apple 1.5
Pear 1.3
Car 30000
|
> Item
> Apple
> Pear
> Car
Options
| Flag | Disable Flag | Description | Default Value |
|---|---|---|---|
-h, --help |
Print help text | ||
-v, --version |
Print version number | ||
-i, --input=<FILE> |
Provide an input file | ||
-o, --output=<FILE> |
Write output to a file | ||
-d, --delimiter=<REGEX> |
Specify the delimiter to use (required for fields mode) | ||
-j, --join=<STRING|HEX> |
Join each selection with a given string | ||
--placeholder=<STRING|HEX> |
Inserts placeholder for invalid selections | ||
-p, --per-line |
Processes the input line by line (default) | Enabled | |
-w, --whole-string |
Processes the input as a single string, rather than each line separately | ||
-z, --zero-terminated |
Processes the input as zero-terminated strings | ||
-f, --fields |
Select fields split by delimiter (default) | Enabled | |
-b, --bytes |
Select bytes from the input | ||
-c, --characters |
Select characters from the input | ||
-a, --align[=MODE] |
Align fields to consistent column widths (left, right, squash) |
left |
|
--count |
Return the number of results after splitting | ||
--invert |
Inverts the chosen selection | ||
-e, --skip-empty |
-E, --no-skip-empty |
Skips empty fields when indexing or counting | Disabled |
-s, --strict |
--no-strict |
Shorthand for all strict features | |
--strict-bounds |
--no-strict-bounds |
Emit error if range is out of bounds | Disabled |
--strict-return |
--no-strict-return |
Emit error if there is no result | Disabled |
--strict-range-order |
--no-strict-range-order |
Emit error if start of a range is greater than the end | Enabled |
--strict-utf8 |
--no-strict-utf8 |
Emit error on invalid UTF-8 sequences | Disabled |
By default the input string is taken from stdin, unless the --input flag is used.
Disable flags are available for making aliasing easier, allowing you to specify your preferred settings. Flags respect last-flag-wins logic.
Delimiter
-d, --delimiter
This flag specifies the delimiter to use. It can be any regex string.
| > this,test
As shorthand, you can drop the -d flag if you use the format splitby <FLAGS> <DELIMITER> <SELECTIONS>, and it will be inferred. But after reading the delimiter, it will begin parsing selections. To avoid this, you can explicitly declare the delimiter with the -d flag. For example:
| > this,test
| > invalid
| > this,test
Input Modes
MODE: Per-line
-p, --per-line (default: enabled)
This functionality will have the tool run once per line. Useful for when dealing with a table of information.
For example:
staff.csv:
Name,Age
Bob,20
Alice,30
Alex,35
| > Name
> Bob
> Alice
> Alex
MODE: Whole-string
-w, --whole-string
This treats the input as a single string. It runs once over the entire input. Useful for situations where you want to treat the string as a single blob, or you wish to use \n as your delimiter.
| > b
MODE: Zero-terminated
-z, --zero-terminated
This mode treats the input as a sequence of zero-terminated strings. It runs once over the entire input. Useful for processing filenames from find -print0 or other tools that output null-terminated strings.
# split on /, join with \n, and get the last field
|
> file1.txt
> file2.txt
> file3.txt
Selection Modes
MODE: Fields
-f, --fields (default: enabled)
This mode treats the input as a sequence of fields, split by a delimiter.
|
> is
MODE: Chars
-c, --characters
This mode treats the input as a sequence of characters. It runs once over the entire input. Useful for situations where you need to work with a sequence of characters.
Note: Unlike cut, this respects visible characters, rather than byte counts.
|
> fé
MODE: Bytes
-b, --bytes
This mode treats the input as a sequence of bytes.
Note: Join is not supported in bytes mode.
|
> his
Selection Options
Invert
--invert
The invert option selects everything except what you choose.
|
> is
|
> this
Skip-empty
-e, --skip-empty | -E, --no-skip-empty (default: disabled)
By default the tool does not skip empty values. --skip-empty tells it to ignore empty fields when counting and indexing.
With indexes:
|
>
|
> hoo
Transform Options
Align
-a, --align
This option pads fields so that columns line up across lines. Selections are optional; omitting them returns all fields.
It accepts an optional mode:
left(default): fields are left-aligned within their columnright: fields are right-aligned within their columnsquash: padding is placed after the delimiter, aligning the first character of each field
|
> apple,banana,cherry
> a
|
> apple,banana,cherry
> a,
|
> apple,banana,cherry
> a,
Join
-j <STRING|HEX>, --join=<STRING|HEX>
This flag lets you control how selections are joined together.
By default, the joiner is the delimiter after the previous selection. If unavailable, the joiner is the delimiter before the next selection. If both are unavailable, the joiner is the first delimiter in the record.
|
> this
> a
|
> this,is
> a,test
The join flag also accepts hex values (with 0x or 0X prefix) for multi-byte joiners or non-printable characters:
|
> this,
> a,
There are also a number of useful keywords you can use (only in fields mode):
| Keyword | Description |
|---|---|
--join=auto |
Automatically tries after-previous, then before-next, then space |
--join=after-previous |
Use the delimiter after the previous selection |
--join=before-next |
Use the delimiter before the next selection |
--join=first |
Use the first delimiter in the record |
--join=last |
Use the last delimiter in the record |
--join=space |
Use a space character |
--join=none |
No join (equivalent to "") |
Placeholder
--placeholder=<STRING|HEX>
This is a useful flag for the situation where you need a reliable output format. Normally an invalid selection is skipped, however with this flag an invalid selection will output the given placeholder string instead.
The placeholder accepts both string values and hex values (with 0x or 0X prefix). Hex values are useful for multi-byte placeholders or non-printable characters.
A join string is added here for clarity:
| > boo:hoo
|
> boo:?:hoo
|
> boo,,hoo # empty string placeholder
echo |
> boo,,
Count
--count
The count option allows you to get the number of results:
|
> 4
As with index selection, empty fields are counted unless you use the --skip-empty flag.
Behaviours that affect selections are ignored, e.g. --invert, --placeholder
|
> 3
|
> 2
With count:
|
> 3
|
> 2
Strictness Options
Strict
--strict | --no-strict
The plain --strict flag is shorthand for all strictness options listed below.
Strict Bounds
--strict-bounds | --no-strict-bounds (default: disabled)
In normal operation, the tool silently limits the bounds to within the range. --strict-bounds tells it to emit an error instead.
For example, this is silently corrected to 2-3. With strict mode, it emits an error to stderr instead:
|
> hoo
|
> line )
This also applies to single indexes out of bounds.
|
> # Empty output (index out of bounds)
|
> line )
Strict Return
--strict-return | --no-strict-return (default: disabled)
In situations where the selected result would be empty, the tool defaults to emitting nothing. --strict-return tells it to emit an error instead.
For example:
|
> # Empty output (field 1 is empty)
|
> line
Similarly, if you skip empty fields:
|
> # Empty output (all fields are empty)
|
> line
It has no effect when --count is used.
Strict Range Order
--strict-range-order | --no-strict-range-order (default: enabled)
This flag causes an error to emit if the start of a range is after the end, e.g. 3-1.
|
> line ) )
|
> # No error emitted
Strict UTF-8
--strict-utf8 | --no-strict-utf8 (default: disabled)
By default, when the tool encounters invalid UTF-8 sequences, it replaces them with the Unicode replacement character (U+FFFD). When --strict-utf8 is enabled, the tool will emit an error instead of silently replacing invalid sequences.
This is particularly useful when processing binary data or when you need to ensure data integrity.
# Invalid UTF-8 sequence (example)
|
> hello # Replacement character used, but only first 5 characters returned
echo |
> line