cw 0.2.0

Count Words, a wc clone
# cw - Count Words

A `wc` clone in Rust.

## Synopsis

```
-% cw --help
cw 0.2.0
Thomas Hurst <tom@hur.st>
Count Words - word, line, character and byte count

USAGE:
    cw [FLAGS] [input]...

FLAGS:
    -c, --bytes              Count bytes
    -m, --chars              Count UTF-8 characters instead of bytes
    -h, --help               Prints help information
    -l, --lines              Count lines
    -L, --max-line-length    Count bytes (default) or characters (-m) of the longest line
    -V, --version            Prints version information
    -w, --words              Count words

ARGS:
    <input>...    Input files

-% cw Dickens_Charles_Pickwick_Papers.xml
 3449440 51715840 341152640 Dickens_Charles_Pickwick_Papers.xml
```

## Performance

Line counts are optimized using the `bytecount` crate:

```
Benchmark #1: wc -l Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):     439.7 ms ±   2.0 ms    [User: 354.9 ms, System: 84.5 ms]
  Range (min … max):   435.3 ms … 441.4 ms

Benchmark #2: gwc -l Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):     533.0 ms ±   1.7 ms    [User: 388.8 ms, System: 144.0 ms]
  Range (min … max):   530.9 ms … 535.1 ms

Benchmark #3: cw -l Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):     127.9 ms ±   1.5 ms    [User: 24.1 ms, System: 103.7 ms]
  Range (min … max):   125.1 ms … 131.3 ms

Summary
  'cw -l Dickens_Charles_Pickwick_Papers.xml' ran
    3.44 ± 0.04 times faster than 'wc -l Dickens_Charles_Pickwick_Papers.xml'
    4.17 ± 0.05 times faster than 'gwc -l Dickens_Charles_Pickwick_Papers.xml'
```

Line counts with line length are optimized using the `memchr` crate:

```
Benchmark #1: wc -lL Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):     441.6 ms ±   1.8 ms    [User: 354.7 ms, System: 86.5 ms]
  Range (min … max):   438.5 ms … 443.8 ms

Benchmark #2: gwc -lL Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      3.851 s ±  0.005 s    [User: 3.710 s, System: 0.141 s]
  Range (min … max):    3.847 s …  3.864 s

Benchmark #3: cw -lL Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):     255.6 ms ±   1.1 ms    [User: 154.6 ms, System: 100.9 ms]
  Range (min … max):   253.3 ms … 256.9 ms

Summary
  'cw -lL Dickens_Charles_Pickwick_Papers.xml' ran
    1.73 ± 0.01 times faster than 'wc -lL Dickens_Charles_Pickwick_Papers.xml'
   15.07 ± 0.07 times faster than 'gwc -lL Dickens_Charles_Pickwick_Papers.xml'
```

Note without `-m` cw only operates on bytes, and it never cares about your locale.

```
Benchmark #1: wc Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      2.708 s ±  0.002 s    [User: 2.612 s, System: 0.095 s]
  Range (min … max):    2.706 s …  2.712 s

Benchmark #2: gwc Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      3.851 s ±  0.003 s    [User: 3.714 s, System: 0.136 s]
  Range (min … max):    3.847 s …  3.856 s

Benchmark #3: cw Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      2.026 s ±  0.001 s    [User: 1.939 s, System: 0.087 s]
  Range (min … max):    2.024 s …  2.028 s

Summary
  'cw Dickens_Charles_Pickwick_Papers.xml' ran
    1.34 ± 0.00 times faster than 'wc Dickens_Charles_Pickwick_Papers.xml'
    1.90 ± 0.00 times faster than 'gwc Dickens_Charles_Pickwick_Papers.xml'
```

`-m` enables UTF-8 processing, and currently has no fast paths.

```
Benchmark #1: wc -mLlw Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      8.972 s ±  0.019 s    [User: 8.875 s, System: 0.096 s]
  Range (min … max):    8.958 s …  9.013 s

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #2: gwc -mLlw Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      3.852 s ±  0.008 s    [User: 3.700 s, System: 0.151 s]
  Range (min … max):    3.846 s …  3.867 s

Benchmark #3: cw -mLlw Dickens_Charles_Pickwick_Papers.xml
  Time (mean ± σ):      3.721 s ±  0.003 s    [User: 3.598 s, System: 0.123 s]
  Range (min … max):    3.715 s …  3.726 s

Summary
  'cw -mLlw Dickens_Charles_Pickwick_Papers.xml' ran
    1.04 ± 0.00 times faster than 'gwc -mLlw Dickens_Charles_Pickwick_Papers.xml'
    2.41 ± 0.01 times faster than 'wc -mLlw Dickens_Charles_Pickwick_Papers.xml'
```

These tests are on FreeBSD 12 on a 2.1GHz Westmere Xeon.  `gwc` is from GNU
coreutils 8.30.

For best results build with:

```
cargo build --release --features runtime-dispatch-simd
```

This enables SIMD optimizations for line counting.  It has no effect if you have
it count anything else.


## Future

 * Test suite.
 * Refactor to reduce the code sprawl.
 * Improve `SIGINFO` support.
 * Factor internals out into a library. (#1)
 * Improve multibyte support.
 * Possibly implement locale.
 * Replace clap/structopt with something lighter.

## See Also

### [uwc]

[uwc] focuses on following Unicode rules as precisely as possible, taking into
account less-common newlines, counting graphemes as well as codepoints, and
following Unicode word-boundary rules precisely.

The cost of this is currently a great deal of performance, with counts on my
benchmark file taking over a minute.


### [rwc]

cw was originally called [rwc] until I noticed this existed.  It's quite old and
doesn't appear to compile.


### [linecount]

A little library that only does plain newline counting, along with a binary
called `lc`.  Version 0.2 will use the same algorithm as `cw`.


[uwc]: https://crates.io/crates/uwc
[rwc]: https://crates.io/crates/rwc
[linecount]: https://crates.io/crates/linecount