cleanup-history 0.2.0

Filters my `.bash_history` through a set of regexes, deduplicates, and sorts it by most recently used.
Documentation

cleanup-history-rs

Filters my .bash_history through a set of regexes, deduplicates, and sorts it by most recently used.

Based on https://github.com/naggie/dotfiles/blob/master/scripts/cleanup-history

Notes on .bash_history

Format:

#1593575811
echo each command has a timestamp immediately before it
#2
#1593575811
echo 'after multiple timestamp lines, `history` will show the timestamp 1593575811'
#1593575811
#3
echo after multiple timestamp lines, this will show the timestamp 3
#1
echo 'when you run `history` this will show up with a timestamp long ago but still at the end of the list'
#1593575811
echo this will have the same timestamp as others above, duplicates don\'t matter
#1593575812
#1593575813
#1593575814
#1593575815
#1593575816
#1593575817
#1593575818
#1593575819
#1593575820
#1593575821
echo 'once you `history -w` all these extra timestamps will get removed'
#1593576854
for ((i=0;i<5;i++)); do echo $i; done
#1593576854
echo ^^ that was written on multiple lines
#1593576874
echo 'foo
bar'
#1593576874
echo ^^ that was also written on multiple lines, cmdhist=on, lithist=off

Gotchas

If a line starts with #\d+, it will be interpreted as a timestamp.

$ export HISTFILE=./foo
$ history -c
$ echo 'this
#1234
that'
$ history -w
$ cat foo
#1594044806
echo 'this
#1234
that'
#1594044814
history -w
$ history -c
$ history -r
$ history
    1  2020-07-06 08:16.14 | history -r
    2  2020-07-06 08:15.15 | echo 'this
    3  1969-12-31 17:20.34 | that'
    4  2020-07-06 08:15.25 | history -w
    5  2020-07-06 08:16.16 | history
$ history -c
$ echo 'foo
#1234 bar
baz'
$ history -w
$ history # correct in memory
    1  2020-07-06 08:24.30 | echo 'foo
#1234 bar
baz'
    2  2020-07-06 08:24.38 | history -w
    3  2020-07-06 08:24.41 | history
$ cat foo
#1594045470
echo 'foo
#1234 bar
baz'
#1594045478
history -w
$ history -c # clear in-memory history
$ history -r # reread from file
$ history # now incorrectly interprets `#1234 bar` as a timestamp
    1  2020-07-06 08:19.49 | history -r
    2  2020-07-06 08:19.09 | echo 'foo
    3  1969-12-31 17:20.34 | baz'
    4  2020-07-06 08:19.31 | history -w
    5  2020-07-06 08:19.51 | history

Benchmarks

The deduplicated line count is a little different due to slightly different regexes ¯\(ツ)/¯. I think it's close enough to be informational.

$ wc -l bash_history.bak
86636 bash_history.bak
$ hyperfine --warmup=5 --prepare='cp bash_history.bak bash_history_python' \
    --export-markdown=bash-history-python.txt \
    --time-unit=millisecond \
    'python3 cleanup-history.py bash_history_python'
$ wc -l bash_history_python
73149 bash_history_python
$ hyperfine --warmup=5 --prepare='cp bash_history.bak bash_history_rust' \
    --export-markdown=bash-history-rust.txt \
    --time-unit=millisecond \
    'cleanup-history-rs/target/release/cleanup-history bash_history_rust'
$ wc -l bash_history_rust
64638 bash_history_rust
Command Mean [ms] Min [ms] Max [ms]
python3 cleanup-history.py bash_history_python 2069.9 ± 112.4 1935.1 2356.4
cleanup-history-rs/target/release/cleanup-history bash_history_rust 653.5 ± 22.1 631.2 698.9