fcoreutils 0.0.6

High-performance GNU coreutils replacement with SIMD and parallelism
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
# Claude Code Instructions: fast-coreutils

## Project Vision

Build `fast-coreutils` - a high-performance Rust implementation of GNU coreutils that is:
- **10-30x faster** than GNU through SIMD and parallelism
- **100% GNU compatible** - passes all GNU test suites
- **Cross-platform** - Linux, macOS, Windows (x86_64 + ARM64)
- **Single binary** - no runtime dependencies
- **Memory safe** - pure Rust, no unsafe except for proven SIMD

---

## Phase 0: Project Setup and Research

### 0.1 Initialize Repository

```bash
# Create the project
cargo new fast-coreutils --lib
cd fast-coreutils
```

Create the following structure:
```
fast-coreutils/
├── Cargo.toml              # Workspace with lib + bins
├── README.md               # Project documentation
├── PROGRESS.md             # Track your progress here (CRITICAL)
├── ARCHITECTURE.md         # Document design decisions
├── BENCHMARKS.md           # Record all benchmark results
├── src/
│   ├── lib.rs              # Core library exports
│   ├── common/             # Shared utilities
│   │   ├── mod.rs
│   │   ├── io.rs           # I/O utilities (mmap, buffered read)
│   │   ├── simd.rs         # SIMD abstractions
│   │   ├── parallel.rs     # Rayon helpers
│   │   └── args.rs         # CLI argument parsing patterns
│   ├── wc/                 # Each tool in its own module
│   │   ├── mod.rs          # Public API
│   │   ├── core.rs         # Core algorithm
│   │   └── tests.rs        # Unit tests
│   ├── cut/
│   ├── base64/
│   ├── sha256sum/
│   ├── sort/
│   ├── tr/
│   ├── uniq/
│   ├── b2sum/
│   ├── tac/
│   └── md5sum/
├── src/bin/                # Binary entry points
│   ├── fwc.rs              # Prefixed to avoid conflicts
│   ├── fcut.rs
│   ├── fbase64.rs
│   ├── fsha256sum.rs
│   ├── fsort.rs
│   ├── ftr.rs
│   ├── funiq.rs
│   ├── fb2sum.rs
│   ├── ftac.rs
│   └── fmd5sum.rs
├── benches/
│   ├── wc_benchmark.rs     # Criterion benchmarks per tool
│   ├── comparison.rs       # vs GNU comparison
│   └── fixtures/           # Test data for benchmarks
├── tests/
│   ├── compat/             # GNU compatibility tests
│   │   ├── wc_compat.rs
│   │   └── ...
│   ├── integration/        # End-to-end tests
│   └── fixtures/           # Test files
├── .github/
│   └── workflows/
│       ├── test.yml        # Test on every push
│       ├── pre-merge.yml   # Build all platforms before merge
│       └── release.yml     # Publish to crates.io + GitHub releases
└── scripts/
    ├── gnu_compat_test.sh  # Run GNU test suite
    └── benchmark.sh        # Compare against GNU
```

### 0.2 Core Dependencies

```toml
[package]
name = "fast-coreutils"
version = "0.1.0"
edition = "2024"
license = "MIT"
description = "High-performance GNU coreutils replacement"
repository = "https://github.com/AiBrush/fast-coreutils"
keywords = ["coreutils", "cli", "performance", "simd"]
categories = ["command-line-utilities", "filesystem"]

[lib]
name = "fast_coreutils"
path = "src/lib.rs"

[[bin]]
name = "fwc"
path = "src/bin/fwc.rs"

# ... repeat for each tool

[dependencies]
# CLI
clap = { version = "4", features = ["derive", "cargo"] }

# SIMD byte operations (auto-detects AVX2/NEON)
memchr = "2"

# Memory-mapped I/O
memmap2 = "0.9"

# Parallelism
rayon = "1"

# Hashing (with hardware acceleration)
sha2 = "0.10"           # SHA-256 with SHA-NI
blake2 = "0.10"         # BLAKE2
md-5 = "0.10"           # MD5

# Base64 (SIMD-accelerated)
base64-simd = "0.8"

# Error handling
thiserror = "1"
anyhow = "1"

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
proptest = "1"          # Property-based testing
tempfile = "3"

[profile.release]
lto = "fat"
codegen-units = 1
panic = "abort"
strip = true

[profile.bench]
inherits = "release"
debug = true
```

### 0.3 Research Phase (CRITICAL)

Before writing any implementation, research deeply:

1. **Study GNU source code**
   ```bash
   git clone https://github.com/coreutils/coreutils
   # Read the implementation of each tool
   # Understand ALL flags and edge cases
   # Note any undocumented behaviors
   ```

2. **Study GNU test suite**
   ```bash
   cd coreutils/tests
   # Each tool has tests in tests/<toolname>/
   # Understand what behaviors are tested
   # Document edge cases in ARCHITECTURE.md
   ```

3. **Study existing Rust implementations**
   - uutils/coreutils - for compatibility approach
   - gnu-sort crate - for sort optimizations
   - NeoSmart tac - for SIMD tac

4. **Research SIMD techniques**
   - memchr crate internals
   - Daniel Lemire's blog on SIMD
   - Wojciech Muła's SIMD techniques

**Document all findings in ARCHITECTURE.md before coding.**

---

## Phase 1: Foundation (Week 1)

### 1.1 Common Utilities

Build the shared foundation that all tools will use:

```rust
// src/common/io.rs
// Research the best approach for each:
// - When to use mmap vs buffered read?
// - How to handle stdin vs files?
// - How to handle very large files (>RAM)?
// - How to handle binary vs text mode on Windows?
```

Think deeply about:
- **Memory mapping**: When is it faster? When does it hurt?
- **Buffer sizes**: What's optimal for different file sizes?
- **Cross-platform**: Windows line endings, path handling
- **Error handling**: What errors can occur? How to report them GNU-compatibly?

### 1.2 SIMD Abstraction Layer

```rust
// src/common/simd.rs
// Create abstractions that work across:
// - x86_64 with AVX2
// - x86_64 with SSE2 only (older CPUs)
// - ARM64 with NEON
// - Fallback for unknown platforms
```

Research questions:
- How does memchr handle runtime detection?
- Should we use std::simd (unstable) or explicit intrinsics?
- How to benchmark SIMD vs scalar to prove speedup?

---

## Phase 2: Implement Tools (Priority Order)

### Tool 1: wc (Word Count)

**This is your proof of concept. Get it perfect.**

#### Research
- Read GNU wc source completely
- Run `wc --help` and document EVERY flag
- Study fastlwc for SIMD approach
- Study how GNU handles multibyte characters (-m flag)

#### Flags to implement (ALL of them)
```
-c, --bytes            print the byte counts
-m, --chars            print the character counts
-l, --lines            print the newline counts
-L, --max-line-length  print the maximum display width
-w, --words            print the word counts
    --files0-from=F    read input from files in NUL-terminated list
    --total=WHEN       when to print a line with total counts
```

#### Algorithm research
- How does `-w` (word count) work exactly? What counts as a word?
- How does `-m` (char count) differ from `-c` (byte count) for UTF-8?
- What about `-L` (max line length) with wide characters?

#### Performance targets
- Line counting: 30x faster (proven achievable)
- Word counting: 10x faster
- Byte counting: Should be near-instant (just stat the file)

#### Testing strategy
```bash
# Generate test files
dd if=/dev/urandom bs=1M count=100 of=test_100mb.bin
python -c "print('hello world\\n' * 1000000)" > test_text.txt

# Run GNU tests
cd coreutils/tests && make check TESTS=wc

# Our compatibility tests
diff <(gnu-wc file) <(fwc file)
```

### Tool 2: cut (Field Extraction)

#### Research
- Delimiter handling (-d)
- Field selection (-f1,3-5)
- Byte selection (-b)
- Character selection (-c)
- Complement (--complement)

#### Edge cases to handle
- What if delimiter doesn't exist in line?
- What about empty fields?
- How does -s (suppress lines without delimiter) work?

### Tool 3: base64

#### Research
- Use base64-simd crate (don't reinvent)
- Wrap with CLI that matches GNU exactly
- Handle -d (decode), -w (wrap), -i (ignore garbage)

### Tool 4: sha256sum

#### Research
- SHA-NI hardware detection
- Multiple file handling
- Check mode (-c)
- Binary vs text mode (-b, -t)

### Tool 5: sort

**This is the most complex. Allocate extra time.**

#### Research deeply
- Numeric sort (-n)
- Human numeric sort (-h)
- Version sort (-V)
- Field/key sorting (-k)
- Stable sort (-s)
- Unique (-u)
- Parallel sort (--parallel)
- External sort for huge files (-T)

#### Algorithm choices
- Radix sort for numeric data
- Parallel merge sort for general data
- External merge sort for files > RAM

### Tool 6: tr (Translate)

#### Research
- Character class handling ([:alpha:], [:digit:])
- Range expansion (a-z)
- Delete mode (-d)
- Squeeze mode (-s)
- Complement (-c)

### Tool 7: uniq

#### Research
- Case insensitive (-i)
- Count (-c)
- Duplicate only (-d)
- Unique only (-u)
- Field skipping (-f)
- Character skipping (-s)

### Tool 8: b2sum

#### Research
- BLAKE2b vs BLAKE2s
- Length option (-l)
- Use blake2 crate

### Tool 9: tac

#### Research
- Study NeoSmart's SIMD tac
- Separator option (-s)
- Before flag (-b)

### Tool 10: md5sum

#### Research
- Same structure as sha256sum
- Binary/text mode handling

---

## Phase 3: Testing Strategy

### 3.1 Unit Tests

Every function must have unit tests:
```rust
#[cfg(test)]
mod tests {
    use super::*;
    
    #[test]
    fn test_count_lines_empty() { ... }
    
    #[test]
    fn test_count_lines_no_trailing_newline() { ... }
    
    #[test]
    fn test_count_lines_binary_data() { ... }
}
```

### 3.2 Property-Based Tests

Use proptest for fuzzing:
```rust
proptest! {
    #[test]
    fn wc_matches_gnu(input: Vec<u8>) {
        let our_result = fwc::count(&input);
        let gnu_result = run_gnu_wc(&input);
        prop_assert_eq!(our_result, gnu_result);
    }
}
```

### 3.3 GNU Compatibility Tests

Create a script that runs GNU test suite against our binaries:
```bash
#!/bin/bash
# scripts/gnu_compat_test.sh

# Clone GNU coreutils if not present
if [ ! -d "gnu-coreutils" ]; then
    git clone https://github.com/coreutils/coreutils gnu-coreutils
fi

# Build our tools
cargo build --release

# Run tests with our binaries
export PATH="$(pwd)/target/release:$PATH"
cd gnu-coreutils/tests
make check TESTS=wc
make check TESTS=cut
# ... etc
```

### 3.4 Performance Tests

Every tool must have benchmarks proving 10x+ speedup:
```rust
// benches/wc_benchmark.rs
use criterion::{criterion_group, criterion_main, Criterion};

fn benchmark_wc(c: &mut Criterion) {
    let data = std::fs::read("fixtures/100mb.txt").unwrap();
    
    let mut group = c.benchmark_group("wc_lines");
    
    group.bench_function("fast-coreutils", |b| {
        b.iter(|| fast_coreutils::wc::count_lines(&data))
    });
    
    // Compare with GNU (run external process)
    group.bench_function("gnu", |b| {
        b.iter(|| {
            std::process::Command::new("wc")
                .arg("-l")
                .arg("fixtures/100mb.txt")
                .output()
        })
    });
    
    group.finish();
}
```

---

## Phase 4: CI/CD

### 4.1 Test Workflow

```yaml
# .github/workflows/test.yml
name: Test

on:
  push:
    branches: [main]
  pull_request:
  merge_group:

concurrency:
  group: test-${{ github.head_ref || github.ref }}
  cancel-in-progress: true

jobs:
  test-rust:
    name: Test (Rust)
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
    steps:
      - uses: actions/checkout@v6
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - name: Run tests
        run: cargo test --release
      - name: Run clippy
        run: cargo clippy -- -D warnings

  gnu-compat:
    name: GNU Compatibility
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: dtolnay/rust-toolchain@stable
      - name: Build
        run: cargo build --release
      - name: Clone GNU coreutils
        run: git clone --depth 1 https://github.com/coreutils/coreutils
      - name: Run GNU tests
        run: ./scripts/gnu_compat_test.sh

  benchmarks:
    name: Benchmarks
    needs: [test-rust]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: dtolnay/rust-toolchain@stable
      - name: Run benchmarks
        run: cargo bench --bench comparison
      - name: Upload results
        uses: actions/upload-artifact@v6
        with:
          name: benchmark-results
          path: target/criterion
```

### 4.2 Pre-merge Workflow

Build for all platforms before allowing merge.

### 4.3 Release Workflow

- Automatic versioning from tags
- Build binaries for all platforms
- Create GitHub release with binaries
- Publish to crates.io

---

## Phase 5: Documentation

### README.md

Must include:
- Performance comparison table
- Installation instructions (cargo install, binaries)
- Usage examples
- Compatibility notes
- Link to benchmarks

### ARCHITECTURE.md

Document:
- Why we made each design decision
- SIMD techniques used
- Caching strategies
- Platform-specific considerations

### BENCHMARKS.md

Continuously updated with:
- Performance results for each tool
- Comparison methodology
- Hardware used for benchmarks

---

## Progress Tracking (CRITICAL)

### PROGRESS.md

Maintain this file with:
```markdown
# fast-coreutils Progress

## Current Status: Phase X - [description]

## Completed
- [ ] wc: lines, words, bytes, chars, max-line-length
- [ ] wc: GNU test suite passing
- [ ] wc: 30x speedup verified

## In Progress
- [ ] cut: implementing -f flag

## Blocked
- [ ] sort: need to research external sort algorithm

## Benchmarks
| Tool | vs GNU | Target | Status |
|------|--------|--------|--------|
| wc   | 32x    | 10x    ||
| cut  | 8x     | 10x    | 🔄     |

## Key Findings
- memchr 2.7+ has ARM NEON support
- Windows requires different path handling for...

## Questions to Resolve
- How does GNU handle invalid UTF-8 in -m mode?
```

### Memory Updates

After each significant milestone, update your memory with:
- Key architectural decisions
- Performance findings
- Compatibility gotchas
- What worked and what didn't

---

## Creative Problem Solving Guidelines

### Think Beyond the Obvious

1. **Don't just copy GNU's algorithm** - They optimized for 1990s hardware. Modern CPUs have SIMD, multiple cores, huge caches.

2. **Question everything** - Why does GNU do X? Is there a faster way that produces identical output?

3. **Measure, don't assume** - Before implementing an optimization, benchmark to prove it helps.

4. **Learn from failures** - If something is slower than expected, investigate why. Document findings.

### Research Sources

- Daniel Lemire's blog (lemire.me) - SIMD techniques
- Wojciech Muła's site (0x80.pl) - SIMD algorithms
- Rust Performance Book
- BurntSushi's blog (burntsushi.net) - memchr, ripgrep insights

### When Stuck

1. Read the GNU source code - understand their solution first
2. Search for academic papers on the algorithm
3. Check if there's a Rust crate that solves part of the problem
4. Benchmark different approaches, don't guess

---

## Success Criteria

Before declaring a tool "complete":

1. ✅ All GNU flags implemented
2. ✅ Passes 100% of GNU test suite
3. ✅ Works on Linux, macOS, Windows
4. ✅ Works on x86_64 and ARM64
5. ✅ At least 10x faster than GNU (proven with benchmarks)
6. ✅ No unsafe code (except SIMD intrinsics if needed)
7. ✅ Comprehensive unit tests
8. ✅ Documentation complete
9. ✅ Clippy passes with no warnings
10. ✅ CI/CD builds and tests passing

---

## Final Notes

This is an ambitious project. The key to success is:

1. **Deep research before coding** - Understand the problem completely
2. **Incremental progress** - Get one tool perfect before moving to the next
3. **Continuous benchmarking** - Prove every optimization works
4. **Rigorous testing** - 100% GNU compatibility is non-negotiable
5. **Document everything** - Future you will thank present you

Start with `wc`. Get it absolutely perfect. Use that as the template for all other tools.

Good luck! 🚀