pcap-toolkit 0.2.0

# Code Review: pcap-toolkit

## Architecture Overview

Well-structured streaming tool with clean separation of concerns: PCAP I/O (`pcap/`),
filtering (`filter.rs`, `bpf.rs`), two-pass sorting (`sort/`), multi-format export
(`export/`), and replay (`replay/`). The `Filter` → `FilterRule` trait-based composition
via `impl_filter_body!` is elegant. The two-thread export pipeline (bounded
`sync_channel` at `export/mod.rs:208`) and parallel first-pass via Rayon are the right
tools for those jobs.

---

## Correctness Issues

### ✅ 1. Sidecar file leak on partial first-pass failure — `sort/mod.rs:97–105`

```rust
let results: Vec<Result<...>> = inputs.par_iter().map(|p| first_pass(...)).collect();

let mut headers_and_stores = Vec::with_capacity(inputs.len());
for r in results {
    headers_and_stores.push(r?);  // ← exits early on first error
}
```

If N first passes succeed and one fails, the `?` short-circuits the loop. The sidecar
`.idx` files from the successful passes are never cleaned up. The existing cleanup block
at lines 112–122 only covers the link-type mismatch case. Fix: collect sidecar paths
before iterating results, clean them up in an error branch.

### ✅ 2. TCP data-offset < 5 not rejected in truncation — `transform.rs:543`

```rust
PROTO_TCP => {
    if data.len() < ts + 20 { return (origlen, false); }
    ((data[ts + 12] >> 4) * 4) as usize
}
```

A corrupt data-offset field < 5 (meaning < 20 bytes) produces a `transport_hdr_len`
smaller than the minimum TCP header, making `payload_start` point into the header
itself. The bounds check guards against total buffer underflow but not against invalid
header lengths. Add: `if transport_hdr_len < 20 { return (origlen, false); }`.

### ✅ 3. `src_port`/`dst_port` are `Some(0)` for ICMP in export records — `export/mod.rs:343–355`

`FlowKey::new` always fills `src_port = 0` / `dst_port = 0` for non-TCP/UDP
(see `filter.rs:402`). The export record ends up with `src_port: Some(0)` for ICMP
packets, which is misleading. Downstream consumers need out-of-band protocol knowledge
to know whether port fields are meaningful. Emit `None` when `protocol ∉ {6, 17}`:

```rust
let src_port = if matches!(key.protocol, 6 | 17) { Some(key.src_port) } else { None };
let dst_port = if matches!(key.protocol, 6 | 17) { Some(key.dst_port) } else { None };
```

### ✅ 4. Bare integer epoch silently treated as milliseconds — `filter.rs:300–302`

```rust
if let Ok(ms) = s.parse::<u64>() {
    return Ok(ms.saturating_mul(1_000_000));
}
```

A user who passes a Unix epoch in **seconds** (the dominant convention) gets a timestamp
1000× too far in the future with no error or warning. The behaviour is documented, but
it is surprising enough to warrant a louder note in the CLI help text and the error
message when `--from`/`--to` filters produce zero matches.

---

## Performance

### ✅ 5. `flow_ids` stored as `Vec`, O(n) scan per packet — `filter.rs:601`

```rust
if !b.flow_ids().contains(&id) {
    return false;
}
```

`resolve_min_flow_filter` builds a `HashSet<u64>` internally but then converts it to a
`Vec` at `main.rs:648`. For large allowlists (common with `--min-flow-packets`), every
packet pays a linear scan. Keep the qualifying set as a `HashSet<u64>` in `Filter`, or
at minimum sort the vec and use binary search.

### ✅ 6. `Pps` timing overflow at high rates — `replay/mod.rs:127`

```rust
let target_ns = sent_count * 1_000_000_000 / pps;
```

`sent_count * 1_000_000_000` overflows `u64` after ~18.4 billion packets. At 1 Mpps
that is roughly 5 hours of replay, after which the computed target wraps to zero and
pacing breaks silently. Use `u128` for the intermediate product or apply saturating
arithmetic.

---

## Code Quality

### ✅ 7. Three nearly-identical filter builders — `main.rs:177–606`

`build_filter`, `build_export_filter`, and `build_replay_filter` are ~120 lines of
copy-paste that differ only in their argument struct type. A future filter field added
to one will silently be missing from the others (this has nearly happened with
`min_len`/`max_len`). Unify with a shared trait or a free function taking the common
fields as parameters.

### ✅ 8. `--format` silently applies only to the first `--output` — `main.rs:339–347`

```rust
let format = if i == 0 {
    args.format.as_deref().and_then(ExportFormat::parse)
        .or_else(|| ExportFormat::from_extension(path))
} else {
    ExportFormat::from_extension(path)
};
```

If a user passes `--format parquet --output a.pq --output b.pq`, only `a.pq` receives
the explicit format. The behaviour is labelled "backward compat" but is not documented
in the CLI help. Add a note to the `--format` help text.

---

## Missing Tests

### 9. No test for `--min-flow-packets`

`resolve_min_flow_filter` (`main.rs:616`) is the most recently added complex feature.
It has no direct test. The pre-scan + intersection logic is non-trivial and deserves an
integration test that verifies flows with fewer than N packets are excluded from the
output.

### 10. No test for time-slice boundary and empty-slice cases

`SlicedWriter` rotates output files on time boundaries. There are no tests verifying
that a packet exactly on a boundary goes to the expected file, or that an empty time
window does not produce a zero-byte output file.

### 11. No integration tests for compound BPF expressions

`bpf.rs` implements a pure-Rust libpcap-compatible parser. Tests for compound
expressions (`tcp and port 443`, `not udp`, nested `or`/`and`) would catch regressions
in precedence or short-circuit evaluation.

---

## Security

### 12. `unsafe` in `open_raw_socket` is correct — `replay/mod.rs:222–235`

The `unsafe` block uses `SockAddr::try_init` to write a `SockAddrLl` into a zeroed
storage buffer. The struct is `#[repr(C)]`, all fields are initialised, and `*len` is
set to `size_of::<SockAddrLl>()`. This is the canonical pattern for AF_PACKET sockets
without a `libc` dependency. No issues.

### 13. Output paths accept arbitrary destinations

If `--output` can be influenced by untrusted input, packets could be written to
arbitrary paths. Acceptable in the current trusted-operator context, but worth noting
if the tool is ever wrapped in an API or web frontend.

---

## Summary Table

| Severity | Finding | Location |
|---|---|---|
| Bug | Sidecar `.idx` file leak on first-pass error | `sort/mod.rs:97–105` |
| Bug | TCP data-offset < 5 not rejected before truncation | `transform.rs:543` |
| Misleading | `src_port`/`dst_port` `Some(0)` for ICMP in export | `export/mod.rs:343` |
| Misleading | Bare integer epoch silently treated as milliseconds | `filter.rs:300` |
| Performance | `flow_ids` Vec gives O(n) lookup per packet | `filter.rs:601` |
| Performance | `Pps` timing overflows u64 after ~5 h at 1 Mpps | `replay/mod.rs:127` |
| Maintainability | Three identical filter builders (150 lines duplicated) | `main.rs:177–606` |
| Missing test | `--min-flow-packets` end-to-end behaviour | — |
| Missing test | Time-slice boundary and empty-window cases | — |
| Missing test | Compound BPF expression parsing and evaluation | — |

The codebase is well-structured and clearly written. Items 1 and 2 (sidecar leak and
data-offset check) are the most important to address before a 1.0 release.