flowmark 0.3.1

A Markdown auto-formatter for clean diffs and semantic line breaks
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
# Flowmark Performance Report

**Date:** 2026-02-27

## Part 1: Cross-Formatter Comparison

### Benchmark Setup

- **Platform:** macOS 25.2.0, arm64 (local)
- **Corpus:** 928 Markdown files (23 MB)
- **Methodology:** single-corpus measurements using benchmark harness warmup + timed
  run(s).
- **Fresh-run mode:** `./benchmarks/run_comparison.sh first-run 1`
- **Cached second-run mode:** `./benchmarks/run_comparison.sh second-run 1`

Scripts to reproduce: `benchmarks/generate_corpus.sh`, `benchmarks/run_comparison.sh`.

### Current Headline Results (2026-02-27)

### Fresh Run (single corpus, files need formatting)

| Formatter | Language | Parallel | Mean | Relative speed |
| --- | --- | --- | --- | --- |
| **dprint** (`--incremental=false`) | Rust (WASM plugin) | yes | **0.36 s** | **1.0x** |
| **flowmark-rs** (`--auto`) | Rust | yes (rayon) | **0.71 s** | **2.0x** |
| **markdownfmt** | Go | no | **0.95 s** | **2.6x** |
| **prettier** | JavaScript | no | **38.0 s** | **105x** |
| **mdformat** | Python | no | **72.9 s** | **197x** |
| **flowmark-py** | Python | no | **~48 s** | **~130x** |

Notes:

- `flowmark-rs` and `dprint` values are from current local reruns on the 928-file
  corpus.
- `markdownfmt`, `prettier`, `mdformat`, `flowmark-py` are from the same corpus profile
  suite in this report (retained for cross-formatter ranking continuity).

### Cached Second Run (unchanged files)

| Formatter | Mean | Relative speed |
| --- | --- | --- |
| **flowmark-rs** (`--auto`, incremental default) | **0.023 s** | **1.0x** |
| **dprint** (`fmt`, incremental default) | **0.031 s** | **1.3x** |

Interpretation:

- Fresh-run ranking remains unchanged: flowmark-rs is #2 overall.
- With incremental cache warm, flowmark-rs now drops to ~23ms on this corpus.
- Fresh-run Rust vs Python headline remains roughly **60-70x faster** (`0.71s` vs
  `~48s`).

### Per-File Throughput

| Formatter | ms/file | files/sec |
| --- | --- | --- |
| dprint | 0.40 | 2,508 |
| flowmark-rs (parallel) | 0.79 | 1,271 |
| markdownfmt | 1.02 | 976 |
| flowmark-rs (sequential) | 2.61 | 383 |
| prettier | 41.0 | 24 |
| mdformat | 78.6 | 13 |
| flowmark-py | ~52 | ~19 |

### Raw Timings (3 Runs Each)

**v0.3.0 parallel runs (928 files, fresh corpus):**

| Formatter | Run 1 | Run 2 | Run 3 |
| --- | --- | --- | --- |
| dprint | 0.364 s | 0.371 s | 0.361 s |
| flowmark-rs (parallel) | 0.727 s | 0.728 s | 0.737 s |
| markdownfmt | 0.958 s | 0.969 s | 0.929 s |
| flowmark-rs (sequential) | 2.403 s | 2.385 s | 2.474 s |

**Original v0.2.4 runs (924 files, sequential only):**

| Formatter | Run 1 | Run 2 | Run 3 |
| --- | --- | --- | --- |
| dprint | 0.235 s | 0.224 s | 0.242 s |
| markdownfmt | 0.829 s | 0.790 s | 0.781 s |
| flowmark-rs (sequential) | 2.633 s | 2.928 s | 2.647 s |
| prettier | 20.961 s | 20.822 s | 20.885 s |
| flowmark-py | 27.889 s | 27.914 s | 27.597 s |
| mdformat | 37.571 s | 37.395 s | 37.499 s |

### Analysis

**Compiled-language formatters (dprint, flowmark-rs, markdownfmt) are 2–3 orders of
magnitude faster than interpreted-language formatters (prettier, flowmark-py,
mdformat).**

- **dprint** is the fastest, its Rust core with WASM plugin and multi-threaded file
  processing gives it ~0.37s on 928 files.
  Note that dprint uses ~3.3s of user CPU time (multi-threaded) for 0.37s wall-clock,
  indicating heavy parallelism.
- **flowmark-rs (parallel)** is second at 0.73s, within **2x of dprint** after adding
  rayon parallelism in v0.3.0. This is a **3.3x improvement** over the v0.2.4 sequential
  version (2.42s). The remaining gap vs dprint is due to flowmark doing more work per
  file (semantic line breaks, smart quotes, typography, reference link encoding,
  footnote extraction).
- **markdownfmt** is third at 0.95s, benefiting from Go’s fast compilation model and low
  per-file overhead. It processes files via `find -exec` with argument batching (not
  parallel internally).
- **prettier** is the fastest interpreted-language formatter, but ~100x slower than
  dprint. Node.js startup and single-threaded JS execution are the main bottlenecks.
- **flowmark-py** and **mdformat** are the slowest, reflecting Python’s interpreter
  overhead. mdformat is slower than flowmark-py despite doing less work, likely due to
  markdown-it-py parsing overhead.

### Important Caveats

These formatters are **not interchangeable:** they have very different feature sets:

- **flowmark** (Python and Rust): Semantic line breaks, smart quotes, ellipsis
  typography, reference link encoding, footnote extraction, configurable wrapping modes.
  The most feature-rich formatter.
- **prettier:** Opinionated reformatting with consistent style.
  Good ecosystem integration.
  No semantic line breaks.
- **dprint:** Fast, parallel, plugin-based.
  Basic markdown normalization.
  No typography or semantic features.
- **mdformat:** Extensible Python formatter with plugin system.
  CommonMark-focused.
- **markdownfmt:** Minimal Go formatter.
  Normalizes headings, lists, and whitespace.
  Limited configurability.

The speed differences partially reflect feature complexity: simpler formatters that do
less per-file processing are naturally faster.

### How dprint Achieves Its Speed

Source analysis of [dprint/dprint](https://github.com/dprint/dprint) (cloned to
`attic/dprint`). Key file: `crates/dprint/src/format.rs`.

**Architecture:** Single-threaded tokio `current_thread` runtime for async
orchestration, with all actual work (file I/O + formatting) dispatched to tokio’s
multi-threaded blocking pool via `spawn_blocking()`.

**Parallelism model:**

1. **Thread count = CPU cores.** Uses `std::thread::available_parallelism()`,
   overridable via `DPRINT_MAX_THREADS`. Reserves 1 thread per process plugin + 1 for
   the runtime.
2. **Semaphore-controlled concurrency.** Files are grouped by plugin.
   Each group gets a custom `Semaphore` with permits proportional to the thread count.
   A file can only begin formatting when it acquires a permit, capping active concurrent
   formats at ~core count.
3. **`spawn_blocking()` for I/O and formatting.** Each file: read (blocking) -> format
   (blocking or async depending on plugin type) -> write (blocking).
   The async event loop just orchestrates.
4. **Adaptive CPU throttling.** A background task monitors CPU usage every 2 seconds.
   If CPU exceeds a threshold, it removes semaphore permits to reduce parallelism.
   When CPU drops, it adds permits back.
   Disabled on CI.
5. **Work stealing on completion.** When one plugin group finishes, its semaphore
   permits are redistributed to remaining groups via `SemaphorePermitReleaser::drop`,
   favoring groups with fewer permits.
6. **Incremental caching.** Hash-based skip for unchanged files (explains the 0.13s with
   caching vs 0.23s with `--incremental=false`).

**Plugin system:** WASM plugins (compiled with Wasmer, run synchronously in-process) and
process plugins (separate child processes communicating via stdin/stdout).
The markdown formatter is a WASM plugin.

### Implemented: Parallel File Processing for flowmark-rs

Parallel file processing was implemented in v0.3.0 using rayon (see Part 3 for full
results). The sequential loop was replaced with `rayon::par_iter().try_for_each()`,
achieving a **3.8x wall-clock speedup** on batch workloads and bringing flowmark-rs to
**within 2x of dprint’s performance**.

The rayon approach proved simpler and equally effective as dprint’s more complex tokio +
semaphore architecture, since flowmark-rs has no plugin infrastructure.

* * *

## Part 2: Flowmark Python vs Rust (Detailed)

### Benchmark Setup

- **Python:** flowmark v0.6.4
- **Rust:** flowmark v0.2.4 (release: `opt-level=3`, LTO, `codegen-units=1`,
  `panic=abort`)
- **Benchmarking tool:** hyperfine (with warmup, multiple runs)
- **Profiling tool:** valgrind callgrind (instruction-level, single file and batch)

Scripts to reproduce: `benchmarks/run_benchmarks.sh`, `benchmarks/profile_rust.sh`.

### Headline Results

Rust flowmark is **10–17x faster** than Python flowmark across all workloads.

| Benchmark | Python | Rust | Speedup |
| --- | --- | --- | --- |
| Single file (1,734 lines, stdout) | 471.7 ms | 27.3 ms | **17.3x** |
| Batch `--auto` (924 files in-place) | 27.8 s | 2.74 s | **10.1x** |
| Batch `--semantic` (1,080 files in-place) | 27.2 s | 2.5 s | **10.9x** |
| File discovery `--list-files` (1,080 files) | 1.31 s | 169 ms | **7.8x** |

### Per-File Throughput

| Mode | Python | Rust |
| --- | --- | --- |
| `--auto` (batch) | 30.1 ms/file, 33 files/sec | 2.96 ms/file, 338 files/sec |
| `--semantic` (batch) | 25.2 ms/file, 39 files/sec | 2.3 ms/file, 432 files/sec |

### Notes

- Python startup overhead (~300 ms) inflates single-file times; in batch mode this is
  amortized and the per-file speedup drops to ~10x.
- Semantic mode is slightly faster than auto for both implementations (fewer line-wrap
  iterations).
- File discovery (`--list-files`) shows 7.8x speedup, reflecting Rust `ignore` crate vs
  Python `pathspec`/`os.walk`.

## Profiling: Where Does Rust Spend Its Time?

Profiled with `valgrind --tool=callgrind` on `tests/testdocs/testdoc.orig.md` (1,734
lines). Total: 155.7M instructions.

### Call Hierarchy (Inclusive Cost)

```
fill_markdown (entry)                           99.4%   (154.7M)
├── render_block (comrak AST → Markdown)        69.3%   (107.9M)
│   └── render_block recursive                  55.9%   ( 87.0M)
│       └── line wrapping pipeline              37.8%   ( 58.8M)
│           └── tag newline handling            37.4%   ( 58.2M)
│               └── line_wrap_to_width          35.6%   ( 55.4M)
│                   └── wrap_paragraph          35.1%   ( 54.6M)
│                       └── wrap_paragraph_lines 34.4%  ( 53.5M)
│                           └── html_md_word_split 27.6% ( 43.0M)
└── pre/post-processing workarounds             30.1%   ( 46.8M)
```

The wrapping pipeline (word splitting → paragraph wrapping → line breaking) is the
dominant cost at ~35% inclusive.
Pre- and post-processing workarounds for comrak account for another ~30%.

### Self-Time Breakdown (Exclusive Cost)

| Category | % | Instructions | What’s happening |
| --- | --- | --- | --- |
| **String searching** (`core::str::pattern`) | **~30%** | ~46.7M | `StrSearcher::new` 15.2%, `TwoWaySearcher::next` 10.4% |
| **Memory allocation** (malloc/free/realloc) | ~18.5% | ~28.8M | Allocation churn from string operations |
| **Memory ops** (memcpy/memcmp/memset) | ~6.7% | ~10.4M | Copying strings during replace/concat |
| **Regex** (regex-automata hybrid DFA) | ~5.5% | ~8.6M | Sentence detection, atomic construct extraction |
| **flowmark functions** (direct self-time) | ~4.5% | ~7.0M | `fill_markdown` 0.6%, `remove_period_escapes` 0.5% |
| **str::replace** (alloc + search) | ~2.8% | ~4.4M | Each `.replace()` allocates a new String |
| **Comrak parser** | ~2.4% | ~3.7M | `parse_inline`, `process_line`, `open_new_blocks` |

### Key Finding

**String pattern searching is the #1 bottleneck at ~30% of total instructions.** This is
not from the comrak parser or regex, it’s from Rust’s `str::replace()`,
`str::contains()`, and related methods that use `core::str::pattern::StrSearcher`
(Two-Way string search algorithm).

## Root Causes

### 1. O(N×M) Placeholder Restoration in `restore_atomic_constructs`

**File:** `src/wrapping/text_wrapping.rs:56–71`

```rust
fn restore_atomic_constructs(tokens: &[String], constructs: &[String], placeholders: &[String]) -> Vec<String> {
    tokens.iter().map(|token| {
        let mut result = token.clone();
        for (placeholder, construct) in placeholders.iter().zip(constructs.iter()) {
            result = result.replace(placeholder.as_str(), construct);  // N×M string scans
        }
        result
    }).collect()
}
```

For each token, this scans the full string M times (once per placeholder).
Each `.replace()` call invokes `StrSearcher::new` (builds a Two-Way searcher) and
`TwoWaySearcher::next` (scans the string).
With many tokens and many placeholders, this is expensive.

### 2. 32× Sequential `.replace()` for Escape Placeholders

**File:** `src/formatter/filling.rs:2200–2203`

```rust
for (escaped, placeholder) in &escape_placeholders {
    result = result.replace(placeholder.as_str(), escaped.as_str());
}
```

This runs 32 `.replace()` calls over the entire document (one per escapable ASCII
punctuation character).
Each call scans the full document and allocates a new `String`. The same pattern appears
in the pre-processing direction at lines 750–755.

### 3. Per-Line Character Scanning in `remove_period_escapes_preserving_code`

**File:** `src/formatter/filling.rs:807–850`

Called on every non-fenced line.
Character-by-character processing with `String::with_capacity` + push.
Not algorithmically bad, but the sheer call volume makes it visible at 0.5% self-time.

## Optimization Opportunities

| # | Optimization | Estimated Impact | Complexity |
| --- | --- | --- | --- |
| 1 | Single-pass `restore_atomic_constructs`: scan each token once for `\x00AC` prefix instead of M `.replace()` calls | 10–15% | Low |
| 2 | Single-pass PUA escape restoration: scan document once for PUA chars in `\u{E000}..=\u{E07E}` instead of 32 `.replace()` calls | 5–10% | Low |
| 3 | Buffer reuse / `Cow<str>` in wrapping pipeline to reduce allocation churn | 3–5% | Medium |
| 4 | Pre-built regex `Cache` for hybrid DFA | 1–2% | Low |

## Optimization Experiments

Two optimizations were implemented and tested.
All 430 tests pass after each change.

### Optimization 1: Single-pass `restore_atomic_constructs`

**Change:** Replace the O(N×M) `.replace()` loop in `restore_atomic_constructs`
(`src/wrapping/text_wrapping.rs`) with a fast-path check: if the token doesn’t contain
the placeholder prefix byte (`\x00`), skip entirely.
If the entire token is a placeholder (common case), do a HashMap lookup instead of M
sequential `.replace()` calls.

**Result (alone):** Within measurement noise, no significant improvement on test
document. This makes sense: the testdoc has relatively few atomic constructs (HTML tags,
code spans), so the placeholder restoration isn’t the dominant contributor.
The optimization would show more benefit on documents heavy with inline HTML/code.

### Optimization 2: Single-pass PUA Escape Processing

**Change:** Replace two sets of 32× sequential `.replace()` calls:

- **Pre-processing** (`replace_escapes_in_line`): Instead of calling
  `.replace(escaped, placeholder)` for each of 32 escape chars, scan the line once for
  `\` and check if the next char is in the escape set.
- **Post-processing** (`restore_pua_escape_placeholders`): Instead of 32×
  `.replace(placeholder, escaped)` over the full document, scan once for any char in the
  PUA range `\u{E000}..=\u{E0FF}` followed by filler `\u{E100}` and emit the original
  `\<char>`.

Both directions now process the text in a single pass with O(N) time per call instead of
O(32×N).

### Combined Results (Optimizations 1+2)

Benchmarked with `hyperfine` (warmup + 10 runs for single file, 5 for batch).

**Single file (`testdoc.orig.md`, 1,734 lines):**

|  | Mean | Range |
| --- | --- | --- |
| Before | 31.5 ms +/- 2.2 ms | 28.4 – 39.6 ms |
| After | 27.3 ms +/- 2.5 ms | 24.2 – 34.9 ms |
| **Improvement** | **13.3% faster** |  |

Verified across 3 independent runs: 27.0, 27.2, 27.4, 27.8 ms (consistent).

**Batch `--auto` (1,080 files):**

|  | Mean | Range |
| --- | --- | --- |
| Before | 3.21 s +/- 0.11 s | 3.09 – 3.34 s |
| After | 2.69 s +/- 0.15 s | 2.58 – 3.02 s |
| **Improvement** | **16.2% faster** |  |

Verified across 3 independent runs: 2.71, 2.73, 2.63 s (consistent).

### Profiling After Optimization

Re-profiled with callgrind after optimizations:

| Metric | Before | After | Change |
| --- | --- | --- | --- |
| **Total instructions** | 155.7M | 89.0M | **-42.8%** |
| String searching (`str::pattern`) | ~30% (46.7M) | ~7.4% (6.6M) | **-85.9%** |
| Memory allocation (malloc/free) | ~18.5% (28.8M) | ~19% (16.9M) | -41.3% |
| Regex (regex-automata) | ~5.5% (8.6M) | ~5.8% (5.2M) | -39.5% |
| Comrak parser | ~2.4% (3.7M) | ~2.6% (2.3M) | -37.8% |

The string searching cost dropped from the dominant bottleneck (30%) to a minor
contributor (7.4%). All other categories decreased in absolute terms by ~40%, reflecting
the removal of the unnecessary work that string-search-heavy replace loops were causing.

### What’s Left After Optimization

Post-optimization, the remaining cost is spread across:

1. **Memory allocation** (~19%), inherent to string manipulation; would require
   `Cow<str>` or arena allocation (medium complexity)
2. **String searching** (~7%), remaining uses are necessary `.contains()` and `.find()`
   calls
3. **Regex** (~6%), already well-optimized with `LazyLock`; hybrid DFA is the regex
   crate’s efficient path
4. **Comrak parser** (~3%), external dependency, not directly optimizable
5. **memcpy/memset** (~7%), inherent to string operations

Further optimization would yield diminishing returns for increasing complexity.

### Optimization 3: Allocation Reduction

**Status:** Not implemented, the profiling after optimizations 1+2 shows that allocation
cost dropped 41% in absolute terms (from 28.8M to 16.9M instructions) as a side effect
of eliminating the string-replace churn.
The remaining allocations are spread across many small sites in the wrapping pipeline,
and reducing them would require introducing `Cow<str>` throughout the call chain, medium
complexity for an estimated 3-5% further improvement.

## Updated Headline Numbers (With Optimizations)

After applying optimizations 1+2:

| Benchmark | Python | Rust (before) | Rust (after) | Speedup vs Python |
| --- | --- | --- | --- | --- |
| Single file (1,734 lines) | 471.7 ms | 31.5 ms | 27.3 ms | **17.3x** |
| Batch `--auto` (1,080 files) | 32.1 s | 3.21 s | 2.69 s | **11.9x** |

Per-file throughput after optimization: **401 files/sec** in `--auto` mode (was 294).

* * *

## Part 3: Parallel File Processing (v0.3.0)

### Changes

Two complementary improvements implemented in v0.3.0:

1. **Rayon parallel file processing.** Replaced the sequential `for` loop with
   `rayon::par_iter().try_for_each()` for inplace formatting.
   Rayon’s work-stealing thread pool automatically sizes to `available_parallelism()`. A
   `--threads` CLI flag allows overriding (0 = all cores, default).
   Stdout output remains sequential to preserve file ordering.

2. **Skip-unchanged optimization.** After formatting, if the output matches the input
   exactly, the file write is skipped entirely.
   This preserves file modification times (important for build tools that use mtime) and
   eliminates I/O for already-formatted files.

### Benchmark Results (928 files, 8.8 MB)

Corpus: 928 Markdown files across a 4–5 level deep directory tree.
3 runs each.

#### Fresh Corpus (Files Need Formatting)

| Formatter | Run 1 | Run 2 | Run 3 | Mean | Relative speed |
| --- | --- | --- | --- | --- | --- |
| **dprint** | 0.364 s | 0.371 s | 0.361 s | **0.37 s** | **1.0x** |
| **flowmark-rs (parallel)** | 0.727 s | 0.728 s | 0.737 s | **0.73 s** | **2.0x** |
| **markdownfmt** | 0.958 s | 0.969 s | 0.929 s | **0.95 s** | **2.6x** |
| **flowmark-rs (sequential)** | 2.403 s | 2.385 s | 2.474 s | **2.42 s** | **6.5x** |

#### Already-Formatted Corpus (Re-format, Skip-Unchanged)

| Formatter | Run 1 | Run 2 | Run 3 | Mean | Relative speed |
| --- | --- | --- | --- | --- | --- |
| **dprint** | 0.247 s | 0.248 s | 0.247 s | **0.25 s** | **1.0x** |
| **flowmark-rs (parallel)** | 0.396 s | 0.367 s | 0.370 s | **0.38 s** | **1.5x** |

#### Thread Scaling (Fresh Corpus)

| Threads | Run 1 | Run 2 | Run 3 | Mean | Speedup vs 1 |
| --- | --- | --- | --- | --- | --- |
| 1 (sequential) | 2.521 s | 2.554 s | 2.498 s | 2.52 s | 1.0x |
| 2 | 2.673 s | 2.686 s | 2.761 s | 2.71 s | 0.9x |
| 4 | 1.733 s | 1.672 s | 1.774 s | 1.73 s | 1.5x |
| all cores (default) | 0.708 s | 0.670 s | 0.774 s | 0.72 s | 3.5x |

### Summary

| Metric | Before (v0.2.4) | After (v0.3.0) | Improvement |
| --- | --- | --- | --- |
| Batch formatting (928 files) | 2.74 s | 0.73 s | **3.8x faster** |
| Re-formatting (already done) | 2.74 s | 0.38 s | **7.2x faster** |
| vs dprint (fresh) | 11.7x slower | 2.0x slower | **Gap closed by 5.9x** |
| vs dprint (re-format) | N/A | 1.5x slower | **Nearly competitive** |
| Per-file throughput (fresh) | 338 files/sec | 1,271 files/sec | **3.8x** |
| Per-file throughput (re-format) | 338 files/sec | 2,442 files/sec | **7.2x** |

**flowmark-rs is now within 2x of dprint’s performance** on fresh formatting, and within
1.5x on re-formatting.
The remaining gap is primarily due to flowmark doing significantly more work per file
(semantic line breaks, smart quotes, typography, reference link encoding, footnote
extraction) versus dprint’s basic markdown normalization.

### Note on Thread Scaling

The --threads 2 result (2.71s) is slower than sequential (2.52s). This is expected on
this benchmark machine, the overhead of rayon’s thread pool and synchronization exceeds
the benefit with only 2 threads and relatively fast per-file formatting (~2.7ms/file).
Scaling improves at 4+ threads.