flowmark 0.2.7 - Docs.rs

# Flowmark Performance Report

**Date:** 2026-02-27

## Part 1: Cross-Formatter Comparison

### Benchmark Setup

- **Platform:** macOS 25.2.0, arm64 (local)
- **Corpus:** 928 Markdown files (23 MB)
- **Methodology:** single-corpus measurements using benchmark harness warmup + timed run(s).
- **Fresh-run mode:** `./benchmarks/run_comparison.sh first-run 1`
- **Cached second-run mode:** `./benchmarks/run_comparison.sh second-run 1`

Scripts to reproduce: `benchmarks/generate_corpus.sh`, `benchmarks/run_comparison.sh`.

### Current Headline Results (2026-02-27)

### Fresh Run (single corpus, files need formatting)

| Formatter | Language | Parallel | Mean | Relative speed |
| --- | --- | --- | --- | --- |
| **dprint** (`--incremental=false`) | Rust (WASM plugin) | yes | **0.36 s** | **1.0x** |
| **flowmark-rs** (`--auto`) | Rust | yes (rayon) | **0.71 s** | **2.0x** |
| **markdownfmt** | Go | no | **0.95 s** | **2.6x** |
| **prettier** | JavaScript | no | **38.0 s** | **105x** |
| **mdformat** | Python | no | **72.9 s** | **197x** |
| **flowmark-py** | Python | no | **~48 s** | **~130x** |

Notes:

- `flowmark-rs` and `dprint` values are from current local reruns on the 928-file corpus.
- `markdownfmt`, `prettier`, `mdformat`, `flowmark-py` are from the same corpus profile
  suite in this report (retained for cross-formatter ranking continuity).

### Cached Second Run (unchanged files)

| Formatter | Mean | Relative speed |
| --- | --- | --- |
| **flowmark-rs** (`--auto`, incremental default) | **0.023 s** | **1.0x** |
| **dprint** (`fmt`, incremental default) | **0.031 s** | **1.3x** |

Interpretation:

- Fresh-run ranking remains unchanged: flowmark-rs is #2 overall.
- With incremental cache warm, flowmark-rs now drops to ~23ms on this corpus.
- Fresh-run Rust vs Python headline remains roughly **60-70x faster** (`0.71s` vs `~48s`).

### Per-File Throughput

| Formatter | ms/file | files/sec |
| --- | --- | --- |
| dprint | 0.40 | 2,508 |
| flowmark-rs (parallel) | 0.79 | 1,271 |
| markdownfmt | 1.02 | 976 |
| flowmark-rs (sequential) | 2.61 | 383 |
| prettier | 41.0 | 24 |
| mdformat | 78.6 | 13 |
| flowmark-py | ~52 | ~19 |

### Raw Timings (3 Runs Each)

**v0.3.0 parallel runs (928 files, fresh corpus):**

| Formatter | Run 1 | Run 2 | Run 3 |
| --- | --- | --- | --- |
| dprint | 0.364 s | 0.371 s | 0.361 s |
| flowmark-rs (parallel) | 0.727 s | 0.728 s | 0.737 s |
| markdownfmt | 0.958 s | 0.969 s | 0.929 s |
| flowmark-rs (sequential) | 2.403 s | 2.385 s | 2.474 s |

**Original v0.2.4 runs (924 files, sequential only):**

| Formatter | Run 1 | Run 2 | Run 3 |
| --- | --- | --- | --- |
| dprint | 0.235 s | 0.224 s | 0.242 s |
| markdownfmt | 0.829 s | 0.790 s | 0.781 s |
| flowmark-rs (sequential) | 2.633 s | 2.928 s | 2.647 s |
| prettier | 20.961 s | 20.822 s | 20.885 s |
| flowmark-py | 27.889 s | 27.914 s | 27.597 s |
| mdformat | 37.571 s | 37.395 s | 37.499 s |

### Analysis

**Compiled-language formatters (dprint, flowmark-rs, markdownfmt) are 2–3 orders of
magnitude faster than interpreted-language formatters (prettier, flowmark-py,
mdformat).**

- **dprint** is the fastest — its Rust core with WASM plugin and multi-threaded file
  processing gives it ~0.37s on 928 files. Note that dprint uses ~3.3s of user CPU time
  (multi-threaded) for 0.37s wall-clock, indicating heavy parallelism.
- **flowmark-rs (parallel)** is second at 0.73s — within **2x of dprint** after adding
  rayon parallelism in v0.3.0. This is a **3.3x improvement** over the v0.2.4
  sequential version (2.42s). The remaining gap vs dprint is due to flowmark doing more
  work per file (semantic line breaks, smart quotes, typography, reference link encoding,
  footnote extraction).
- **markdownfmt** is third at 0.95s, benefiting from Go's fast compilation model and
  low per-file overhead. It processes files via `find -exec` with argument batching (not
  parallel internally).
- **prettier** is the fastest interpreted-language formatter, but ~100x slower than
  dprint. Node.js startup and single-threaded JS execution are the main bottlenecks.
- **flowmark-py** and **mdformat** are the slowest, reflecting Python's interpreter
  overhead. mdformat is slower than flowmark-py despite doing less work, likely due to
  markdown-it-py parsing overhead.

### Important Caveats

These formatters are **not interchangeable** — they have very different feature sets:

- **flowmark** (Python and Rust): Semantic line breaks, smart quotes, ellipsis
  typography, reference link encoding, footnote extraction, configurable wrapping modes.
  The most feature-rich formatter.
- **prettier**: Opinionated reformatting with consistent style. Good ecosystem
  integration. No semantic line breaks.
- **dprint**: Fast, parallel, plugin-based. Basic markdown normalization. No typography
  or semantic features.
- **mdformat**: Extensible Python formatter with plugin system. CommonMark-focused.
- **markdownfmt**: Minimal Go formatter. Normalizes headings, lists, and whitespace.
  Limited configurability.

The speed differences partially reflect feature complexity: simpler formatters that do
less per-file processing are naturally faster.

### How dprint Achieves Its Speed

Source analysis of [dprint/dprint](https://github.com/dprint/dprint) (cloned to
`attic/dprint`). Key file: `crates/dprint/src/format.rs`.

**Architecture:** Single-threaded tokio `current_thread` runtime for async
orchestration, with all actual work (file I/O + formatting) dispatched to tokio's
multi-threaded blocking pool via `spawn_blocking()`.

**Parallelism model:**

1. **Thread count = CPU cores.** Uses `std::thread::available_parallelism()`,
   overridable via `DPRINT_MAX_THREADS`. Reserves 1 thread per process plugin + 1 for
   the runtime.
1. **Semaphore-controlled concurrency.** Files are grouped by plugin. Each group gets a
   custom `Semaphore` with permits proportional to the thread count. A file can only
   begin formatting when it acquires a permit, capping active concurrent formats at
   ~core count.
1. **`spawn_blocking()` for I/O and formatting.** Each file: read (blocking) -> format
   (blocking or async depending on plugin type) -> write (blocking). The async event
   loop just orchestrates.
1. **Adaptive CPU throttling.** A background task monitors CPU usage every 2 seconds. If
   CPU exceeds a threshold, it removes semaphore permits to reduce parallelism. When CPU
   drops, it adds permits back. Disabled on CI.
1. **Work stealing on completion.** When one plugin group finishes, its semaphore
   permits are redistributed to remaining groups via `SemaphorePermitReleaser::drop`,
   favoring groups with fewer permits.
1. **Incremental caching.** Hash-based skip for unchanged files (explains the 0.13s with
   caching vs 0.23s with `--incremental=false`).

**Plugin system:** WASM plugins (compiled with Wasmer, run synchronously in-process) and
process plugins (separate child processes communicating via stdin/stdout). The markdown
formatter is a WASM plugin.

### Implemented: Parallel File Processing for flowmark-rs

Parallel file processing was implemented in v0.3.0 using rayon (see Part 3 for full
results). The sequential loop was replaced with `rayon::par_iter().try_for_each()`,
achieving a **3.8x wall-clock speedup** on batch workloads and bringing flowmark-rs to
**within 2x of dprint's performance**.

The rayon approach proved simpler and equally effective as dprint's more complex tokio +
semaphore architecture, since flowmark-rs has no plugin infrastructure.

______________________________________________________________________

## Part 2: Flowmark Python vs Rust (Detailed)

### Benchmark Setup

- **Python:** flowmark v0.6.4
- **Rust:** flowmark v0.2.4 (release: `opt-level=3`, LTO, `codegen-units=1`,
  `panic=abort`)
- **Benchmarking tool:** hyperfine (with warmup, multiple runs)
- **Profiling tool:** valgrind callgrind (instruction-level, single file and batch)

Scripts to reproduce: `benchmarks/run_benchmarks.sh`, `benchmarks/profile_rust.sh`.

### Headline Results

Rust flowmark is **10–17x faster** than Python flowmark across all workloads.

| Benchmark | Python | Rust | Speedup |
| --- | --- | --- | --- |
| Single file (1,734 lines, stdout) | 471.7 ms | 27.3 ms | **17.3x** |
| Batch `--auto` (924 files in-place) | 27.8 s | 2.74 s | **10.1x** |
| Batch `--semantic` (1,080 files in-place) | 27.2 s | 2.5 s | **10.9x** |
| File discovery `--list-files` (1,080 files) | 1.31 s | 169 ms | **7.8x** |

### Per-File Throughput

| Mode | Python | Rust |
| --- | --- | --- |
| `--auto` (batch) | 30.1 ms/file, 33 files/sec | 2.96 ms/file, 338 files/sec |
| `--semantic` (batch) | 25.2 ms/file, 39 files/sec | 2.3 ms/file, 432 files/sec |

### Notes

- Python startup overhead (~300 ms) inflates single-file times; in batch mode this is
  amortized and the per-file speedup drops to ~10x.
- Semantic mode is slightly faster than auto for both implementations (fewer line-wrap
  iterations).
- File discovery (`--list-files`) shows 7.8x speedup, reflecting Rust `ignore` crate vs
  Python `pathspec`/`os.walk`.

## Profiling: Where Does Rust Spend Its Time?

Profiled with `valgrind --tool=callgrind` on `tests/testdocs/testdoc.orig.md` (1,734
lines). Total: 155.7M instructions.

### Call Hierarchy (Inclusive Cost)

```
fill_markdown (entry)                           99.4%   (154.7M)
├── render_block (comrak AST → Markdown)        69.3%   (107.9M)
│   └── render_block recursive                  55.9%   ( 87.0M)
│       └── line wrapping pipeline              37.8%   ( 58.8M)
│           └── tag newline handling            37.4%   ( 58.2M)
│               └── line_wrap_to_width          35.6%   ( 55.4M)
│                   └── wrap_paragraph          35.1%   ( 54.6M)
│                       └── wrap_paragraph_lines 34.4%  ( 53.5M)
│                           └── html_md_word_split 27.6% ( 43.0M)
└── pre/post-processing workarounds             30.1%   ( 46.8M)
```

The wrapping pipeline (word splitting → paragraph wrapping → line breaking) is the
dominant cost at ~35% inclusive. Pre- and post-processing workarounds for comrak account
for another ~30%.

### Self-Time Breakdown (Exclusive Cost)

| Category | % | Instructions | What's happening |
| --- | --- | --- | --- |
| **String searching** (`core::str::pattern`) | **~30%** | ~46.7M | `StrSearcher::new` 15.2%, `TwoWaySearcher::next` 10.4% |
| **Memory allocation** (malloc/free/realloc) | ~18.5% | ~28.8M | Allocation churn from string operations |
| **Memory ops** (memcpy/memcmp/memset) | ~6.7% | ~10.4M | Copying strings during replace/concat |
| **Regex** (regex-automata hybrid DFA) | ~5.5% | ~8.6M | Sentence detection, atomic construct extraction |
| **flowmark functions** (direct self-time) | ~4.5% | ~7.0M | `fill_markdown` 0.6%, `remove_period_escapes` 0.5% |
| **str::replace** (alloc + search) | ~2.8% | ~4.4M | Each `.replace()` allocates a new String |
| **Comrak parser** | ~2.4% | ~3.7M | `parse_inline`, `process_line`, `open_new_blocks` |

### Key Finding

**String pattern searching is the #1 bottleneck at ~30% of total instructions.** This is
not from the comrak parser or regex — it's from Rust's `str::replace()`,
`str::contains()`, and related methods that use `core::str::pattern::StrSearcher`
(Two-Way string search algorithm).

## Root Causes

### 1. O(N×M) Placeholder Restoration in `restore_atomic_constructs`

**File:** `src/wrapping/text_wrapping.rs:56–71`

```rust
fn restore_atomic_constructs(tokens: &[String], constructs: &[String], placeholders: &[String]) -> Vec<String> {
    tokens.iter().map(|token| {
        let mut result = token.clone();
        for (placeholder, construct) in placeholders.iter().zip(constructs.iter()) {
            result = result.replace(placeholder.as_str(), construct);  // N×M string scans
        }
        result
    }).collect()
}
```

For each token, this scans the full string M times (once per placeholder). Each
`.replace()` call invokes `StrSearcher::new` (builds a Two-Way searcher) and
`TwoWaySearcher::next` (scans the string). With many tokens and many placeholders, this
is expensive.

### 2. 32× Sequential `.replace()` for Escape Placeholders

**File:** `src/formatter/filling.rs:2200–2203`

```rust
for (escaped, placeholder) in &escape_placeholders {
    result = result.replace(placeholder.as_str(), escaped.as_str());
}
```

This runs 32 `.replace()` calls over the entire document (one per escapable ASCII
punctuation character). Each call scans the full document and allocates a new `String`.
The same pattern appears in the pre-processing direction at lines 750–755.

### 3. Per-Line Character Scanning in `remove_period_escapes_preserving_code`

**File:** `src/formatter/filling.rs:807–850`

Called on every non-fenced line. Character-by-character processing with
`String::with_capacity` + push. Not algorithmically bad, but the sheer call volume makes
it visible at 0.5% self-time.

## Optimization Opportunities

| # | Optimization | Estimated Impact | Complexity |
| --- | --- | --- | --- |
| 1 | Single-pass `restore_atomic_constructs`: scan each token once for `\x00AC` prefix instead of M `.replace()` calls | 10–15% | Low |
| 2 | Single-pass PUA escape restoration: scan document once for PUA chars in `\u{E000}..=\u{E07E}` instead of 32 `.replace()` calls | 5–10% | Low |
| 3 | Buffer reuse / `Cow<str>` in wrapping pipeline to reduce allocation churn | 3–5% | Medium |
| 4 | Pre-built regex `Cache` for hybrid DFA | 1–2% | Low |

## Optimization Experiments

Two optimizations were implemented and tested. All 430 tests pass after each change.

### Optimization 1: Single-pass `restore_atomic_constructs`

**Change:** Replace the O(N×M) `.replace()` loop in `restore_atomic_constructs`
(`src/wrapping/text_wrapping.rs`) with a fast-path check: if the token doesn't contain
the placeholder prefix byte (`\x00`), skip entirely. If the entire token is a
placeholder (common case), do a HashMap lookup instead of M sequential `.replace()`
calls.

**Result (alone):** Within measurement noise — no significant improvement on test
document. This makes sense: the testdoc has relatively few atomic constructs (HTML tags,
code spans), so the placeholder restoration isn't the dominant contributor. The
optimization would show more benefit on documents heavy with inline HTML/code.

### Optimization 2: Single-pass PUA Escape Processing

**Change:** Replace two sets of 32× sequential `.replace()` calls:

- **Pre-processing** (`replace_escapes_in_line`): Instead of calling
  `.replace(escaped, placeholder)` for each of 32 escape chars, scan the line once for
  `\` and check if the next char is in the escape set.
- **Post-processing** (`restore_pua_escape_placeholders`): Instead of 32×
  `.replace(placeholder, escaped)` over the full document, scan once for any char in the
  PUA range `\u{E000}..=\u{E0FF}` followed by filler `\u{E100}` and emit the original
  `\<char>`.

Both directions now process the text in a single pass with O(N) time per call instead of
O(32×N).

### Combined Results (Optimizations 1+2)

Benchmarked with `hyperfine` (warmup + 10 runs for single file, 5 for batch).

**Single file (`testdoc.orig.md`, 1,734 lines):**

| | Mean | Range |
| --- | --- | --- |
| Before | 31.5 ms +/- 2.2 ms | 28.4 – 39.6 ms |
| After | 27.3 ms +/- 2.5 ms | 24.2 – 34.9 ms |
| **Improvement** | **13.3% faster** | |

Verified across 3 independent runs: 27.0, 27.2, 27.4, 27.8 ms (consistent).

**Batch `--auto` (1,080 files):**

| | Mean | Range |
| --- | --- | --- |
| Before | 3.21 s +/- 0.11 s | 3.09 – 3.34 s |
| After | 2.69 s +/- 0.15 s | 2.58 – 3.02 s |
| **Improvement** | **16.2% faster** | |

Verified across 3 independent runs: 2.71, 2.73, 2.63 s (consistent).

### Profiling After Optimization

Re-profiled with callgrind after optimizations:

| Metric | Before | After | Change |
| --- | --- | --- | --- |
| **Total instructions** | 155.7M | 89.0M | **-42.8%** |
| String searching (`str::pattern`) | ~30% (46.7M) | ~7.4% (6.6M) | **-85.9%** |
| Memory allocation (malloc/free) | ~18.5% (28.8M) | ~19% (16.9M) | -41.3% |
| Regex (regex-automata) | ~5.5% (8.6M) | ~5.8% (5.2M) | -39.5% |
| Comrak parser | ~2.4% (3.7M) | ~2.6% (2.3M) | -37.8% |

The string searching cost dropped from the dominant bottleneck (30%) to a minor
contributor (7.4%). All other categories decreased in absolute terms by ~40%, reflecting
the removal of the unnecessary work that string-search-heavy replace loops were causing.

### What's Left After Optimization

Post-optimization, the remaining cost is spread across:

1. **Memory allocation** (~19%) — inherent to string manipulation; would require
   `Cow<str>` or arena allocation (medium complexity)
1. **String searching** (~7%) — remaining uses are necessary `.contains()` and `.find()`
   calls
1. **Regex** (~6%) — already well-optimized with `LazyLock`; hybrid DFA is the regex
   crate's efficient path
1. **Comrak parser** (~3%) — external dependency, not directly optimizable
1. **memcpy/memset** (~7%) — inherent to string operations

Further optimization would yield diminishing returns for increasing complexity.

### Optimization 3: Allocation Reduction

**Status:** Not implemented — the profiling after optimizations 1+2 shows that
allocation cost dropped 41% in absolute terms (from 28.8M to 16.9M instructions) as a
side effect of eliminating the string-replace churn. The remaining allocations are
spread across many small sites in the wrapping pipeline, and reducing them would require
introducing `Cow<str>` throughout the call chain — medium complexity for an estimated
3-5% further improvement.

## Updated Headline Numbers (With Optimizations)

After applying optimizations 1+2:

| Benchmark | Python | Rust (before) | Rust (after) | Speedup vs Python |
| --- | --- | --- | --- | --- |
| Single file (1,734 lines) | 471.7 ms | 31.5 ms | 27.3 ms | **17.3x** |
| Batch `--auto` (1,080 files) | 32.1 s | 3.21 s | 2.69 s | **11.9x** |

Per-file throughput after optimization: **401 files/sec** in `--auto` mode (was 294).

______________________________________________________________________

## Part 3: Parallel File Processing (v0.3.0)

### Changes

Two complementary improvements implemented in v0.3.0:

1. **Rayon parallel file processing.** Replaced the sequential `for` loop with
   `rayon::par_iter().try_for_each()` for inplace formatting. Rayon's work-stealing
   thread pool automatically sizes to `available_parallelism()`. A `--threads` CLI flag
   allows overriding (0 = all cores, default). Stdout output remains sequential to
   preserve file ordering.

1. **Skip-unchanged optimization.** After formatting, if the output matches the input
   exactly, the file write is skipped entirely. This preserves file modification times
   (important for build tools that use mtime) and eliminates I/O for already-formatted
   files.

### Benchmark Results (928 files, 8.8 MB)

Corpus: 928 Markdown files across a 4–5 level deep directory tree. 3 runs each.

#### Fresh Corpus (Files Need Formatting)

| Formatter | Run 1 | Run 2 | Run 3 | Mean | Relative speed |
| --- | --- | --- | --- | --- | --- |
| **dprint** | 0.364 s | 0.371 s | 0.361 s | **0.37 s** | **1.0x** |
| **flowmark-rs (parallel)** | 0.727 s | 0.728 s | 0.737 s | **0.73 s** | **2.0x** |
| **markdownfmt** | 0.958 s | 0.969 s | 0.929 s | **0.95 s** | **2.6x** |
| **flowmark-rs (sequential)** | 2.403 s | 2.385 s | 2.474 s | **2.42 s** | **6.5x** |

#### Already-Formatted Corpus (Re-format, Skip-Unchanged)

| Formatter | Run 1 | Run 2 | Run 3 | Mean | Relative speed |
| --- | --- | --- | --- | --- | --- |
| **dprint** | 0.247 s | 0.248 s | 0.247 s | **0.25 s** | **1.0x** |
| **flowmark-rs (parallel)** | 0.396 s | 0.367 s | 0.370 s | **0.38 s** | **1.5x** |

#### Thread Scaling (Fresh Corpus)

| Threads | Run 1 | Run 2 | Run 3 | Mean | Speedup vs 1 |
| --- | --- | --- | --- | --- | --- |
| 1 (sequential) | 2.521 s | 2.554 s | 2.498 s | 2.52 s | 1.0x |
| 2 | 2.673 s | 2.686 s | 2.761 s | 2.71 s | 0.9x |
| 4 | 1.733 s | 1.672 s | 1.774 s | 1.73 s | 1.5x |
| all cores (default) | 0.708 s | 0.670 s | 0.774 s | 0.72 s | 3.5x |

### Summary

| Metric | Before (v0.2.4) | After (v0.3.0) | Improvement |
| --- | --- | --- | --- |
| Batch formatting (928 files) | 2.74 s | 0.73 s | **3.8x faster** |
| Re-formatting (already done) | 2.74 s | 0.38 s | **7.2x faster** |
| vs dprint (fresh) | 11.7x slower | 2.0x slower | **Gap closed by 5.9x** |
| vs dprint (re-format) | N/A | 1.5x slower | **Nearly competitive** |
| Per-file throughput (fresh) | 338 files/sec | 1,271 files/sec | **3.8x** |
| Per-file throughput (re-format) | 338 files/sec | 2,442 files/sec | **7.2x** |

**flowmark-rs is now within 2x of dprint's performance** on fresh formatting, and within
1.5x on re-formatting. The remaining gap is primarily due to flowmark doing significantly
more work per file (semantic line breaks, smart quotes, typography, reference link
encoding, footnote extraction) versus dprint's basic markdown normalization.

### Note on Thread Scaling

The --threads 2 result (2.71s) is slower than sequential (2.52s). This is expected on
this benchmark machine — the overhead of rayon's thread pool and synchronization exceeds
the benefit with only 2 threads and relatively fast per-file formatting (~2.7ms/file).
Scaling improves at 4+ threads.