PoW Buster
Table of Contents
- PoW Buster
- Table of Contents
- Motivation / Why?
- Features
- Building
- Usage
- Limitations
- Ethical Disclaimer (i.e. the "How Dare you Publish this?" question)
- Benchmark
- Security Implications and Responsible Reporting
- Future Work (i.e. Okay, so what would be a good PoW then?)
- Conflict of Interest Disclosure
- Contributing / Wishlist
- License and Acknowledgments
A fast, data-parallel, adversarially [^3] implemented mCaptcha/Anubis/Cerberus/go-away/Cap.js PoW solver, targeting AVX-512/SHA-NI/simd128. Can be used for computing solutions to these systems without disabling privacy-enhancing features, without wasting energy in the browser.
[^3]: Adversarial refers to challenges are solved using the path-of-least-resistance, sometimes involving massaging nonce space into favorable conditions or partially inverting hash images into lower-latency internal states. Most schemes supported do not have explicit specifications and depend on the cryptographic guarantees of the hash function, which I did not break (at least not in a previously unknown way). This code follows the original code to the letter of the law and sometimes emit awkward but computationally or statistically favorable solutions (such as 10000000073377131, -10.00000141128212, etc.)
The benchmarks demonstrate a significant performance gap between browser-based JavaScript execution and native implementations, suggesting fundamental challenges for PoW-based browser CAPTCHA systems.
Public web demo running on Netcup (R) RS 2000 G12, 8vCPU server at MSRP 14.58EUR/month. Ballparking 80-100MH/s/thread for SHA-2 and 170MH/s/thread for BLAKE3.
Motivation / Why?
See MOTIVATION.md for more details.
A longer blabbing post regarding this
Features
- SHA-2 and BLAKE3 hotstarting with round-level precomputation granularity
- Structure of Array hashing backed by register-resident SIMD.
- 3 searching modes: Prefix greater than (mCaptcha), prefix less than (Anubis/go-away), mask test (Cerberus/Cap.js)
- Greedy padding logic with 64-bit integer and floating point nonce stretching
- Efficient outer loop and SIMD nonce encoding
- Fully unrolled and monomorphic core friendly to pipelining and ternary logic instruction lowering
- Short-circuiting comparison with $H_1 \to H_7$ feed-forward elision with optional 64-bit support
- Switch to octal nonces when success rate is overwhelming
- Fuzzy Anubis challenge handling
- An API compatible with anubis_offload but doesn't need a GPU to run.
Building
MSRV: Rust 1.89+.
Requires AVX-512 (cpuid: avx512f) or SHA-NI+SSE4.1 (cpuid: sha, sse4_1) CPU or SIMD128 on WASM. If you don't have any of these advanced instruction support, sorry, some "solutions" have "changed the way" of "security" (by paying with energy and battery life and making browsing on budget hardware hard). There is a pure Rust scalar fallback that should make the code compile and work regardless.
Recommended CPU feature flags in order of preference (Tagged releases will have Linux musl builds):
-Ctarget-cpu=native-Ctarget-feature=+avx512vbmi(Artifacts released on top of x86-64-v4)-Ctarget-feature=+avx512f(Artifacts released on top of x86-64-v3)-Ctarget-cpu=x86-64-v2(Will auto dispatch to accelerated solvers at runtime)
RUSTFLAGS="-Ctarget-cpu=x86-64-v2"
Optional Features:
compare-64bit: Compare 64-bit words instead of 32-bit words at ~5% penalty, almost never needed for realistic challenges. Not compatible with WASM.client: End-to-end solver client, required for most non-computational functionality.live-throughput-test: End-to-end multi-worker throughput benchmark.server: Solver-as-a-Service API.server-wasm: Solver-as-a-Service API (with WASM simd128 solver, build first with./build_wasm.sh).tracing: Write tracing for debugging.tracing-subscriber: For binary releases only, writes tracing logs to console.
Usage
Using the CLI Solver
The most common use case is you have a non browser client (for example a CLI downloader, feed checker or other automation tool) and you want a valid token. pow-buster features comprehensive coverage for Anubis challenge workflow and basic coverage for Cerberus and go-away for your edge case non-JavaScript use cases.
> target/release/pow-buster
Custom User Agent
The default user agent is "pow-buster/x.x.x (NotAMozilla)". Some adopters may hard block the default user agent or reject any non browser-like UAs, but regardless it is good etiquette to identify yourself using your actual bot name not just "pow-buster". I personally think it is okay to "impersonate" a browser if the overblocking ruleset make it impossible to be "honest", but you are responsible for your own actions.
For example TLNET blocks all UAs that do not start with "Mozilla", we can make a compromise and put the magic word in the front and then disclose our actual UA responsibly as a workaround:
> target/release/pow-buster
> USER_AGENT="Mozilla/5.0 MyDocumentationBot/0.1.0"
Using the browser extension
Browser addons will be released unsigned for reasons below. To install it you have to:
- Get it signed under your developer account, or
- On any Firefox browser that is not the release flavor (e.g. Nightly, LibreWolf, etc.), manually flip
xpinstall.signatures.requiredtofalseinabout:configto install them.
For easier to use (but less reliable) alternatives, I highly recommend trying out the NoPoW extension (which is signed by Mozilla because it is just a UA changer), main differences:
- ✅ Works on non-default or challenge-all config.
- ✅ Resistant to adopters who like to block specific request signatures because there is no request signature - vanilla vendor submission code is used.
- ✅ Does not try to circumvent any administrative rules, code is law.
- ❌ Will still "waste" CPU cycles, just (much) more efficiently.
- ❌ The extension may violate Firefox Submission Guideline of being "self contained". Although it can (and by default does) work without any server-side components, you have the option to use the preferences page to have it connect to a native
pow-busterserver and execute solution scripts generated by the server. This is to simplify my workflow of just having one server-side integration and increase agility of getting native acceleration from manual devTools based workflows to fully automated ones.
Limitations
We assume you have a relatively modern and powerful platform, specifically:
- A cold optimized build with end-to-end features may take up to 5 minutes as this program aggressively generates specialized kernels and build time isn't my priority.
- For Anubis target, this assumes the server is 64-bit (i.e. is able to accept a signed 64-bit nonce).
- AVX-512 build requires Rust 1.89 or later.
- All solvers are single-threaded and are intended to be scaled up using multiple workers optionally pinned to specific cores.
- This is designed for "low", practical-for-a-website difficulty settings, A worst-case $1 - P_{geom}(80e7, 1/\text{difficulty})$ chance of failure for any particular messaeg offset with most offset cases almost guaranteed to succeed eventually, which for 1e8 (takes about 10 seconds on a browser for mCaptcha and an eternity for Anubis) is about 0.03%. Go-away solver explores the full solution space and guarantees a solution if one exists.
Ethical Disclaimer (i.e. the "How Dare you Publish this?" question)
This isn't a vulnerability nor anything previously unknown, it's a structural weakness that needs to be assessed. I didn't "skip" or somehow "simplify" any number of SHA-2 rounds, it is a materialized analysis of performance characteristics of the system.
This is a structural limitation, PoW is supposed for global consensus, not maintaining a meaningful peer-to-peer "fair" hash rate margin, especially not when compared to commodity hardware. Every academic paper will tell you that PoW system loses protection margin using hardware or software optimizations. I implemented it, that's it.
Website operators deploying a PoW system bear the responsibility to understand the performance characteristics and security implications of their chosen PoW parameters, and whether that protects against their identified threat. The purpose of this research is to provide the statistical analysis and empirical validation data necessary for informed deployment decisions, including optimized CPU only solutions.
Benchmark
Most of the formal comparison is done against mCaptcha, because they have a WASM solver and cannot be immediately dismissed as "that's JS overhead"/"we will do better later".
TLDR; My extrapolated throughput for each approach, corroborated by empirical and formal benchmarks:

Formal Benchmark (mCaptcha only)
Speedup against official solution, reported by Criterion.rs, single-threaded except for "mCaptcha User Survey extrapolated" column which uses all worker threads on the user's browser:
Results on AMD Ryzen 9 7950X, 32 hyperthreads, when supported, single-hash number comes first (there is 90% chance your deployment is single-hash, this vagueness is IMO design oversight by the mCaptcha team), double-hash number comes second, all numbers are in milliseconds, compiled with -Ctarget-cpu=native unless otherwise specified.
| DFactor (equiv. Anubis difficulty) | AVX-512 | AVX-512 (32-byte salt) | Safe Optimized (+) [^1] | mCaptcha (+) | mCaptcha Generic x64 (+) | mCaptcha User Survey extrapolated [^2] |
|---|---|---|---|---|---|---|
| 50_000 (3.90) | 0.554/0.953 | 0.487 | 1.565 | 2.851/4.009 | 5.600/9.537 | 14.556 |
| 100_000 (4.15) | 1.105/1.903 | 0.978 | 3.172 | 5.698/7.817 | 11.152/18.575 | 29.11176 |
| 1_000_000 (4.98) | 11.138/18.515 | 9.707 | 31.622 | 54.931/80.029 | 117.34/188.41 | 291.118 |
| 4_000_000 (5.48) | 46.136/75.630 | 37.475 | 125.06 | 222.93/323.70 | 432.81/777.88 | 1164.471 |
| 10_000_000 (5.81) | 107.49/186.01 | 94.645 | 323.06 | 564.41/805.02 | DNS | 2911.18 |
(+) = SNA-NI and a standard SHA-256 implementation is used.
[^1]: Represents a custom implementation using safe, externally-validated cryptographic abstractions only and no platform-specific optimizations. [^2]: Manivannan, A.; Sethuraman, S. C.; Vimala Sudhakaran, D. P. MCaptcha: Replacing Captchas with Rate Limiters to Improve Security and Accessibility. Communications of the ACM 2024, 67 (10), 70–80. https://doi.org/10.1145/3660628.
End to End Benchmark
A default official docker-compose instance is used for the benchmark target (the default 33-byte salt was unchanged).
CPU only
The following were configured for difficulty 5_000_000 (default max tier).
10 consecutive solutions using the official Captcha widget: [0.105s, 1.69s, 1.06s, 1.89s, 1.91s, 1.09s, 1.80s, 0.97s, 0.71s, 1.15s, 3.59s, 1.09s, 0.14s, 3.98s, 1.26s, 1.05s, 1.26s]
> RUSTFLAGS="-Ctarget-cpu=native" \
Anubis "mild suspicion" (4, saturated Anubis Go runtime):
> target/release/pow-buster
Anubis "extreme suspiction" (6):
> target/release/pow-buster
All 32 hyperthreads of a AMD Ryzen 9 7950X are used for the end-to-end benchmark. It seems we are at the bottleneck of the server being able to record successful attempts, as further performance tuning only show improvement in offline benchmarks.
Cap.js Browser Comparison
Cap.js is a good end-to-end comparison target since it:
- Uses multiple sub-goals instead of one big goal, results in a normal instead of geometric distribution of solution times
- Has an official browser benchmark performed by BrowserStack
- Has a WASM solver written in Rust
Cap.js got the following benchmark as of 09/12/2025:
| Tier | Device | Chrome | Safari |
|---|---|---|---|
| Low-end | Samsung Galaxy A11 | 4.583s | - |
| Low-end | iPhone SE (2020) | - | 1.282s |
| Mid-range | Google Pixel 7 | 1.027s | - |
| Mid-range | iPad (9th gen) | – | 1.401s |
| High-end | Google Pixel 9 | 0.894s | – |
| High-end | MacBook Air M3 | 0.312s | 0.423s |
Tested with BrowserStack using the following configuration:
- Challenge difficulty: 4
- Number of challenges: 50
- Salt/challenge size: 32
- Number of benchmarks: 50
We set up a local Cap.js server, set it to the same difficulty but 5000 subgoals instead of 50.
> target/release/pow-buster
{
}
We solved it in ~200ms using 32-threads on a 7950X at 1.633 GH/s (faster because of hotstarting), and about 150x faster than MacBook Air M3. Taking out the thread count lead (32 threads on 7950X, 8 threads on MacBook Air M3), we are at about 40x "net" speedup per thread.
Throughput Sanity Check
Just as a sanity check to make sure we are actually performing checks with effective data parallelism and the difference is not just because implementation overhead, here are the numbers from OpenSSL with SHA-NI support:
The program were built with -Ctarget-feature=+avx512f and -Ctarget-feature=+sha,+avx respectively and ran on 7950X with mitigations enabled.
Single Threaded
> openssl
The single-threaded throughput for OpenSSL with SHA-NI support is about 12.94 MH/s (828.2MB/s) single block, 42.00 MH/s (2.86 GB/s) continuous.
For us we have single thread:
| Workload | AVX-512 log | SHA-NI log | Chromium SIMD128 log |
|---|---|---|---|
| SingleBlock/Anubis | 89.16 MH/s | 62.19 MH/s | 14.74 MH/s |
| DoubleBlock (mCaptcha edge case) | 53.28 MH/s | 42.55 MH/s | Not Tested |
| go-away (32 bytes) | 98.42 MH/s | 78.10 MH/s | Not Tested |
| Cerberus (BLAKE3) | 205.98 MH/s | N/A | 49.86 MH/s |
On a mobile CPU (i7-11370H), similar performance can be achieved on AVX-512 (at a higher IPC due to Intel having faster register rotations):
| Workload | AVX-512 | SHA-NI |
|---|---|---|
| SingleBlock/Anubis | 72.30 MH/s | 21.87 MH/s |
| DoubleBlock (mCaptcha edge case) | 44.84 MH/s | 14.46 MH/s |
| go-away (32 bytes) | 80.53 MH/s | 20.42 MH/s |
| Cerberus (BLAKE3) | 179.07 MH/s | N/A |
The throughput on 7950X for Anubis and go-away is about 100kH/s on Chromium and about 20% of that on Firefox, this is corroborated by Anubis's own accounts in their code comments using 7950X3D empirical testing. Empirical throughput of WASM-based mCaptcha is unreliable due to lack of official benchmark tools, but should be around 2-4 MH/s, corroborated with the author's CACM paper.
Multi Threaded
The peak throughput on 7950X reported by openssl speed -multi 32 sha256 is 239.76 MH/s (15.34 GB/s) single block, 1.14 GH/s (73.24 GB/s) continuous.
| Workload | AVX-512 log | SHA-NI | Vendor Official on Chromium [^4] |
|---|---|---|---|
| SingleBlock/Anubis | 1.465 GH/s | 1.143 GH/s | ~650kH/s |
| DoubleBlock (mCaptcha edge case) | 850.97 MH/s | 827.74 MH/s | N/A |
| go-away (32 bytes) | 1.564 GH/s | 1.291 GH/s | N/A |
| Cerberus (BLAKE3) | 3.426 GH/s | N/A | ~465MH/s (Cherry picked from this repo) |
[^4]: Due to instablity of WASM optimization and runtime throttling behavior and lack of vendor provided benchmark harness, only approximate numbers can be provided.
On EPYC 9634 with better thermals, OpenSSL has 598.28 MH/s (38.29 GB/s) single block, 1.91 GH/s (122.54 GB/s) continuous.
| Workload | AVX-512 | SHA-NI |
|---|---|---|
| SingleBlock/Anubis | 3.387 GH/s | 2.09 GH/s |
| DoubleBlock (mCaptcha edge case) | 1.861 GH/s | 1.64 GH/s |
| go-away (32 bytes) | 3.826 GH/s | 3.15 GH/s |
| Cerberus (BLAKE3) | 8.874 GH/s | N/A |
Security Implications and Responsible Reporting
The performance gap between optimized native code and browser JavaScript (>100x) makes it impractical to set difficulty levels that are both:
- High enough to prevent automated solving on native hardware
- Low enough to be solvable in browsers within reasonable timeouts
These findings suggest that both designing and adopting a PoW-based CAPTCHA systems may need additional verification mechanisms beyond empirical testing.
A good effort outreach was made to Anubis for comments on similar concerns of lacking efficacy/transparency dated 9/19/2025, but no reply was received as of 11/04/2025. email
Additionally, we had opened some issues to upstream when there are clear performance regressions (i.e. not just optimization by a factor but non-linear server-side performance degradation). Here are the current statuses:
| Project | Issue / PR | Reported Issue | Upstream response |
|---|---|---|---|
| mCaptcha | #186 | Difficulty inversion; Spin loop, stalls at ~200 rps | Pending Since 06/05/2025 |
| Anubis | #1103 | Lock-convoy on certain backend caps at 5-6 k rps | Fixed only for in-memory DB (pending algo tweak) |
| go-away | – | not evaluated | – |
| Cap.js | #97 | Difficulty inversion; Event-loop starvation, drops from 400 → 50 rps | Declined ("out-of-scope"/suggested IP RL) |
| GoToSocial | PR 4433 | Structural bias | Feature removed |
All load tests were performed using the live command with the following methodology:
- For projects with multiple difficulty presets, the highest difficulty preset was used
- All requests are strictly proof of work round-trips and contain valid proof of work. No backend service were hooked up.
- Supplementary features that are irrelevant to this study such as traditional IP rate-limiting were disabled.
- These tests were performed on a 32-core AMD Ryzen 9 7950X, actual ratio/capacity may vary depending on server capacity and topology.
- "difficulty inversion" was defined as a the server not being able to fully load all 32-cores at at least 1:1 latency (as reported by
http_waitmetric at cutoff 50%) or 1:1 CPU usage topow-buster's throughput.
Future Work (i.e. Okay, so what would be a good PoW then?)
Conflict of Interest Disclosure
This work is not funded or sponsored by any external entity and entirely my own research on my own time. While I personally hold negative ethical and technical position against using Proof of Work for the open-web in general, no vendor, competitor, professional affiliation or bounty program otherwise influenced its content.
For benchmarks conducted against vendor code, I am happy to rerun and/or clarify any part of the analysis by request if any vendor believes particular performance characteristics demonstrated here no longer apply or requires further clarification. Create an issue with:
- An official release or beta-release tag that you want to be benchmarked.
- Proposed methodology (commands to build your project, environment setup, etc).
I will rerun the benchmarks and update relavant sections and keep all communications on record.
Contributing / Wishlist
Contributions are welcome, roughly in priority order we want:
- General profiling and further optimization.
- Would be nice to have a WebGPU solution that can be used in a UserScript.
- An AVX-2 solution and corresponding benchmark. (low priority as this isn't really a "product")
License and Acknowledgments
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
This project contains some copy pasted or minimally modified/transpiled code from:
- the sha2 crate, in the core SHA-2 routine in sha256.rs.
- milakov's int_fastdiv, in strings.rs.