spider-tendril 0.5.1

Send-able tendril fork (atomic refcount) for high-concurrency HTML parsing
Documentation

spider-tendril

A Send-by-default fork of tendril for high-concurrency HTML parsing.

crates.io

Why this fork?

Upstream tendril defaults Tendril<F, A = NonAtomic> to non-atomic refcounting. That makes StrTendril and ByteTendril !Send, which propagates through markup5ever::BufferQueue, html5ever::Tokenizer, html5ever::TreeBuilder, and html5ever::Parser. None of them can cross thread boundaries — so no future holding an html5ever parser across an .await point can be Send.

This fork flips a single default: Tendril<F, A = Atomic>. The struct is otherwise unchanged. As a result:

  • StrTendril and ByteTendril are now Send + Sync by default.
  • The whole markup5ever / html5ever parser stack can be made Send simply by transitively depending on spider-tendril (see spider-markup5ever and spider-html5ever).
  • Parser state can move freely between tokio worker threads in a multi-threaded async runtime.

The cost is a few extra atomic ops per refcount bump (≈5–10 ns each). Behavior, parse output, and the public API are identical to upstream. If you specifically need non-atomic refcounting for a performance-critical single-threaded use case, write Tendril<F, NonAtomic> explicitly.

Library name

The crate publishes as spider-tendril on crates.io but the library itself is still imported as tendril:

[dependencies]
spider-tendril = "0.5"
use tendril::StrTendril;

This means existing code that uses tendril types compiles without changes — just swap the dependency.

Original tendril docs

Tendril is a compact string/buffer type optimized for zero-copy parsing. Tendrils have the semantics of owned strings, but are sometimes views into shared buffers. When you mutate a tendril, an owned copy is made if necessary; further mutations occur in-place until the string becomes shared (e.g. via clone() or subtendril()).

Tendril uses phantom types to track a buffer's format. This determines at compile time which operations are available on a given tendril — for example, Tendril<UTF8> and Tendril<Bytes> can be borrowed as &str and &[u8] respectively.

Whereas String allocates on the heap for any non-empty string, Tendril can store small strings (up to 8 bytes) inline. Tendril is also smaller than String on 64-bit platforms — 16 bytes versus 24. Option<Tendril> is the same size as Tendril.

The maximum length of a tendril is 4 GB. The library will panic if you attempt to go over the limit.

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option, matching the upstream tendril license.