pub struct TokenTree { /* private fields */ }Expand description
Token-based radix tree for cache-aware routing.
Implementations§
Source§impl TokenTree
impl TokenTree
Sourcepub fn new() -> Self
pub fn new() -> Self
Create a new TokenTree with default settings (page_size=16, eviction_policy=LRU).
Sourcepub fn with_policy(policy: EvictionPolicy) -> Self
pub fn with_policy(policy: EvictionPolicy) -> Self
Create a new TokenTree with a specific eviction policy (uses default page_size=16). The policy should match the backend worker’s policy to stay in sync.
Sourcepub fn with_config(page_size: usize, policy: EvictionPolicy) -> Self
pub fn with_config(page_size: usize, policy: EvictionPolicy) -> Self
Create a new TokenTree with specific page size and eviction policy.
§Arguments
page_size- Token page size for grouping (must match backend worker’s page size). SGLang supports: 1, 16, 32, 64, 128 depending on attention backend. Note: Currently only page_size=16 is supported at compile time.policy- Eviction policy (should match the backend worker’s policy).
§Panics
Panics if page_size != 16 (compile-time limitation, future versions may support other sizes).
Sourcepub fn eviction_policy(&self) -> EvictionPolicy
pub fn eviction_policy(&self) -> EvictionPolicy
Get the current eviction policy.
Sourcepub fn insert_tokens(&self, tokens: &[u32], tenant: &str)
pub fn insert_tokens(&self, tokens: &[u32], tenant: &str)
Insert a token sequence with associated tenant.
Page-aligned: Input is aligned to PAGE_SIZE boundary. Sequences shorter than PAGE_SIZE are skipped (no cache benefit).
Sourcepub fn match_prefix_with_counts(&self, tokens: &[u32]) -> PrefixMatchResult
pub fn match_prefix_with_counts(&self, tokens: &[u32]) -> PrefixMatchResult
Find longest matching prefix with detailed counts.
Page-aligned: Input is aligned to PAGE_SIZE boundary before lookup. Sequences shorter than PAGE_SIZE return 0 matched tokens.
Sourcepub fn match_and_insert(
&self,
tokens: &[u32],
tenant: &str,
) -> PrefixMatchResult
pub fn match_and_insert( &self, tokens: &[u32], tenant: &str, ) -> PrefixMatchResult
Combined match + insert in a SINGLE tree descent.
Equivalent to calling Self::match_prefix_with_counts immediately
followed by Self::insert_tokens with the same tokens, but it
traverses the prefix only once. For long prefixes (e.g. 150K tokens)
this halves the number of page-key lookups, child-map probes, token
comparisons, and touch_tenant writes on the request hot path.
§Why a single descent is correct
insert_tokens descends through full-match children exactly the way
match_prefix_with_counts does (same page key, same page-aligned common
prefix, same “continue on full match” rule). Insert’s descent is a
(depth-wise) superset of match’s: match stops as soon as it hits a node
with no tenants (all evicted) or a partial match, whereas insert keeps
going on a full match and only stops when it must create or split a node.
So we drive the descent with insert’s logic and freeze the match result
the first time match’s stop condition is reached (tracked by
match_frozen). After freezing, deeper nodes only affect insert
accounting, never the returned match result — exactly as if the separate
match call had already returned.
§Preserved semantics (identical to match-then-insert)
- Node split: the vacant / prefix-of-child / diverge branches below
are copied verbatim from
insert_tokens(intermediate node inherits the child’s tenant map, hit_count, creation_time, priority; child is demoted to the suffix; parent pointers / page keys rewritten the same way). - Per-tenant token counting:
tokens_addedis accumulated with the same rules asinsert_tokens(full-match Continue countscommon_len; the split branches countcommon_lenonly when the tenant did not already own the path, plus any brand-new branch tokens) and folded intotenant_token_countonce at the end. touch_tenant/ timestamps: at every node we reproduce the exact touch sequence the two separate calls would have made, in the same order, on the same node identities:- The match side reads
child.get_any_tenant()before any insert mutation (so the all-evicted check and the routed tenant are computed against the pre-insert node), then touches that tenant — exactly whatmatch_prefix_with_countsdoes, and before any split so the split’s tenant-map clone inherits the fresh timestamp just like the original ordering (matchran fully beforeinsert). - The insert side then touches the inserting
tenanton the node it would have (the continued child, the newly created leaf, or the new intermediate). We deliberately do NOT deduplicate the two touches even when they land on the same node for the same tenant: the original match-then-insert pair already touched twice in that case (e.g. a full-match continuation whoseget_any_tenant()is the routed/inserting tenant), so keeping both touches reproduces the LFUhit_countand timestamp progression byte-for-byte. For the default LRU policy the second touch only advances the timestamp, which is immaterial to relative ordering.
- The match side reads
Sourcepub fn match_and_insert_with<'t, F>(
&self,
tokens: &[u32],
select: F,
) -> PrefixMatchResult
pub fn match_and_insert_with<'t, F>( &self, tokens: &[u32], select: F, ) -> PrefixMatchResult
Match in a single descent, then choose the insert tenant from the match result and insert for it — still a SINGLE top-to-bottom traversal of the prefix.
This is the variant the cache-aware router needs: the worker it inserts
for is derived from the match outcome (route to the matched worker on a
cache hit, else to the least-loaded worker), so the inserting tenant is
not known until the match completes. select is invoked exactly once,
after the match, with the PrefixMatchResult; returning Some(tenant)
inserts tokens for that tenant, and returning None skips the insert
entirely (mirroring the router’s “selected worker is gone” branch, which
performs no insert).
§How the single descent is achieved
The match phase walks the prefix exactly like
Self::match_prefix_with_counts, additionally recording the chain of
nodes it traverses as path (one (node, advance) per edge) and the
node/remaining slice where the walk fell off. After select yields the
tenant we replay insert’s per-node work directly on the recorded nodes
(no second tree navigation): touch_tenant on every traversed node plus
the same tokens_added accounting, then splice the remainder at the
fall-off point with the very branches insert_tokens uses (vacant leaf /
prefix-of-child split / diverge split). The only tree lookups are the
single descent and one entry() re-probe at the fall-off node for the
splice — never a second full walk.
§Preserved semantics
Identical to match_prefix_with_counts followed by
insert_tokens(tokens, tenant):
- match-side: same matched-token count, same routed tenant, same per-node
touch_tenant(get_any_tenant()), same “stop at an all-evicted node or a partial match” rule; - insert-side: same node-split structure, same
tenant_token_countaccounting (including the existing behavior of re-counting a fully matched path), sametouch_tenant(tenant)on every node on the path and on freshly created/split nodes.
The replay reorders insert’s per-node touches to after the match phase instead of interleaving them, which only changes the exact monotonic timestamp values written (immaterial to relative LRU/LFU ordering); the set of touched (node, tenant) pairs and the LFU hit-count increments are unchanged.
§Equivalence scope (single-threaded vs concurrent)
The “identical” guarantees above hold exactly single-threaded
(covered by test_match_and_insert_equiv_*). Under concurrent node
splits the equivalence is on routing and tenant_token_count, not on
per-node access timestamps: this path replays onto the nodes recorded
during the match rather than re-walking, so if another thread splits one
of those nodes mid-flight, a freshly-created intermediate can miss a
timestamp bump — it is then evicted slightly earlier (a bounded,
self-healing cache-hit-rate effect on an approximate tree). The token
count stays exact because the recorded per-node advances are frozen at
match time; test_concurrent_match_and_insert_count_and_route_integrity
proves both the exact count and route integrity under a concurrent-split
stress.
Sourcepub fn prefix_match_legacy(&self, tokens: &[u32]) -> (Vec<u32>, String)
pub fn prefix_match_legacy(&self, tokens: &[u32]) -> (Vec<u32>, String)
Legacy prefix_match API returning (matched_tokens, tenant_string).
Sourcepub fn get_tenant_token_counts(&self) -> HashMap<String, usize>
pub fn get_tenant_token_counts(&self) -> HashMap<String, usize>
Get token counts per tenant.
Sourcepub fn evict_tenant(&self, tenant: &TenantId, max_tokens: usize)
pub fn evict_tenant(&self, tenant: &TenantId, max_tokens: usize)
Evict entries for a tenant to reduce to max_tokens. Uses leaf-first eviction with the configured policy (matching SGLang’s behavior).
Sourcepub fn tenant_token_size(&self, tenant: &TenantId) -> usize
pub fn tenant_token_size(&self, tenant: &TenantId) -> usize
Get the token count for a specific tenant.
Sourcepub fn evict_tenant_by_size(&self, max_size: usize)
pub fn evict_tenant_by_size(&self, max_size: usize)
Evict cache entries by total token count to reduce memory usage. Convenience method matching the StringTree API.
For each tenant, reduces their token usage to max_size if they exceed it.
Sourcepub fn iter_entries(&self) -> EntriesIter
pub fn iter_entries(&self) -> EntriesIter
Lazily walk the tree in pre-order, yielding each node’s
(prefix_tokens, [(tenant, epoch)]) pair without
materializing the full tree as a single buffer.
Empty-tenant nodes are skipped (they’re routing
intermediates, not real entries). Children are visited in
lexicographic page-key order so callers paging the
iterator can resume by re-running and skipping the first
N items if the tree is unchanged.
The walk is not atomic — concurrent insert_tokens may
split or replace nodes mid-walk; the iterator may then
reflect a mix of pre- and post-split state. Acceptable
for mesh sync (eventual consistency).