StringTape
Memory-efficient collection classes for variable-length strings, co-located on a contiguous "tape".
- Convertible to Apache Arrow
String
/LargeString
&Binary
/LargeBinary
arrays - Compatible with UTF-8 & binary strings in Rust via
CharsTape
andBytesTape
- Usable in
no_std
and with custom allocators for GPU & embedded use cases - Sliceable into zero-copy borrow-checked views with
[i..n]
range syntax
Why?
let doc = read_to_string?; // 1.0 GB
let words = doc.split_whitespace; // ~ 160 M words
let buffers = words.map;
let _ = Vec:: from_iter; // + 7.1 GB copied ❌
let _ = from_iter; // + 1.3 GB copied ✅
let _ = Vec:: from_iter; // + 1.9 GB copy-less ⚠️
let _ = from_iter_and_data;
"Tape" classes copy data into contiguous buffers for cache-friendly iteration. "Cows" classes reference existing data without copies.
Quick Start
use ;
// Create a new CharsTape with 32-bit offsets
let mut tape = new;
tape.push?;
tape.push?;
assert_eq!;
assert_eq!;
assert_eq!;
// Iterate over strings
for s in &tape
// Build from iterator with auto type selection
let tape2 = from_iter;
assert_eq!;
// Zero-copy slices referencing existing data
let data = "hello world";
let cows = from_iter_and_data?;
# Ok::
Memory Layout
CharsTape
and BytesTape
use the same memory layout as Apache Arrow string and binary arrays:
Data buffer: [h,e,l,l,o,w,o,r,l,d]
Offset buffer: [0, 5, 10]
API Overview
Basic Operations
use CharsTapeI32;
let mut tape = new;
tape.push?; // Append one string
tape.extend?; // Append an array
assert_eq!; // Direct indexing
assert_eq!; // Safe access
for s in &tape
// Construct from iterator
let tape2: CharsTapeI32 = .into_iter.collect;
// Sort in-place
tape.sort;
tape.sort_by;
BytesTape
and CharsCowsAuto
/BytesCowsAuto
provide the same interface.
Views and Slicing
let view = tape.view; // View entire tape
let subview = tape.subview?; // Items [1, 3)
let nested = subview.subview?; // Nested subviews
let raw_bytes = &tape.view; // Raw byte slice
// Views have same API as tapes
assert_eq!;
assert_eq!;
Memory Management
// Pre-allocate capacity
let tape = with_capacity?; // 1KB data, 100 strings
// Monitor usage
println!;
// Modify
tape.clear; // Remove all items
tape.truncate; // Keep first 5 items
// Custom allocators
use Global;
let tape = new_in;
Apache Arrow Interop
True zero-copy conversion to/from Arrow arrays:
// CharsTape → Arrow (zero-copy)
let = tape.arrow_slices;
let data_buffer = from_slice_ref;
let offsets_buffer = new;
let arrow_array = new;
// Arrow → CharsTapeView (zero-copy)
let view = unsafe ;
BytesTape
works the same way with Arrow BinaryArray
/LargeBinaryArray
types.
Auto Type Selection
Auto variants automatically select the most memory-efficient types based on data size:
// CharsTapeAuto: selects I32/U32/U64 offset based on total data size
let tape = from_iter;
// CharsCowsAuto: selects offset (U32/U64) and length (U8/U16/U32) types
let cows = from_iter_and_data?;
Available: CharsTapeAuto
, BytesTapeAuto
, CharsCowsAuto
, BytesCowsAuto
.
Unsigned Offsets
Unsigned offsets (u32
/u64
) are available via CharsTapeU32
, CharsTapeU64
, BytesTapeU16
, BytesTapeU32
, BytesTapeU64
and corresponding views.
These cannot be converted to/from Arrow arrays.
no_std
Support
StringTape can be used in no_std
environments:
[]
= { = "2", = false }
In no_std
mode:
- All functionality is preserved
- Requires
alloc
for dynamic allocation - Error types implement
Display
but notstd::error::Error
Testing
Run tests for both std
and no_std
configurations:
To reproduce memory usage numbers mentioned above, run: