Expand description
Serialize large streams of Serialize-able structs to disk from multiple threads, with a customizable on-disk sort order.
Data is written to sorted chunks. When reading shardio will merge the data on the fly into a single sorted view. You can
also procss disjoint subsets of sorted data.
Additionally, you can also iterate through the data in the order they are written to disk. The items will not in general follow the sort order. Such an iterator does not involve the merge sort step and hence does not have the memory overhead associated with keeping multiple items in memory to perform the merge sort.
use serde::{Deserialize, Serialize};
use shardio::*;
use std::fs::File;
use anyhow::Error;
#[derive(Clone, Eq, PartialEq, Serialize, Deserialize, PartialOrd, Ord, Debug)]
struct DataStruct {
a: u64,
b: u32,
}
fn main() -> Result<(), Error>
{
let filename = "test.shardio";
{
// Open a shardio output file
// Parameters here control buffering, and the size of disk chunks
// which affect how many disk chunks need to be read to
// satisfy a range query when reading.
// In this example the 'built-in' sort order given by #[derive(Ord)]
// is used.
let mut writer: ShardWriter<DataStruct> =
ShardWriter::new(filename, 64, 256, 1<<16)?;
// Get a handle to send data to the file
let mut sender = writer.get_sender();
// Generate some test data
for i in 0..(2 << 16) {
sender.send(DataStruct { a: (i%25) as u64, b: (i%100) as u32 });
}
// done sending items
sender.finished();
// Write errors are accessible by calling the finish() method
writer.finish()?;
}
// Open finished file & test chunked reads
let reader = ShardReader::<DataStruct>::open(filename)?;
let mut all_items = Vec::new();
// Shardio will divide the key space into 5 roughly equally sized chunks.
// These chunks can be processed serially, in parallel in different threads,
// or on different machines.
let chunks = reader.make_chunks(5, &Range::all());
for c in chunks {
// Iterate over all the data in chunk c.
let mut range_iter = reader.iter_range(&c)?;
for i in range_iter {
all_items.push(i?);
}
}
// Data will be return in sorted order
let mut all_items_sorted = all_items.clone();
all_items.sort();
assert_eq!(all_items, all_items_sorted);
// If you want to iterate through the items in unsorted order.
let unsorted_items: Vec<_> = UnsortedShardReader::<DataStruct>::open(filename)?.collect();
// You will get the items in the order they are written to disk.
assert_eq!(unsorted_items.len(), all_items.len());
std::fs::remove_file(filename)?;
Ok(())
}Re-exports§
pub use crate::range::Range;
Modules§
Structs§
- Default
Sort - Marker struct for sorting types that implement
Ordin the order defined by theirOrdimpl. This sort order is used by default when writing, unless an alternative sort order is provided. - Merge
Iterator - Iterator over merged shardio files
- Range
Iter - Iterator of items from a single shardio reader
- Shard
Reader - Read from a collection of shardio files. The input data is merged to give
a single sorted view of the combined dataset. The input files must
be created with the same sort order
Sas they are read with. - Shard
Sender - A handle that is used to send data to a
ShardWriter. Each thread that is producing data needs it’s own ShardSender. AShardSendercan be obtained with theget_sendermethod ofShardWriter. ShardSender implement clone. - Shard
Writer - Write a stream data items of type
Tto disk, in the sort order defined byS. - Unsorted
Shard Reader - Read from a collection of shardio files in the order in which items are written without considering the sort order.
Constants§
- SHARD_
ITER_ SZ - The size (in bytes) of a ShardIter object (mostly buffers)
Traits§
- SortKey
- Specify a key function from data items of type
Tto a sort key of typeKey. Impelment this trait to create a custom sort order. The functionsort_keyreturns aCowso that we abstract over Owned or Borrowed data.