[][src]Crate shardio

Serialize large streams of Serialize-able structs to disk from multiple threads, with a customizable on-disk sort order. Data is written to sorted chunks. When reading shardio will merge the data on the fly into a single sorted view. You can also procss disjoint subsets of sorted data.

use serde::{Deserialize, Serialize};
use shardio::*;
use std::fs::File;
use failure::Error;

#[derive(Clone, Eq, PartialEq, Serialize, Deserialize, PartialOrd, Ord, Debug)]
struct DataStruct {
    a: u64,
    b: u32,
}

fn main() -> Result<(), Error>
{
    let filename = "test.shardio";
    {
        // Open a shardio output file
        // Parameters here control buffering, and the size of disk chunks
        // which affect how many disk chunks need to be read to
        // satisfy a range query when reading.
        // In this example the 'built-in' sort order given by #[derive(Ord)]
        // is used.
        let mut writer: ShardWriter<DataStruct> =
            ShardWriter::new(filename, 64, 256, 1<<16)?;

        // Get a handle to send data to the file
        let mut sender = writer.get_sender();

        // Generate some test data
        for i in 0..(2 << 16) {
            sender.send(DataStruct { a: (i%25) as u64, b: (i%100) as u32 });
        }

        // done sending items
        sender.finished();

        // Write errors are accessible by calling the finish() method
        writer.finish()?;
    }

    // Open finished file & test chunked reads
    let reader = ShardReader::<DataStruct>::open(filename)?;

     
    let mut all_items = Vec::new();

    // Shardio will divide the key space into 5 roughly equally sized chunks.
    // These chunks can be processed serially, in parallel in different threads,
    // or on different machines.
    let chunks = reader.make_chunks(5, &Range::all());

    for c in chunks {
        // Iterate over all the data in chunk c.
        let mut range_iter = reader.iter_range(&c)?;
        for i in range_iter {
            all_items.push(i?);
        }
    }

    // Data will be return in sorted order
    let mut all_items_sorted = all_items.clone();
    all_items.sort();
    assert_eq!(all_items, all_items_sorted);
    std::fs::remove_file(filename)?;
    Ok(())
}

Re-exports

pub use crate::range::Range;

Modules

helper

Helper methods

range

Represent a range of key space

Structs

DefaultSort

Marker struct for sorting types that implement Ord in the order defined by their Ord impl. This sort order is used by default when writing, unless an alternative sort order is provided.

MergeIterator

Iterator over merged shardio files

RangeIter

Iterator of items from a single shardio reader

ShardReader

Read from a collection of shardio files. The input data is merged to give a single sorted view of the combined dataset. The input files must be created with the same sort order S as they are read with.

ShardSender

A handle that is used to send data to a ShardWriter. Each thread that is producing data needs it's own ShardSender. A ShardSender can be obtained with the get_sender method of ShardWriter. ShardSender implement clone.

ShardWriter

Write a stream data items of type T to disk, in the sort order defined by S.

Constants

SHARD_ITER_SZ

The size (in bytes) of a ShardIter object (mostly buffers)

Traits

SortKey

Specify a key function from data items of type T to a sort key of type Key. Impelment this trait to create a custom sort order. The function sort_key returns a Cow so that we abstract over Owned or Borrowed data.