Struct catalog::HashFile [] [src]

pub struct HashFile<K: Display + FromStr + Hash, V: Display + FromStr> { /* fields omitted */ }

An implementation of a "file-based" map which stores key-value pairs in sorted fashion in file(s), and gets them using binary search and file seeking in O(log-n) time.

Design

HashFile maintains two files - one for the keys and value indices, and the other for the values themselves. Since it uses padding to maintain constant line length throughout the file, large values can lead to extremely huge and sparse files. Having two files eliminates this problem.

  • During insertion, the hash of the supplied key is obtained (using the built-in SipHasher), which acts as the key for sorting.

  • While flushing, two iterators (one for the key/values in the underlying map, another for the keys in the file) and two writers (one for key-index, another for the values) are maintained. The readers throw key/values in ascending order, the hashes of the keys are compared, the values are appended to a temporary file, and the keys (along with the value indices) are written to another temporary file. In the end, both the files are renamed.

  • Basically, the files are in DSV format - one has keys and value indices separated by a null byte, while the other has values separated by \n. Each line in the "key" file is ensured to have the same length, by properly padding it with null bytes, which is done by calling the finish method. The method also does a cleanup and gets rid of unnecessary values from the "data" file.

  • While getting, the hash for the given key is computed, and a binary search is made by seeking through the file. The value index is found in O(log-n) time, and the value is obtained from the "data" file in O(1) time.

Examples

Once you've added the package to your Cargo.toml

catalog = "0.1.1"

... it can be used as follows,

extern crate catalog;

use catalog::HashFile;

// This will create a new file in the path (if it doesn't exist)
let mut hash_file: HashFile<usize, _> =
    try!(HashFile::new("/tmp/SAMPLE").map(|hf| hf.set_capacity(100)));

// We don't have to mention all the types explicitly, leaving it to type inference
// But, there's a reason why I mentioned `usize` (which we'll see in a moment).

// Insert some stuff into the map (in this case, integers and upper case alphabets)
for i in 0..1000 {
    try!(hf.insert(i, format!("{}", (65 + (i % 26)) as u8 as char)));
}

// This flushes the data to the file for every 100 key/value pairs, since
// we've set the capacity to 100.

// Call the finish method once you're done with insertion. It's necessary
// because it pads each line and ensures that all the lines have the same length.
try!(hf.finish());

// Now, we're ready to "get" the values.
let value = try!(hf.get(&0));
assert_eq!(Some(("A".to_owned(), 0)), value);
// Note that in addition to the value, there's a number.

// Let's try it again...
try!(hf.insert(0, format!("Z")));
// Don't forget to flush it to the file!
try!(hf.finish());

let value = try!(hf.get(&0));
assert_eq!(Some(("Z".to_owned(), 1)), value);

// So, the number is just a counter. HashFile keeps track of the number of
// times a value has been overridden (with insertion).

Now, let's have a quick peek inside the generated file.

$ head -5 /tmp/SAMPLE*
==> /tmp/SAMPLE <==
68600
18320
59540
50060
1580
==> /tmp/SAMPLE.dat <==
K
B
X
G
P
$ wc -l /tmp/SAMPLE*
 1000 /tmp/SAMPLE
 1000 /tmp/SAMPLE.dat
 2000 total
$ ls -l /tmp/SAMPLE*
-rw-rw-r-- 1 user user 11000 Jul 09 22:10 /tmp/SAMPLE
-rw-rw-r-- 1 user user 2000 Jul 09 22:10 /tmp/SAMPLE.dat

The file sizes will (and should!) always be a multiple of the number of key/value pairs, since each line is padded to have the same length. Now, we can have another program to get the key/value pairs.

// This will open the file in the path (if it exists)
let mut hf: HashFile<usize, String> = try!(HashFile::new("/tmp/SAMPLE"));

A couple of things to note here. Before getting, we need to mention the types, because rustc doesn't know what type we have in the file (and, it'll throw an error).

Moreover, if we hadn't explicitly mentioned usize during insertion, rustc would've gone for some default type, and if we mention some other primitive now, the hashes won't match i.e., hash(0u32) != hash(0usize).

For example, "2" can be parsed to all the integer primitives (u8, u64, isize, etc.), but, they all produce different hashes. In such a case, it's more likely that HashFile returns None while getting the value corresponding to a key, even if it exists in the file. Hence, it's up to the user to handle such cases (by manually denoting the type during insertion and getting).

// Now, we can get the values...
let value = try!(hf.get(&0));
assert_eq!(Some(("Z".to_owned(), 1)), value);

// ... as many as we want!
let value = try!(hf.get(&1));
assert_eq!(Some(("B".to_owned(), 0)), value);

We've used a lot of try! here, because each method invocation involves making OS calls for manipulating the underlying file descriptor. Since all the methods have been ensured to return a Result<T, E>, HashFile can be guaranteed from panicking along the run.

Advantages:

  • Control over memory: You're planning to put a great deal of "stuff" into a map, but you cannot afford the memory it demands. You wanna have control on how much memory your map can consume. That said, you still want a map which can throw the values for your requested keys in appreciable time.

  • Long term storage: You're sure that the large "stuff" won't change in the near future, and so you're not willing to risk the deallocation (when the program quits) or re-insertion (whenever the program starts).

Drawbacks:

  • Sluggish insertion: Re-allocation in memory is lightning fast, while putting stuff into the usual maps, and so it won't be obvious during the execution of a program. But, that's not the case when it comes to file. Flushing to a file takes time (as it makes OS calls), and it increases exponentially as O(2n) during insertion, which would be very obvious in our case.

Methods

impl<K: Display + FromStr + Hash, V: Display + FromStr> HashFile<K, V>
[src]

Create a new HashFile for mapping key/value pairs in the given path

Set the capacity of the HashFile (to flush to the file whenever it exceeds this value)

Run this finally to flush the values (if any) from the struct to the file

Insert a key/value pair

Get the value corresponding to the key from the file