Struct catalog::HashFile
[−]
[src]
pub struct HashFile<K: Display + FromStr + Hash, V: Display + FromStr> { /* fields omitted */ }
An implementation of a "file-based" map which stores key-value pairs in sorted fashion in file(s), and gets them using binary search and file seeking in O(log-n) time.
Design
HashFile
maintains two files - one for the keys and value indices, and the other
for the values themselves. Since it uses padding to maintain constant line length
throughout the file, large values can lead to extremely huge and sparse files. Having two
files eliminates this problem.
During insertion, the hash of the supplied key is obtained (using the built-in
SipHasher
), which acts as the key for sorting.While flushing, two iterators (one for the key/values in the underlying map, another for the keys in the file) and two writers (one for key-index, another for the values) are maintained. The readers throw key/values in ascending order, the hashes of the keys are compared, the values are appended to a temporary file, and the keys (along with the value indices) are written to another temporary file. In the end, both the files are renamed.
Basically, the files are in DSV format - one has keys and value indices separated by a null byte, while the other has values separated by
\n
. Each line in the "key" file is ensured to have the same length, by properly padding it with null bytes, which is done by calling thefinish
method. The method also does a cleanup and gets rid of unnecessary values from the "data" file.While getting, the hash for the given key is computed, and a binary search is made by seeking through the file. The value index is found in O(log-n) time, and the value is obtained from the "data" file in O(1) time.
Examples
Once you've added the package to your Cargo.toml
catalog = "0.1.1"
... it can be used as follows,
extern crate catalog; use catalog::HashFile; // This will create a new file in the path (if it doesn't exist) let mut hash_file: HashFile<usize, _> = try!(HashFile::new("/tmp/SAMPLE").map(|hf| hf.set_capacity(100))); // We don't have to mention all the types explicitly, leaving it to type inference // But, there's a reason why I mentioned `usize` (which we'll see in a moment). // Insert some stuff into the map (in this case, integers and upper case alphabets) for i in 0..1000 { try!(hf.insert(i, format!("{}", (65 + (i % 26)) as u8 as char))); } // This flushes the data to the file for every 100 key/value pairs, since // we've set the capacity to 100. // Call the finish method once you're done with insertion. It's necessary // because it pads each line and ensures that all the lines have the same length. try!(hf.finish()); // Now, we're ready to "get" the values. let value = try!(hf.get(&0)); assert_eq!(Some(("A".to_owned(), 0)), value); // Note that in addition to the value, there's a number. // Let's try it again... try!(hf.insert(0, format!("Z"))); // Don't forget to flush it to the file! try!(hf.finish()); let value = try!(hf.get(&0)); assert_eq!(Some(("Z".to_owned(), 1)), value); // So, the number is just a counter. HashFile keeps track of the number of // times a value has been overridden (with insertion).
Now, let's have a quick peek inside the generated file.
$ head -5 /tmp/SAMPLE*
==> /tmp/SAMPLE <==
68600
18320
59540
50060
1580
==> /tmp/SAMPLE.dat <==
K
B
X
G
P
$ wc -l /tmp/SAMPLE*
1000 /tmp/SAMPLE
1000 /tmp/SAMPLE.dat
2000 total
$ ls -l /tmp/SAMPLE*
-rw-rw-r-- 1 user user 11000 Jul 09 22:10 /tmp/SAMPLE
-rw-rw-r-- 1 user user 2000 Jul 09 22:10 /tmp/SAMPLE.dat
The file sizes will (and should!) always be a multiple of the number of key/value pairs, since each line is padded to have the same length. Now, we can have another program to get the key/value pairs.
// This will open the file in the path (if it exists) let mut hf: HashFile<usize, String> = try!(HashFile::new("/tmp/SAMPLE"));
A couple of things to note here. Before getting, we need to mention the types, because rustc doesn't know what type we have in the file (and, it'll throw an error).
Moreover, if we hadn't explicitly mentioned usize
during insertion,
rustc
would've gone for some default type, and if we mention some other primitive
now, the hashes won't match i.e., hash(0u32) != hash(0usize)
.
For example, "2"
can be parsed to all the integer primitives (u8
, u64
, isize
, etc.),
but, they all produce different hashes. In such a case, it's more likely that HashFile
returns None
while getting the value corresponding to a key, even if it exists in
the file. Hence, it's up to the user to handle such cases
(by manually denoting the type during insertion and getting).
// Now, we can get the values... let value = try!(hf.get(&0)); assert_eq!(Some(("Z".to_owned(), 1)), value); // ... as many as we want! let value = try!(hf.get(&1)); assert_eq!(Some(("B".to_owned(), 0)), value);
We've used a lot of try!
here, because each method invocation involves making OS
calls for manipulating the underlying file descriptor. Since all the methods have been
ensured to return a Result<T, E>
, HashFile
can be guaranteed from
panicking along the run.
Advantages:
Control over memory: You're planning to put a great deal of "stuff" into a map, but you cannot afford the memory it demands. You wanna have control on how much memory your map can consume. That said, you still want a map which can throw the values for your requested keys in appreciable time.
Long term storage: You're sure that the large "stuff" won't change in the near future, and so you're not willing to risk the deallocation (when the program quits) or re-insertion (whenever the program starts).
Drawbacks:
- Sluggish insertion: Re-allocation in memory is lightning fast, while putting stuff into the usual maps, and so it won't be obvious during the execution of a program. But, that's not the case when it comes to file. Flushing to a file takes time (as it makes OS calls), and it increases exponentially as O(2n) during insertion, which would be very obvious in our case.
Methods
impl<K: Display + FromStr + Hash, V: Display + FromStr> HashFile<K, V>
[src]
fn new(path: &str) -> Result<HashFile<K, V>, String>
Create a new HashFile
for mapping key/value pairs in the given path
fn set_capacity(self, capacity: usize) -> HashFile<K, V>
Set the capacity of the HashFile
(to flush to the file whenever it exceeds this value)
fn finish(&mut self) -> Result<(), String>
Run this finally to flush the values (if any) from the struct to the file
fn insert(&mut self, key: K, value: V) -> Result<(), String>
Insert a key/value pair
fn get(&mut self, key: &K) -> Result<Option<(V, usize)>, String>
Get the value corresponding to the key from the file