Crate bindet[−][src]
Expand description
Fast binary file type detection
bindet provides a fast and safe binary file detection even for large files.
The worst case for bindet is O(n), but some tricks are applied to try to amortize
the time complexity to O(1), so in most of the cases it does not take O(n) to execute.
Supported file types
- Zip
- Rar (4 and 5)
- Tar
- Png
- Jpg
- 7-zip
- Opus
- Vorbis
- Mp3
- Webp
- Flac
- Matroska (mkv, mka, mks, mk3d, webm)
- Wasm
First Step
File detection is made in a two-pass process, first it tries to find the magic number at the start
of the Read, if the magic number if found, a second pass may be done to ensure correctness of detection.
For example, FileType::Zip does have a Local File Header which starts
with a 4-byte descriptor and a End of central directory record that appears at the
end of non-empty zip files.
Some files can de detect only by looking at the start of the file, using a fixed-size buffer,
which guarantees O(1) for simple detection and a amortized O(1) for correctness. Also, some file
types, such as RAR SFX states that the
magic number may be found from the start of the file up to SFX module size (which is of 1 MB),
this means that in the worst case, we need to do a sliding window up to 1 MB to find this value,
this type of check happens in the second step.
Second Step
In the first step, we use a small buffer to store initial bytes of the data and try to detect
the file type, in the second step we use a larger buffer size, up to the size of the largest
lookup range (which at the moment is of 1 MB, which matches with RAR5 specification) and
use a sliding window to find a range that matches the magic number sequence.
Also, the same strategy is applied to detect_at_end logic, it looks into the
file backwardly, using a sliding window, to find a matching sequence of bytes, this logic is
used to ensure correctness for file types that does have a sequence of bytes that appear at the end.
Worst-case scenario
For detect function, we mixes reading from the start and then only do backward sliding
at the end for types that have matched at the start, this improves the accuracy of file detection,
with the cost that if a marker is found at the start, and the specification states that there is a
marker at the end, and we do the backward sliding-window, and there is no marker at the end, we
will have traversed the entire data stream, with a time complexity of O(n), so, the worst case
of file detection is linear.
However, even with a linear worst case, we assume that in the most scenarios the marker at the start will be enough to detect the file type. And if this is not enough and we need to look at the end, we assume that in most cases we will not need to slide the window until the start of the stream, assuming that the algorithm will find the marker closer to the end than to the start.
Further benchmarks can be done to check if bindet amortized time complexity is really O(1), given
a bunch of files to be detected.
Examples
use std::fs::{OpenOptions};
use std::io::BufReader;
use std::io::ErrorKind;
use bindet;
use bindet::types::FileType;
use bindet::FileTypeMatch;
use bindet::FileTypeMatches;
let file = OpenOptions::new().read(true).open("files/test.tar").unwrap();
let buf = BufReader::new(file);
let detect = bindet::detect(buf).map_err(|e| e.kind());
let expected: Result<Option<FileTypeMatches>, ErrorKind> = Ok(Some(FileTypeMatches::new(
vec![FileType::Tar],
vec![FileTypeMatch::new(FileType::Tar, true)]
)));
assert_eq!(detect, expected);Modules
Structs
Functions
Detect a file type by looking at the start and at the end of the file (at the end only for applicable file types)
Detect a file type by using a backward sliding window, this approach does have a O(n) time complexity
and is not meant to be used directly.
Detect a file type by using a backward sliding window, this approach does have a O(n) time complexity
and is not meant to be used directly.
Detect a file type by looking at the start of the file. Types that need a second check at the end
may be reported with FileTypeMatch.full_match = false signaling a probable
match.
Detect a file type by looking at the start of the file. Types that need a second check at the end
may be reported with FileTypeMatch.full_match = false signaling a probable
match.