Expand description
Fast binary file type detection
bindet
provides a fast and safe binary file detection even for large files.
The worst case for bindet
is O(n)
, but some tricks are applied to try to amortize
the time complexity to O(1)
, so in most of the cases it does not take O(n)
to execute.
§Supported file types
- Zip
- Rar (4 and 5)
- Tar (uncompressed)
- LZMA
- 7zXZ
- Zst
- Png
- Jpg
- 7-zip
- Opus
- Vorbis
- Mp3
- Webp
- Flac
- Matroska (mkv, mka, mks, mk3d, webm)
- Wasm
- Java Class
- Scala Tasty
- Mach-O
- Elf (Executable and Linkable Format)
- Wav
- Avi
- Aiff
- Tiff
- Sqlite3 (
.db
) - Ico
- Dalvik
- Exe/Dll
- Gif
- Xcf
- Scala Tasty
- Bmp
- Iso
- Swf/Swc
- (some may be missing, please refer to FileType)
§First Step
File detection is made in a two-pass process, first it tries to find the magic number at the start
of the Read, if the magic number if found, a second pass may be done to ensure correctness of detection.
For example, FileType::Zip
does have a Local File Header which starts
with a 4-byte descriptor and a End of central directory record that appears at the
end of non-empty zip files.
Some files can de detect only by looking at the start of the file, using a fixed-size buffer,
which guarantees O(1)
for simple detection and a amortized O(1)
for correctness. Also, some file
types, such as RAR SFX states that the
magic number may be found from the start of the file up to SFX module size (which is of 1 MB
),
this means that in the worst case, we need to do a sliding window up to 1 MB
to find this value,
this type of check happens in the second step.
§Second Step
In the first step, we use a small buffer to store initial bytes of the data and try to detect
the file type, in the second step we use a larger buffer size, up to the size of the largest
lookup range (which at the moment is of 1 MB
, which matches with RAR5 specification) and
use a sliding window to find a range that matches the magic number sequence.
Also, the same strategy is applied to detect_at_end
logic, it looks into the
file backwardly, using a sliding window, to find a matching sequence of bytes, this logic is
used to ensure correctness for file types that does have a sequence of bytes that appear at the end.
§Worst-case scenario
For detect
function, we mixes reading from the start and then only do backward sliding
at the end for types that have matched at the start, this improves the accuracy of file detection,
with the cost that if a marker is found at the start, and the specification states that there is a
marker at the end, and we do the backward sliding-window, and there is no marker at the end, we
will have traversed the entire data stream, with a time complexity of O(n)
, so, the worst case
of file detection is linear.
However, even with a linear worst case, we assume that in the most scenarios the marker at the start will be enough to detect the file type. And if this is not enough and we need to look at the end, we assume that in most cases we will not need to slide the window until the start of the stream, assuming that the algorithm will find the marker closer to the end than to the start.
Further benchmarks can be done to check if bindet amortized time complexity is really O(1)
, given
a bunch of files to be detected.
§Examples
use std::fs::{OpenOptions};
use std::io::BufReader;
use std::io::ErrorKind;
use bindet;
use bindet::types::FileType;
use bindet::FileTypeMatch;
use bindet::FileTypeMatches;
let file = OpenOptions::new().read(true).open("files/test.tar").unwrap();
let buf = BufReader::new(file);
let detect = bindet::detect(buf).map_err(|e| e.kind());
let expected: Result<Option<FileTypeMatches>, ErrorKind> = Ok(Some(FileTypeMatches::new(
vec![FileType::Tar],
vec![FileTypeMatch::new(FileType::Tar, true)]
)));
assert_eq!(detect, expected);
§Features
§nightly
Uses Macro MetaVar Expression Counting instead of Repetition Counting through tuple slice length.
Enabling nightly feature flag does not have any impact on the performance, neither runtime nor compile time, since both stable and nightly approach gets optimized and inlined at compile time, but has the same negligible compile-time cost.
This only exists because Macro MetaVar Expression
is still being discussed1 (and may not reach stable 1.63 or 1.64, and it is late for 1.62),
even though $$
and ${ignore(_)}
are targeting 1.622 and
it will probably be delivered,
${count(_)}
is one of the features that are left out of the stabilization because it may need more
refinement and discussion
bindet don’t need this personally, since it does not have too much elements to cause a compiler crash, but keeping it in the source code helps us to remember to deliver it as a default when it reach stable, and reduce the amount of hacky/tricky things we need to do with declarative macros (which already needs a bunch of tricks).
§mime
Enables conversion from FileType
and FileRootType
to
Mime
by implementing TryInto<Mime>
trait,
there is no need to use
any additional module, just enable the feature.
§mediatype
Enables conversion from FileType
and FileRootType
to
MediaTypeBuf
by implementing TryInto<MediaTypeBuf>
trait,
there is no need to use
any additional module, just enable the feature.
Re-exports§
pub use crate::types::FileRootType;
pub use crate::types::FileType;
Modules§
- description
- Description module
- matcher
- Matcher module
- types
- Enumerates all file types that bindet is able to detect.
Structs§
- File
Type Match - Stores information about a specific FileType match result.
- File
Type Matches
Functions§
- detect
- Detect a file type by looking at the start and at the end of the file (at the end only for applicable file types)
- detect_
at_ end - Detect a file type by using a backward sliding window, this approach does have a
O(n)
time complexity and is not meant to be used directly. - detect_
at_ end_ from_ ref - Detect a file type by using a backward sliding window, this approach does have a
O(n)
time complexity and is not meant to be used directly. - detect_
at_ start - Detect a file type by looking at the start of the file. Types that need a second check at the end
may be reported with
FileTypeMatch.full_match = false
signaling a probable match. - detect_
at_ start_ from_ ref - Detect a file type by looking at the start of the file. Types that need a second check at the end
may be reported with
FileTypeMatch.full_match = false
signaling a probable match. - detect_
variants_ at_ end - Detect a file type by using a backward sliding window, this approach does have a
O(n)
time complexity and is not meant to be used directly. - detect_
variants_ at_ end_ from_ ref - Detect a file type by using a backward sliding window, this approach does have a
O(n)
time complexity and is not meant to be used directly. - detect_
variants_ at_ start - Detect a file type by looking at the start of the file. Types that need a second check at the end
may be reported with
FileTypeMatch.full_match = false
signaling a probable match. - detect_
variants_ at_ start_ from_ ref - Detect a file type by looking at the start of the file. Types that need a second check at the end
may be reported with
FileTypeMatch.full_match = false
signaling a probable match.