Crate bindet

Source
Expand description

Fast binary file type detection

bindet provides a fast and safe binary file detection even for large files.

The worst case for bindet is O(n), but some tricks are applied to try to amortize the time complexity to O(1), so in most of the cases it does not take O(n) to execute.

§Supported file types

  • Zip
  • Rar (4 and 5)
  • Tar (uncompressed)
  • LZMA
  • 7zXZ
  • Zst
  • Png
  • Jpg
  • 7-zip
  • Opus
  • Vorbis
  • Mp3
  • Webp
  • Flac
  • Matroska (mkv, mka, mks, mk3d, webm)
  • Wasm
  • Java Class
  • Scala Tasty
  • Mach-O
  • Elf (Executable and Linkable Format)
  • Wav
  • Avi
  • Aiff
  • Tiff
  • Sqlite3 (.db)
  • Ico
  • Dalvik
  • Pdf
  • Exe/Dll
  • Gif
  • Xcf
  • Scala Tasty
  • Bmp
  • Iso
  • Swf/Swc
  • (some may be missing, please refer to FileType)

§First Step

File detection is made in a two-pass process, first it tries to find the magic number at the start of the Read, if the magic number if found, a second pass may be done to ensure correctness of detection. For example, FileType::Zip does have a Local File Header which starts with a 4-byte descriptor and a End of central directory record that appears at the end of non-empty zip files.

Some files can de detect only by looking at the start of the file, using a fixed-size buffer, which guarantees O(1) for simple detection and a amortized O(1) for correctness. Also, some file types, such as RAR SFX states that the magic number may be found from the start of the file up to SFX module size (which is of 1 MB), this means that in the worst case, we need to do a sliding window up to 1 MB to find this value, this type of check happens in the second step.

§Second Step

In the first step, we use a small buffer to store initial bytes of the data and try to detect the file type, in the second step we use a larger buffer size, up to the size of the largest lookup range (which at the moment is of 1 MB, which matches with RAR5 specification) and use a sliding window to find a range that matches the magic number sequence.

Also, the same strategy is applied to detect_at_end logic, it looks into the file backwardly, using a sliding window, to find a matching sequence of bytes, this logic is used to ensure correctness for file types that does have a sequence of bytes that appear at the end.

§Worst-case scenario

For detect function, we mixes reading from the start and then only do backward sliding at the end for types that have matched at the start, this improves the accuracy of file detection, with the cost that if a marker is found at the start, and the specification states that there is a marker at the end, and we do the backward sliding-window, and there is no marker at the end, we will have traversed the entire data stream, with a time complexity of O(n), so, the worst case of file detection is linear.

However, even with a linear worst case, we assume that in the most scenarios the marker at the start will be enough to detect the file type. And if this is not enough and we need to look at the end, we assume that in most cases we will not need to slide the window until the start of the stream, assuming that the algorithm will find the marker closer to the end than to the start.

Further benchmarks can be done to check if bindet amortized time complexity is really O(1), given a bunch of files to be detected.

§Examples

use std::fs::{OpenOptions};
use std::io::BufReader;
use std::io::ErrorKind;
use bindet;
use bindet::types::FileType;
use bindet::FileTypeMatch;
use bindet::FileTypeMatches;

let file = OpenOptions::new().read(true).open("files/test.tar").unwrap();
let buf = BufReader::new(file);

let detect = bindet::detect(buf).map_err(|e| e.kind());
let expected: Result<Option<FileTypeMatches>, ErrorKind> = Ok(Some(FileTypeMatches::new(
    vec![FileType::Tar],
    vec![FileTypeMatch::new(FileType::Tar, true)]
)));

assert_eq!(detect, expected);

§Features

§nightly

Uses Macro MetaVar Expression Counting instead of Repetition Counting through tuple slice length.

Enabling nightly feature flag does not have any impact on the performance, neither runtime nor compile time, since both stable and nightly approach gets optimized and inlined at compile time, but has the same negligible compile-time cost.

This only exists because Macro MetaVar Expression is still being discussed1 (and may not reach stable 1.63 or 1.64, and it is late for 1.62), even though $$ and ${ignore(_)} are targeting 1.622 and it will probably be delivered, ${count(_)} is one of the features that are left out of the stabilization because it may need more refinement and discussion

bindet don’t need this personally, since it does not have too much elements to cause a compiler crash, but keeping it in the source code helps us to remember to deliver it as a default when it reach stable, and reduce the amount of hacky/tricky things we need to do with declarative macros (which already needs a bunch of tricks).

§mime

Enables conversion from FileType and FileRootType to Mime by implementing TryInto<Mime> trait, there is no need to use any additional module, just enable the feature.

§mediatype

Enables conversion from FileType and FileRootType to MediaTypeBuf by implementing TryInto<MediaTypeBuf> trait, there is no need to use any additional module, just enable the feature.

Re-exports§

pub use crate::types::FileRootType;
pub use crate::types::FileType;

Modules§

description
Description module
matcher
Matcher module
types
Enumerates all file types that bindet is able to detect.

Structs§

FileTypeMatch
Stores information about a specific FileType match result.
FileTypeMatches

Functions§

detect
Detect a file type by looking at the start and at the end of the file (at the end only for applicable file types)
detect_at_end
Detect a file type by using a backward sliding window, this approach does have a O(n) time complexity and is not meant to be used directly.
detect_at_end_from_ref
Detect a file type by using a backward sliding window, this approach does have a O(n) time complexity and is not meant to be used directly.
detect_at_start
Detect a file type by looking at the start of the file. Types that need a second check at the end may be reported with FileTypeMatch.full_match = false signaling a probable match.
detect_at_start_from_ref
Detect a file type by looking at the start of the file. Types that need a second check at the end may be reported with FileTypeMatch.full_match = false signaling a probable match.
detect_variants_at_end
Detect a file type by using a backward sliding window, this approach does have a O(n) time complexity and is not meant to be used directly.
detect_variants_at_end_from_ref
Detect a file type by using a backward sliding window, this approach does have a O(n) time complexity and is not meant to be used directly.
detect_variants_at_start
Detect a file type by looking at the start of the file. Types that need a second check at the end may be reported with FileTypeMatch.full_match = false signaling a probable match.
detect_variants_at_start_from_ref
Detect a file type by looking at the start of the file. Types that need a second check at the end may be reported with FileTypeMatch.full_match = false signaling a probable match.