Crate file_crawler

Crate file_crawler 

Source
Expand description

A customisable, multithreaded (optionally async) file crawler for local file systems

§Getting Started

It is recommended to

  • add it to your project
cargo add file-crawler
  • use the prelude
use file_crawler::prelude::*;
  • and read the examples (or the Crawler docs)!

  • While working with the library, refer to the Crawler documentation.

§Examples

Below are some examples showing usage in different use cases. Reading these is is enough to understand everything for most use cases.

§Example 1

Here’s how you create a synchronous, multithreaded Crawler that prints the file name of every file in a folder:

use file_crawler::prelude::*;

use std::path::PathBuf;

Crawler::new()
    .start_dir("C:\\user\\foo")
    .run(|_, path: PathBuf| {
        println!("{}", path.display());
        //placeholder error type for now
        Ok::<(), std::io::Error>(())
    })?;
Ok(())

§Example 2

Actually, we left one argument out: the Context! We didn’t need it, but if we want to know how many files we have in our folder we can do this:

use file_crawler::prelude::*;

use std::path::PathBuf;
use std::sync::atomic::AtomicU32;
use std::sync::{Arc, Mutex};

//the context is later returned as the exact same type from the Crawler::run function
//so we can bind it to a variable if needed
let count=
Crawler::new()
    .start_dir("C:\\user\\foo")
    //you can of course use atomic types, this makes more sense for numbers
    .context(Mutex::new(0))
    .run(|ctx: Arc<Mutex<u32>>, path: PathBuf| {
        *ctx.lock().unwrap()+=1;
        println!("{}", path.display());
        Ok::<(), std::io::Error>(())
    })?;
 println!("Total number of files in \"C\\user\\foo\": {}", count.lock().unwrap());
Ok(())

§Example 3

Until now the Ok() was more mandatory than useful. Let’s look at a use case where it is a big benefit, like counting the appearance of the letter ‘a’ (assuming only text files are in the folder)

use file_crawler::prelude::*;

use std::fs::File;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicU32;

let a_count=
Crawler::new()
.start_dir("C:\\user\\foo")
.context(AtomicU32::new(0))
.run(|ctx: Arc<AtomicU32>, path: PathBuf| {
let mut contents=String::new();
let mut file=File::open(path)?;
//NOTE: this can cause an error for files not readable as UTF-8
//which returns an error and therefore terminates the crawler
file.read_to_string(&mut contents)?;
contents.chars().for_each(|char| if char=='a' { ctx.fetch_add(1, Ordering::Relaxed); });
Ok::<(),std::io::Error>(())
})?;
println!("Appearance of the letter 'a' in \"C\\user\\foo\": {}", a_count.load(Ordering::Relaxed));
Ok(())

§Example 4

Say, you are looking all .txt files in a folder that’s probably very big and deeply nested and don’t want to use all the computation power and time it would require you can do something like this:

 use file_crawler::prelude::*;

 use std::path::PathBuf;

 Crawler::new()
    .start_dir("C:\\user\\probably_very_deep_folder")
    //you can set a regex for every file / folder
    //the closure you specify is only executed for a file if its name matches the regex
    //this regex matches every single-line string ending in ".txt"
    .file_regex(r"^.*\.txt$")
    //sets a maximum depth (in terms of "folder layers" over each other)
    .search_depth(3)
    //you can also leave out the "PathBuf", before it was kept to make it easier to read
    .run(|_, path| {
        println!("{}", path.display());
        Ok::<(), std::io::Error>(())
    })?;

You can also set a folder regex via Crawler::folder_regex, checking for the file regex in the closure is possible, but in the future declaring it on the Crawler may enable further optimisations.

§Example 5

A focus was also put on the laziness1 of the Crawler, so it is possible to create, store and then use one or more mostly without any heavy computations before running2:

use file_crawler::prelude::*;

use tokio::fs::File;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicU32;
use std::sync::atomic::Ordering;
use tokio::io::AsyncReadExt;

const START_DIR: &str="C:\\user\\foo";

//the file types we are interested in
let regexes = [
               r"^.*\.txt$",
               r"^.*\.elf$",
               r"^.*\.png$"
              ];

//constructing them
let crawlers = regexes.iter()
                .map(|regex|
                    Crawler::new()
                    .file_regex(regex)
                    .start_dir(START_DIR)
                );

//using them
for crawler in crawlers {
    crawler.run(|_, path| {
        println!("{}", path.display());
        Ok::<(), std::io::Error>(())
    })?;
}
Ok(())

§Example 6

Like with iterators in rayon, you can simply exchange the Crawler::new method with the Crawler::new_async method to get an async crawler.

 use file_crawler::prelude::*;

 //we're using the tokio from the prelude here, no need to add it as an extra dependency
 use tokio::fs::File;
 use tokio::path::PathBuf;
 use std::sync::Arc;
 use std::sync::atomic::AtomicU32;

//basically the same as example 3!
 let a_count=
 //only change required to make it async (except for the run(..) code)
 //don't forget to enable the 'async' feature!
 Crawler::new_async()
    .start_dir("C:\\user\\foo")
    .context(AtomicU32::new(0))
    .run(async |ctx: Arc<AtomicU32>, path: PathBuf| {
        let mut contents=String::new();
        let mut file=File::open(path).await?;
        //NOTE: this can cause an error for files not readable as UTF-8
        //which returns an error and therefore terminates the crawler
        file.read_to_string(&mut contents).await?;
        contents.chars().for_each(|char| if char=='a' { ctx.fetch_add(1, Ordering::Relaxed); });
        Ok::<(), std::io::Error>(())
    }).await?;
 println!("Appearance of the letter 'a' in \"C\\user\\foo\": {}", a_count.load(Ordering::Relaxed));
 Ok(())

§Features

  • parallel: enables non-async multithreaded Crawler execution via the rayon crate. Enabled by default.
  • async: enables asynchronous, multithreaded3 Crawler execution via tokio.
  • lazy_store: enables creation of async and non-async Crawlers for later usage or interfacing with other crates, but not running them so tokio/rayon do not need to be compiled4.

§Planned Features

§Panics

In general - especially with the focus on the Crawler’s laziness - it is desirable to have as many potential panics at creation, not at runtime (in terms of calling run on the Crawler). Panics can (for example) occur when setting the regex to an invalid string, this may be changed in the future. So, if the creation of the Crawler succeeds, running will most likely not cause a panic.


  1. lazy evaluation

  2. one exception is setting a regex because it is compiled on setting it to emit an early panic. 

  3. Currently, the async version demands a tokio runtime with at least 2 threads. Running it in a single threaded runtime is theoretically possible, but causes indefinite execution, so this won’t work

  4. Not necessary if both the parallel and async feature are enabled. 

Modules§

builder
Building the crawler via the builder pattern, only way as of now
prelude
The prelude. It is - as you see - very small, so if you include it, you don’t have to worry about importing items as well as namespace pollution.