Expand description
A customisable, multithreaded (optionally async) file crawler for local file systems
§Getting Started
It is recommended to
- add it to your project
cargo add file-crawler- use the prelude
use file_crawler::prelude::*;-
and read the examples (or the
Crawlerdocs)! -
While working with the library, refer to the
Crawlerdocumentation.
§Examples
Below are some examples showing usage in different use cases. Reading these is is enough to understand everything for most use cases.
§Example 1
Here’s how you create a synchronous, multithreaded Crawler that prints the file name of every file in a folder:
use file_crawler::prelude::*;
use std::path::PathBuf;
Crawler::new()
.start_dir("C:\\user\\foo")
.run(|_, path: PathBuf| {
println!("{}", path.display());
//placeholder error type for now
Ok::<(), std::io::Error>(())
})?;
Ok(())§Example 2
Actually, we left one argument out: the Context!
We didn’t need it, but if we want to know how many files we have in our folder we can do this:
use file_crawler::prelude::*;
use std::path::PathBuf;
use std::sync::atomic::AtomicU32;
use std::sync::{Arc, Mutex};
//the context is later returned as the exact same type from the Crawler::run function
//so we can bind it to a variable if needed
let count=
Crawler::new()
.start_dir("C:\\user\\foo")
//you can of course use atomic types, this makes more sense for numbers
.context(Mutex::new(0))
.run(|ctx: Arc<Mutex<u32>>, path: PathBuf| {
*ctx.lock().unwrap()+=1;
println!("{}", path.display());
Ok::<(), std::io::Error>(())
})?;
println!("Total number of files in \"C\\user\\foo\": {}", count.lock().unwrap());
Ok(())§Example 3
Until now the Ok() was more mandatory than useful. Let’s look at a use case where it is a big benefit,
like counting the appearance of the letter ‘a’ (assuming only text files are in the folder)
use file_crawler::prelude::*;
use std::fs::File;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicU32;
let a_count=
Crawler::new()
.start_dir("C:\\user\\foo")
.context(AtomicU32::new(0))
.run(|ctx: Arc<AtomicU32>, path: PathBuf| {
let mut contents=String::new();
let mut file=File::open(path)?;
//NOTE: this can cause an error for files not readable as UTF-8
//which returns an error and therefore terminates the crawler
file.read_to_string(&mut contents)?;
contents.chars().for_each(|char| if char=='a' { ctx.fetch_add(1, Ordering::Relaxed); });
Ok::<(),std::io::Error>(())
})?;
println!("Appearance of the letter 'a' in \"C\\user\\foo\": {}", a_count.load(Ordering::Relaxed));
Ok(())
§Example 4
Say, you are looking all .txt files in a folder that’s probably very big and deeply nested and
don’t want to use all the computation power and time it would require you can do something like this:
use file_crawler::prelude::*;
use std::path::PathBuf;
Crawler::new()
.start_dir("C:\\user\\probably_very_deep_folder")
//you can set a regex for every file / folder
//the closure you specify is only executed for a file if its name matches the regex
//this regex matches every single-line string ending in ".txt"
.file_regex(r"^.*\.txt$")
//sets a maximum depth (in terms of "folder layers" over each other)
.search_depth(3)
//you can also leave out the "PathBuf", before it was kept to make it easier to read
.run(|_, path| {
println!("{}", path.display());
Ok::<(), std::io::Error>(())
})?;You can also set a folder regex via Crawler::folder_regex, checking for the file regex
in the closure is possible, but in the future declaring it on the Crawler may enable further optimisations.
§Example 5
A focus was also put on the laziness1 of the Crawler, so it is possible to create, store and then use one or more mostly without any heavy computations before running2:
use file_crawler::prelude::*;
use tokio::fs::File;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicU32;
use std::sync::atomic::Ordering;
use tokio::io::AsyncReadExt;
const START_DIR: &str="C:\\user\\foo";
//the file types we are interested in
let regexes = [
r"^.*\.txt$",
r"^.*\.elf$",
r"^.*\.png$"
];
//constructing them
let crawlers = regexes.iter()
.map(|regex|
Crawler::new()
.file_regex(regex)
.start_dir(START_DIR)
);
//using them
for crawler in crawlers {
crawler.run(|_, path| {
println!("{}", path.display());
Ok::<(), std::io::Error>(())
})?;
}
Ok(())§Example 6
Like with iterators in rayon, you can simply exchange the Crawler::new method with the Crawler::new_async
method to get an async crawler.
use file_crawler::prelude::*;
//we're using the tokio from the prelude here, no need to add it as an extra dependency
use tokio::fs::File;
use tokio::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicU32;
//basically the same as example 3!
let a_count=
//only change required to make it async (except for the run(..) code)
//don't forget to enable the 'async' feature!
Crawler::new_async()
.start_dir("C:\\user\\foo")
.context(AtomicU32::new(0))
.run(async |ctx: Arc<AtomicU32>, path: PathBuf| {
let mut contents=String::new();
let mut file=File::open(path).await?;
//NOTE: this can cause an error for files not readable as UTF-8
//which returns an error and therefore terminates the crawler
file.read_to_string(&mut contents).await?;
contents.chars().for_each(|char| if char=='a' { ctx.fetch_add(1, Ordering::Relaxed); });
Ok::<(), std::io::Error>(())
}).await?;
println!("Appearance of the letter 'a' in \"C\\user\\foo\": {}", a_count.load(Ordering::Relaxed));
Ok(())§Features
- parallel: enables non-async multithreaded Crawler execution via the
rayoncrate. Enabled by default. - async: enables asynchronous, multithreaded3 Crawler execution via
tokio. - lazy_store: enables creation of async and non-async
Crawlers for later usage or interfacing with other crates, but not running them so tokio/rayon do not need to be compiled4.
§Planned Features
- chili:
chilias an optional backend (instead ofrayon, GitHub issue)
§Panics
In general - especially with the focus on the Crawler’s laziness - it is desirable to have as many potential panics at creation, not at runtime (in terms of calling run on the Crawler).
Panics can (for example) occur when setting the regex to an invalid string, this may be changed in the future. So, if the creation of the Crawler succeeds, running will most likely not cause a panic.
one exception is setting a regex because it is compiled on setting it to emit an early panic. ↩
Currently, the async version demands a tokio runtime with at least 2 threads. Running it in a single threaded runtime is theoretically possible, but causes indefinite execution, so this won’t work: ↩
Not necessary if both the parallel and async feature are enabled. ↩