Crate fail

Expand description

A fail point implementation for Rust.

Fail points are code instrumentations that allow errors and other behavior to be injected dynamically at runtime, primarily for testing purposes. Fail points are flexible and can be configured to exhibit a variety of behavior, including panics, early returns, and sleeping. They can be controlled both programmatically and via the environment, and can be triggered conditionally and probabilistically.

This crate is inspired by FreeBSD’s failpoints.

Usage

First, add this to your Cargo.toml:

[dependencies]
fail = "0.2"

Now you can import the fail_point! macro from the fail crate and use it to inject dynamic failures.

As an example, here’s a simple program that uses a fail point to simulate an I/O panic:

#[macro_use]
extern crate fail;

fn do_fallible_work() {
    fail_point!("read-dir");
    let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
    // ... do some work on the directory ...
}

fn main() {
    fail::setup();
    do_fallible_work();
    fail::teardown();
    println!("done");
}

Here, the program calls unwrap on the result of read_dir, a function that returns a Result. In other words, this particular program expects this call to read_dir to always succeed. And in practice it almost always will, which makes the behavior of this program when read_dir fails difficult to test. By instrumenting the program with a fail point we can pretend that read_dir failed, causing the subsequent unwrap to panic, and allowing us to observe the program’s behavior under failure conditions.

When the program is run normally it just prints “done”:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/failpointtest`
done

But now, by setting the FAILPOINTS variable we can see what happens if the read_dir fails:

FAILPOINTS=read-dir=panic cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/failpointtest`
thread 'main' panicked at 'failpoint read-dir panic', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/fail-0.2.0/src/lib.rs:286:25
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Usage in tests

The previous example triggers a fail point by modifying the FAILPOINT environment variable. In practice, you’ll often want to trigger fail points programmatically, in unit tests. Unfortunately, unit testing with fail points is complicated by concurrency concerns, so requires some careful setup. First, let’s see the intuitive — but wrong — way to test with fail points.

This next example is like the previous, except instead of controlling fail points with an environment variable, it does so with the fail::cfg function, and instead of having a main function, it has a test case:

#[macro_use]
extern crate fail;

fn do_fallible_work() {
    fail_point!("read-dir");
    let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
    // ... do some work on the directory ...
}

#[test]
#[should_panic]
fn test_fallible_work() {
    fail::setup();
    fail::cfg("read-dir", "panic").unwrap();
    do_fallible_work();
    fail::teardown();
}

So this is a test that sets up the fail point to panic, and the test is expected to panic because it has the #[should_panic] attribute.

And this works fine.

But only in this simple case. It is not correct generally. This is because fail points are global resources that can be accessed from any thread, and setup and teardown are operations that have global effect, and Rust tests are run in multiple threads, in parallel. As a result, if more than one test calls setup, teardown, or configures the same fail point then their result is non-deterministic.

To account for this we need to serialize the execution of tests by holding a global lock, and only running a single fail point test at a time.

Here’s the correct way to write this test, and the basic pattern for writing tests with fail points:

#[macro_use]
extern crate lazy_static;
#[macro_use]
extern crate fail;

use std::sync::{Mutex, MutexGuard};

fn do_fallible_work() {
    fail_point!("read-dir");
    let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
    // ... do some work on the directory ...
}

lazy_static! {
    static ref LOCK: Mutex<()> = Mutex::new(());
}

fn setup<'a>() -> MutexGuard<'a, ()> {
    let guard = LOCK.lock().unwrap_or_else(|e| e.into_inner());
    fail::teardown();
    fail::setup();
    guard
}

#[test]
#[should_panic]
fn test_fallible_work() {
    let _gaurd = setup();
    fail::cfg("read-dir", "panic").unwrap();
    do_fallible_work();
}

With this arrangement, any test that calls setup and holds the resulting guard for the duration will not run in parallel with other tests. It depends on the lazy_static crate to initialize a global mutex.

Note that this type of guard is not only necessary for test cases that configure fail points, but also, if there are any test cases that enable fail points in the same crate, then the guard is also necessary for any tests that execute the code containing those fail points, even if those tests don’t call fail::cfg themselves. In our example, consider what happens of we have two test cases that test do_fallible_work, and one of them configures the fail point, expecting the function to fail, while the other does not configure the fail point, expecting it to succeed. Then consider what might happen if those tests execute in parallel — the result is not deterministic and there will be spurious test failures.

Because of this it is a best practice to put all fail point unit tests into their own binary. Here’s an example of a snippet from Cargo.toml that creates a fail-point-specific test binary:

[[test]]
name = "failpoints"
path = "tests/failpoints/mod.rs"

Early return

The previous examples illustrate injecting panics via fail points, but panics aren’t the only — or even the most common — error pattern in Rust. The more common type of error is propagated by Result return values, and fail points can inject those as well with “early returns”. That is, when configuring a fail point as “return” (as opposed to “panic”), the fail point will immediately return from the function, optionally with a configurable value.

The setup for early return requires a slightly diferent invocation of the fail_point! macro. To illustrate this, let’s modify the do_fallible_work function we used earlier to return a Result:

#[macro_use]
extern crate fail;

use std::io;

fn do_fallible_work() -> io::Result<()> {
    fail_point!("read-dir");
    let _dir: Vec<_> = std::fs::read_dir(".")?.collect();
    // ... do some work on the directory ...
    Ok(())
}

fn main() -> io::Result<()> {
    fail::setup();
    do_fallible_work()?;
    fail::teardown();
    println!("done");
    Ok(())
}

So this example has more proper Rust error handling, with no unwraps anywhere. Instead it uses ? to propagate errors via the Result type return values. This is more realistic Rust code.

The “read-dir” fail point though is not yet configured to support early return, so if we attempt to configure it to “return”, we’ll see an error like

$ FAILPOINTS=read-dir=return cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/failpointtest`
thread 'main' panicked at 'Return is not supported for the fail point "read-dir"', src/main.rs:7:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

This error tells us that the “read-dir” fail point is not defined correctly to support early return, and gives us the line number of that fail point. What we’re missing in the fail point definition is code describring how to return an error value, and the way we do this is by passing fail_point! a closure that returns the same type as the enclosing function.

Here’s a variation that does so:

fn do_fallible_work() -> io::Result<()> {
    fail_point!("read-dir", |_| {
        Err(io::Error::new(io::ErrorKind::PermissionDenied, "error"))
    });
    let _dir: Vec<_> = std::fs::read_dir(".")?.collect();
    // ... do some work on the directory ...
    Ok(())
}

And now if the “read-dir” fail point is configured to “return” we get a different result:

$ FAILPOINTS=read-dir=return cargo run
   Compiling failpointtest v0.1.0 (/home/brian/pingcap/failpointtest)
    Finished dev [unoptimized + debuginfo] target(s) in 2.38s
     Running `target/debug/failpointtest`
Error: Custom { kind: PermissionDenied, error: StringError("error") }

This time, do_fallible_work returned the error defined in our closure, which propagated all the way up and out of main, then Rust’s default error handler printed the error. All as expected.

There’s one other thing to understand about this closure used for early return, and that’s the purpose of the argument. Notice that in the previous example our closure accepted an argument, but only with the placeholder _ — it didn’t do anything with it.

The purpose of this argument is to customize the return value dynamically: when configuring a fail point for return, you can also provide a string representing what should be returned, e.g. “return(true)” or “return(false)”. The closure receives that string inside an Option<String> and is responsible for converting into the proper return type.

So here’s one final variation that accepts that string and incorporates it into the return value:

fn do_fallible_work() -> io::Result<()> {
    fail_point!("read-dir", |err| {
        let err = err.unwrap_or("error".to_string());
        Err(io::Error::new(io::ErrorKind::PermissionDenied, err))
    });
    let _dir: Vec<_> = std::fs::read_dir(".")?.collect();
    // ... do some work on the directory ...
    Ok(())
}

And running it with a custom value:

$ FAILPOINTS="read-dir=return(kablooey)" cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/failpointtest`
Error: Custom { kind: PermissionDenied, error: StringError("kablooey") }

Advanced usage

That’s the basics of fail points: defining them with fail_point!, configuring them with FAILPOINTS and fail::cfg, and configuring them to panic and return early. But that’s not all they can do. To learn more see the documentation for cfg and fail_point!.

Usage considerations

For most effective fail point usage, keep in mind the following:

Enable the no_fail feature in your release build. This will remove all the code for individual fail points, though not the code for calls to setup and teardown.
Carefully consider complex, concurrent, non-deterministic combinations of fail points. Put test cases exercising fail points into their own test crate and protect each test case with a mutex guard.
Use self-describing fail point names.
Fail points might have the same name, in which case they take the same actions. Be careful about duplicating fail point names, either within a single crate, or across multiple crates.

Macros

fail_point

Define a fail point.

Functions

cfg

Configure the actions for a fail point at runtime.

list

Get all registered fail points.

remove

Remove a fail point.

setup

Set up the fail point system.

teardown

Tear down the fail point system.