Expand description
A fail point implementation for Rust.
Fail points are code instrumentations that allow errors and other behavior to be injected dynamically at runtime, primarily for testing purposes. Fail points are flexible and can be configured to exhibit a variety of behavior, including panics, early returns, and sleeping. They can be controlled both programmatically and via the environment, and can be triggered conditionally and probabilistically.
This crate is inspired by FreeBSD’s failpoints.
Usage
First, add this to your Cargo.toml
:
[dependencies]
fail = "0.2"
Now you can import the fail_point!
macro from the fail
crate and use it
to inject dynamic failures.
As an example, here’s a simple program that uses a fail point to simulate an I/O panic:
#[macro_use]
extern crate fail;
fn do_fallible_work() {
fail_point!("read-dir");
let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
// ... do some work on the directory ...
}
fn main() {
fail::setup();
do_fallible_work();
fail::teardown();
println!("done");
}
Here, the program calls unwrap
on the result of read_dir
, a function
that returns a Result
. In other words, this particular program expects
this call to read_dir
to always succeed. And in practice it almost always
will, which makes the behavior of this program when read_dir
fails
difficult to test. By instrumenting the program with a fail point we can
pretend that read_dir
failed, causing the subsequent unwrap
to panic,
and allowing us to observe the program’s behavior under failure conditions.
When the program is run normally it just prints “done”:
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/failpointtest`
done
But now, by setting the FAILPOINTS
variable we can see what happens if the
read_dir
fails:
FAILPOINTS=read-dir=panic cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/failpointtest`
thread 'main' panicked at 'failpoint read-dir panic', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/fail-0.2.0/src/lib.rs:286:25
note: Run with `RUST_BACKTRACE=1` for a backtrace.
Usage in tests
The previous example triggers a fail point by modifying the FAILPOINT
environment variable. In practice, you’ll often want to trigger fail points
programmatically, in unit tests. Unfortunately, unit testing with fail
points is complicated by concurrency concerns, so requires some careful
setup. First, let’s see the intuitive — but wrong — way to test
with fail points.
This next example is like the previous, except instead of controlling fail
points with an environment variable, it does so with the fail::cfg
function, and instead of having a main
function, it has a test case:
#[macro_use]
extern crate fail;
fn do_fallible_work() {
fail_point!("read-dir");
let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
// ... do some work on the directory ...
}
#[test]
#[should_panic]
fn test_fallible_work() {
fail::setup();
fail::cfg("read-dir", "panic").unwrap();
do_fallible_work();
fail::teardown();
}
So this is a test that sets up the fail point to panic, and the test is
expected to panic because it has the #[should_panic]
attribute.
And this works fine.
But only in this simple case. It is not correct generally. This is because
fail points are global resources that can be accessed from any thread, and
setup
and teardown
are operations that have global effect, and Rust
tests are run in multiple threads, in parallel. As a result, if more than
one test calls setup
, teardown
, or configures the same fail point then
their result is non-deterministic.
To account for this we need to serialize the execution of tests by holding a global lock, and only running a single fail point test at a time.
Here’s the correct way to write this test, and the basic pattern for writing tests with fail points:
#[macro_use]
extern crate lazy_static;
#[macro_use]
extern crate fail;
use std::sync::{Mutex, MutexGuard};
fn do_fallible_work() {
fail_point!("read-dir");
let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
// ... do some work on the directory ...
}
lazy_static! {
static ref LOCK: Mutex<()> = Mutex::new(());
}
fn setup<'a>() -> MutexGuard<'a, ()> {
let guard = LOCK.lock().unwrap_or_else(|e| e.into_inner());
fail::teardown();
fail::setup();
guard
}
#[test]
#[should_panic]
fn test_fallible_work() {
let _gaurd = setup();
fail::cfg("read-dir", "panic").unwrap();
do_fallible_work();
}
With this arrangement, any test that calls setup
and holds the resulting
guard for the duration will not run in parallel with other tests. It depends
on the lazy_static
crate to
initialize a global mutex.
Note that this type of guard is not only necessary for test cases that
configure fail points, but also, if there are any test cases that enable
fail points in the same crate, then the guard is also necessary for any
tests that execute the code containing those fail points, even if those
tests don’t call fail::cfg
themselves. In our example, consider what
happens of we have two test cases that test do_fallible_work
, and one of
them configures the fail point, expecting the function to fail, while the
other does not configure the fail point, expecting it to succeed. Then
consider what might happen if those tests execute in parallel — the
result is not deterministic and there will be spurious test failures.
Because of this it is a best practice to put all fail point unit tests into
their own binary. Here’s an example of a snippet from Cargo.toml
that
creates a fail-point-specific test binary:
[[test]]
name = "failpoints"
path = "tests/failpoints/mod.rs"
Early return
The previous examples illustrate injecting panics via fail points, but
panics aren’t the only — or even the most common — error pattern
in Rust. The more common type of error is propagated by Result
return
values, and fail points can inject those as well with “early returns”. That
is, when configuring a fail point as “return” (as opposed to “panic”), the
fail point will immediately return from the function, optionally with a
configurable value.
The setup for early return requires a slightly diferent invocation of the
fail_point!
macro. To illustrate this, let’s modify the do_fallible_work
function we used earlier to return a Result
:
#[macro_use]
extern crate fail;
use std::io;
fn do_fallible_work() -> io::Result<()> {
fail_point!("read-dir");
let _dir: Vec<_> = std::fs::read_dir(".")?.collect();
// ... do some work on the directory ...
Ok(())
}
fn main() -> io::Result<()> {
fail::setup();
do_fallible_work()?;
fail::teardown();
println!("done");
Ok(())
}
So this example has more proper Rust error handling, with no unwraps
anywhere. Instead it uses ?
to propagate errors via the Result
type
return values. This is more realistic Rust code.
The “read-dir” fail point though is not yet configured to support early return, so if we attempt to configure it to “return”, we’ll see an error like
$ FAILPOINTS=read-dir=return cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.13s
Running `target/debug/failpointtest`
thread 'main' panicked at 'Return is not supported for the fail point "read-dir"', src/main.rs:7:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
This error tells us that the “read-dir” fail point is not defined correctly
to support early return, and gives us the line number of that fail point.
What we’re missing in the fail point definition is code describring how to
return an error value, and the way we do this is by passing fail_point!
a
closure that returns the same type as the enclosing function.
Here’s a variation that does so:
fn do_fallible_work() -> io::Result<()> {
fail_point!("read-dir", |_| {
Err(io::Error::new(io::ErrorKind::PermissionDenied, "error"))
});
let _dir: Vec<_> = std::fs::read_dir(".")?.collect();
// ... do some work on the directory ...
Ok(())
}
And now if the “read-dir” fail point is configured to “return” we get a different result:
$ FAILPOINTS=read-dir=return cargo run
Compiling failpointtest v0.1.0 (/home/brian/pingcap/failpointtest)
Finished dev [unoptimized + debuginfo] target(s) in 2.38s
Running `target/debug/failpointtest`
Error: Custom { kind: PermissionDenied, error: StringError("error") }
This time, do_fallible_work
returned the error defined in our closure,
which propagated all the way up and out of main, then Rust’s default error
handler printed the error. All as expected.
There’s one other thing to understand about this closure used for early
return, and that’s the purpose of the argument. Notice that in the previous
example our closure accepted an argument, but only with the placeholder _
— it didn’t do anything with it.
The purpose of this argument is to customize the return value dynamically:
when configuring a fail point for return, you can also provide a string
representing what should be returned, e.g. “return(true)” or
“return(false)”. The closure receives that string inside an Option<String>
and is responsible for converting into the proper return type.
So here’s one final variation that accepts that string and incorporates it into the return value:
fn do_fallible_work() -> io::Result<()> {
fail_point!("read-dir", |err| {
let err = err.unwrap_or("error".to_string());
Err(io::Error::new(io::ErrorKind::PermissionDenied, err))
});
let _dir: Vec<_> = std::fs::read_dir(".")?.collect();
// ... do some work on the directory ...
Ok(())
}
And running it with a custom value:
$ FAILPOINTS="read-dir=return(kablooey)" cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.10s
Running `target/debug/failpointtest`
Error: Custom { kind: PermissionDenied, error: StringError("kablooey") }
Advanced usage
That’s the basics of fail points: defining them with fail_point!
,
configuring them with FAILPOINTS
and fail::cfg
, and configuring them to
panic and return early. But that’s not all they can do. To learn more see
the documentation for cfg
and
fail_point!
.
Usage considerations
For most effective fail point usage, keep in mind the following:
- Enable the
no_fail
feature in your release build. This will remove all the code for individual fail points, though not the code for calls tosetup
andteardown
. - Carefully consider complex, concurrent, non-deterministic combinations of fail points. Put test cases exercising fail points into their own test crate and protect each test case with a mutex guard.
- Use self-describing fail point names.
- Fail points might have the same name, in which case they take the same actions. Be careful about duplicating fail point names, either within a single crate, or across multiple crates.