Expand description
§Glommio - asynchronous thread per core applications in Rust.
§What is Glommio
Glommio is a library providing a safe Rust interface for asynchronous,
thread-local I/O, based on the linux io_uring
interface and Rust’s async
support. Glommio also provides support for pinning threads to CPUs, allowing
thread-per-core applications in Rust.
This library depends on linux’s io_uring
interface, so this is Linux-only,
with a kernel version 5.8 or newer recommended.
This library provides abstractions for timers, file I/O and networking plus support for multiple-queues and an internal scheduler, all without using helper threads.
A more detailed exposition of Glommio’s architecture is available in this blog post
§Rust async
Using Glommio is not hard if you are familiar with rust async. All you have to do is:
use glommio::LocalExecutorBuilder;
LocalExecutorBuilder::default()
.spawn(|| async move {
// your code here
})
.unwrap();
§Pinned threads
Although pinned threads are not required for use of glommio, by creating N executors and binding each to a specific CPU one can use this crate to implement a thread-per-core system where context switches essentially never happen, allowing much higher efficiency.
You can easily bind an executor to a CPU by adjusting the LocalExecutorBuilder in the example above:
/// This will now never leave CPU 0
use glommio::{LocalExecutorBuilder, Placement};
LocalExecutorBuilder::new(Placement::Fixed(0))
.spawn(|| async move {
// your code here
})
.unwrap();
Note that you can only have one executor per thread, so if you need more executors, you will have to create more threads. A more ergonomic interface for that is planned but not yet available.
§Scheduling
For a Thread-per-core system to work well, it is paramount that some form of scheduling can happen within the thread. Traditional applications use many threads to divide the many aspects of its workload and rely on the operating system and runtime to schedule these threads fairly and switch between these as necessary. For a thread-per-core system, each thread must handle its own scheduling at the application level.
Glommio provides extensive abstractions for handling scheduling, allowing multiple tasks to proceed on the same thread. Task scheduling can be handled broadly through static shares, or more dynamically through the use of controllers:
use glommio::{executor, Latency, LocalExecutorBuilder, Placement, Shares};
LocalExecutorBuilder::new(Placement::Fixed(0))
.spawn(|| async move {
let tq1 =
executor().create_task_queue(Shares::Static(2), Latency::NotImportant, "test1");
let tq2 =
executor().create_task_queue(Shares::Static(1), Latency::NotImportant, "test2");
let t1 = glommio::spawn_local_into(
async move {
// your code here
},
tq1,
)
.unwrap();
let t2 = glommio::spawn_local_into(
async move {
// your code here
},
tq2,
)
.unwrap();
t1.await;
t2.await;
})
.unwrap();
This example creates two task queues: tq1
has 2 shares, tq2
has 1 share.
This means that if both want to use the CPU to its maximum, tq1
will have
2/3
of the CPU time (2 / (1 + 2))
and tq2
will have 1/3
of the CPU
time. Those shares are dynamic and can be changed at any time. Notice that
this scheduling method doesn’t prevent either tq1
no tq2
from using 100%
of CPU time at times in which they are the only task queue running: the
shares are only considered when multiple queues need to run.
§Direct I/O
Glommio makes Direct I/O a first-class citizen, although Buffered I/O is present as well for situations where it may make sense.
This rides the trend of devices getting faster over the years and tries to bridge the software gap between fast devices, and fast storage applications. You can read more about it in this article
§Controlled processes
Glommio ships with embedded controllers. You can read more about them in the Controllers module documentation. Controllers allow one to automatically adjust the scheduler shares to control how fast a particular process should happen given a user-provided criteria.
For a real-life application of such technology I recommend reading this post from Glauber.
§Prior work
This work is heavily inspired (with some code respectfully imported) by the great work by Stjepan Glavina, in particular the following crates:
Aside from Stjepan’s work, this is also inspired greatly by the Seastar Framework for C++ that powers I/O intensive systems that are pushing the performance envelope, like ScyllaDB.
§Why is this its own crate?
Cooperative Thread-per-core is a very specific programming model. Because only one task is executing per thread, the programmer never needs any locking to be held. Atomic operations are therefore rare, delegated to only a handful of corner case tasks.
As atomic operations are costlier than their non-atomic counterparts, this improves efficiency by itself. However, it comes with the added benefits that context switches are virtually non-existent (they only occur for kernel threads and interrupts) and no time is ever wasted in waiting on locks.
§Why is this a single monolith instead of many crates
Take as an example the async-io crate. It has park()
and unpark()
methods. One can park()
the current executor, and a helper
thread will unpark it. This allows one to effectively use that crate with
very little need for anything else for the simpler cases. Combined with
synchronization primitives like Condvar
, and other thread-pool based
future crates, it excels in conjunction with others, but it is useful on its
own.
Now contrast that to the equivalent bits in this crate: once you park()
the thread, you can’t unpark it. I/O never gets dispatched without explicit
calling into the reactor, which makes for a very weird programming model,
and it is very hard to integrate with the outside world since most external
I/O related crates have threads that sooner or later will require Send
+
Sync
.
A single crate is a way to minimize friction.
§io_uring
This crate depends heavily on Linux’s io_uring
. The reactor will register
3 rings per CPU:
-
Main ring: The main ring, as its name implies, is where most operations will be placed. Once the reactor is parked, it only returns if the main ring has events to report.
-
Latency ring: Operations that are latency sensitive can be put in the latency ring. The crate has a function called
yield_if_needed()
that efficiently checks if there are events pending in the latency ring. Because this crate usescooperative
programming, tasks run until they either complete or decide to yield, which means they can run for a very long time before tasks that are latency sensitive have a chance to run. Every time you fire a long-running operation (usually a loop) it is good practice to checkyield_if_needed()
periodically (for example after x iterations of the loop). In particular, a when a new priority class is registered, one can specify if it contains latency sensitive tasks or not. And if the queue is marked as latency sensitive, the Latency enum takes a duration parameter that determines for how long other tasks can run even if there are no external events (by registering a timer with the io_uring). If no runnable tasks in the system are latency sensitive, this timer is not registered. Becauseio_uring
allows for polling in the ring file descriptor, it is safe topark()
even if work is present in the latency ring: before going to sleep, the latency ring’s file descriptor is registered with the main ring and any events it sees will also wake up the main ring. -
Poll ring: Read and write operations on NVMe devices are put in the poll ring. The poll ring does not rely on interrupts so the system has to keep constantly polling if there is any pending work. By not relying on interrupts we can be even more efficient with I/O in high IOPS scenarios
§Before using Glommio
Please note Glommio requires at least 512 KiB of locked memory for
io_uring
to work. You can increase the memlock
resource limit (rlimit)
as follows:
$ vi /etc/security/limits.conf
* hard memlock 512
* soft memlock 512
To make the new limits effective, you need to log in to the machine again. You can verify that the limits are updated by running the following:
$ ulimit -l
512
Glommio also requires a kernel with a recent enough io_uring
support, at
least recent enough to run discovery probes. The minimum version at this
time is 5.8
§Examples
Connect to example.com:80
, or time out after 10 seconds:
use futures_lite::{future::FutureExt, io};
use glommio::{net::TcpStream, timer::Timer, LocalExecutor};
use std::time::Duration;
let local_ex = LocalExecutor::default();
local_ex.run(async {
let timeout = async {
Timer::new(Duration::from_secs(10)).await;
Err(io::Error::new(io::ErrorKind::TimedOut, "").into())
};
let stream = TcpStream::connect("::80").or(timeout).await?;
// Read or write from stream
std::io::Result::Ok(())
});
Modules§
- glommio::channels is a module that provides glommio channel-like abstractions.
- provides constructs to automatically control the shares, and in consequence the proportion of resources, that a task uses.
glommio::io
provides data structures targeted towards File I/O.- This module provides glommio’s networking support.
- Provides common imports that almost all Glommio applications will need
- Set of synchronization primitives.
- Task abstraction for building executors.
- glommio::timer is a module that provides timing related primitives.
Macros§
- Mark context for task operations
- Macro to create a
ScopeGuard
(always run). - Macro for cloning values to close.
- Converts a Nix error into a native ErrorKind
Structs§
- A description of the CPU’s location in the machine topology.
- Used to specify a set of permitted CPUs on which executors created by a
LocalExecutorPoolBuilder
are run. - Default settings for signal number, threshold and stall handler. By default, the threshold to consider a task queue stalled is set to 10ms over the expected run time. The default handler will log a stack trace of the currently executing task queue. The default signal number is
nix::libc::SIGUSR1
. - A wrapper around a
std::thread::JoinHandle
- A proxy struct to the underlying
LocalExecutor
. It is accessible from anywhere within a Glommio context usingexecutor()
. - Allows information about the current state of this executor to be consumed by applications.
- Stores information about IO
- Single-threaded executor.
- A factory that can be used to configure and create a
LocalExecutor
. - A factory to configure and create a pool of
LocalExecutor
s. - Holds a collection of
JoinHandle
s. - Stores information about IO performed in a specific ring
- A spawned future that cannot be detached, and has a predictable lifetime.
- A spawned future that can be detached
- An opaque handle indicating in which queue a group of tasks will execute. Tasks in the same group will execute in FIFO order but no guarantee is made about ordering on different task queues.
- Allows information about the current state of a particular task queue to be consumed by applications.
Enums§
- Error types that can be created when building executors.
- Error types that can be created by the executor.
- Composite error type to encompass all error types glommio produces.
- An attribute of a
TaskQueue
, passed during its creation. - Specifies a policy by which
LocalExecutorBuilder
selects CPUs. - Specifies a policy by which
LocalExecutorPoolBuilder
selects CPUs. - Error variants for executor queues.
- Errors coming from the reactor.
- Resource Type used for errors that
WouldBlock
and includes extra diagnostic data for richer error messages. - Represents how many shares a
TaskQueue
should receive.
Traits§
- Utility methods for working with byte slices/buffers.
- Utility methods for working with mutable byte slices/buffers.
- The SharesManager allows the user to implement dynamic shares for a
TaskQueue
- Trait describing what signal to use to trigger stall detection, how far past expected execution time to trigger a stall, and how to handle a stall detection once triggered.
Functions§
- Allocates a buffer that is suitable for using to write to Direct Memory Access File (DMA). Please note that this implementation uses embedded buddy allocator to speed up allocation of the memory chunks, but the same allocator is used to server memory needed to write/read data from
uring
so probably that is not good idea to keep allocated memory for a long time. If you want to keep allocated buffer for a long time please use [‘crate::allocate_dma_buffer_global’] instead. Be careful when you use this buffer with DMA file, size and position of the buffer should be properly aligned to the block size of the device where the file is located - Allocates a buffer that is suitable for using to write to Direct Memory Access File (DMA). If you do not plan to keep allocated buffer for a long time please use [‘crate::allocate_dma_buffer’] instead. Be careful when you use this buffer with DMA file, size and position of the buffer should be properly aligned to the block size of the device where the file is located
- Returns a proxy struct to the
LocalExecutor
- Spawns a task onto the current single-threaded executor.
- Spawns a task onto the current single-threaded executor, in a particular task queue
- Spawns a task onto the current single-threaded executor.
- Spawns a task onto the current single-threaded executor, in a particular task queue
- Conditionally yields the current task queue. The scheduler may then process other task queues according to their latency requirements. If a call to this function results in the current queue to yield, then the calling task is moved to the back of the yielded task queue.
Type Aliases§
- Result type alias that all Glommio public API functions can use.