slitter 0.1.0 - Docs.rs

The allocation and release fast paths
=====================================

Slitter attempts to use as much Rust as possible; while this does
imply a certain loss of control, the structure of a thread-caching
slab allocator means that we can easily pinpoint what little code
is really hot, and rewrite that in C or assembly language.

Allocations
-----------

The core of object allocation is `slitter_allocate`.

```
void *
slitter_allocate(struct slitter_class class)
{
	struct magazine *restrict mag;
	size_t next_index;
	uint32_t id = class.id;

	if (__builtin_expect(id >= slitter_cache.n, 0))
		return slitter__allocate_slow(class);

	mag = &slitter_cache.mags[id].alloc;
	if (__builtin_usubl_overflow(mag->top_of_stack, 2, &next_index)) {
		next_index++;
	}

	if (__builtin_expect(slitter__magazine_is_exhausted(mag), 0))
		return slitter__allocate_slow(class);

	/*
	 * The magazine was not empty, so next_index did not overflow
	 * by more than 1.
	 */
	__builtin_prefetch(mag->storage->allocations[next_index], 1);
	return slitter__magazine_get_non_empty(mag);
}
```

Which assembles to something like the following on x86-64:

```
   0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <slitter_allocate+0x7>
                        3: R_X86_64_GOTTPOFF    slitter_cache-0x4
   7:   89 f8                   mov    %edi,%eax
   9:   64 48 3b 02             cmp    %fs:(%rdx),%rax
   d:   73 39                   jae    48 <slitter_allocate+0x48>  ; Check if the cache must be (re-)initialised
   f:   48 c1 e0 05             shl    $0x5,%rax
  13:   64 48 03 42 08          add    %fs:0x8(%rdx),%rax
  18:   48 8b 10                mov    (%rax),%rdx
  1b:   48 83 fa 02             cmp    $0x2,%rdx
  1f:   48 89 d6                mov    %rdx,%rsi
  22:   48 83 d6 fe             adc    $0xfffffffffffffffe,%rsi    ; Generate the prefetch index
  26:   48 85 d2                test   %rdx,%rdx
  29:   74 1d                   je     48 <slitter_allocate+0x48>  ; Is the magazine empty?
  2b:   48 8b 48 08             mov    0x8(%rax),%rcx              ; Load the magazine's array
  2f:   48 83 ea 01             sub    $0x1,%rdx
  33:   48 8b 34 f1             mov    (%rcx,%rsi,8),%rsi
  37:   0f 18 0e                prefetcht0 (%rsi)                  ; Prefetch the next allocation
  3a:   48 89 10                mov    %rdx,(%rax)                 ; Update the allocation index
  3d:   48 8b 04 d1             mov    (%rcx,%rdx,8),%rax          ; Grab our allocation
  41:   c3                      retq   
  42:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  48:   e9 00 00 00 00          jmpq   4d <slitter_allocate+0x4d>
                        49: R_X86_64_PLT32      slitter__allocate_slow-0x4
```

The first branch is taken ~once per thread, and the second once per
30-allocation magazine.  Most of the remaining instructions are used
to prefetch the next allocation; the prefetch isn't part of any
dependency chain, and code that allocates memory typically doesn't
saturate execution units, so that's not a problem.

When we must refill the magazine, the slow path isn't that slow
either.  At a high level, we first check if there's a full magazine in
the thread cache's deallocation buffer; if there isn't, we try to pop
some non-empty magazines off the allocation class's linked stack.  When
both stacks are empty, that's just a quick load and comparison with 0.
Otherwise, we call `slitter__stack_pop` or `slitter__stack_try_pop`,
straightforward double-wide CAS jobs (we avoid ABA with a generation
counter, and don't have to worry about reclamation races because
magazines are immortal).  The plain `pop` looks like:

```
   0:   4c 8b 4f 08             mov    0x8(%rdi),%r9
   4:   4c 8b 07                mov    (%rdi),%r8
   7:   4d 85 c0                test   %r8,%r8
   a:   74 34                   je     40 <slitter__stack_pop+0x40>
   c:   53                      push   %rbx
   d:   4c 89 c0                mov    %r8,%rax
  10:   4c 89 ca                mov    %r9,%rdx
  13:   49 8d 49 01             lea    0x1(%r9),%rcx
  17:   49 8b 98 f0 00 00 00    mov    0xf0(%r8),%rbx
  1e:   f0 48 0f c7 0f          lock cmpxchg16b (%rdi)
  23:   49 39 d1                cmp    %rdx,%r9
  26:   75 20                   jne    48 <slitter__stack_pop+0x48>
  28:   49 c7 80 f0 00 00 00    movq   $0x0,0xf0(%r8)
  2f:   00 00 00 00 
  33:   b8 01 00 00 00          mov    $0x1,%eax
  38:   5b                      pop    %rbx
  39:   4c 89 06                mov    %r8,(%rsi)
  3c:   c3                      retq   
  3d:   0f 1f 00                nopl   (%rax)
  40:   31 c0                   xor    %eax,%eax
  42:   c3                      retq   
  43:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  48:   49 89 c0                mov    %rax,%r8
  4b:   49 89 d1                mov    %rdx,%r9
  4e:   48 85 c0                test   %rax,%rax
  51:   75 c0                   jne    13 <slitter__stack_pop+0x13>
  53:   31 c0                   xor    %eax,%eax
  55:   5b                      pop    %rbx
  56:   c3                      retq   
```

If we found a non-empty magazine, we must push our currenty empty one
to a freelist for recycling.  That's another compare-and-swap.

If the local buffer and both class-global stacks are empty, we must
allocate more objects.  We implement that in `press.rs` with a single
atomic increment, which does not fail under contention, but only when
we notice (after the fact) that we must find a new bump allocation
region.

At a high level, we expect the slow path to incur one atomic increment
during bulk allocation phases, when no allocation is cached in
magazines.  When mixing allocation and deallocation (steady state),
this turns into two atomics, one CAS to acquire a new magazine of
cahed allocations, and another to recycle our empty magazine.

However, we will enter an even slower path whenever we exhaust the
current bump allocation reigon. When that happens (roughly once per
megabyte), we take a lock and grab another piece of address space.
Finally, we can also run out of pre-reserved address space, in which
case we must `mmap` to ask the kernel for more address space; we only
do that in 1 GB increments (and never release memory to the OS), so
that's a rare occasion.

Release
-------

Releasing allocations is special because it's a fundamentally
asynchronous operation: `slitter_release`, like `free` doesn't return
anything, so nothing can wait on it.  That's why we try to pack more
safety checks in the release code, as long as we can avoid
(unpredictable) control flow.

The code for `slitter_release` is

```
void
slitter_release(struct slitter_class class, void *ptr)
{
	uintptr_t address = (uintptr_t)ptr;
	uintptr_t chunk_base = address & -SLITTER__DATA_ALIGNMENT;
	uintptr_t chunk_offset = address % SLITTER__DATA_ALIGNMENT;
	size_t span_index = chunk_offset / SLITTER__SPAN_ALIGNMENT;
	uintptr_t meta_base = chunk_base -
	    (SLITTER__GUARD_PAGE_SIZE + SLITTER__METADATA_PAGE_SIZE);
	struct magazine *restrict mag;
	uint32_t id = class.id;

	if (ptr == NULL)
		return;

	/* Check the span metadata. */
	{
		const struct span_metadata *meta = (void *)meta_base;
		const struct span_metadata *span = &meta[span_index];

		assert(class.id == span->class_id && "class mismatch");
	}

	if (__builtin_expect(id >= slitter_cache.n, 0))
		return slitter__release_slow(class, ptr);

	mag = &slitter_cache.mags[id].release;
	if (__builtin_expect(slitter__magazine_is_exhausted(mag), 0))
		return slitter__release_slow(class, ptr);

	return slitter__magazine_put_non_full(mag, ptr);
}
```

All the shifting and masking above are there to help detect
mismatching releases.  The real work starts at
`if (__builtin_expect(id >= slitter_cache.n, 0))`.

Again, the majority of instructions aren't the deallocation itself:
that's just two range checks followed by a store and a stack index
update.

```
   0:   48 89 f2                mov    %rsi,%rdx
   3:   48 89 f0                mov    %rsi,%rax
   6:   48 81 e2 00 00 00 c0    and    $0xffffffffc0000000,%rdx
   d:   48 c1 e8 0e             shr    $0xe,%rax
  11:   48 81 ea 00 00 40 00    sub    $0x400000,%rdx
  18:   48 85 f6                test   %rsi,%rsi
  1b:   74 41                   je     5e <slitter_release+0x5e>  ; check for NULL
  1d:   0f b7 c0                movzwl %ax,%eax
  20:   48 8d 04 40             lea    (%rax,%rax,2),%rax
  24:   39 3c c2                cmp    %edi,(%rdx,%rax,8)
  27:   75 3c                   jne    65 <slitter_release+0x65>  ; Assert out on class mismatch
  29:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 30 <slitter_release+0x30>
                        2c: R_X86_64_GOTTPOFF   slitter_cache-0x4
  30:   89 f8                   mov    %edi,%eax
  32:   64 48 3b 02             cmp    %fs:(%rdx),%rax
  36:   73 28                   jae    60 <slitter_release+0x60>  ; Maybe (re-)initialise the cache
  38:   48 c1 e0 05             shl    $0x5,%rax
  3c:   64 48 03 42 08          add    %fs:0x8(%rdx),%rax
  41:   48 8b 50 10             mov    0x10(%rax),%rdx
  45:   48 85 d2                test   %rdx,%rdx                  ; Is the target magazine full?
  48:   74 16                   je     60 <slitter_release+0x60>
  4a:   48 8b 48 18             mov    0x18(%rax),%rcx
  4e:   48 8d 7a 01             lea    0x1(%rdx),%rdi
  52:   48 89 78 10             mov    %rdi,0x10(%rax)            ; Update the release index
  56:   48 89 b4 d1 f0 00 00    mov    %rsi,0xf0(%rcx,%rdx,8)     ; Store the freed object
  5d:   00 
  5e:   c3                      retq   
  5f:   90                      nop
  60:   e9 00 00 00 00          jmpq   65 <slitter_release+0x65>
                        61: R_X86_64_PLT32      slitter__release_slow-0x4
  65:   50                      push   %rax
  66:   48 8d 0d 00 00 00 00    lea    0x0(%rip),%rcx        # 6d <slitter_release+0x6d>
                        69: R_X86_64_PC32       .rodata.__PRETTY_FUNCTION__.2337-0x4
  6d:   ba 4e 00 00 00          mov    $0x4e,%edx
  72:   48 8d 35 00 00 00 00    lea    0x0(%rip),%rsi        # 79 <slitter_release+0x79>
                        75: R_X86_64_PC32       .LC0-0x4
  79:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi        # 80 <slitter_release+0x80>
                        7c: R_X86_64_PC32       .LC1-0x4
  80:   e8 00 00 00 00          callq  85 <slitter_release+0x85>
                        81: R_X86_64_PLT32      __assert_fail-0x4
```

Here as well, the first slow path branch is taken ~once per thread,
and the second once per 30-allocation magazine.

When a magazine is full, we must push it to the allocation class's
stack (`slitter__stack_push`).  That's another double-wide CAS loop.

After that, we must find an empty one.  We hope to find one from the
global cache of empty magazines (an atomic stack pop); otherwise, we
create a new one... by calling the system allocator for now.

In total, that's two atomics, or one atomic and a malloc... a bit
worse than allocation, but the application for which we wrote slitter
cares more about the performance of parallel allocation during startup
than anything else.