tcmalloc

Restartable Sequence Mechancism for TCMalloc

per-CPU Caches

TCMalloc implements its per-CPU caches using restartable sequences (man rseq(2)) on Linux. This kernel feature was developed by Paul Turner and Andrew Hunter at Google and Mathieu Desnoyers at EfficiOS. Restartable sequences let us execute a region to completion (atomically with respect to other threads on the same CPU) or to be aborted if interrupted by the kernel by preemption, interrupts, or signal handling.

Choosing to restart on migration across cores or preemption allows us to optimize the common case - we stay on the same core - by avoiding atomics, over the more rare case - we are actually preempted. As a consequence of this tradeoff, we need to make our code paths actually support being restarted. The entire sequence, except for its final store to memory which commits the change, must be capable of starting over.

This carries a few implementation challenges:

Structure of the TcmallocSlab

In per-CPU mode, we allocate an array of N TcmallocSlab::Slabs. For all operations, we index into the array with the logical CPU ID.

Each slab is has a header region of control data (one 8-byte header per-size class). These index into the remainder of the slab, which contains pointers to free listed objects.

Memory layout of per-cpu data structures

In C++ code, these are represented as:

struct Slabs {
  std::atomic<int64_t> header[NumClasses];
  void* mem[((1ul << Shift) - sizeof(header)) / sizeof(void*)];
};

// Slab header (packed, atomically updated 64-bit).
struct Header {
  // All values are word offsets from per-CPU region start.
  // The array is [begin, end).
  uint16_t current;
  // Copy of end. Updated by Shrink/Grow, but is not overwritten by Drain.
  uint16_t end_copy;
  // Lock updates only begin and end with a 32-bit write.
  uint16_t begin;
  uint16_t end;

  // Lock is used by Drain to stop concurrent mutations of the Header.
  // Lock sets begin to 0xffff and end to 0, which makes Push and Pop fail
  // regardless of current value.
  bool IsLocked() const;
  void Lock();
};

The atomic header allows us to read the state (esp. for telemetry purposes) of a core without undefined behavior.

The fields in Header are indexed in sizeof(void*) strides into the slab. For the default value of Shift=18, this allows us to cache nearly 32K objects per CPU.

We have allocated capacity for end-begin objects for a given size-class. begin is chosen via static partitioning at initialization time. end is chosen dynamically at a higher-level (in tcmalloc::CPUCache), as to:

Usage: Allocation

As the first operation, we can look at allocation, which needs to read the pointer at index current-1, return that object, and decrement current. Decrementing current is the commit operation.

In psuedo-C++, this looks like:

void* TcmallocSlab_Pop(
    void *slabs,
    size_t cl,
    UnderflowHandler f) {
  // Expanded START_RSEQ macro...
restart:
  __rseq_abi.rseq_cs = &__rseq_cs_TcmallocSlab_Pop;
start:
  // Actual sequence
  uint64_t cpu_id = __rseq_abi.cpu_id;
  Header* hdr = &slabs[cpu_id].header[cl];
  uint64_t current = hdr->current;
  uint64_t begin = hdr->begin;
  if (ABSL_PREDICT_FALSE(current <= begin)) {
    goto underflow;
  }

  void* next = *(&slabs[cpu_id] + current * sizeof(void*) - 2 *sizeof(void*))
  prefetcht0(next);

  void* ret = *(&slabs[cpu_id] + current * sizeof(void*) - sizeof(void*));
  current--;
  hdr->current = current;
commit:
  return ret;
underflow:
  return f(cpu_id, cl);
}

// This is implemented in assembly, but for exposition.
ABSL_CONST_INIT kernel_rseq_cs __rseq_cs_TcmallocSlab_Pop = {
  .version = 0,
  .flags = 0,
  .start_ip = &&start,
  .post_commit_offset = &&commit - &&start,
  .abort_ip = &&abort,
};

__rseq_cs_TcmallocSlab_Pop is a read-only data structure, which contains metadata about this particular restartable sequence. When the kernel preempts the current thread, it examines this data structure. If the current instruction pointer is between [start, commit), it returns control to a specified, per-sequence restart header at abort.

Since the next object is frequently allocated soon after the current object, so the allocation path prefetches the pointed-to object. To avoid prefetching a wild address, we populate slabs[cpu][begin] for each CPU/size-class with a pointer-to-self.

This sequence terminates with the single committing store to hdr->current. If we are migrated or otherwise interrupted, we restart the preparatory steps, as the values of cpu_id, current, begin may have changed.

As these operations work on a single core’s data and are executed on that core. From a memory ordering perspective, loads and stores need to appear on that core in program order.

Restart Handling

The abort label is distinct from restart. The rseq API provided by the kernel (see below) requires a “signature” (typically an intentionally invalid opcode) in the 4 bytes prior to the restart handler, we form a small trampoline - properly signed - to jump back to abort.

In x86 assembly, this looks like:

  // Encode nop with RSEQ_SIGNATURE in its padding.
  .byte 0x0f, 0x1f, 0x05
  .long RSEQ_SIGNATURE
  .local TcmallocSlab_Push_trampoline
  .type TcmallocSlab_Push_trampoline,@function
  TcmallocSlab_Push_trampoline:
abort:
  jmp restart

This ensures that the 4 bytes prior to abort match up with the signature that was configured with the rseq syscall.

On x86, we can represent this with a nop which would allow for interleaving in the main implementation. On other platforms - with fixed width instructions - the signature is often chosen to be an illegal/trap instruction, so it has to be disjoint from the function’s body.

Usage: Deallocation

Deallocation uses two stores, one to store the deallocated object and another to update current. This is still compatible with the restartable sequence technique, as there is a single commit step, updating current. Any preempted sequences will overwrite the value of the deallocated object until a successful sequence commits it by updating current.

int TcmallocSlab_Push(
    void *slab,
    size_t cl,
    void* item,
    OverflowHandler f) {
  // Expanded START_RSEQ macro...
abort:
  __rseq_abi.rseq_cs = &__rseq_cs_TcmallocSlab_Push;
start:
  // Actual sequence
  uint64_t cpu_id = __rseq_abi.cpu_id;
  Header* hdr = &slabs[cpu_id].header[cl];
  uint64_t current = hdr->current;
  uint64_t end = hdr->end;
  if (ABSL_PREDICT_FALSE(current >= end)) {
    goto overflow;
  }

  *(&slabs[cpu_id] + current * sizeof(void*) - sizeof(void*)) = item;
  current++;
  hdr->current = current;
commit:
  return;
overflow:
  return f(cpu_id, cl, item);
}

Initialization of the Slab

To reduce metadata demands, we lazily initialize the slabs, relying on the kernel to provide zeroed pages from the mmap call to obtain memory for the slab metadata.

At startup, this leaves the Header of each initialized to current = begin = end = 0. The initial push or pop will trigger the overflow or underflow paths (respectively), so that we can populate these values.

More Complex Operations: Batches

When the cache under or overflows, we populate or remove a full batch of objects obtained from inner caches. This amortizes some of the lock acquisition/logic for those caches. Using a similar approach to push and pop, we update a batch of N items and we update `current to commit the update.

Kernel API and implementation

This section contains notes on the rseq API provided by the kernel, which is not well documented, and code pointers for how it is implemented.

The rseq syscall is implemented by sys_rseq. It starts by handling the case where the thread wants to unregister, implementing that by clearing the rseq information out of the task_struct for the thread runnong on the current CPU. It then moves on to return an error if the thread is already registered for rseq. Then it validates and saves the input from the user, and sets the TIF_NOTIFY_RESUME flag for the thread.

Restarts

Among other things, the user’s input to the rseq syscall is used by rseq_ip_fixup to decide whether we’re in a critical section and if so restart at the abort point. That function is called by __rseq_handle_notify_resume, which is documented as needing to be called after preemption or signal delivery before returning to the user. That in turn is called by rseq_handle_notify_resume, a simple wrapper that bails if rseq is not enabled for the thread.

Here is one path that causes us to wind up here on x86:

So the choke point is the code that returns to user space. Here are some notes on how the restart logic varies based on user input:

It’s also worth noting that signals and preemption always result in clearing rseq::rseq_cs::ptr64 from user space memory on the way out, except in error cases that cause a segfault.

CPU IDs

The other thing rseq.c takes care of is writing CPU IDs to user space memory.

There are two fields in user space that get this information: rseq::cpu_id_start and rseq::cpu_id. The difference between the two is that cpu_id_start is always in range, whereas cpu_id may contain error values. The kernel provides both in order to support computation of values derived from the CPU ID that happens before entering the critical section. We could do this with one CPU ID, but it would require an extra branch to distinguish “not initialized” from “CPU ID changed after fetching it”. On the other hand if (like tcmalloc) you only fetch the CPU Id within a critical section, then you need only one field because you have only one branch: am I initialized. There is no such thing as a CPU mismatch because instead you are just restarted when the CPU ID changes.

The two CPU ID fields are maintained as follows: