tcmalloc

Temeraire: Hugepage-Aware Allocator

Andrew Hunter, Chris Kennelly

Notes on the name1: the french word for “reckless” or “rash” :), and also the name of several large and powerful English warships. So: giant and powerful, but maybe a little dangerous. :)

This is a description of the design of the Hugepage-Aware Allocator. We have also published “Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator” at OSDI 2021, which provides further details on the design, implementation, and rollout of Temeraire.

GOALS

What do we want out of this redesign?

DESIGN

The algorithm – as usual here, really, the data structures, which neatly determine our algorithm – are nicely divided into components. Essentially, the path of an allocation goes like this:

  1. If it is sufficiently small and we have the space we take an existing, backed, partially empty hugepage and fit our allocation within it.
  2. If it is too large to fit in a single hugepage, but too small to simply round up to an integral number of hugepages, we best-fit it into one of several larger slabs (whose allocations can cross hugepage boundaries). We will back hugepages as needed for the allocation.
  3. Sufficiently large allocations are rounded up to the nearest hugepage; the extra space may be used for smaller allocations.

Deallocation simply determines which of 1), 2), or 3) happened, and marks the corresponding object we allocated from as free.

We will sketch the purpose and approach of each important part. Note that we have fairly detailed unit tests for each of these; one consequence on the implementations is that most components are templated on the tcmalloc::SystemRelease functions[^templated] as we make a strong attempt to be zero initializable where possible (sadly not everywhere).

RangeTracker

RangeTracker and Bitmap, its underlying implementation, are helper class used throughout the components below. They are both quite simple: Bitmap is a fixed-size (templated) bitmap with fast operations to set and clear bits and ranges of bits, with extensive support for searching and iterating. (Search and iteration support is why std::bitset is not usable here.)

RangeTracker is essentially a Bitmap augmented with statistics on usage, in particular the longest range of contiguous free (false) bits. It provides methods to do best-fit allocation from free ranges (keeping the statistics correct).

Both of these need to be quite fast as they’re on nearly every allocation/deallocation path in HugePageAwareAllocator (in multiple ways)! They are reasonably optimized but probably still have more headroom.

HugeAllocator/HugeCache (the backing…)

This is a set of classes that fulfills requests for backed (or unbacked) aligned hugepage ranges. We use this for sufficiently large (or nicely sized) requests, and to provide memory for the other components to break up into smaller chunks.

HugeAllocator

HugeAllocator is (nearly) trivial: it requests arbitrarily large hugepage-sized chunks from SysAllocator, keeps them unbacked, and tracks the available (unbacked) regions. Note that we do not need to be perfectly space efficient here: we only pay virtual memory and metadata, since none of the contents are backed. (We do make our best efforts to be relatively frugal, however, since there’s no need to inflate VSS by large factors.) Nor do we have to be particularly fast; this is well off any hot path, and we’re going to incur non-trivial backing costs as soon as we’re done assigning a range.

The one tricky bit here is that we have to write some fiddly data structures by hand. We would have liked to implement this by grabbing large (gigabyte+) ranges from SysAllocator and using bitmaps or the like within them; however, too many tests have brittle reliance on details of SysAllocator that break if TCMalloc consistently requests (any considerable amount) more than the minimum needed to back current usage. So instead we need to track relatively small ranges. We’ve implemented a balanced tree that merges adjacent ranges; it is, as we said, fiddly, but reasonably efficient and not stunningly complicated.

HugeCache

This is a very simple wrapper on top of HugeAllocator. Its only purpose is to store some number of backed single hugepage ranges as a hot cache (in case we rapidly allocate and deallocate a 2 MiB chunk).

It is not clear whether the cache is necessary, but we have it and it’s not costing us much in complexity, and will help significantly in some potential antagonistic scenarios, so we favor keeping it.

It currently attempts to estimate the optimal cache size based on past behavior. This may not really be needed, but it’s a very minor feature to keep or drop.

HugePageFiller (the core…)

HugePageFiller takes small requests (less than a hugepage) and attempts to pack them efficiently into hugepages. The vast majority of binaries use almost entirely small allocations[^conditional], so this is the dominant consumer of space and the most important component.

Our goal here is to make our live allocations fit within the smallest set of hugepages possible, so that we can afford to keep all used hugepages fully backed (and aggressively free empty ones).

The key challenge is avoiding fragmentation of free space within a hugepage: requests for 1 page are (usually) the most common, but 4, 8, or even 50+ page requests aren’t unheard of. Many 1-page free regions won’t be useful here, and we’ll have to request enormous numbers of new hugepages for anything large.

Our solution is to build a heap-ordered data structure on fragmentation, not total amount free, in each hugepage. We use the longest free range (the biggest allocation a hugepage can fulfill!) as a measurement of fragmentation. In other words: if a hugepage has a free range of length 8, we never allocate from it for a smaller request (unless all hugepages available have equally long ranges). This carefully husbands long ranges for the requests that need them, and allows them to grow (as neighboring allocations are freed).

Inside each equal-longest-free-range group, we order our heap by the number of allocations (chunked logarithmically). This helps favor allocating from fuller hugepages (of equal fragmented status). Number of allocations handily outperforms the total number of allocated pages here; our hypothesis is that since allocations of any size are equally likely[^radioactive] to become free at any given time, and we need all allocations on a hugepage to become free to make the hugepage empty, we’re better off hoping for 1 10-page allocation to become free (with some probability P) than 5 1-page allocations (with probability P^5).

The HugePageFiller contains support for releasing parts of mostly-empty hugepages as a last resort.

The actual implementation uses a fixed set of lists and a bitmap for acceleration.

HugeRegion (big but not enormous…)

HugeAllocator covers very large requests and HugePageFiller tiny ones; what about the middle? In particular, requests that cannot fit into a hugepage, but should not be rounded to multiples? (For instance, 2.1 MiB.) These are woefully common.

In any case, we certainly have to do something with “2.1 MiB”-type allocations, and rounding them to 4 will produce unacceptable slack (see below for what we can do with the filler here; it is wildly insufficient in current binaries which have the majority of their allocation in these large chunks.)

The solution is a much larger “region” that best-fits these chunks into a large range of hugepages (i.e. allows them to cross a hugepage boundary). We keep a set of these regions, and allocate from the most fragmented one (much as with Filler above)! The main difference is that these regions are kept un-backed by default (whereas the Filler deals almost entirely with backed hugepages). We back hugepages on demand when they are used by a request hitting the region (and aggressively _unback _them when they become empty again).

A few important details:

Additional details on the design goals/tradeoffs are in the Regions Are Not Optional design doc.

HugePageAwareAllocator (putting it all together…)

This class houses the above components and routes between them, in addition to interfacing with the rest of TCMalloc (the above classes don’t need or use Spans, for instance). This is mostly straightforward; two points are worth discussing.

The changeover point between 1) and 2) is just a tuning decision (any choice would produce a usable binary). Half a hugepage was picked arbitrarily; this seems to work well.

Allocations from HugeAllocator or HugeRegion (some of the time) need to be backed; so do hugepages that grow the HugePageFiller. This isn’t free. Page heap allocation isn’t hugely expensive in practice, but it is under a lock and contention matters. We currently rely on access by the application to back memory, and assume returned memory has been backed.

For accounting purposes, we do a bit of tracking whether a given allocation is being fulfilled from previously-unbacked memory.

We do wire that information to the point we drop the pageheap lock; we then back it without producing lock contention. This made a noticeable performance difference when explicitly backing memory before returning it to the application.

Leveraging hot/cold hints

TCMalloc provides a hot_cold variant that applications may use to provide hints as to how frequently an allocation would be accessed. It’s more likely that a larger-sized page may be considered hot by kernel even if some allocations are frequently accessed and placed together with infrequently-accessed data. By specifying hot and cold hints, TCMalloc may separate allocations to improve data locality for hot heap, and the effectiveness of memory tiering by placing cold heap on a slower, but cheaper, memory tier. In the backend, TCMalloc bifurcates the page heap to store cold allocations separately from the hot allocations. This avoids TCMalloc to place these allocations together on the same hugepage. TCMalloc encodes hot/cold signal into size classes, that is then used to decide which page heap to use. It marks the separate cold heap as MADV_NOHUGEPAGE, such that cold allocations are placed on native-sized pages. By avoiding placing cold allocations on hugepages, we intend to improve hugepage availability for the hot heap.

Notes

  1. Also the name of this cutie, the real reason for the choice. [^templated]: It will be possible, given recent improvements in constexpr usage, to eliminate this in followups. [^conditional]: Here we mean “requests to the pageheap as filtered through sampling, the central cache, etc” [^radioactive]: Well, no, this is false in our empirical data, but to first order.