There are three user accessible controls that we can use to performance tune TCMalloc:
None of these tuning parameters are clear wins, otherwise they would be the default. We’ll discuss the advantages and disadvantages of changing them.
This is determined at compile time by linking in the appropriate version of TCMalloc. The page size indicates the unit in which TCMalloc manages memory. The default is in 8KiB chunks, there are larger options of 32KiB and 256KiB. There is also the 4KiB page size used by the small-but-slow allocator.
A smaller page size allows TCMalloc to provide memory to an application with less waste. Waste comes about through two issues:
The second of these points is worth elucidating. For small allocations TCMalloc will fit multiple objects onto a single page.
So if you request 512 bytes, then an entire page will be devoted to 512 byte objects. If the size of that page is 4KiB we get 8 objects, if the size of that page is 256KiB we get 512 objects. That page can only be used for 512 byte objects until all the objects on the page have been freed.
If you have 8 objects on a page, there’s a reasonable chance that all 8 will become free at the same time, and we can repurpose the page for objects of a different size. If there’s 512 objects on that page, then it is very unlikely that all the objects will become freed at the same time, so that page will probably never become entirely free and will probably hang around, potentially containing only a few in-use objects.
The consequence of this is that large pages tend to lead to a larger memory footprint. There’s also the issue that if you want one object of a size, you need to allocate a whole page.
The advantage of managing objects using larger page sizes are:
PageMapwhich enables TCMalloc to lookup information about any allocated memory. If we use large pages the pagemap needs fewer entries and can be much smaller. This makes it more likely that it is cache resident. However, sized delete substantially reduced the number of times that we need to consult the pagemap, so the benefit from larger pages is reduced.
Suggestion: The default of 8KiB page sizes is probably good enough for most applications. However, if an application has a heap measured in GiB it may be worth looking at using large page sizes.
Suggestion: Consider small-but-slow if it is more important to minimise memory footprint over performance.
Note: Size-classes are determined on a per-page-size basis. So changing the page size will implicitly change the size-classes used. Size-classes are selected to be memory-efficient for the applications using that page size. If an application changes page size, there may be a performance or memory impact from the different selection of size-classes.
The default is for TCMalloc to run in per-cpu mode as this is faster; however, there are few applications which have not yet transitioned. The plan is to move these across at some point soon.
Increasing the size of the cache is an obvious way to improve performance. The larger the cache the less frequently memory needs to be fetched from the central caches. Returning memory from the cache is substantially faster than fetching from the central cache.
The size of the per-cpu caches is controlled by
tcmalloc::MallocExtension::SetMaxPerCpuCacheSize. This controls the limit for
each CPU, so the total amount of memory for application could be much larger
than this. Memory on CPUs where the application is no longer able to run can be
freed by calling
Releasing memory held by unuable CPU caches is handled by
the total size of all thread caches in the application.
Suggestion: The default cache size is typically sufficient, but cache size can be increased (or decreased) depending on the amount of time spent in TCMalloc code, and depending on the overall size of the application (a larger application can afford to cache more memory without noticeably increasing its overall size).
tcmalloc::MallocExtension::ReleaseMemoryToSystem makes a request to release
n bytes of memory to TCMalloc. This can keep the memory footprint of the
application down to a minimal amount, however it should be considered that this
just reduces the application down from its peak memory footprint over time, and
does not make that peak memory footprint smaller.
Using a background thread running
tcmalloc::MallocExtension::ProcessBackgroundActions(), memory will be released
from the page heap at the specified rate.
There are two disadvantages of releasing memory aggressively:
Note: Release rate is not a panacea for memory usage. Jobs should be provisioned for peak memory usage to avoid OOM errors. Setting a release rate may enable an application to exceed the memory limit for short periods of time without triggering an OOM. A release rate is also a good citizen behavior as it will enable the system to use spare capacity memory for applications which are are under provisioned. However, it is not a substitute for setting appropriate memory requirements for the job.
Note: Memory is released from the
PageHeap and stranded per-cpu caches.
It is not possible to release memory from other internal structures, like
Suggestion: The default release rate is probably appropriate for most applications. In situations where it is tempting to set a faster rate it is worth considering why there are memory spikes, since those spikes are likely to cause an OOM at some point.
/sys/kernel/mm/transparent_hugepage/enabled: [always] madvise never /sys/kernel/mm/transparent_hugepage/defrag: always defer [defer+madvise] madvise never` /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none: 0
TCMalloc is built and tested in certain ways. These build-time options can improve performance:
Enabling sized deallocation from
reduces deallocation costs when the size can be determined. Sized deallocation
is enabled with the
-fsized-deallocation flag. This behavior is enabled by
default in GCC), but as of early 2020, is not enabled by default on Clang even
when compiling for C++14/C++17.
Some standard C++ libraries (such as libc++) will take advantage of sized deallocation for their allocators as well, improving deallocation performance in C++ containers.
Aligning raw storage allocated with
::operator new to 8 bytes by compiling
__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8. This smaller alignment minimizes
wasted memory for many common allocation sizes (24, 40, etc.) which are
otherwise rounded up to a multiple of 16 bytes. On many compilers, this
behavior is controlled by the
__STDCPP_DEFAULT_NEW_ALIGNMENT__ is not specified (or is larger than 8
bytes), we use standard 16 byte alignments for
::operator new. However, for
allocations under 16 bytes, we may return an object with a lower alignment, as
no object with a larger alignment requirement can be allocated in the space.
Optimizing failures of
operator new by directly failing instead of throwing
exceptions. Because TCMalloc does not throw exceptions when
fails, this can be used as a performance optimization for many move
Within Abseil code, these direct allocation failures are enabled with the
Abseil build-time configuration macro