API synopsis / quick reference
High-level overview
Highway is a collection of ‘ops’: platform-agnostic pure functions that operate on tuples (multiple values of the same type). These functions are implemented using platform-specific intrinsics, which map to SIMD/vector instructions.
Your code calls these ops and uses them to implement the desired
algorithm. Alternatively, hwy/contrib
also includes higher-level
algorithms such as FindIf
or VQSort
implemented using these ops.
Static vs. dynamic dispatch
Highway supports two ways of deciding which instruction sets to use: static or dynamic dispatch.
Static means targeting a single instruction set, typically the best one enabled by the given compiler flags. This has no runtime overhead and only compiles your code once, but because compiler flags are typically conservative, you will not benefit from more recent instruction sets. Conversely, if you run the binary on a CPU that does not support this instruction set, it will crash.
Dynamic dispatch means compiling your code multiple times and choosing the best available implementation at runtime. Highway supports three ways of doing this:
Highway can take care of everything including compilation (by re-#including your code), setting the required compiler #pragmas, and dispatching to the best available implementation. The only changes to your code relative to static dispatch are adding
#define HWY_TARGET_INCLUDE
,#include "third_party/highway/hwy/foreach_target.h"
(which must come before any inclusion of highway.h) and callingHWY_DYNAMIC_DISPATCH
instead ofHWY_STATIC_DISPATCH
.Some build systems (e.g. Apple) support the concept of ‘fat’ binaries which contain code for multiple architectures or instruction sets. Then, the operating system or loader typically takes care of calling the appropriate code. Highway interoperates with this by using the instruction set requested by the current compiler flags during each compilation pass. Your code is the same as with static dispatch.
Note that this method replicates the entire binary, whereas the Highway-assisted dynamic dispatch method only replicates your SIMD code, which is typically a small fraction of the total size.
Because Highway is a library (as opposed to a code generator or compiler), the dynamic dispatch method can be inspected, and made to interoperate with existing systems. For compilation, you can replace foreach_target.h if your build system supports compiling for multiple targets. For choosing the best available target, you can replace Highway’s CPU detection and decision with your own.
HWY_DYNAMIC_DISPATCH
calls into a table of function pointers with a zero-based index indicating the desired target. Instead of calling it immediately, you can also save the function pointer returned byHWY_DYNAMIC_POINTER
. Note thatHWY_DYNAMIC_POINTER
returns the same pointer thatHWY_DYNAMIC_DISPATCH
would. When either of them are first invoked, the function pointer first detects the CPU, then calls your actual function. You can callGetChosenTarget().Update(SupportedTargets());
to ensure future dynamic dispatch avoids the overhead of CPU detection. You can also replace the table lookup with your own choice of index, or even call e.g.N_AVX2::YourFunction
directly.
Examples of both static and dynamic dispatch are provided in examples/. Typically, the function that does the dispatch receives a pointer to one or more arrays. Due to differing ABIs, we recommend only passing vector arguments to functions that are inlined, and in particular not the top-level function that does the dispatch.
Note that if your compiler is pre-configured to generate code only for a
specific architecture, or your build flags include -m flags that specify
a baseline CPU architecture, then this can interfere with dynamic
dispatch, which aims to build code for all attainable targets. One
example is specializing for a Raspberry Pi CPU that lacks AES, by
specifying -march=armv8-a+crc
. When we build the HWY_NEON
target
(which would only be used if the CPU actually does have AES), there is a
conflict between the arch=armv8-a+crypto
that is set via pragma only
for the vector code, and the global -march
. This results in a
compile error, see #1460, #1570, and #1707. As a workaround, we
recommend avoiding -m flags if possible, and otherwise defining
HWY_COMPILE_ONLY_STATIC
or HWY_SKIP_NON_BEST_BASELINE
when
building Highway as well as any user code that includes Highway headers.
As a result, only the baseline target, or targets at least as good as
the baseline, will be compiled. Note that it is fine for user code to
still call HWY_DYNAMIC_DISPATCH
. When Highway is only built for a
single target, HWY_DYNAMIC_DISPATCH
results in the same direct call
that HWY_STATIC_DISPATCH
would produce.
Headers
The public headers are:
hwy/highway.h: main header, included from source AND/OR header files that use vector types. Note that including in headers may increase compile time, but allows declaring functions implemented out of line.
hwy/base.h: included from headers that only need compiler/platform-dependent definitions (e.g.
PopCount
) without the full highway.h.hwy/foreach_target.h: re-includes the translation unit (specified by
HWY_TARGET_INCLUDE
) once per enabled target to generate code from the same source code. highway.h must still be included.hwy/aligned_allocator.h: defines functions for allocating memory with alignment suitable for
Load
/Store
.hwy/cache_control.h: defines stand-alone functions to control caching ( e.g. prefetching), independent of actual SIMD.
hwy/nanobenchmark.h: library for precisely measuring elapsed time (under varying inputs) for benchmarking small/medium regions of code.
hwy/print-inl.h: defines Print() for writing vector lanes to stderr.
hwy/tests/test_util-inl.h: defines macros for invoking tests on all available targets, plus per-target functions useful in tests.
Highway provides helper macros to simplify your vector code and ensure support for dynamic dispatch. To use these, add the following to the start and end of any vector code:
#include "hwy/highway.h"
HWY_BEFORE_NAMESPACE(); // at file scope
namespace project { // optional
namespace HWY_NAMESPACE {
// implementation
// NOLINTNEXTLINE(google-readability-namespace-comments)
} // namespace HWY_NAMESPACE
} // namespace project - optional
HWY_AFTER_NAMESPACE();
If you choose not to use the BEFORE/AFTER
lines, you must prefix any
function that calls Highway ops such as Load
with HWY_ATTR
.
Either of these will set the compiler #pragma required to generate
vector code.
The HWY_NAMESPACE
lines ensure each instantiation of your code (one
per target) resides in a unique namespace, thus preventing ODR
violations. You can omit this if your code will only ever use static
dispatch.
Notation in this doc
By vector ‘lanes’, we mean the ‘elements’ of that vector. Analogous to the lanes of a highway or swimming pool, most operations act on each lane independently, but it is possible for lanes to interact and change order via ‘swizzling’ ops.
T
denotes the type of a vector lane (integer or floating-point);N
is a size_t value that governs (but is not necessarily identical to) the number of lanes;D
is shorthand for a zero-sized tag typeSimd<T, N, kPow2>
, used to select the desired overloaded function (see next section). Use aliases such asScalableTag
instead of referring to this type directly;d
is an lvalue of typeD
, passed as a function argument e.g. to Zero;V
is the type of a vector, which may be a class or built-in type.v[i]
is analogous to C++ array notation, with zero-based indexi
from the starting address of the vectorv
.
Vector and tag types
Highway vectors consist of one or more ‘lanes’ of the same built-in type
T
: uint##_t, int##_t
for ## = 8, 16, 32, 64
, or
float##_t
for ## = 16, 32, 64
and bfloat16_t
. T
may be
retrieved via TFromD<D>
. IsIntegerLaneType<T>
evaluates to true
for these int
or uint
types.
Beware that char
may differ from these types, and is not supported
directly. If your code loads from/stores to char*
, use T=uint8_t
for Highway’s d
tags (see below) or T=int8_t
(which may enable
faster less-than/greater-than comparisons), and cast your char*
pointers to your T*
.
In Highway, float16_t
(an IEEE binary16 half-float) and
bfloat16_t
(the upper 16 bits of an IEEE binary32 float) only
support load, store, and conversion to/from float32_t
. The behavior
of infinity and NaN in float16_t
is implementation-defined due to
Armv7. To ensure binary compatibility, these types are always wrapper
structs and cannot be initialized with values directly. You can
initialize them via BitCastScalar
or ConvertScalarTo
.
On RVV/SVE, vectors are sizeless and cannot be wrapped inside a class.
The Highway API allows using built-in types as vectors because
operations are expressed as overloaded functions. Instead of
constructors, overloaded initialization functions such as Set
take a
zero-sized tag argument called d
of type D
and return an actual
vector of unspecified type.
The actual lane count (used to increment loop counters etc.) can be
obtained via Lanes(d)
. This value might not be known at compile
time, thus storage for vectors should be dynamically allocated, e.g. via
AllocateAligned(Lanes(d))
.
Note that Lanes(d)
could potentially change at runtime. This is
currently unlikely, and will not be initiated by Highway without user
action, but could still happen in other circumstances:
upon user request in future via special CPU instructions (switching to ‘streaming SVE’ mode for Arm SME), or
via system software (
prctl(PR_SVE_SET_VL
on Linux for Arm SVE). When the vector length is changed using this mechanism, all but the lower 128 bits of vector registers are invalidated.
Thus we discourage caching the result; it is typically used inside a
function or basic block. If the application anticipates that one of the
above circumstances could happen, it should ensure by some out-of-band
mechanism that such changes will not happen during the critical section
(the vector code which uses the result of the previously obtained
Lanes(d)
).
MaxLanes(d)
returns a (potentially loose) upper bound on
Lanes(d)
, and is implemented as a constexpr function.
The actual lane count is guaranteed to be a power of two, even on SVE.
This simplifies alignment: remainders can be computed as
count & (Lanes(d) - 1)
instead of an expensive modulo. It also
ensures loop trip counts that are a large power of two (at least
MaxLanes
) are evenly divisible by the lane count, thus avoiding the
need for a second loop to handle remainders.
d
lvalues (a tag, NOT actual vector) are obtained using aliases:
Most common:
ScalableTag<T[, kPow2=0]> d;
or the macro formHWY_FULL(T[, LMUL=1]) d;
. With the default value of the second argument, these both select full vectors which utilize all available lanes.Only for targets (e.g. RVV) that support register groups, the kPow2 (-3..3) and LMUL argument (1, 2, 4, 8) specify
LMUL
, the number of registers in the group. This effectively multiplies the lane count in each operation byLMUL
, or left-shifts bykPow2
(negative values are understood as right-shifting by the absolute value). These arguments will eventually be optional hints that may improve performance on 1-2 wide machines (at the cost of reducing the effective number of registers), but RVV target does not yet support fractionalLMUL
. Thus, mixed-precision code (e.g. demoting float to uint8_t) currently requiresLMUL
to be at least the ratio of the sizes of the largest and smallest type, and smallerd
to be obtained viaHalf<DLarger>
.For other targets,
kPow2
must lie within [HWY_MIN_POW2, HWY_MAX_POW2]. The*Tag
aliases clamp to the upper bound but your code should ensure the lower bound is not exceeded, typically by specializing compile-time recursions forkPow2
=HWY_MIN_POW2
(this avoids compile errors whenkPow2
is low enough that it is no longer a valid shift count).Less common:
CappedTag<T, kCap> d
or the macro formHWY_CAPPED(T, kCap) d;
. These select vectors or masks where no more than the largest power of two not exceedingkCap
lanes have observable effects such as loading/storing to memory, or being counted byCountTrue
. The number of lanes may also be less; for theHWY_SCALAR
target, vectors always have a single lane. For example,CappedTag<T, 3>
will use up to two lanes.For applications that require fixed-size vectors:
FixedTag<T, kCount> d;
will select vectors where exactlykCount
lanes have observable effects. These may be implemented using full vectors plus additional runtime cost for masking inLoad
etc.kCount
must be a power of two not exceedingHWY_LANES(T)
, which is one forHWY_SCALAR
. This tag can be used when theHWY_SCALAR
target is anyway disabled (superseded by a higher baseline) or unusable (due to use of ops such asTableLookupBytes
). As a convenience, we also provideFull128<T>
,Full64<T>
andFull32<T>
aliases which are equivalent toFixedTag<T, 16 / sizeof(T)>
,FixedTag<T, 8 / sizeof(T)>
andFixedTag<T, 4 / sizeof(T)>
.The result of
UpperHalf
/LowerHalf
has half the lanes. To obtain a correspondingd
, useHalf<decltype(d)>
; the opposite isTwice<>
.BlockDFromD<D>
returns ad
with a lane type ofTFromD<D>
andHWY_MIN(HWY_MAX_LANES_D(D), 16 / sizeof(TFromD<D>))
lanes.
User-specified lane counts or tuples of vectors could cause spills on targets with fewer or smaller vectors. By contrast, Highway encourages vector-length agnostic code, which is more performance-portable.
For mixed-precision code (e.g. uint8_t
lanes promoted to float
),
tags for the smaller types must be obtained from those of the larger
type (e.g. via Rebind<uint8_t, ScalableTag<float>>
).
Using unspecified vector types
Vector types are unspecified and depend on the target. Your code could
define vector variables using auto
, but it is more readable (due to
making the type visible) to use an alias such as Vec<D>
, or
decltype(Zero(d))
. Similarly, the mask type can be obtained via
Mask<D>
. Often your code will first define a d
lvalue using
ScalableTag<T>
. You may wish to define an alias for your vector
types such as using VecT = Vec<decltype(d)>
. Do not use undocumented
types such as Vec128
; these may work on most targets, but not all
(e.g. SVE).
Vectors are sizeless types on RVV/SVE. Therefore, vectors must not be
used in arrays/STL containers (use the lane type T
instead), class
members, static/thread_local variables, new-expressions (use
AllocateAligned
instead), and sizeof/pointer arithmetic (increment
T*
by Lanes(d)
instead).
Initializing constants requires a tag type D
, or an lvalue d
of
that type. The D
can be passed as a template argument or obtained
from a vector type V
via DFromV<V>
. TFromV<V>
is equivalent
to TFromD<DFromV<V>>
.
Note: Let DV = DFromV<V>
. For builtin V
(currently necessary
on RVV/SVE), DV
might not be the same as the D
used to create
V
. In particular, DV
must not be passed to Load/Store
functions because it may lack the limit on N
established by the
original D
. However, Vec<DV>
is the same as V
.
Thus a template argument V
suffices for generic functions that do
not load from/store to memory:
template<class V> V Mul4(V v) { return Mul(v, Set(DFromV<V>(), 4)); }
.
Example of mixing partial vectors with generic functions:
CappedTag<int16_t, 2> d2;
auto v = Mul4(Set(d2, 2));
Store(v, d2, ptr); // Use d2, NOT DFromV<decltype(v)>()
Targets
Let Target
denote an instruction set, one of SCALAR/EMU128
,
RVV
, SSE2/SSSE3/SSE4/AVX2/AVX3/AVX3_DL/AVX3_ZEN4/AVX3_SPR
(x86),
PPC8/PPC9/PPC10/Z14/Z15
(POWER), WASM/WASM_EMU256
(WebAssembly),
NEON_WITHOUT_AES/NEON/NEON_BF16/SVE/SVE2/SVE_256/SVE2_128
(Arm).
Note that x86 CPUs are segmented into dozens of feature flags and
capabilities, which are often used together because they were introduced
in the same CPU (example: AVX2 and FMA). To keep the number of targets
and thus compile time and code size manageable, we define targets as
‘clusters’ of related features. To use HWY_AVX2
, it is therefore
insufficient to pass -mavx2. For definitions of the clusters, see
kGroup*
in targets.cc
. The corresponding Clang/GCC compiler
options to enable them (without -m prefix) are defined by
HWY_TARGET_STR*
in set_macros-inl.h
, and also listed as comments
in https://gcc.godbolt.org/z/rGnjMevKG.
Targets are only used if enabled (i.e. not broken nor disabled). Baseline targets are those for which the compiler is unconditionally allowed to generate instructions (implying the target CPU must support them).
HWY_STATIC_TARGET
is the best enabled baselineHWY_Target
, and matchesHWY_TARGET
in static dispatch mode. This is useful even in dynamic dispatch mode for deducing and printing the compiler flags.HWY_TARGETS
indicates which targets to generate for dynamic dispatch, and which headers to include. It is determined by configuration macros and always includesHWY_STATIC_TARGET
.HWY_SUPPORTED_TARGETS
is the set of targets available at runtime. Expands to a literal if only a single target is enabled, or SupportedTargets().HWY_TARGET
: whichHWY_Target
is currently being compiled. This is initially identical toHWY_STATIC_TARGET
and remains so in static dispatch mode. For dynamic dispatch, this changes before each re-inclusion and finally reverts toHWY_STATIC_TARGET
. Can be used in#if
expressions to provide an alternative to functions which are not supported byHWY_SCALAR
.In particular, for x86 we sometimes wish to specialize functions for AVX-512 because it provides many new instructions. This can be accomplished via
#if HWY_TARGET <= HWY_AVX3
, which means AVX-512 or better (e.g.HWY_AVX3_DL
). This is because numerically lower targets are better, and no other platform has targets numerically less than those of x86.HWY_WANT_SSSE3
,HWY_WANT_SSE4
: add SSSE3 and SSE4 to the baseline even if they are not marked as available by the compiler. On MSVC, the only ways to enable SSSE3 and SSE4 are defining these, or enabling AVX.HWY_WANT_AVX3_DL
: opt-in for dynamic dispatch toHWY_AVX3_DL
. This is unnecessary if the baseline already includes AVX3_DL.
You can detect and influence the set of supported targets:
TargetName(t)
returns a string literal identifying the single targett
, wheret
is typicallyHWY_TARGET
.SupportedTargets()
returns an int64_t bitfield of enabled targets that are supported on this CPU. The return value may change after callingDisableTargets
, but will never be zero.HWY_SUPPORTED_TARGETS
is equivalent toSupportedTargets()
but more efficient if only a single target is enabled.DisableTargets(b)
causes subsequentSupportedTargets()
to not return target(s) whose bits are set inb
. This is useful for disabling specific targets if they are unhelpful or undesirable, e.g. due to memory bandwidth limitations. The effect is not cumulative; each call overrides the effect of all previous calls. Calling withb == 0
restores the original behavior. UseSetSupportedTargetsForTest
instead of this function for iteratively enabling specific targets for testing.SetSupportedTargetsForTest(b)
causes subsequentSupportedTargets
to returnb
, minus those disabled viaDisableTargets
.b
is typically derived from a subset ofSupportedTargets()
, e.g. each individual bit in order to test each supported target. Calling withb == 0
restores the normalSupportedTargets
behavior.
Operations
In the following, the argument or return type V
denotes a vector
with N
lanes, and M
a mask. Operations limited to certain vector
types begin with a constraint of the form V
: {prefixes}[{bits}]
.
The prefixes u,i,f
denote unsigned, signed, and floating-point
types, and bits indicates the number of bits per lane: 8, 16, 32, or 64.
Any combination of the specified prefixes and bits are allowed.
Abbreviations of the form u32 = {u}{32}
may also be used.
Note that Highway functions reside in hwy::HWY_NAMESPACE
, whereas
user-defined functions reside in project::[nested]::HWY_NAMESPACE
.
Highway functions generally take either a D
or vector/mask argument.
For targets where vectors and masks are defined in namespace hwy
,
the functions will be found via Argument-Dependent Lookup. However, this
does not work for function templates, and RVV and SVE both use built-in
vectors. Thus portable code must use one of the three following options,
in descending order of preference:
namespace hn = hwy::HWY_NAMESPACE;
alias used to prefix ops, e.g.hn::LoadDup128(..)
;using hwy::HWY_NAMESPACE::LoadDup128;
declarations for each op used;using hwy::HWY_NAMESPACE;
directive. This is generally discouraged, especially for SIMD code residing in a header.
Note that overloaded operators were not supported on RVV
and SVE
until recently. Unfortunately, clang’s SVE
comparison operators
return integer vectors instead of the svbool_t
type which exists for
this purpose. To ensure your code works on all targets, we recommend
instead using the corresponding equivalents mentioned in our description
of each overloaded operator, especially for comparisons, for example
Lt
instead of operator<
.
Initialization
V Zero(D): returns N-lane vector with all bits set to 0.
V Set(D, T): returns N-lane vector with all lanes equal to the given value of type
T
.V Undefined(D): returns uninitialized N-lane vector, e.g. for use as an output parameter.
V Iota(D, T2): returns N-lane vector where the lane with index
i
has the given value of typeT2
(the op converts it to T) +i
. The least significant lane has index 0. This is useful in tests for detecting lane-crossing bugs.V SignBit(D): returns N-lane vector with all lanes set to a value whose representation has only the most-significant bit set.
V Dup128VecFromValues(D d, T t0, .., T tK): Creates a vector from
K+1
values, broadcasted to each 128-bit block ifLanes(d) >= 16/sizeof(T)
is true, whereK
is16/sizeof(T) - 1
.Dup128VecFromValues returns the following values in each 128-bit block of the result, with
t0
in the least-significant (lowest-indexed) lane of each 128-bit block andtK
in the most-significant (highest-indexed) lane of each 128-bit block:{t0, t1, ..., tK}
Getting/setting lanes
T GetLane(V): returns lane 0 within
V
. This is useful for extractingSumOfLanes
results.
The following may be slow on some platforms (e.g. x86) and should not be used in time-critical code:
T ExtractLane(V, size_t i): returns lane
i
withinV
.i
must be in[0, Lanes(DFromV<V>()))
. Potentially slow, it may be better to store an entire vector to an array and then operate on its elements.V InsertLane(V, size_t i, T t): returns a copy of V whose lane
i
is set tot
.i
must be in[0, Lanes(DFromV<V>()))
. Potentially slow, it may be better set all elements of an aligned array and thenLoad
it.
Getting/setting blocks
Vec<BlockDFromD<DFromV>> ExtractBlock<int kBlock>(V) : returns block
kBlock
of V, wherekBlock
is an index to a block that isHWY_MIN(DFromV<V>().MaxBytes(), 16)
bytes.kBlock
must be in[0, DFromV<V>().MaxBlocks())
.V InsertBlock<int kBlock>(V v, Vec<BlockDFromD<DFromV>> blk_to_insert): Inserts
blk_to_insert
, withblk_to_insert[i]
inserted into lanekBlock * (16 / sizeof(TFromV<V>)) + i
of the result vector, ifkBlock * 16 < Lanes(DFromV<V>()) * sizeof(TFromV<V>)
is true.Otherwise, returns
v
ifkBlock * 16
is greater than or equal toLanes(DFromV<V>()) * sizeof(TFromV<V>)
.kBlock
must be in[0, DFromV<V>().MaxBlocks())
.size_t Blocks(D d): Returns the number of 16-byte blocks if
Lanes(d) * sizeof(TFromD<D>)
is greater than or equal to 16.Otherwise, returns 1 if
Lanes(d) * sizeof(TFromD<D>)
is less than 16.
Printing
V Print(D, const char* caption, V [, size_t lane][, size_t max_lanes]): prints
caption
followed by up tomax_lanes
comma-separated lanes from the vector argument, starting at indexlane
. Defined in hwy/print-inl.h, also available if hwy/tests/test_util-inl.h has been included.
Tuples
As a partial workaround to the “no vectors as class members” compiler
limitation mentioned in “Using unspecified vector types”, we provide
special types able to carry 2, 3 or 4 vectors, denoted Tuple{2-4}
below. Their type is unspecified, potentially built-in, so use the
aliases Vec{2-4}<D>
. These can (only) be passed as arguments or
returned from functions, and created/accessed using the functions in
this section.
Tuple2 Create2(D, V v0, V v1): returns tuple such that
Get2<1>(tuple)
returnsv1
.Tuple3 Create3(D, V v0, V v1, V v2): returns tuple such that
Get3<2>(tuple)
returnsv2
.Tuple4 Create4(D, V v0, V v1, V v2, V v3): returns tuple such that
Get4<3>(tuple)
returnsv3
.
The following take a size_t
template argument indicating the
zero-based index, from left to right, of the arguments passed to
Create{2-4}
.
V Get2<size_t>(Tuple2): returns the i-th vector passed to
Create2
.V Get3<size_t>(Tuple3): returns the i-th vector passed to
Create3
.V Get4<size_t>(Tuple4): returns the i-th vector passed to
Create4
.Tuple2 Set2<size_t>(Tuple2 tuple, Vec v): sets the i-th vector
Tuple3 Set3<size_t>(Tuple3 tuple, Vec v): sets the i-th vector
Tuple4 Set4<size_t>(Tuple4 tuple, Vec v): sets the i-th vector
Arithmetic
V operator+(V a, V b): returns
a[i] + b[i]
(mod 2^bits). Currently unavailable on SVE/RVV; use the equivalentAdd
instead.V operator-(V a, V b): returns
a[i] - b[i]
(mod 2^bits). Currently unavailable on SVE/RVV; use the equivalentSub
instead.V AddSub(V a, V b): returns
a[i] - b[i]
in the even lanes anda[i] + b[i]
in the odd lanes.AddSub(a, b)
is equivalent toOddEven(Add(a, b), Sub(a, b))
orAdd(a, OddEven(b, Neg(b)))
, butAddSub(a, b)
is more efficient thanOddEven(Add(a, b), Sub(a, b))
orAdd(a, OddEven(b, Neg(b)))
on some targets.V
:{i,f}
V Neg(V a): returns-a[i]
.V
:i
V SaturatedNeg(V a): returnsa[i] == LimitsMin<T>() ? LimitsMax<T>() : -a[i]
.SaturatedNeg(a)
is usually more efficient thanIfThenElse(Eq(a, Set(d, LimitsMin<T>())), Set(d, LimitsMax<T>()), Neg(a))
.V
:{i,f}
V Abs(V a) returns the absolute value ofa[i]
; for integers,LimitsMin()
maps toLimitsMax() + 1
.V
:i
V SaturatedAbs(V a) returnsa[i] == LimitsMin<T>() ? LimitsMax<T>() : (a[i] < 0 ? (-a[i]) : a[i])
.SaturatedAbs(a)
is usually more efficient thanIfThenElse(Eq(a, Set(d, LimitsMin<T>())), Set(d, LimitsMax<T>()), Abs(a))
.V AbsDiff(V a, V b): returns
|a[i] - b[i]|
in each lane.V
:{i,u}{8,16,32},f{16,32}
,VW
:Vec<RepartitionToWide<DFromV<V>>>
VW SumsOf2(V v) returns the sums of 2 consecutive lanes, promoting each sum into a lane ofTFromV<VW>
.V
:{i,u}{8,16}
,VW
:Vec<RepartitionToWideX2<DFromV<V>>>
VW SumsOf4(V v) returns the sums of 4 consecutive lanes, promoting each sum into a lane ofTFromV<VW>
.V
:{i,u}8
,VW
:Vec<RepartitionToWideX3<DFromV<V>>>
VW SumsOf8(V v) returns the sums of 8 consecutive lanes, promoting each sum into a lane ofTFromV<VW>
. This is slower on RVV/WASM.V
:{i,u}8
,VW
:Vec<RepartitionToWideX3<DFromV<V>>>
VW SumsOf8AbsDiff(V a, V b) returns the same result asSumsOf8(AbsDiff(a, b))
, but is more efficient on x86.V
:{i,u}8
,VW
:Vec<RepartitionToWide<DFromV<V>>>
VW SumsOfAdjQuadAbsDiff<int kAOffset, int kBOffset>(V a, V b) returns the sums of the absolute differences of 32-bit blocks of 8-bit integers, widened toMakeWide<TFromV<V>>
.kAOffset
must be between0
andHWY_MIN(1, (HWY_MAX_LANES_D(DFromV<V>) - 1)/4)
.kBOffset
must be between0
andHWY_MIN(3, (HWY_MAX_LANES_D(DFromV<V>) - 1)/4)
.SumsOfAdjQuadAbsDiff computes
|a[a_idx] - b[b_idx]| + |a[a_idx+1] - b[b_idx+1]| + |a[a_idx+2] - b[b_idx+2]| + |a[a_idx+3] - b[b_idx+3]|
for each lanei
of the result, wherea_idx
is equal tokAOffset*4+((i/8)*16)+(i&7)
and whereb_idx
is equal tokBOffset*4+((i/8)*16)
.If
Lanes(DFromV<V>()) < (8 << kAOffset)
is true, then SumsOfAdjQuadAbsDiff returns implementation-defined values in any lanes past the first (lowest-indexed) lane of the result vector.SumsOfAdjQuadAbsDiff is only available if
HWY_TARGET != HWY_SCALAR
.V
:{i,u}8
,VW
:Vec<RepartitionToWide<DFromV<V>>>
VW SumsOfShuffledQuadAbsDiff<int kIdx3, int kIdx2, int kIdx1, int kIdx0>(V a, V b) first shufflesa
as if by thePer4LaneBlockShuffle<kIdx3, kIdx2, kIdx1, kIdx0>(BitCast( RepartitionToWideX2<DFromV<V>>(), a))
operation, and then computes the sum of absolute differences of 32-bit blocks of 8-bit integers taken from the shuffleda
vector and theb
vector.kIdx0
,kIdx1
,kIdx2
, andkIdx3
must be between 0 and 3.SumsOfShuffledQuadAbsDiff computes
|a_shuf[a_idx] - b[b_idx]| + |a_shuf[a_idx+1] - b[b_idx+1]| + |a_shuf[a_idx+2] - b[b_idx+2]| + |a_shuf[a_idx+3] - b[b_idx+3]|
for each lanei
of the result, wherea_shuf
is equal toBitCast(DFromV<V>(), Per4LaneBlockShuffle<kIdx3, kIdx2, kIdx1, kIdx0>(BitCast(RepartitionToWideX2<DFromV<V>>(), a))
,a_idx
is equal to(i/4)*8+(i&3)
, andb_idx
is equal to(i/2)*4
.If
Lanes(DFromV<V>()) < 16
is true, SumsOfShuffledQuadAbsDiff returns implementation-defined results in any lanes where(i/4)*8+(i&3)+3 >= Lanes(d)
.The results of SumsOfAdjQuadAbsDiff are implementation-defined if
kIdx0 >= Lanes(DFromV<V>()) / 4
.The results of any lanes past the first (lowest-indexed) lane of SumsOfAdjQuadAbsDiff are implementation-defined if
kIdx1 >= Lanes(DFromV<V>()) / 4
.SumsOfShuffledQuadAbsDiff is only available if
HWY_TARGET != HWY_SCALAR
.V
:{u,i}{8,16}
V SaturatedAdd(V a, V b) returnsa[i] + b[i]
saturated to the minimum/maximum representable value.V
:{u,i}{8,16}
V SaturatedSub(V a, V b) returnsa[i] - b[i]
saturated to the minimum/maximum representable value.V
:{u,i}
V AverageRound(V a, V b) returns(a[i] + b[i] + 1) >> 1
.V Clamp(V a, V lo, V hi): returns
a[i]
clamped to[lo[i], hi[i]]
.V operator/(V a, V b): returns
a[i] / b[i]
in each lane. Currently unavailable on SVE/RVV; use the equivalentDiv
instead.For integer vectors,
Div(a, b)
returns an implementation-defined value in any lanes whereb[i] == 0
.For signed integer vectors,
Div(a, b)
returns an implementation-defined value in any lanes wherea[i] == LimitsMin<T>() && b[i] == -1
.V
:{u,i}
V operator%(V a, V b): returnsa[i] % b[i]
in each lane. Currently unavailable on SVE/RVV; use the equivalentMod
instead.Mod(a, b)
returns an implementation-defined value in any lanes whereb[i] == 0
.For signed integer vectors,
Mod(a, b)
returns an implementation-defined value in any lanes wherea[i] == LimitsMin<T>() && b[i] == -1
.V
:{f}
V Sqrt(V a): returnssqrt(a[i])
.V
:{f}
V ApproximateReciprocalSqrt(V a): returns an approximation of1.0 / sqrt(a[i])
.sqrt(a) ~= ApproximateReciprocalSqrt(a) * a
. x86 and PPC provide 12-bit approximations but the error on Arm is closer to 1%.V
:{f}
V ApproximateReciprocal(V a): returns an approximation of1.0 / a[i]
.
Min/Max
Note: Min/Max corner cases are target-specific and may change. If either argument is qNaN, x86 SIMD returns the second argument, Armv7 Neon returns NaN, Wasm is supposed to return NaN but does not always, but other targets actually uphold IEEE 754-2019 minimumNumber: returning the other argument if exactly one is qNaN, and NaN if both are.
V Min(V a, V b): returns
min(a[i], b[i])
.V Max(V a, V b): returns
max(a[i], b[i])
.
All other ops in this section are only available if
HWY_TARGET != HWY_SCALAR
:
V
:u64
V Min128(D, V a, V b): returns the minimum of unsigned 128-bit values, each stored as an adjacent pair of 64-bit lanes ( e.g. indices 1 and 0, where 0 is the least-significant 64-bits).V
:u64
V Max128(D, V a, V b): returns the maximum of unsigned 128-bit values, each stored as an adjacent pair of 64-bit lanes ( e.g. indices 1 and 0, where 0 is the least-significant 64-bits).V
:u64
V Min128Upper(D, V a, V b): for each 128-bit key-value pair, returnsa
if it is considered less thanb
by Lt128Upper, elseb
.V
:u64
V Max128Upper(D, V a, V b): for each 128-bit key-value pair, returnsa
if it is considered >b
by Lt128Upper, elseb
.
Multiply
V operator*(V a, V b): returns
r[i] = a[i] * b[i]
, truncating it to the lower half for integer inputs. Currently unavailable on SVE/RVV; use the equivalentMul
instead.V
:f
,VI
:Vec<RebindToSigned<DFromV<V>>>
V MulByPow2(V a, VI b): Multipliesa[i]
by2^b[i]
.MulByPow2(a, b)
is equivalent tostd::ldexp(a[i], HWY_MIN(HWY_MAX(b[i], LimitsMin<int>()), LimitsMax<int>()))
.V
:f
V MulByFloorPow2(V a, V b): Multipliesa[i]
by2^floor(b[i])
.It is implementation-defined if
MulByFloorPow2(a, b)
returns zero or NaN in any lanes wherea[i]
is NaN andb[i]
is equal to negative infinity.It is implementation-defined if
MulByFloorPow2(a, b)
returns positive infinity or NaN in any lanes wherea[i]
is NaN andb[i]
is equal to positive infinity.If
a[i]
is a non-NaN value andb[i]
is equal to negative infinity,MulByFloorPow2(a, b)
is equivalent toa[i] * 0.0
.If
b[i]
is NaN or ifa[i]
is non-NaN andb[i]
is positive infinity,MulByFloorPow2(a, b)
is equivalent toa[i] * b[i]
.If
b[i]
is a finite value,MulByFloorPow2(a, b)
is equivalent toMulByPow2(a, FloorInt(b))
.V
:{u,i}
V MulHigh(V a, V b): returns the upper half ofa[i] * b[i]
in each lane.V
:i16
V MulFixedPoint15(V a, V b): returns the result of multiplying two Q1.15 fixed-point numbers. This corresponds to doubling the multiplication result and storing the upper half. Results are implementation-defined iff both inputs are -32768.V
:{u,i}
V2 MulEven(V a, V b): returns double-wide result ofa[i] * b[i]
for every eveni
, in lanesi
(lower) andi + 1
(upper).V2
is a vector with double-width lanes, or the same asV
for 64-bit inputs (which are only supported ifHWY_TARGET != HWY_SCALAR
).V
:{u,i}
V MulOdd(V a, V b): returns double-wide result ofa[i] * b[i]
for every oddi
, in lanesi - 1
(lower) andi
(upper). Only supported ifHWY_TARGET != HWY_SCALAR
.V
:{bf,u,i}16
,D
:RepartitionToWide<DFromV<V>>
Vec<D> WidenMulPairwiseAdd(D d, V a, V b): widensa
andb
toTFromD<D>
and computesa[2*i+1]*b[2*i+1] + a[2*i+0]*b[2*i+0]
.VI
:i8
,VU
:Vec<RebindToUnsigned<DFromV<VI>>>
,DI
:RepartitionToWide<DFromV<VI>>
Vec<DI> SatWidenMulPairwiseAdd(DI di, VU a_u, VI b_i) : widensa_u
andb_i
toTFromD<DI>
and computesa_u[2*i+1]*b_i[2*i+1] + a_u[2*i+0]*b_i[2*i+0]
, saturated to the range ofTFromD<D>
.DW
:i32
,D
:Rebind<MakeNarrow<TFromD<DW>>, DW>
,VW
:Vec<DW>
,V
:Vec<D>
Vec<D> SatWidenMulPairwiseAccumulate(DW, V a, V b, VW sum) : widensa[i]
andb[i]
toTFromD<DI>
and computesa[2*i]*b[2*i] + a[2*i+1]*b[2*i+1] + sum[i]
, saturated to the range ofTFromD<DW>
.DW
:i32
,D
:Rebind<MakeNarrow<TFromD<DW>>, DW>
,VW
:Vec<DW>
,V
:Vec<D>
VW SatWidenMulAccumFixedPoint(DW, V a, V b, VW sum)**: First, widensa
andb
toTFromD<DW>
, then addsa[i] * b[i] * 2
tosum[i]
, saturated to the range ofTFromD<DW>
.If
a[i] == LimitsMin<TFromD<D>>() && b[i] == LimitsMin<TFromD<D>>()
, it is implementation-defined whethera[i] * b[i] * 2
is first saturated toTFromD<DW>
prior to the addition ofa[i] * b[i] * 2
tosum[i]
.V
:{bf,u,i}16
,DW
:RepartitionToWide<DFromV<V>>
,VW
:Vec<DW>
VW ReorderWidenMulAccumulate(DW d, V a, V b, VW sum0, VW& sum1): widensa
andb
toTFromD<DW>
, then addsa[i] * b[i]
to eithersum1[j]
or lanej
of the return value, wherej = P(i)
andP
is a permutation. The only guarantee is thatSumOfLanes(d, Add(return_value, sum1))
is the sum of alla[i] * b[i]
. This is useful for computing dot products and the L2 norm. The initial value ofsum1
before any call toReorderWidenMulAccumulate
must be zero (because it is unused on some platforms). It is safe to set the initial value ofsum0
to any vectorv
; this has the effect of increasing the total sum byGetLane(SumOfLanes(d, v))
and may be slightly more efficient than later addingv
tosum0
.VW
:{f,u,i}32
VW RearrangeToOddPlusEven(VW sum0, VW sum1): returns in each 32-bit lane with indexi
a[2*i+1]*b[2*i+1] + a[2*i+0]*b[2*i+0]
.sum0
must be the return value of a priorReorderWidenMulAccumulate
, andsum1
must be its last (output) argument. In other words, this strengthens the invariant ofReorderWidenMulAccumulate
such that each 32-bit lane is the sum of the widened products whose 16-bit inputs came from the top and bottom halves of the 32-bit lane. This is typically called after a series of calls toReorderWidenMulAccumulate
, as opposed to after each one. Exception: ifHWY_TARGET == HWY_SCALAR
, returnsa[0]*b[0]
. Note that the initial value ofsum1
must be zero, seeReorderWidenMulAccumulate
.VN
:{u,i}{8,16}
,D
:RepartitionToWideX2<DFromV<VN>>
Vec<D> SumOfMulQuadAccumulate(D d, VN a, VN b, Vec<D> sum): widensa
andb
toTFromD<D>
and computessum[i] + a[4*i+3]*b[4*i+3] + a[4*i+2]*b[4*i+2] + a[4*i+1]*b[4*i+1] + a[4*i+0]*b[4*i+0]
VN_I
:i8
,VN_U
:Vec<RebindToUnsigned<DFromV<VN_I>>>
,DI
:Repartition<int32_t, DFromV<VN_I>>
Vec<DI> SumOfMulQuadAccumulate(DI di, VN_U a_u, VN_I b_i, Vec<DI> sum): widensa
andb
toTFromD<DI>
and computessum[i] + a[4*i+3]*b[4*i+3] + a[4*i+2]*b[4*i+2] + a[4*i+1]*b[4*i+1] + a[4*i+0]*b[4*i+0]
V
:{u,i}{8,16,32},{f}16
,VW
:Vec<RepartitionToWide<DFromV<V>>
:VW WidenMulAccumulate(D, V a, V b, VW low, VW& high)
: widensa
andb
, multiplies them together, then adds them to the concatenated vectors high:low. Returns the lower half of the result, and sets high to the upper half.
Fused multiply-add
When implemented using special instructions, these functions are more
precise and faster than separate multiplication followed by addition.
The *Sub
variants are somewhat slower on Arm, and unavailable for
integer inputs; if the c
argument is a constant, it would be better
to negate it and use MulAdd
.
V MulAdd(V a, V b, V c): returns
a[i] * b[i] + c[i]
.V NegMulAdd(V a, V b, V c): returns
-a[i] * b[i] + c[i]
.V MulSub(V a, V b, V c): returns
a[i] * b[i] - c[i]
.V NegMulSub(V a, V b, V c): returns
-a[i] * b[i] - c[i]
.V MulAddSub(V a, V b, V c): returns
a[i] * b[i] - c[i]
in the even lanes anda[i] * b[i] + c[i]
in the odd lanes.MulAddSub(a, b, c)
is equivalent toOddEven(MulAdd(a, b, c), MulSub(a, b, c))
orMulAddSub(a, b, OddEven(c, Neg(c))
, butMulSub(a, b, c)
is more efficient on some targets (including AVX2/AVX3).V
:bf16
,D
:RepartitionToWide<DFromV<V>>
,VW
:Vec<D>
VW MulEvenAdd(D d, V a, V b, VW c): equivalent to and potentially more efficient thanMulAdd(PromoteEvenTo(d, a), PromoteEvenTo(d, b), c)
.V
:bf16
,D
:RepartitionToWide<DFromV<V>>
,VW
:Vec<D>
VW MulOddAdd(D d, V a, V b, VW c): equivalent to and potentially more efficient thanMulAdd(PromoteOddTo(d, a), PromoteOddTo(d, b), c)
.
Masked arithmetic
All ops in this section return no
for mask=false
lanes, and
suppress any exceptions for those lanes if that is supported by the ISA.
When exceptions are not a concern, these are equivalent to, and
potentially more efficient than, IfThenElse(m, Add(a, b), no);
etc.
V MaskedMinOr(V no, M m, V a, V b): returns
Min(a, b)[i]
orno[i]
ifm[i]
is false.V MaskedMaxOr(V no, M m, V a, V b): returns
Max(a, b)[i]
orno[i]
ifm[i]
is false.V MaskedAddOr(V no, M m, V a, V b): returns
a[i] + b[i]
orno[i]
ifm[i]
is false.V MaskedSubOr(V no, M m, V a, V b): returns
a[i] - b[i]
orno[i]
ifm[i]
is false.V MaskedMulOr(V no, M m, V a, V b): returns
a[i] * b[i]
orno[i]
ifm[i]
is false.V MaskedDivOr(V no, M m, V a, V b): returns
a[i] / b[i]
orno[i]
ifm[i]
is false.V
:{u,i}
V MaskedModOr(V no, M m, V a, V b): returnsa[i] % b[i]
orno[i]
ifm[i]
is false.V
:{u,i}{8,16}
V MaskedSatAddOr(V no, M m, V a, V b): returnsa[i] + b[i]
saturated to the minimum/maximum representable value, orno[i]
ifm[i]
is false.V
:{u,i}{8,16}
V MaskedSatSubOr(V no, M m, V a, V b): returnsa[i] + b[i]
saturated to the minimum/maximum representable value, orno[i]
ifm[i]
is false.
Shifts
Note: Counts not in [0, sizeof(T)*8)
yield
implementation-defined results. Left-shifting signed T
and
right-shifting positive signed T
is the same as shifting
MakeUnsigned<T>
and casting to T
. Right-shifting negative signed
T
is the same as an unsigned shift, except that 1-bits are shifted
in.
Compile-time constant shifts: the amount must be in [0, sizeof(T)*8).
Generally the most efficient variant, but 8-bit shifts are potentially
slower than other lane sizes, and RotateRight
is often emulated with
shifts:
V
:{u,i}
V ShiftLeft<int>(V a) returnsa[i] << int
.V
:{u,i}
V ShiftRight<int>(V a) returnsa[i] >> int
.V
:{u,i}
V RoundingShiftRight<int>(V a) returns((int == 0) ? a[i] : (((a[i] >> (int - 1)) + 1) >> 1)
.V
:{u,i}
V RotateLeft<int>(V a) returns(a[i] << int) | (static_cast<TU>(a[i]) >> (sizeof(T)*8 - int))
.V
:{u,i}
V RotateRight<int>(V a) returns(static_cast<TU>(a[i]) >> int) | (a[i] << (sizeof(T)*8 - int))
.
Shift all lanes by the same (not necessarily compile-time constant) amount:
V
:{u,i}
V ShiftLeftSame(V a, int bits) returnsa[i] << bits
.V
:{u,i}
V ShiftRightSame(V a, int bits) returnsa[i] >> bits
.V
:{u,i}
V RoundingShiftRightSame<int kShiftAmt>(V a, int bits) returns((bits == 0) ? a[i] : (((a[i] >> (bits - 1)) + 1) >> 1)
.V
:{u,i}
V RotateLeftSame(V a, int bits) returns(a[i] << shl_bits) | (static_cast<TU>(a[i]) >> (sizeof(T)*8 - shl_bits))
, whereshl_bits
is equal tobits & (sizeof(T)*8 - 1)
.V
:{u,i}
V RotateRightSame(V a, int bits) returns(static_cast<TU>(a[i]) >> shr_bits) | (a[i] >> (sizeof(T)*8 - shr_bits))
, whereshr_bits
is equal tobits & (sizeof(T)*8 - 1)
.
Per-lane variable shifts (slow if SSSE3/SSE4, or 16-bit, or Shr i64 on AVX2):
V
:{u,i}
V operator<<(V a, V b) returnsa[i] << b[i]
. Currently unavailable on SVE/RVV; use the equivalentShl
instead.V
:{u,i}
V operator>>(V a, V b) returnsa[i] >> b[i]
. Currently unavailable on SVE/RVV; use the equivalentShr
instead.V
:{u,i}
V RoundingShr(V a, V b) returns((b[i] == 0) ? a[i] : (((a[i] >> (b[i] - 1)) + 1) >> 1)
.V
:{u,i}
V Rol(V a, V b) returns(a[i] << (b[i] & shift_amt_mask)) | (static_cast<TU>(a[i]) >> ((sizeof(T)*8 - b[i]) & shift_amt_mask))
, whereshift_amt_mask
is equal tosizeof(T)*8 - 1
.V
:{u,i}
V Ror(V a, V b) returns(static_cast<TU>(a[i]) >> (b[i] & shift_amt_mask)) | (a[i] << ((sizeof(T)*8 - b[i]) & shift_amt_mask))
, whereshift_amt_mask
is equal tosizeof(T)*8 - 1
.
Floating-point rounding
V
:{f}
V Round(V v): returnsv[i]
rounded towards the nearest integer, with ties to even.V
:{f}
V Trunc(V v): returnsv[i]
rounded towards zero (truncate).V
:{f}
V Ceil(V v): returnsv[i]
rounded towards positive infinity (ceiling).V
:{f}
V Floor(V v): returnsv[i]
rounded towards negative infinity.
Floating-point classification
V
:{f}
M IsNaN(V v): returns mask indicating whetherv[i]
is “not a number” (unordered).V
:{f}
M IsEitherNaN(V a, V b): equivalent toOr(IsNaN(a), IsNaN(b))
, butIsEitherNaN(a, b)
is more efficient thanOr(IsNaN(a), IsNaN(b))
on x86.V
:{f}
M IsInf(V v): returns mask indicating whetherv[i]
is positive or negative infinity.V
:{f}
M IsFinite(V v): returns mask indicating whetherv[i]
is neither NaN nor infinity, i.e. normal, subnormal or zero. Equivalent toNot(Or(IsNaN(v), IsInf(v)))
.
Logical
V
:{u,i}
V PopulationCount(V a): returns the number of 1-bits in each lane, i.e.PopCount(a[i])
.V
:{u,i}
V LeadingZeroCount(V a): returns the number of leading zeros in each lane. For any lanes wherea[i]
is zero,sizeof(TFromV<V>) * 8
is returned in the corresponding result lanes.V
:{u,i}
V TrailingZeroCount(V a): returns the number of trailing zeros in each lane. For any lanes wherea[i]
is zero,sizeof(TFromV<V>) * 8
is returned in the corresponding result lanes.V
:{u,i}
V HighestSetBitIndex(V a): returns the index of the highest set bit of each lane. For any lanes of a signed vector type wherea[i]
is zero, an unspecified negative value is returned in the corresponding result lanes. For any lanes of an unsigned vector type wherea[i]
is zero, an unspecified value that is greater thanHighestValue<MakeSigned<TFromV<V>>>()
is returned in the corresponding result lanes.
The following operate on individual bits within each lane. Note that the
non-operator functions (And
instead of &
) must be used for
floating-point types, and on SVE/RVV.
V
:{u,i}
V operator&(V a, V b): returnsa[i] & b[i]
. Currently unavailable on SVE/RVV; use the equivalentAnd
instead.V
:{u,i}
V operator|(V a, V b): returnsa[i] | b[i]
. Currently unavailable on SVE/RVV; use the equivalentOr
instead.V
:{u,i}
V operator^(V a, V b): returnsa[i] ^ b[i]
. Currently unavailable on SVE/RVV; use the equivalentXor
instead.V
:{u,i}
V Not(V v): returns~v[i]
.V AndNot(V a, V b): returns
~a[i] & b[i]
.
The following three-argument functions may be more efficient than assembling them from 2-argument functions:
V Xor3(V x1, V x2, V x3): returns
x1[i] ^ x2[i] ^ x3[i]
. This is more efficient thanOr3
on some targets. When inputs are disjoint (no bit is set in more than one argument),Xor3
andOr3
are equivalent and you should use the former.V Or3(V o1, V o2, V o3): returns
o1[i] | o2[i] | o3[i]
. This is less efficient thanXor3
on some targets; use that where possible.V OrAnd(V o, V a1, V a2): returns
o[i] | (a1[i] & a2[i])
.V BitwiseIfThenElse(V mask, V yes, V no): returns
((mask[i] & yes[i]) | (~mask[i] & no[i]))
.BitwiseIfThenElse
is equivalent to, but potentially more efficient thanOr(And(mask, yes), AndNot(mask, no))
.
Special functions for signed types:
V
:{f}
V CopySign(V a, V b): returns the number with the magnitude ofa
and sign ofb
.V
:{f}
V CopySignToAbs(V a, V b): as above, but potentially slightly more efficient; requires the first argument to be non-negative.V
:{i}
V BroadcastSignBit(V a) returnsa[i] < 0 ? -1 : 0
.V
:{i,f}
V ZeroIfNegative(V v): returnsv[i] < 0 ? 0 : v[i]
.V
:{i,f}
V IfNegativeThenElse(V v, V yes, V no): returnsv[i] < 0 ? yes[i] : no[i]
. This may be more efficient thanIfThenElse(Lt..)
.V
:{i,f}
V IfNegativeThenElseZero(V v, V yes): returnsv[i] < 0 ? yes[i] : 0
.IfNegativeThenElseZero(v, yes)
is equivalent to but more efficient thanIfThenElseZero(IsNegative(v), yes)
orIfNegativeThenElse(v, yes, Zero(d))
on some targets.V
:{i,f}
V IfNegativeThenZeroElse(V v, V no): returnsv[i] < 0 ? 0 : no
.IfNegativeThenZeroElse(v, no)
is equivalent to but more efficient thanIfThenZeroElse(IsNegative(v), no)
orIfNegativeThenElse(v, Zero(d), no)
on some targets.V
:{i,f}
V IfNegativeThenNegOrUndefIfZero(V mask, V v): returnsmask[i] < 0 ? (-v[i]) : ((mask[i] > 0) ? v[i] : impl_defined_val)
, whereimpl_defined_val
is an implementation-defined value that is equal to either 0 orv[i]
.IfNegativeThenNegOrUndefIfZero(mask, v)
is more efficient thanIfNegativeThenElse(mask, Neg(v), v)
for I8/I16/I32 vectors that are 32 bytes or smaller on SSSE3/SSE4/AVX2/AVX3 targets.
Masks
Let M
denote a mask capable of storing a logical true/false for each
lane (the encoding depends on the platform).
Create mask
M FirstN(D, size_t N): returns mask with the first
N
lanes (those with index< N
) true.N >= Lanes(D())
results in an all-true mask.N
must not exceedLimitsMax<SignedFromSize<HWY_MIN(sizeof(size_t), sizeof(TFromD<D>))>>()
. Useful for implementing “masked” stores by loadingprev
followed byIfThenElse(FirstN(d, N), what_to_store, prev)
.M MaskFromVec(V v): returns false in lane
i
ifv[i] == 0
, or true ifv[i]
has all bits set. The result is implementation-defined ifv[i]
is neither zero nor all bits set.M LoadMaskBits(D, const uint8_t* p): returns a mask indicating whether the i-th bit in the array is set. Loads bytes and bits in ascending order of address and index. At least 8 bytes of
p
must be readable, but only(Lanes(D()) + 7) / 8
need be initialized. Any unused bits (happens ifLanes(D()) < 8
) are treated as if they were zero.M Dup128MaskFromMaskBits(D d, unsigned mask_bits): returns a mask with lane
i
set to((mask_bits >> (i & (16 / sizeof(T) - 1))) & 1) != 0
.M MaskFalse(D): returns an all-false mask.
MaskFalse(D())
is equivalent toMaskFromVec(Zero(D()))
, butMaskFalse(D())
is more efficient thanMaskFromVec(Zero(D()))
on AVX3, RVV, and SVE.MaskFalse(D())
is also equivalent toFirstN(D(), 0)
orDup128MaskFromMaskBits(D(), 0)
, butMaskFalse(D())
is usually more efficient.
Convert mask
M1 RebindMask(D, M2 m): returns same mask bits as
m
, but reinterpreted as a mask for lanes of typeTFromD<D>
.M1
andM2
must have the same number of lanes.V VecFromMask(D, M m): returns 0 in lane
i
ifm[i] == false
, otherwise all bits set.uint64_t BitsFromMask(D, M m): returns bits
b
such that(b >> i) & 1
indicates whetherm[i]
was set, and any remaining bits in theuint64_t
are zero. This is only available ifHWY_MAX_BYTES <= 64
, because 512-bit vectors are the longest for which there are no more than 64 lanes and thus mask bits.size_t StoreMaskBits(D, M m, uint8_t* p): stores a bit array indicating whether
m[i]
is true, in ascending order ofi
, filling the bits of each byte from least to most significant, then proceeding to the next byte. Returns the number of bytes written:(Lanes(D()) + 7) / 8
. At least 8 bytes ofp
must be writable.Mask<DTo> PromoteMaskTo(DTo d_to, DFrom d_from, Mask<DFrom> m): Promotes
m
to a mask with a lane type ofTFromD<DTo>
,DFrom
isRebind<TFrom, DTo>
.PromoteMaskTo(d_to, d_from, m)
is equivalent toMaskFromVec(BitCast(d_to, PromoteTo(di_to, BitCast(di_from, VecFromMask(d_from, m)))))
, wheredi_from
isRebindToSigned<DFrom>()
anddi_from
isRebindToSigned<DFrom>()
, butPromoteMaskTo(d_to, d_from, m)
is more efficient on some targets.PromoteMaskTo requires that
sizeof(TFromD<DFrom>) < sizeof(TFromD<DTo>)
be true.Mask<DTo> DemoteMaskTo(DTo d_to, DFrom d_from, Mask<DFrom> m): Demotes
m
to a mask with a lane type ofTFromD<DTo>
,DFrom
isRebind<TFrom, DTo>
.DemoteMaskTo(d_to, d_from, m)
is equivalent toMaskFromVec(BitCast(d_to, DemoteTo(di_to, BitCast(di_from, VecFromMask(d_from, m)))))
, wheredi_from
isRebindToSigned<DFrom>()
anddi_from
isRebindToSigned<DFrom>()
, butDemoteMaskTo(d_to, d_from, m)
is more efficient on some targets.DemoteMaskTo requires that
sizeof(TFromD<DFrom>) > sizeof(TFromD<DTo>)
be true.M OrderedDemote2MasksTo(DTo, DFrom, M2, M2): returns a mask whose
LowerHalf
is the first argument and whoseUpperHalf
is the second argument;M2
isMask<Half<DFrom>>
;DTo
isRepartition<TTo, DFrom>
.OrderedDemote2MasksTo requires that
sizeof(TFromD<DTo>) == sizeof(TFromD<DFrom>) * 2
be true.OrderedDemote2MasksTo(d_to, d_from, a, b)
is equivalent toMaskFromVec(BitCast(d_to, OrderedDemote2To(di_to, va, vb)))
, whereva
isBitCast(di_from, MaskFromVec(d_from, a))
,vb
isBitCast(di_from, MaskFromVec(d_from, b))
,di_to
isRebindToSigned<DTo>()
, anddi_from
isRebindToSigned<DFrom>()
, butOrderedDemote2MasksTo(d_to, d_from, a, b)
is more efficient on some targets.OrderedDemote2MasksTo is only available if
HWY_TARGET != HWY_SCALAR
is true.
Combine mask
M2 LowerHalfOfMask(D d, M m): returns the lower half of mask
m
, whereM
isMFromD<Twice<D>>
andM2
isMFromD<D>
.LowerHalfOfMask(d, m)
is equivalent toMaskFromVec(LowerHalf(d, VecFromMask(d, m)))
, butLowerHalfOfMask(d, m)
is more efficient on some targets.M2 UpperHalfOfMask(D d, M m): returns the upper half of mask
m
, whereM
isMFromD<Twice<D>>
andM2
isMFromD<D>
.UpperHalfOfMask(d, m)
is equivalent toMaskFromVec(UpperHalf(d, VecFromMask(d, m)))
, butUpperHalfOfMask(d, m)
is more efficient on some targets.UpperHalfOfMask is only available if
HWY_TARGET != HWY_SCALAR
is true.M CombineMasks(D, M2, M2): returns a mask whose
UpperHalf
is the first argument and whoseLowerHalf
is the second argument;M2
isMask<Half<D>>
.CombineMasks(d, hi, lo)
is equivalent toMaskFromVec(d, Combine(d, VecFromMask(Half<D>(), hi), VecFromMask(Half<D>(), lo)))
, butCombineMasks(d, hi, lo)
is more efficient on some targets.CombineMasks is only available if
HWY_TARGET != HWY_SCALAR
is true.
Slide mask across blocks
M SlideMaskUpLanes(D d, M m, size_t N): Slides
m
upN
lanes.SlideMaskUpLanes(d, m, N)
is equivalent toMaskFromVec(SlideUpLanes(d, VecFromMask(d, m), N))
, butSlideMaskUpLanes(d, m, N)
is more efficient on some targets.The results of SlideMaskUpLanes is implementation-defined if
N >= Lanes(d)
.M SlideMaskDownLanes(D d, M m, size_t N): Slides
m
downN
lanes.SlideMaskDownLanes(d, m, N)
is equivalent toMaskFromVec(SlideDownLanes(d, VecFromMask(d, m), N))
, butSlideMaskDownLanes(d, m, N)
is more efficient on some targets.The results of SlideMaskDownLanes is implementation-defined if
N >= Lanes(d)
.M SlideMask1Up(D d, M m): Slides
m
up 1 lane.SlideMask1Up(d, m)
is equivalent toMaskFromVec(Slide1Up(d, VecFromMask(d, m)))
, butSlideMask1Up(d, m)
is more efficient on some targets.M SlideMask1Down(D d, M m): Slides
m
down 1 lane.SlideMask1Down(d, m)
is equivalent toMaskFromVec(Slide1Down(d, VecFromMask(d, m)))
, butSlideMask1Down(d, m)
is more efficient on some targets.
Test mask
bool AllTrue(D, M m): returns whether all
m[i]
are true.bool AllFalse(D, M m): returns whether all
m[i]
are false.size_t CountTrue(D, M m): returns how many of
m[i]
are true [0, N]. This is typically more expensive than AllTrue/False.intptr_t FindFirstTrue(D, M m): returns the index of the first (i.e. lowest index)
m[i]
that is true, or -1 if none are.size_t FindKnownFirstTrue(D, M m): returns the index of the first (i.e. lowest index)
m[i]
that is true. Requires!AllFalse(d, m)
, otherwise results are undefined. This is typically more efficient thanFindFirstTrue
.intptr_t FindLastTrue(D, M m): returns the index of the last (i.e. highest index)
m[i]
that is true, or -1 if none are.size_t FindKnownLastTrue(D, M m): returns the index of the last (i.e. highest index)
m[i]
that is true. Requires!AllFalse(d, m)
, otherwise results are undefined. This is typically more efficient thanFindLastTrue
.
Ternary operator for masks
For IfThen*
, masks must adhere to the invariant established by
MaskFromVec
: false is zero, true has all bits set:
V IfThenElse(M mask, V yes, V no): returns
mask[i] ? yes[i] : no[i]
.V IfThenElseZero(M mask, V yes): returns
mask[i] ? yes[i] : 0
.V IfThenZeroElse(M mask, V no): returns
mask[i] ? 0 : no[i]
.V IfVecThenElse(V mask, V yes, V no): equivalent to and possibly faster than
IfVecThenElse(MaskFromVec(mask), yes, no)
. The result is implementation-defined ifmask[i]
is neither zero nor all bits set.
Logical mask
M Not(M m): returns mask of elements indicating whether the input mask element was false.
M And(M a, M b): returns mask of elements indicating whether both input mask elements were true.
M AndNot(M not_a, M b): returns mask of elements indicating whether
not_a
is false andb
is true.M Or(M a, M b): returns mask of elements indicating whether either input mask element was true.
M Xor(M a, M b): returns mask of elements indicating whether exactly one input mask element was true.
M ExclusiveNeither(M a, M b): returns mask of elements indicating
a
is false andb
is false. Undefined if both are true. We choose not to provide NotOr/NotXor because x86 and SVE only define one of these operations. This op is for situations where the inputs are known to be mutually exclusive.M SetOnlyFirst(M m): If none of
m[i]
are true, returns all-false. Otherwise, only lanek
is true, wherek
is equal toFindKnownFirstTrue(m)
. In other words, sets to false any lanes with index greater than the first true lane, if it exists.M SetBeforeFirst(M m): If none of
m[i]
are true, returns all-true. Otherwise, returns mask with the firstk
lanes true and all remaining lanes false, wherek
is equal toFindKnownFirstTrue(m)
. In other words, if at least one ofm[i]
is true, sets to true any lanes with index less than the first true lane and all remaining lanes to false.M SetAtOrBeforeFirst(M m): equivalent to
Or(SetBeforeFirst(m), SetOnlyFirst(m))
, butSetAtOrBeforeFirst(m)
is usually more efficient thanOr(SetBeforeFirst(m), SetOnlyFirst(m))
.M SetAtOrAfterFirst(M m): equivalent to
Not(SetBeforeFirst(m))
.
Compress
V Compress(V v, M m): returns
r
such thatr[n]
isv[i]
, withi
the n-th lane index (starting from 0) wherem[i]
is true. Compacts lanes whose mask is true into the lower lanes. For targets and lane typeT
whereCompressIsPartition<T>::value
is true, the upper lanes are those whose mask is false (thusCompress
corresponds to partitioning according to the mask). Otherwise, the upper lanes are implementation-defined. Potentially slow with 8 and 16-bit lanes. Use this form when the input is already a mask, e.g. returned by a comparison.V CompressNot(V v, M m): equivalent to
Compress(v, Not(m))
but possibly faster ifCompressIsPartition<T>::value
is true.V
:u64
V CompressBlocksNot(V v, M m): equivalent toCompressNot(v, m)
whenm
is structured as adjacent pairs (both true or false), e.g. as returned byLt128
. This is a no-op for 128 bit vectors. Unavailable ifHWY_TARGET == HWY_SCALAR
.size_t CompressStore(V v, M m, D d, T* p): writes lanes whose mask
m
is true intop
, starting from lane 0. ReturnsCountTrue(d, m)
, the number of valid lanes. May be implemented asCompress
followed byStoreU
; lanes after the valid ones may still be overwritten! Potentially slow with 8 and 16-bit lanes.size_t CompressBlendedStore(V v, M m, D d, T* p): writes only lanes whose mask
m
is true intop
, starting from lane 0. ReturnsCountTrue(d, m)
, the number of lanes written. Does not modify subsequent lanes, but there is no guarantee of atomicity because this may be implemented asCompress, LoadU, IfThenElse(FirstN), StoreU
.V CompressBits(V v, const uint8_t* HWY_RESTRICT bits): Equivalent to, but often faster than
Compress(v, LoadMaskBits(d, bits))
.bits
is as specified forLoadMaskBits
. If called multiple times, thebits
pointer passed to this function must also be markedHWY_RESTRICT
to avoid repeated work. Note that if the vector has less than 8 elements, incrementingbits
will not work as intended for packed bit arrays. As withCompress
,CompressIsPartition
indicates the mask=false lanes are moved to the upper lanes. Potentially slow with 8 and 16-bit lanes.size_t CompressBitsStore(V v, const uint8_t* HWY_RESTRICT bits, D d, T* p): combination of
CompressStore
andCompressBits
, see remarks there.
Expand
V Expand(V v, M m): returns
r
such thatr[i]
is zero wherem[i]
is false, and otherwisev[s]
, wheres
is the number ofm[0, i)
which are true. Scatters inputs in ascending index order to the lanes whose mask is true and zeros all other lanes. Potentially slow with 8 and 16-bit lanes.V LoadExpand(M m, D d, const T* p): returns
r
such thatr[i]
is zero wherem[i]
is false, and otherwisep[s]
, wheres
is the number ofm[0, i)
which are true. May be implemented asLoadU
followed byExpand
. Potentially slow with 8 and 16-bit lanes.
Comparisons
These return a mask (see above) indicating whether the condition is true.
M operator==(V a, V b): returns
a[i] == b[i]
. Currently unavailable on SVE/RVV; use the equivalentEq
instead.M operator!=(V a, V b): returns
a[i] != b[i]
. Currently unavailable on SVE/RVV; use the equivalentNe
instead.M operator<(V a, V b): returns
a[i] < b[i]
. Currently unavailable on SVE/RVV; use the equivalentLt
instead.M operator>(V a, V b): returns
a[i] > b[i]
. Currently unavailable on SVE/RVV; use the equivalentGt
instead.M operator<=(V a, V b): returns
a[i] <= b[i]
. Currently unavailable on SVE/RVV; use the equivalentLe
instead.M operator>=(V a, V b): returns
a[i] >= b[i]
. Currently unavailable on SVE/RVV; use the equivalentGe
instead.V
:{i,f}
M IsNegative(V v): returnsv[i] < 0
.IsNegative(v)
is equivalent toMaskFromVec(BroadcastSignBit(v))
orLt(v, Zero(d))
, butIsNegative(v)
is more efficient on some targets.V
:{u,i}
M TestBit(V v, V bit): returns(v[i] & bit[i]) == bit[i]
.bit[i]
must have exactly one bit set.V
:u64
M Lt128(D, V a, V b): for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]:a[0]
concatenated to an unsigned 128-bit integer (least significant bits ina[0]
) is less thanb[1]:b[0]
. For each pair, the mask lanes are either both true or both false. Unavailable ifHWY_TARGET == HWY_SCALAR
.V
:u64
M Lt128Upper(D, V a, V b): for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]
is less thanb[1]
. For each pair, the mask lanes are either both true or both false. This is useful for comparing 64-bit keys alongside 64-bit values. Only available ifHWY_TARGET != HWY_SCALAR
.V
:u64
M Eq128(D, V a, V b): for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]:a[0]
concatenated to an unsigned 128-bit integer (least significant bits ina[0]
) equalsb[1]:b[0]
. For each pair, the mask lanes are either both true or both false. Unavailable ifHWY_TARGET == HWY_SCALAR
.V
:u64
M Ne128(D, V a, V b): for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]:a[0]
concatenated to an unsigned 128-bit integer (least significant bits ina[0]
) differs fromb[1]:b[0]
. For each pair, the mask lanes are either both true or both false. Unavailable ifHWY_TARGET == HWY_SCALAR
.V
:u64
M Eq128Upper(D, V a, V b): for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]
equalsb[1]
. For each pair, the mask lanes are either both true or both false. This is useful for comparing 64-bit keys alongside 64-bit values. Only available ifHWY_TARGET != HWY_SCALAR
.V
:u64
M Ne128Upper(D, V a, V b): for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]
differs fromb[1]
. For each pair, the mask lanes are either both true or both false. This is useful for comparing 64-bit keys alongside 64-bit values. Only available ifHWY_TARGET != HWY_SCALAR
.
Memory
Memory operands are little-endian, otherwise their order would depend on
the lane configuration. Pointers are the addresses of N
consecutive
T
values, either aligned
(address is a multiple of the vector
size) or possibly unaligned (denoted p
).
Even unaligned addresses must still be a multiple of sizeof(T)
,
otherwise StoreU
may crash on some platforms (e.g. RVV and Armv7).
Note that C++ ensures automatic (stack) and dynamically allocated (via
new
or malloc
) variables of type T
are aligned to
sizeof(T)
, hence such addresses are suitable for StoreU
.
However, casting pointers to char*
and adding arbitrary offsets (not
a multiple of sizeof(T)
) can violate this requirement.
Note: computations with low arithmetic intensity (FLOP/s per memory traffic bytes), e.g. dot product, can be 1.5 times as fast when the memory operands are aligned to the vector size. An unaligned access may require two load ports.
Load
Vec<D> Load(D, const T* aligned): returns
aligned[i]
. May fault if the pointer is not aligned to the vector size (using aligned_allocator.h is safe). Using this whenever possible improves codegen on SSSE3/SSE4: unlikeLoadU
,Load
can be fused into a memory operand, which reduces register pressure.
Requires only element-aligned vectors (e.g. from malloc/std::vector, or aligned memory at indices which are not a multiple of the vector length):
Vec<D> LoadU(D, const T* p): returns
p[i]
.Vec<D> LoadDup128(D, const T* p): returns one 128-bit block loaded from
p
and broadcasted into all 128-bit block[s]. This may be faster than broadcasting single values, and is more convenient than preparing constants for the actual vector length. Only available ifHWY_TARGET != HWY_SCALAR
.Vec<D> MaskedLoadOr(V no, M mask, D, const T* p): returns
mask[i] ? p[i] : no[i]
. May fault even wheremask
is false#if HWY_MEM_OPS_MIGHT_FAULT
. Ifp
is aligned, faults cannot happen unless the entire vector is inaccessible. Assuming no faults, this is equivalent to, and potentially more efficient than,IfThenElse(mask, LoadU(D(), p), no)
.Vec<D> MaskedLoad(M mask, D d, const T* p): equivalent to
MaskedLoadOr(Zero(d), mask, d, p)
, but potentially slightly more efficient.Vec<D> LoadN(D d, const T* p, size_t max_lanes_to_load) : Loads
HWY_MIN(Lanes(d), max_lanes_to_load)
lanes fromp
to the first (lowest-index) lanes of the result vector and zeroes out the remaining lanes.LoadN does not fault if all of the elements in
[p, p + max_lanes_to_load)
are accessible, even ifHWY_MEM_OPS_MIGHT_FAULT
is 1 ormax_lanes_to_load < Lanes(d)
is true.Vec<D> LoadNOr(V no, D d, const T* p, size_t max_lanes_to_load) : Loads
HWY_MIN(Lanes(d), max_lanes_to_load)
lanes fromp
to the first (lowest-index) lanes of the result vector and fills the remaining lanes withno
. Like LoadN, this does not fault.
Store
void Store(Vec<D> v, D, T* aligned): copies
v[i]
intoaligned[i]
, which must be aligned to the vector size. Writes exactlyN * sizeof(T)
bytes.void StoreU(Vec<D> v, D, T* p): as
Store
, but the alignment requirement is relaxed to element-aligned (multiple ofsizeof(T)
).void BlendedStore(Vec<D> v, M m, D d, T* p): as
StoreU
, but only updatesp
wherem
is true. May fault even wheremask
is false#if HWY_MEM_OPS_MIGHT_FAULT
. Ifp
is aligned, faults cannot happen unless the entire vector is inaccessible. Equivalent to, and potentially more efficient than,StoreU(IfThenElse(m, v, LoadU(d, p)), d, p)
. “Blended” indicates this may not be atomic; other threads must not concurrently update[p, p + Lanes(d))
without synchronization.void SafeFillN(size_t num, T value, D d, T* HWY_RESTRICT to): Sets
to[0, num)
tovalue
. Ifnum
exceedsLanes(d)
, the behavior is target-dependent (either filling all, or no more than one vector). Potentially more efficient than a scalar loop, but will not fault, unlikeBlendedStore
. No alignment requirement. Potentially non-atomic, likeBlendedStore
.void SafeCopyN(size_t num, D d, const T* HWY_RESTRICT from, T* HWY_RESTRICT to): Copies
from[0, num)
toto
. Ifnum
exceedsLanes(d)
, the behavior is target-dependent (either copying all, or no more than one vector). Potentially more efficient than a scalar loop, but will not fault, unlikeBlendedStore
. No alignment requirement. Potentially non-atomic, likeBlendedStore
.void StoreN(Vec<D> v, D d, T* HWY_RESTRICT p, size_t max_lanes_to_store): Stores the first (lowest-index)
HWY_MIN(Lanes(d), max_lanes_to_store)
lanes ofv
to p.StoreN does not modify any memory past
p + HWY_MIN(Lanes(d), max_lanes_to_store) - 1
.
Interleaved
void LoadInterleaved2(D, const T* p, Vec<D>& v0, Vec<D>& v1): equivalent to
LoadU
intov0, v1
followed by shuffling, such thatv0[0] == p[0], v1[0] == p[1]
.void LoadInterleaved3(D, const T* p, Vec<D>& v0, Vec<D>& v1, Vec<D>& v2): as above, but for three vectors (e.g. RGB samples).
void LoadInterleaved4(D, const T* p, Vec<D>& v0, Vec<D>& v1, Vec<D>& v2, Vec<D>& v3): as above, but for four vectors (e.g. RGBA).
void StoreInterleaved2(Vec<D> v0, Vec<D> v1, D, T* p): equivalent to shuffling
v0, v1
followed by twoStoreU()
, such thatp[0] == v0[0], p[1] == v1[0]
.void StoreInterleaved3(Vec<D> v0, Vec<D> v1, Vec<D> v2, D, T* p): as above, but for three vectors (e.g. RGB samples).
void StoreInterleaved4(Vec<D> v0, Vec<D> v1, Vec<D> v2, Vec<D> v3, D, T* p): as above, but for four vectors (e.g. RGBA samples).
Scatter/Gather
Note: Offsets/indices are of type VI = Vec<RebindToSigned<D>>
and need not be unique. The results are implementation-defined for
negative offsets, because behavior differs between x86 and RVV (signed
vs. unsigned).
Note: Where possible, applications should
Load/Store/TableLookup*
entire vectors, which is much faster than
Scatter/Gather
. Otherwise, code of the form
dst[tbl[i]] = F(src[i])
should when possible be transformed to
dst[i] = F(src[tbl[i]])
because Scatter
may be more expensive
than Gather
.
Note: We provide *Offset
functions for the convenience of users
that have actual byte offsets. However, the preferred interface is
*Index
, which takes indices. To reduce the number of ops, we do not
intend to add Masked*
ops for offsets. If you have offsets, you can
convert them to indices via ShiftRight
.
D
:{u,i,f}{32,64}
void ScatterOffset(Vec<D> v, D, T* base, VI offsets): storesv[i]
to the base address plus byteoffsets[i]
.D
:{u,i,f}{32,64}
void ScatterIndex(Vec<D> v, D, T* base, VI indices): storesv[i]
tobase[indices[i]]
.D
:{u,i,f}{32,64}
void ScatterIndexN(Vec<D> v, D, T* base, VI indices, size_t max_lanes_to_store): StoresHWY_MIN(Lanes(d), max_lanes_to_store)
lanesv[i]
tobase[indices[i]]
D
:{u,i,f}{32,64}
void MaskedScatterIndex(Vec<D> v, M m, D, T* base, VI indices): storesv[i]
tobase[indices[i]]
ifmask[i]
is true. Does not fault for lanes whosemask
is false.D
:{u,i,f}{32,64}
Vec<D> GatherOffset(D, const T* base, VI offsets): returns elements of base selected by byteoffsets[i]
.D
:{u,i,f}{32,64}
Vec<D> GatherIndex(D, const T* base, VI indices): returns vector ofbase[indices[i]]
.D
:{u,i,f}{32,64}
Vec<D> GatherIndexN(D, const T* base, VI indices, size_t max_lanes_to_load): LoadsHWY_MIN(Lanes(d), max_lanes_to_load)
lanes ofbase[indices[i]]
to the first (lowest-index) lanes of the result vector and zeroes out the remaining lanes.D
:{u,i,f}{32,64}
Vec<D> MaskedGatherIndexOr(V no, M mask, D d, const T* base, VI indices): returns vector ofbase[indices[i]]
wheremask[i]
is true, otherwiseno[i]
. Does not fault for lanes whosemask
is false. This is equivalent to, and potentially more efficient than,IfThenElseZero(mask, GatherIndex(d, base, indices))
.D
:{u,i,f}{32,64}
Vec<D> MaskedGatherIndex(M mask, D d, const T* base, VI indices): equivalent toMaskedGatherIndexOr(Zero(d), mask, d, base, indices)
. Use this when the desired default value is zero; it may be more efficient on some targets, and on others require generating a zero constant.
Cache control
All functions except Stream
are defined in cache_control.h.
void Stream(Vec<D> a, D d, const T* aligned): copies
a[i]
intoaligned[i]
with non-temporal hint if available (useful for write-only data; avoids cache pollution). May be implemented using a CPU-internal buffer. To avoid partial flushes and unpredictable interactions with atomics (for example, see Intel SDM Vol 4, Sec. 8.1.2.2), call this consecutively for an entire cache line (typically 64 bytes, aligned to its size). Each call may write a multiple ofHWY_STREAM_MULTIPLE
bytes, which can exceedLanes(d) * sizeof(T)
. The new contents ofaligned
may not be visible untilFlushStream
is called.void FlushStream(): ensures values written by previous
Stream
calls are visible on the current core. This is NOT sufficient for synchronizing across cores; whenStream
outputs are to be consumed by other core(s), the producer must publish availability (e.g. via mutex or atomic_flag) afterFlushStream
.void FlushCacheline(const void* p): invalidates and flushes the cache line containing “p”, if possible.
void Prefetch(const T* p): optionally begins loading the cache line containing “p” to reduce latency of subsequent actual loads.
void Pause(): when called inside a spin-loop, may reduce power consumption.
Type conversion
Vec<D> BitCast(D, V): returns the bits of
V
reinterpreted as typeVec<D>
.Vec<D> ResizeBitCast(D, V): resizes
V
to a vector ofLanes(D()) * sizeof(TFromD<D>)
bytes, and then returns the bits of the resized vector reinterpreted as typeVec<D>
.If
Vec<D>
is a larger vector thanV
, then the contents of any bytes past the firstLanes(DFromV<V>()) * sizeof(TFromV<V>)
bytes of the result vector is unspecified.Vec<DTo> ZeroExtendResizeBitCast(DTo, DFrom, V): resizes
V
, which is a vector of typeVec<DFrom>
, to a vector ofLanes(D()) * sizeof(TFromD<D>)
bytes, and then returns the bits of the resized vector reinterpreted as typeVec<DTo>
.If
Lanes(DTo()) * sizeof(TFromD<DTo>)
is greater thanLanes(DFrom()) * sizeof(TFromD<DFrom>)
, then any bytes past the firstLanes(DFrom()) * sizeof(TFromD<DFrom>)
bytes of the result vector are zeroed out.V
,V8
: (u32,u8
)V8 U8FromU32(V): special-caseu32
tou8
conversion when all lanes ofV
are already clamped to[0, 256)
.D
:{f}
Vec<D> ConvertTo(D, V): converts a signed/unsigned integer value to same-sized floating point.V
:{f}
Vec<D> ConvertTo(D, V): rounds floating point towards zero and converts the value to same-sized signed/unsigned integer. Returns the closest representable value if the input exceeds the destination range.V
:{f}
Vec<D> ConvertInRangeTo(D, V): rounds floating point towards zero and converts the value to same-sized signed/unsigned integer. Returns an implementation-defined value if the input exceeds the destination range.V
:f
;Ret
:Vec<RebindToSigned<DFromV<V>>>
Ret NearestInt(V a): returns the integer nearest toa[i]
; results are undefined for NaN.V
:f
;Ret
:Vec<RebindToSigned<DFromV<V>>>
Ret CeilInt(V a): equivalent toConvertTo(RebindToSigned<DFromV<V>>(), Ceil(a))
, butCeilInt(a)
is more efficient on some targets, including SSE2, SSSE3, and AArch64 NEON.V
:f
;Ret
:Vec<RebindToSigned<DFromV<V>>>
Ret FloorInt(V a): equivalent toConvertTo(RebindToSigned<DFromV<V>>(), Floor(a))
, butFloorInt(a)
is more efficient on some targets, including SSE2, SSSE3, and AArch64 NEON.D
:i32
,V
:f64
Vec<D> DemoteToNearestInt(D d, V v): convertsv[i]
toTFromD<D>
, rounding to nearest (with ties to even).DemoteToNearestInt(d, v)
is equivalent toDemoteTo(d, Round(v))
, butDemoteToNearestInt(d, v)
is more efficient on some targets, including x86 and RVV.
Single vector demotion
These functions demote a full vector (or parts thereof) into a vector of
half the size. Use Rebind<MakeNarrow<T>, D>
or
Half<RepartitionToNarrow<D>>
to obtain the D
that describes the
return type.
V
,D
: (u64,u32
), (u64,u16
), (u64,u8
), (u32,u16
), (u32,u8
), (u16,u8
)Vec<D> TruncateTo(D, V v): returnsv[i]
truncated to the smaller type indicated byT = TFromD<D>
, with the same result as if the more-significant input bits that do not fit inT
had been zero. Example:ScalableTag<uint32_t> du32; Rebind<uint8_t> du8; TruncateTo(du8, Set(du32, 0xF08F))
is the same asSet(du8, 0x8F)
.V
,D
: (i16,i8
), (i32,i8
), (i64,i8
), (i32,i16
), (i64,i16
), (i64,i32
), (u16,i8
), (u32,i8
), (u64,i8
), (u32,i16
), (u64,i16
), (u64,i32
), (i16,u8
), (i32,u8
), (i64,u8
), (i32,u16
), (i64,u16
), (i64,u32
), (u16,u8
), (u32,u8
), (u64,u8
), (u32,u16
), (u64,u16
), (u64,u32
), (f64,f32
)Vec<D> DemoteTo(D, V v): returnsv[i]
after packing with signed/unsigned saturation toMakeNarrow<T>
.V
,D
:f64,{u,i}32
Vec<D> DemoteTo(D, V v): rounds floating point towards zero and converts the value to 32-bit integers. Returns the closest representable value if the input exceeds the destination range.V
,D
:f64,{u,i}32
Vec<D> DemoteInRangeTo(D, V v): rounds floating point towards zero and converts the value to 32-bit integers. Returns an implementation-defined value if the input exceeds the destination range.V
,D
:{u,i}64,f32
Vec<D> DemoteTo(D, V v): converts 64-bit integer tofloat
.V
,D
: (f32,f16
), (f64,f16
), (f32,bf16
)Vec<D> DemoteTo(D, V v): narrows float to half (for bf16, it is unspecified whether this truncates or rounds).
Single vector promotion
These functions promote a half vector to a full vector. To obtain
halves, use LowerHalf
or UpperHalf
, or load them using a
half-sized D
.
- Unsigned
V
to wider signed/unsignedD
; signed to wider signed,f16
tof32
,f16
tof64
,bf16
tof32
,f32
tof64
Vec<D> PromoteTo(D, V part): returnspart[i]
zero- or sign-extended to the integer typeMakeWide<T>
, or widened to the floating-point typeMakeFloat<MakeWide<T>>
. {u,i}32
tof64
Vec<D> PromoteTo(D, V part): returnspart[i]
widened todouble
.f32
toi64
oru64
Vec<D> PromoteTo(D, V part): roundspart[i]
towards zero and converts the rounded value to a 64-bit signed or unsigned integer. Returns the representable value if the input exceeds the destination range.f32
toi64
oru64
Vec<D> PromoteInRangeTo(D, V part): roundspart[i]
towards zero and converts the rounded value to a 64-bit signed or unsigned integer. Returns an implementation-defined value if the input exceeds the destination range.
The following may be more convenient or efficient than also calling
LowerHalf
/ UpperHalf
:
- Unsigned
V
to wider signed/unsignedD
; signed to wider signed,f16
tof32
,bf16
tof32
,f32
tof64
Vec<D> PromoteLowerTo(D, V v): returnsv[i]
widened toMakeWide<T>
, for i in[0, Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. {u,i}32
tof64
Vec<D> PromoteLowerTo(D, V v): returnsv[i]
widened todouble
, for i in[0, Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value.f32
toi64
oru64
Vec<D> PromoteLowerTo(D, V v): roundsv[i]
towards zero and converts the rounded value to a 64-bit signed or unsigned integer, for i in[0, Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value.f32
toi64
oru64
Vec<D> PromoteInRangeLowerTo(D, V v): roundsv[i]
towards zero and converts the rounded value to a 64-bit signed or unsigned integer, for i in[0, Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. Returns an implementation-defined value if the input exceeds the destination range.- Unsigned
V
to wider signed/unsignedD
; signed to wider signed,f16
tof32
,bf16
tof32
,f32
tof64
Vec<D> PromoteUpperTo(D, V v): returnsv[i]
widened toMakeWide<T>
, for i in[Lanes(D()), 2 * Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. Only available ifHWY_TARGET != HWY_SCALAR
. {u,i}32
tof64
Vec<D> PromoteUpperTo(D, V v): returnsv[i]
widened todouble
, for i in[Lanes(D()), 2 * Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. Only available ifHWY_TARGET != HWY_SCALAR
.f32
toi64
oru64
Vec<D> PromoteUpperTo(D, V v): roundsv[i]
towards zero and converts the rounded value to a 64-bit signed or unsigned integer, for i in[Lanes(D()), 2 * Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. Only available ifHWY_TARGET != HWY_SCALAR
.f32
toi64
oru64
Vec<D> PromoteInRangeUpperTo(D, V v): roundsv[i]
towards zero and converts the rounded value to a 64-bit signed or unsigned integer, for i in[Lanes(D()), 2 * Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. Returns an implementation-defined value if the input exceeds the destination range. Only available ifHWY_TARGET != HWY_SCALAR
.
The following may be more convenient or efficient than also calling
ConcatEven
or ConcatOdd
followed by PromoteLowerTo
:
V
:{u,i}{8,16,32},f{16,32},bf16
,D
:RepartitionToWide<DFromV<V>>
Vec<D> PromoteEvenTo(D, V v): promotes the even lanes ofv
toTFromD<D>
. Note thatV
has twice as many lanes asD
and the return value.PromoteEvenTo(d, v)
is equivalent to, but potentially more efficient thanPromoteLowerTo(d, ConcatEven(Repartition<TFromV<V>, D>(), v, v))
.V
:{u,i}{8,16,32},f{16,32},bf16
,D
:RepartitionToWide<DFromV<V>>
Vec<D> PromoteOddTo(D, V v): promotes the odd lanes ofv
toTFromD<D>
. Note thatV
has twice as many lanes asD
and the return value.PromoteOddTo(d, v)
is equivalent to, but potentially more efficient thanPromoteLowerTo(d, ConcatOdd(Repartition<TFromV<V>, D>(), v, v))
. Only available ifHWY_TARGET != HWY_SCALAR
.V
:f32
,D
:{u,i}64
Vec<D> PromoteInRangeEvenTo(D, V v): promotes the even lanes ofv
toTFromD<D>
. Note thatV
has twice as many lanes asD
and the return value.PromoteInRangeEvenTo(d, v)
is equivalent to, but potentially more efficient thanPromoteInRangeLowerTo(d, ConcatEven( Repartition<TFromV<V>, D>(), v, v))
.V
:f32
,D
:{u,i}64
Vec<D> PromoteInRangeOddTo(D, V v): promotes the odd lanes ofv
toTFromD<D>
. Note thatV
has twice as many lanes asD
and the return value.PromoteInRangeOddTo(d, v)
is equivalent to, but potentially more efficient thanPromoteInRangeLowerTo(d, ConcatOdd( Repartition<TFromV<V>, D>(), v, v))
.
Two-vector demotion
V
,D
: (i16,i8
), (i32,i16
), (i64,i32
), (u16,i8
), (u32,i16
), (u64,i32
), (i16,u8
), (i32,u16
), (i64,u32
), (u16,u8
), (u32,u16
), (u64,u32
), (f32,bf16
)Vec<D> ReorderDemote2To(D, V a, V b): as above, but converts two inputs,D
and the output have twice as many lanes asV
, and the output order is some permutation of the inputs. Only available ifHWY_TARGET != HWY_SCALAR
.V
,D
: (i16,i8
), (i32,i16
), (i64,i32
), (u16,i8
), (u32,i16
), (u64,i32
), (i16,u8
), (i32,u16
), (i64,u32
), (u16,u8
), (u32,u16
), (u64,u32
), (f32,bf16
)Vec<D> OrderedDemote2To(D d, V a, V b): as above, but converts two inputs,D
and the output have twice as many lanes asV
, and the output order is the result of demoting the elements ofa
in the lower half of the result followed by the result of demoting the elements ofb
in the upper half of the result.OrderedDemote2To(d, a, b)
is equivalent toCombine(d, DemoteTo(Half<D>(), b), DemoteTo(Half<D>(), a))
, butOrderedDemote2To(d, a, b)
is typically more efficient thanCombine(d, DemoteTo(Half<D>(), b), DemoteTo(Half<D>(), a))
. Only available ifHWY_TARGET != HWY_SCALAR
.V
,D
: (u16,u8
), (u32,u16
), (u64,u32
),Vec<D> OrderedTruncate2To(D d, V a, V b): as above, but converts two inputs,D
and the output have twice as many lanes asV
, and the output order is the result of truncating the elements ofa
in the lower half of the result followed by the result of truncating the elements ofb
in the upper half of the result.OrderedTruncate2To(d, a, b)
is equivalent toCombine(d, TruncateTo(Half<D>(), b), TruncateTo(Half<D>(), a))
, butOrderedTruncate2To(d, a, b)
is typically more efficient thanCombine(d, TruncateTo(Half<D>(), b), TruncateTo(Half<D>(), a))
. Only available ifHWY_TARGET != HWY_SCALAR
.
Combine
V2 LowerHalf([D, ] V): returns the lower half of the vector
V
. The optionalD
(provided for consistency withUpperHalf
) isHalf<DFromV<V>>
.
All other ops in this section are only available if
HWY_TARGET != HWY_SCALAR
:
V2 UpperHalf(D, V): returns upper half of the vector
V
, whereD
isHalf<DFromV<V>>
.V ZeroExtendVector(D, V2): returns vector whose
UpperHalf
is zero and whoseLowerHalf
is the argument;D
isTwice<DFromV<V2>>
.V Combine(D, V2, V2): returns vector whose
UpperHalf
is the first argument and whoseLowerHalf
is the second argument;D
isTwice<DFromV<V2>>
.
Note: the following operations cross block boundaries, which is typically more expensive on AVX2/AVX-512 than per-block operations.
V ConcatLowerLower(D, V hi, V lo): returns the concatenation of the lower halves of
hi
andlo
without splitting into blocks.D
isDFromV<V>
.V ConcatUpperUpper(D, V hi, V lo): returns the concatenation of the upper halves of
hi
andlo
without splitting into blocks.D
isDFromV<V>
.V ConcatLowerUpper(D, V hi, V lo): returns the inner half of the concatenation of
hi
andlo
without splitting into blocks. Useful for swapping the two blocks in 256-bit vectors.D
isDFromV<V>
.V ConcatUpperLower(D, V hi, V lo): returns the outer quarters of the concatenation of
hi
andlo
without splitting into blocks. Unlike the other variants, this does not incur a block-crossing penalty on AVX2/3.D
isDFromV<V>
.V ConcatOdd(D, V hi, V lo): returns the concatenation of the odd lanes of
hi
and the odd lanes oflo
.V ConcatEven(D, V hi, V lo): returns the concatenation of the even lanes of
hi
and the even lanes oflo
.V InterleaveWholeLower([D, ] V a, V b): returns alternating lanes from the lower halves of
a
andb
(a[0]
in the least-significant lane). The optionalD
(provided for consistency withInterleaveWholeUpper
) isDFromV<V>
.V InterleaveWholeUpper(D, V a, V b): returns alternating lanes from the upper halves of
a
andb
(a[N/2]
in the least-significant lane).D
isDFromV<V>
.
Blockwise
Note: if vectors are larger than 128 bits, the following operations split their operands into independently processed 128-bit blocks.
V Broadcast<int i>(V): returns individual blocks, each with lanes set to
input_block[i]
,i = [0, 16/sizeof(T))
.
All other ops in this section are only available if
HWY_TARGET != HWY_SCALAR
:
V
:{u,i}
VI TableLookupBytes(V bytes, VI indices): returnsbytes[indices[i]]
. Uses byte lanes regardless of the actual vector types. Results are implementation-defined ifindices[i] < 0
orindices[i] >= HWY_MIN(Lanes(DFromV<V>()), 16)
.VI
are integers, possibly of a different type than those inV
. The number of lanes inV
andVI
may differ, e.g. a full-length table vector loaded viaLoadDup128
, plus partial vectorVI
of 4-bit indices.V
:{u,i}
VI TableLookupBytesOr0(V bytes, VI indices): returnsbytes[indices[i]]
, or 0 ifindices[i] & 0x80
. Uses byte lanes regardless of the actual vector types. Results are implementation-defined forindices[i] < 0
or in[HWY_MIN(Lanes(DFromV<V>()), 16), 0x80)
. The zeroing behavior has zero cost on x86 and Arm. For vectors of >= 256 bytes (can happen on SVE and RVV), this will set all lanes after the first 128 to 0.VI
are integers, possibly of a different type than those inV
. The number of lanes inV
andVI
may differ.V
:{u,i}64
,VI
:{u,i}8
V BitShuffle(V vals, VI indices): returns a vector with(vals[i] >> indices[i*8+j]) & 1
in bitj
ofr[i]
for eachj
between 0 and 7.BitShuffle(vals, indices)
zeroes out the upper 56 bits ofr[i]
.If
indices[i*8+j]
is less than 0 or greater than 63, bitj
ofr[i]
is implementation-defined.VI
must be eitherVec<Repartition<int8_t, DFromV<V>>>
orVec<Repartition<uint8_t, DFromV<V>>>
.BitShuffle(v, indices)
is equivalent to the following loop (whereN
is equal toLanes(DFromV<V>())
):for(size_t i = 0; i < N; i++) { uint64_t shuf_result = 0; for(int j = 0; j < 7; j++) { shuf_result |= ((v[i] >> indices[i*8+j]) & 1) << j; } r[i] = shuf_result; }
Interleave
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
V InterleaveLower([D, ] V a, V b): returns blocks with alternating lanes from the lower halves of
a
andb
(a[0]
in the least-significant lane). The optionalD
(provided for consistency withInterleaveUpper
) isDFromV<V>
.V InterleaveUpper(D, V a, V b): returns blocks with alternating lanes from the upper halves of
a
andb
(a[N/2]
in the least-significant lane).D
isDFromV<V>
.V InterleaveEven([D, ] V a, V b): returns alternating lanes from the even lanes of
a
andb
(a[0]
in the least-significant lane, followed byb[0]
, followed bya[2]
, followed byb[2]
, and so on). The optionalD
(provided for consistency withInterleaveOdd
) isDFromV<V>
. Note that no lanes move across block boundaries.InterleaveEven(a, b)
andInterleaveEven(d, a, b)
are both equivalent toOddEven(DupEven(b), a)
, butInterleaveEven(a, b)
is usually more efficient thanOddEven(DupEven(b), a)
.V InterleaveOdd(D, V a, V b): returns alternating lanes from the odd lanes of
a
andb
(a[1]
in the least-significant lane, followed byb[1]
, followed bya[3]
, followed byb[3]
, and so on).D
isDFromV<V>
. Note that no lanes move across block boundaries.InterleaveOdd(d, a, b)
is equivalent toOddEven(b, DupOdd(a))
, butInterleaveOdd(d, a, b)
is usually more efficient thanOddEven(b, DupOdd(a))
.
Zip
Ret
:MakeWide<T>
;V
:{u,i}{8,16,32}
Ret ZipLower([DW, ] V a, V b): returns the same bits asInterleaveLower
, but repartitioned into double-width lanes (required in order to use this operation with scalars). The optionalDW
(provided for consistency withZipUpper
) isRepartitionToWide<DFromV<V>>
.Ret
:MakeWide<T>
;V
:{u,i}{8,16,32}
Ret ZipUpper(DW, V a, V b): returns the same bits asInterleaveUpper
, but repartitioned into double-width lanes (required in order to use this operation with scalars).DW
isRepartitionToWide<DFromV<V>>
. Only available ifHWY_TARGET != HWY_SCALAR
.
Shift within blocks
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
V
:{u,i}
V ShiftLeftBytes<int>([D, ] V): returns the result of shifting independent blocks left byint
bytes [1, 15]. The optionalD
(provided for consistency withShiftRightBytes
) isDFromV<V>
.V ShiftLeftLanes<int>([D, ] V): returns the result of shifting independent blocks left by
int
lanes. The optionalD
(provided for consistency withShiftRightLanes
) isDFromV<V>
.V
:{u,i}
V ShiftRightBytes<int>(D, V): returns the result of shifting independent blocks right byint
bytes [1, 15], shifting in zeros even for partial vectors.D
isDFromV<V>
.V ShiftRightLanes<int>(D, V): returns the result of shifting independent blocks right by
int
lanes, shifting in zeros even for partial vectors.D
isDFromV<V>
.V
:{u,i}
V CombineShiftRightBytes<int>(D, V hi, V lo): returns a vector of blocks each the result of shifting two concatenated blockshi[i] || lo[i]
right byint
bytes [1, 16).D
isDFromV<V>
.V CombineShiftRightLanes<int>(D, V hi, V lo): returns a vector of blocks each the result of shifting two concatenated blocks
hi[i] || lo[i]
right byint
lanes [1, 16/sizeof(T)).D
isDFromV<V>
.
Other fixed-pattern permutations within blocks
V OddEven(V a, V b): returns a vector whose odd lanes are taken from
a
and the even lanes fromb
.V DupEven(V v): returns
r
, the result of copying even lanes to the next higher-indexed lane. For each even lane indexi
,r[i] == v[i]
andr[i + 1] == v[i]
.V DupOdd(V v): returns
r
, the result of copying odd lanes to the previous lower-indexed lane. For each odd lane indexi
,r[i] == v[i]
andr[i - 1] == v[i]
. Only available ifHWY_TARGET != HWY_SCALAR
.
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
V
:{u,i,f}{32}
V Shuffle1032(V): returns blocks with 64-bit halves swapped.V
:{u,i,f}{32}
V Shuffle0321(V): returns blocks rotated right (toward the lower end) by 32 bits.V
:{u,i,f}{32}
V Shuffle2103(V): returns blocks rotated left (toward the upper end) by 32 bits.
The following are equivalent to Reverse2
or Reverse4
, which
should be used instead because they are more general:
V
:{u,i,f}{32}
V Shuffle2301(V): returns blocks with 32-bit halves swapped inside 64-bit halves.V
:{u,i,f}{64}
V Shuffle01(V): returns blocks with 64-bit halves swapped.V
:{u,i,f}{32}
V Shuffle0123(V): returns blocks with lanes in reverse order.
Swizzle
Reverse
V Reverse(D, V a) returns a vector with lanes in reversed order (
out[i] == a[Lanes(D()) - 1 - i]
).V ReverseBlocks(V v): returns a vector with blocks in reversed order.
The following ReverseN
must not be called if Lanes(D()) < N
:
V Reverse2(D, V a) returns a vector with each group of 2 contiguous lanes in reversed order (
out[i] == a[i ^ 1]
).V Reverse4(D, V a) returns a vector with each group of 4 contiguous lanes in reversed order (
out[i] == a[i ^ 3]
).V Reverse8(D, V a) returns a vector with each group of 8 contiguous lanes in reversed order (
out[i] == a[i ^ 7]
).V
:{u,i}{16,32,64}
V ReverseLaneBytes(V a) returns a vector where the bytes of each lane are swapped.V
:{u,i}
V ReverseBits(V a) returns a vector where the bits of each lane are reversed.
User-specified permutation across blocks
V TableLookupLanes(V a, unspecified) returns a vector of
a[indices[i]]
, whereunspecified
is the return value ofSetTableIndices(D, &indices[0])
orIndicesFromVec
. The indices are not limited to blocks, hence this is slower thanTableLookupBytes*
on AVX2/AVX-512. Results are implementation-defined unless0 <= indices[i] < Lanes(D())
andindices[i] <= LimitsMax<TFromD<RebindToUnsigned<D>>>()
. Note that the latter condition is only a (potential) limitation for 8-bit lanes on the RVV target; otherwise,Lanes(D()) <= LimitsMax<..>()
.indices
are always integers, even ifV
is a floating-point type.V TwoTablesLookupLanes(D d, V a, V b, unspecified) returns a vector of
indices[i] < N ? a[indices[i]] : b[indices[i] - N]
, whereunspecified
is the return value ofSetTableIndices(d, &indices[0])
orIndicesFromVec
andN
is equal toLanes(d)
. The indices are not limited to blocks. Results are implementation-defined unless0 <= indices[i] < 2 * Lanes(d)
andindices[i] <= LimitsMax<TFromD<RebindToUnsigned<D>>>()
. Note that the latter condition is only a (potential) limitation for 8-bit lanes on the RVV target; otherwise,Lanes(D()) <= LimitsMax<..>()
.indices
are always integers, even ifV
is a floating-point type.V TwoTablesLookupLanes(V a, V b, unspecified) returns
TwoTablesLookupLanes(DFromV<V>(), a, b, indices)
, see above. Note that the results ofTwoTablesLookupLanes(d, a, b, indices)
may differ fromTwoTablesLookupLanes(a, b, indices)
on RVV/SVE ifLanes(d) < Lanes(DFromV<V>())
.unspecified IndicesFromVec(D d, V idx) prepares for
TableLookupLanes
with integer indices inidx
, which must be the same bit width asTFromD<D>
and in the range[0, 2 * Lanes(d))
, but need not be unique.unspecified SetTableIndices(D d, TI* idx) prepares for
TableLookupLanes
by loadingLanes(d)
integer indices fromidx
, which must be in the range[0, 2 * Lanes(d))
but need not be unique. The index typeTI
must be an integer of the same size asTFromD<D>
.V Per4LaneBlockShuffle<size_t kIdx3, size_t kIdx2, size_t kIdx1, size_t kIdx0>(V v) does a per 4-lane block shuffle of
v
ifLanes(DFromV<V>())
is greater than or equal to 4 or a shuffle of the full vector ifLanes(DFromV<V>())
is less than 4.kIdx0
,kIdx1
,kIdx2
, andkIdx3
must all be between 0 and 3.Per4LaneBlockShuffle is equivalent to doing a TableLookupLanes with the following indices (but Per4LaneBlockShuffle is more efficient than TableLookupLanes on some platforms):
{kIdx0, kIdx1, kIdx2, kIdx3, kIdx0+4, kIdx1+4, kIdx2+4, kIdx3+4, ...}
If
Lanes(DFromV<V>())
is less than 4 andkIdx0 >= Lanes(DFromV<V>())
is true, Per4LaneBlockShuffle returns an unspecified value in the first lane of the result. Otherwise, Per4LaneBlockShuffle returnsv[kIdx0]
in the first lane of the result.If
Lanes(DFromV<V>())
is equal to 2 andkIdx1 >= 2
is true, Per4LaneBlockShuffle returns an unspecified value in the second lane of the result. Otherwise, Per4LaneBlockShuffle returnsv[kIdx1]
in the first lane of the result.
Slide across blocks
V SlideUpLanes(D d, V v, size_t N): slides up
v
byN
lanesIf
N < Lanes(d)
is true, returns a vector with the first (lowest-index)Lanes(d) - N
lanes ofv
shifted up to the upper (highest-index)Lanes(d) - N
lanes of the result vector and the first (lowest-index)N
lanes of the result vector zeroed out.In other words,
result[0..N-1]
would be zero,result[N] = v[0]
,result[N+1] = v[1]
, and so on untilresult[Lanes(d)-1] = v[Lanes(d)-1-N]
.The result of SlideUpLanes is implementation-defined if
N >= Lanes(d)
.V SlideDownLanes(D d, V v, size_t N): slides down
v
byN
lanesIf
N < Lanes(d)
is true, returns a vector with the last (highest-index)Lanes(d) - N
ofv
shifted down to the first (lowest-index)Lanes(d) - N
lanes of the result vector and the last (highest-index)N
lanes of the result vector zeroed out.In other words,
result[0] = v[N]
,result[1] = v[N + 1]
, and so on untilresult[Lanes(d)-1-N] = v[Lanes(d)-1]
, and thenresult[Lanes(d)-N..N-1]
would be zero.The results of SlideDownLanes is implementation-defined if
N >= Lanes(d)
.V Slide1Up(D d, V v): slides up
v
by 1 laneIf
Lanes(d) == 1
is true, returnsZero(d)
.If
Lanes(d) > 1
is true,Slide1Up(d, v)
is equivalent toSlideUpLanes(d, v, 1)
, butSlide1Up(d, v)
is more efficient thanSlideUpLanes(d, v, 1)
on some platforms.V Slide1Down(D d, V v): slides down
v
by 1 laneIf
Lanes(d) == 1
is true, returnsZero(d)
.If
Lanes(d) > 1
is true,Slide1Down(d, v)
is equivalent toSlideDownLanes(d, v, 1)
, butSlide1Down(d, v)
is more efficient thanSlideDownLanes(d, v, 1)
on some platforms.V SlideUpBlocks<int kBlocks>(D d, V v) slides up
v
bykBlocks
blocks.kBlocks
must be between 0 andd.MaxBlocks() - 1
.Equivalent to
SlideUpLanes(d, v, kBlocks * (16 / sizeof(TFromD<D>)))
, butSlideUpBlocks<kBlocks>(d, v)
is more efficient thanSlideUpLanes(d, v, kBlocks * (16 / sizeof(TFromD<D>)))
on some platforms.The results of
SlideUpBlocks<kBlocks>(d, v)
is implementation-defined ifkBlocks >= Blocks(d)
is true.V SlideDownBlocks<int kBlocks>(D d, V v) slides down
v
bykBlocks
blocks.kBlocks
must be between 0 andd.MaxBlocks() - 1
.Equivalent to
SlideDownLanes(d, v, kBlocks * (16 / sizeof(TFromD<D>)))
, butSlideDownBlocks<kBlocks>(d, v)
is more efficient thanSlideDownLanes(d, v, kBlocks * (16 / sizeof(TFromD<D>)))
on some platforms.The results of
SlideDownBlocks<kBlocks>(d, v)
is implementation-defined ifkBlocks >= Blocks(d)
is true.
Other fixed-pattern across blocks
V BroadcastLane<int kLane>(V v): returns a vector with all of the lanes set to
v[kLane]
.kLane
must be in[0, MaxLanes(DFromV<V>()))
.V BroadcastBlock<int kBlock>(V v): broadcasts the 16-byte block of vector
v
at indexkBlock
to all of the blocks of the result vector ifLanes(DFromV<V>()) * sizeof(TFromV<V>) > 16
is true. Otherwise, ifLanes(DFromV<V>()) * sizeof(TFromV<V>) <= 16
is true, returnsv
.kBlock
must be in[0, DFromV<V>().MaxBlocks())
.V OddEvenBlocks(V a, V b): returns a vector whose odd blocks are taken from
a
and the even blocks fromb
. Returnsb
if the vector has no more than one block (i.e. is 128 bits or scalar).V SwapAdjacentBlocks(V v): returns a vector where blocks of index
2*i
and2*i+1
are swapped. Results are undefined for vectors with less than two blocks; callers must first check that viaLanes
. Only available ifHWY_TARGET != HWY_SCALAR
.
Reductions
Note: Horizontal operations (across lanes of the same vector) such as reductions are slower than normal SIMD operations and are typically used outside critical loops.
The following broadcast the result to all lanes. To obtain a scalar, you
can call GetLane
on the result, or instead use Reduce*
below.
V SumOfLanes(D, V v): returns the sum of all lanes in each lane.
V MinOfLanes(D, V v): returns the minimum-valued lane in each lane.
V MaxOfLanes(D, V v): returns the maximum-valued lane in each lane.
The following are equivalent to GetLane(SumOfLanes(d, v))
etc. but
potentially more efficient on some targets.
T ReduceSum(D, V v): returns the sum of all lanes.
T ReduceMin(D, V v): returns the minimum of all lanes.
T ReduceMax(D, V v): returns the maximum of all lanes.
Crypto
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
V
:u8
V AESRound(V state, V round_key): one round of AES encryption:MixColumns(SubBytes(ShiftRows(state))) ^ round_key
. This matches x86 AES-NI. The latency is independent of the input values.V
:u8
V AESLastRound(V state, V round_key): the last round of AES encryption:SubBytes(ShiftRows(state)) ^ round_key
. This matches x86 AES-NI. The latency is independent of the input values.V
:u8
V AESRoundInv(V state, V round_key): one round of AES decryption using the AES Equivalent Inverse Cipher:InvMixColumns(InvShiftRows(InvSubBytes(state))) ^ round_key
. This matches x86 AES-NI. The latency is independent of the input values.V
:u8
V AESLastRoundInv(V state, V round_key): the last round of AES decryption:InvShiftRows(InvSubBytes(state)) ^ round_key
. This matches x86 AES-NI. The latency is independent of the input values.V
:u8
V AESInvMixColumns(V state): the InvMixColumns operation of the AES decryption algorithm. AESInvMixColumns is used in the key expansion step of the AES Equivalent Inverse Cipher algorithm. The latency is independent of the input values.V
:u8
V AESKeyGenAssist<uint8_t kRcon>(V v): AES key generation assist operationThe AESKeyGenAssist operation is equivalent to doing the following, which matches the behavior of the x86 AES-NI AESKEYGENASSIST instruction:
Applying the AES SubBytes operation to each byte of
v
.Doing a TableLookupBytes operation on each 128-bit block of the result of the
SubBytes(v)
operation with the following indices (which is broadcast to each 128-bit block in the case of vectors with 32 or more lanes):{4, 5, 6, 7, 5, 6, 7, 4, 12, 13, 14, 15, 13, 14, 15, 12}
Doing a bitwise XOR operation with the following vector (where
kRcon
is the rounding constant that is the first template argument of the AESKeyGenAssist function and where the below vector is broadcasted to each 128-bit block in the case of vectors with 32 or more lanes):{0, 0, 0, 0, kRcon, 0, 0, 0, 0, 0, 0, 0, kRcon, 0, 0, 0}
V
:u64
V CLMulLower(V a, V b): carryless multiplication of the lower 64 bits of each 128-bit block into a 128-bit product. The latency is independent of the input values (assuming that is true of normal integer multiplication) so this can safely be used in crypto. Applications that wish to multiply upper with lower halves canShuffle01
one of the operands; on x86 that is expected to be latency-neutral.V
:u64
V CLMulUpper(V a, V b): as CLMulLower, but multiplies the upper 64 bits of each 128-bit block.
Preprocessor macros
HWY_ALIGN
: Prefix for stack-allocated (i.e. automatic storage duration) arrays to ensure they have suitable alignment for Load()/Store(). This is specific toHWY_TARGET
and should only be used insideHWY_NAMESPACE
.Arrays should also only be used for partial (<= 128-bit) vectors, or
LoadDup128
, because full vectors may be too large for the stack and should be heap-allocated instead (see aligned_allocator.h).Example:
HWY_ALIGN float lanes[4];
HWY_ALIGN_MAX
: asHWY_ALIGN
, but independent ofHWY_TARGET
and may be used outsideHWY_NAMESPACE
.
Advanced macros
Beware that these macros describe the current target being compiled.
Imagine a test (e.g. sort_test) with SIMD code that also uses dynamic
dispatch. There we must test the macros of the target we will call,
e.g. via hwy::HaveFloat64()
instead of HWY_HAVE_FLOAT64
, which
describes the current target.
HWY_IDE
is 0 except when parsed by IDEs; adding it to conditions such as#if HWY_TARGET != HWY_SCALAR || HWY_IDE
avoids code appearing greyed out.
The following indicate full support for certain lane types and expand to 1 or 0.
HWY_HAVE_INTEGER64
: support for 64-bit signed/unsigned integer lanes.HWY_HAVE_FLOAT16
: support for 16-bit floating-point lanes.HWY_HAVE_FLOAT64
: support for double-precision floating-point lanes.
The above were previously known as HWY_CAP_INTEGER64
,
HWY_CAP_FLOAT16
, and HWY_CAP_FLOAT64
, respectively. Those
HWY_CAP_*
names are DEPRECATED.
Even if HWY_HAVE_FLOAT16
is 0, the following ops generally support
float16_t
and bfloat16_t
:
Lanes
,MaxLanes
Zero
,Set
,Undefined
BitCast
Load
,LoadU
,LoadN
,LoadNOr
,LoadInterleaved[234]
,MaskedLoad
,MaskedLoadOr
Store
,StoreU
,StoreN
,StoreInterleaved[234]
,BlendedStore
PromoteTo
,DemoteTo
PromoteUpperTo
,PromoteLowerTo
PromoteEvenTo
,PromoteOddTo
Combine
,InsertLane
,ZeroExtendVector
RebindMask
,FirstN
IfThenElse
,IfThenElseZero
,IfThenZeroElse
Exception: UpperHalf
, PromoteUpperTo
, PromoteOddTo
and
Combine
are not supported for the HWY_SCALAR
target.
Neg
also supports float16_t
and *Demote2To
also supports
bfloat16_t
.
HWY_HAVE_SCALABLE
indicates vector sizes are unknown at compile time, and determined by the CPU.HWY_HAVE_TUPLE
indicatesVec{2-4}
,Create{2-4}
andGet{2-4}
are usable. This is already true#if !HWY_HAVE_SCALABLE
, and for SVE targets, and the RVV target when using Clang 16. We anticipate it will also become, and then remain, true starting with GCC 14.HWY_MEM_OPS_MIGHT_FAULT
is 1 iffMaskedLoad
may trigger a (page) fault when attempting to load lanes from unmapped memory, even if the corresponding mask element is false. This is the case on ASAN/MSAN builds, AMD x86 prior to AVX-512, and Arm NEON. If so, users can prevent faults by ensuring memory addresses are aligned to the vector size or at least padded (allocation size increased by at leastLanes(d)
).HWY_NATIVE_FMA
expands to 1 if theMulAdd
etc. ops use native fused multiply-add for floating-point inputs. Otherwise,MulAdd(f, m, a)
is implemented asAdd(Mul(f, m), a)
. Checking this can be useful for increasing the tolerance of expected results (around 1E-5 or 1E-6).HWY_IS_LITTLE_ENDIAN
expands to 1 on little-endian targets and to 0 on big-endian targets.HWY_IS_BIG_ENDIAN
expands to 1 on big-endian targets and to 0 on little-endian targets.
The following were used to signal the maximum number of lanes for certain operations, but this is no longer necessary (nor possible on SVE/RVV), so they are DEPRECATED:
HWY_CAP_GE256
: the current target supports vectors of >= 256 bits.HWY_CAP_GE512
: the current target supports vectors of >= 512 bits.
Detecting supported targets
SupportedTargets()
returns a non-cached (re-initialized on each
call) bitfield of the targets supported on the current CPU, detected
using CPUID on x86 or equivalent. This may include targets that are not
in HWY_TARGETS
, and vice versa. If there is no overlap the binary
will likely crash. This can only happen if:
the specified baseline is not supported by the current CPU, which contradicts the definition of baseline, so the configuration is invalid; or
the baseline does not include the enabled/attainable target(s), which are also not supported by the current CPU, and baseline targets (in particular
HWY_SCALAR
) were explicitly disabled.
Advanced configuration macros
The following macros govern which targets to generate. Unless specified
otherwise, they may be defined per translation unit, e.g. to disable
>128 bit vectors in modules that do not benefit from them (if
bandwidth-limited or only called occasionally). This is safe because
HWY_TARGETS
always includes at least one baseline target which
HWY_EXPORT
can use.
HWY_DISABLE_CACHE_CONTROL
makes the cache-control functions no-ops.HWY_DISABLE_BMI2_FMA
prevents emitting BMI/BMI2/FMA instructions. This allows using AVX2 in VMs that do not support the other instructions, but only if defined for all translation units.
The following *_TARGETS
are zero or more HWY_Target
bits and can
be defined as an expression, e.g.
-DHWY_DISABLED_TARGETS=(HWY_SSE4|HWY_AVX3)
.
HWY_BROKEN_TARGETS
defaults to a blocklist of known compiler bugs. Defining to 0 disables the blocklist.HWY_DISABLED_TARGETS
defaults to zero. This allows explicitly disabling targets without interfering with the blocklist.HWY_BASELINE_TARGETS
defaults to the set whose predefined macros are defined (i.e. those for which the corresponding flag, e.g. -mavx2, was passed to the compiler). If specified, this should be the same for all translation units, otherwise the safety check in SupportedTargets (that all enabled baseline targets are supported) may be inaccurate.
Zero or one of the following macros may be defined to replace the
default policy for selecting HWY_TARGETS
:
HWY_COMPILE_ONLY_EMU128
selects onlyHWY_EMU128
, which avoids intrinsics but implements all ops using standard C++.HWY_COMPILE_ONLY_SCALAR
selects onlyHWY_SCALAR
, which implements single-lane-only ops using standard C++.HWY_COMPILE_ONLY_STATIC
selects onlyHWY_STATIC_TARGET
, which effectively disables dynamic dispatch.HWY_COMPILE_ALL_ATTAINABLE
selects all attainable targets (i.e. enabled and permitted by the compiler, independently of autovectorization), which maximizes coverage in tests. DefiningHWY_IS_TEST
, which CMake does for the Highway tests, has the same effect.HWY_SKIP_NON_BEST_BASELINE
compiles all targets at least as good as the baseline. This is also the default if nothing is defined. By skipping targets older than the baseline, this reduces binary size and may resolve compile errors caused by conflicts between dynamic dispatch and -m flags.
At most one HWY_COMPILE_ONLY_*
may be defined.
HWY_COMPILE_ALL_ATTAINABLE
may also be defined even if one of
HWY_COMPILE_ONLY_*
is, but will then be ignored because the flags
are tested in the order listed. As an exception,
HWY_SKIP_NON_BEST_BASELINE
overrides the effect of
HWY_COMPILE_ALL_ATTAINABLE
and HWY_IS_TEST
.
Compiler support
Clang and GCC require opting into SIMD intrinsics, e.g. via -mavx2
flags. However, the flag enables AVX2 instructions in the entire
translation unit, which may violate the one-definition rule (that all
versions of a function such as std::abs
are equivalent, thus the
linker may choose any). This can cause crashes if non-SIMD functions are
defined outside of a target-specific namespace, and the linker happens
to choose the AVX2 version, which means it may be called without
verifying AVX2 is indeed supported.
To prevent this problem, we use target-specific attributes introduced
via #pragma
. Function using SIMD must reside between
HWY_BEFORE_NAMESPACE
and HWY_AFTER_NAMESPACE
. Conversely,
non-SIMD functions and in particular, #include of normal or standard
library headers must NOT reside between HWY_BEFORE_NAMESPACE
and
HWY_AFTER_NAMESPACE
. Alternatively, individual functions may be
prefixed with HWY_ATTR
, which is more verbose, but ensures that
#include
-d functions are not covered by target-specific attributes.
WARNING: avoid non-local static objects (namespace scope ‘global
variables’) between HWY_BEFORE_NAMESPACE
and
HWY_AFTER_NAMESPACE
. We have observed crashes on PPC because the
compiler seems to have generated an initializer using PPC10 code to
splat a constant to all vector lanes, see #1739. To prevent this, you
can replace static constants with a function returning the desired
value.
If you know the SVE vector width and are using static dispatch, you can
specify -march=armv9-a+sve2-aes -msve-vector-bits=128
and Highway
will then use HWY_SVE2_128
as the baseline. Similarly,
-march=armv8.2-a+sve -msve-vector-bits=256
enables the
HWY_SVE_256
specialization for Neoverse V1. Note that these flags
are unnecessary when using dynamic dispatch. Highway will automatically
detect and dispatch to the best available target, including
HWY_SVE2_128
or HWY_SVE_256
.
Immediates (compile-time constants) are specified as template arguments to avoid constant-propagation issues with Clang on Arm.
Type traits
IsFloat<T>()
returns true if theT
is a floating-point type.IsSigned<T>()
returns true if theT
is a signed or floating-point type.LimitsMin/Max<T>()
return the smallest/largest value representable in integerT
.SizeTag<N>
is an empty struct, used to select overloaded functions appropriate forN
bytes.MakeUnsigned<T>
is an alias for an unsigned type of the same size asT
.MakeSigned<T>
is an alias for a signed type of the same size asT
.MakeFloat<T>
is an alias for a floating-point type of the same size asT
.MakeWide<T>
is an alias for a type with twice the size ofT
and the same category (unsigned/signed/float).MakeNarrow<T>
is an alias for a type with half the size ofT
and the same category (unsigned/signed/float).
Memory allocation
AllocateAligned<T>(items)
returns a unique pointer to newly
allocated memory for items
elements of POD type T
. The start
address is aligned as required by Load/Store
. Furthermore,
successive allocations are not congruent modulo a platform-specific
alignment. This helps prevent false dependencies or cache conflicts. The
memory allocation is analogous to using malloc()
and free()
with
a std::unique_ptr
since the returned items are not initialized or
default constructed and it is released using FreeAlignedBytes()
without calling ~T()
.
MakeUniqueAligned<T>(Args&&... args)
creates a single object in
newly allocated aligned memory as above but constructed passing the
args
argument to T
’s constructor and returning a unique
pointer to it. This is analogous to using std::make_unique
with
new
but for aligned memory since the object is constructed and later
destructed when the unique pointer is deleted. Typically this type T
is a struct containing multiple members with HWY_ALIGN
or
HWY_ALIGN_MAX
, or arrays whose lengths are known to be a multiple of
the vector size.
MakeUniqueAlignedArray<T>(size_t items, Args&&... args)
creates an
array of objects in newly allocated aligned memory as above and
constructs every element of the new array using the passed constructor
parameters, returning a unique pointer to the array. Note that only the
first element is guaranteed to be aligned to the vector size; because
there is no padding between elements, the alignment of the remaining
elements depends on the size of T
.
Speeding up code for older x86 platforms
Thanks to @dzaima for inspiring this section.
It is possible to improve the performance of your code on older x86 CPUs while remaining portable to all platforms. These older CPUs might indeed be the ones for which optimization is most impactful, because modern CPUs are usually faster and thus likelier to meet performance expectations.
For those without AVX3, preferably avoid Scatter*
; some algorithms
can be reformulated to use Gather*
instead. For pre-AVX2, it is also
important to avoid Gather*
.
It is typically much more efficient to pad arrays and use Load
instead of MaskedLoad
and Store
instead of BlendedStore
.
If possible, use signed 8..32 bit types instead of unsigned types for
comparisons and Min
/Max
.
Other ops which are considerably more expensive especially on SSSE3, and
preferably avoided if possible: MulEven
, i32 Mul
,
Shl
/Shr
, Round
/Trunc
/Ceil
/Floor
, float16
PromoteTo
/DemoteTo
, AESRound
.
Ops which are moderately more expensive on older CPUs: 64-bit
Abs
/ShiftRight
/ConvertTo
, i32->u16 DemoteTo
, u32->f32
ConvertTo
, Not
, IfThenElse
, RotateRight
, OddEven
,
BroadcastSignBit
.
It is likely difficult to avoid all of these ops (about a fifth of the
total). Apps usually also cannot more efficiently achieve the same
result as any op without using it - this is an explicit design goal of
Highway. However, sometimes it is possible to restructure your code to
avoid Not
, e.g. by hoisting it outside the SIMD code, or fusing with
AndNot
or CompressNot
.