# Floating-point routines

XLS provides implementations of several floating-point operations and may add more at any time. Here are listed notable details of our implementations or of floating-point operations in general. Unless otherwise specified, not possible or out-of-scope, all operations and types should be IEEE-754 compliant.

For example, floating-point exceptions have not been implemented; they're outside our current scope. The numeric results of multiplication, on the other hand, should exactly match those of any other compliant implementation.

## APFloat

Floating-point operations are, in general, defined by the same sequence of steps regardless of their underlying bit widths: fractional parts must be expanded then aligned, then an operation (add, multiply, etc.) must be performed, interpreting the fractions as integral types, followed by rounding, special case handling, and reconstructing an output FP type.

This observation leads to the possibility of generic floating-point routines: a fully parameterized add, for example, which can be instantiated with and 8-bit exponent and 23-bit fractional part for binary32 types, and an 11-bit exponent and 52-bit fractional part for binary64 types. Even more interesting, a hypothetical bfloat32 type could immediately be supported by, say, instantiating that adder with, say, 15 exponent bits and 16 fractional ones.

As much as possible, XLS implements FP operations in terms of its APFloat (arbitrary-precision floating-point) type. APFloat is a parameterized floating-point structure with a fixed one-bit sign and specifiable exponent and fractional part size. For ease of use, common types, such as float32, are defined in terms of those APFloat types.

For example, the generic "is X infinite" operation is defined in apfloat.x as:

// Returns whether or not the given APFloat represents an infinite quantity.
pub fn is_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool {
(x.bexp == std::mask_bits<EXP_SZ>() && x.fraction == bits[FRACTION_SZ]:0)
}


Whereas in float32.x, F32 is defined as:

pub type F32 = apfloat::APFloat<u32:8, u32:23>;


and is_inf is exposed as:

pub fn is_inf(f: F32) -> bool { apfloat::is_inf<u32:8, u32:23>(f) }


In this way, users can refer to F32 types and can use them as and with float32::is_inf(f), giving them simplified access to a generic operation.

## Supported operations

Here are listed the routines so far implemented in XLS. Unless otherwise specified, operations are implemented in terms of APFloats such that they can support any precisions (aside from corner cases, such as a zero-byte fractional part).

### apfloat::tag

pub fn tag<EXP_SZ:u32, FRACTION_SZ:u32>(input_float: APFloat<EXP_SZ, FRACTION_SZ>) -> APFloatTag


Returns the type of float as one of APFloatTag::ZERO, APFloatTag::SUBNORMAL, APFloatTag::INFINITY, APFloatTag::NAN and APFloatTag::NORMAL.

### apfloat::qnan

pub fn qnan<EXP_SZ:u32, FRACTION_SZ:u32>() -> APFloat<EXP_SZ, FRACTION_SZ>


Returns a quiet NaN.

### apfloat::is_nan

pub fn is_nan<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns whether or not the given APFloat represents NaN.

### apfloat::inf

pub fn inf<EXP_SZ:u32, FRACTION_SZ:u32>(sign: bits[1]) -> APFloat<EXP_SZ, FRACTION_SZ>


Returns a positive or a negative infinity depending upon the given sign parameter.

### apfloat::is_inf

pub fn is_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns whether or not the given APFloat represents an infinite quantity.

### apfloat::is_pos_inf

pub fn is_pos_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns whether or not the given APFloat represents a positive infinite quantity.

### apfloat::is_neg_inf

pub fn is_neg_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns whether or not the given APFloat represents a negative infinite quantity.

### apfloat::zero

pub fn zero<EXP_SZ:u32, FRACTION_SZ:u32>(sign: bits[1]) -> APFloat<EXP_SZ, FRACTION_SZ>


Returns a positive or negative zero depending upon the given sign parameter.

### apfloat::one

pub fn one<EXP_SZ:u32, FRACTION_SZ:u32>(sign: bits[1]) -> APFloat<EXP_SZ, FRACTION_SZ>


Returns one or minus one depending upon the given sign parameter.

### apfloat::negate

pub fn negate<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> APFloat<EXP_SZ, FRACTION_SZ>


Returns the negative of x unless it is a NaN, in which case it will change it from a quiet to signaling NaN or from signaling to a quiet NaN.

### apfloat::max_normal_exp

pub fn max_normal_exp<EXP_SZ : u32>() -> sN[EXP_SZ]


Maximum value of the exponent for normal numbers with EXP_SZ bits in the exponent field. For single precision floats this value is 127.

### apfloat::min_normal_exp

pub fn min_normal_exp<EXP_SZ : u32>() -> sN[EXP_SZ]


Minimum value of the exponent for normal numbers with EXP_SZ bits in the exponent field. For single precision floats this value is -126.

### apfloat::unbiased_exponent

pub fn unbiased_exponent<EXP_SZ:u32, FRACTION_SZ:u32>(f: APFloat<EXP_SZ, FRACTION_SZ>) -> sN[EXP_SZ]


Returns the unbiased exponent. For normal numbers it is bexp - 2^EXP_SZ + 1and for subnormals it is,2 - 2^EXP_SZ. For infinity and NaN, there are no guarantees, as the unbiased exponent has no meaning in that case.

For example, for single precision normal numbers the unbiased exponent is bexp - 127and for subnormal numbers it is-126.

### apfloat::bias

pub fn bias<EXP_SZ: u32, FRACTION_SZ: u32>(unbiased_exponent: sN[EXP_SZ]) -> bits[EXP_SZ]


Returns the biased exponent which is equal to unbiased_exponent + 2^EXP_SZ - 1

Since the function only takes as input the unbiased exponent, it cannot distinguish between normal and subnormal numbers, as a result it assumes that the input is the exponent for a normal number.

### apfloat::flatten

pub fn flatten<EXP_SZ:u32, FRACTION_SZ:u32,
TOTAL_SZ:u32 = {u32:1+EXP_SZ+FRACTION_SZ}>(
x: APFloat<EXP_SZ, FRACTION_SZ>)
-> bits[TOTAL_SZ]


Returns a bit string of size 1 + EXP_SZ + FRACTION_SZ where the first bit is the sign bit, the next EXP_SZ bit encode the biased exponent and the last FRACTION_SZ are the significand without the hidden bit.

### apfloat::unflatten

pub fn unflatten<EXP_SZ:u32, FRACTION_SZ:u32,
TOTAL_SZ:u32 = {u32:1+EXP_SZ+FRACTION_SZ}>(x: bits[TOTAL_SZ])
-> APFloat<EXP_SZ, FRACTION_SZ>


Returns a APFloat struct whose flattened version would be the input string x.

### apfloat:ldexp

pub fn ldexp<EXP_SZ:u32, FRACTION_SZ:u32>(
fraction: APFloat<EXP_SZ, FRACTION_SZ>,
exp: s32) -> APFloat<EXP_SZ, FRACTION_SZ>


ldexp (load exponent) computes fraction * 2^exp. Note that:

• Input denormals are treated as/flushed to 0. (denormals-are-zero / DAZ). Similarly, denormal results are flushed to 0.
• No exception flags are raised/reported.
• We emit a single, canonical representation for NaN (qnan) but accept all NaN representations as input.

### apfloat::cast_from_fixed_using_rne

pub fn cast_from_fixed_using_rne<EXP_SZ:u32, FRACTION_SZ:u32, NUM_SRC_BITS:u32>(
to_cast: sN[NUM_SRC_BITS])
-> APFloat<EXP_SZ, FRACTION_SZ> {


Casts the fixed point number to a floating point number using RNE (Round to Nearest Even) as the rounding mode.

### apfloat::cast_from_fixed_using_rz

pub fn cast_from_fixed_using_rz<EXP_SZ:u32, FRACTION_SZ:u32, NUM_SRC_BITS:u32>(
to_cast: sN[NUM_SRC_BITS])
-> APFloat<EXP_SZ, FRACTION_SZ> {


Casts the fixed point number to a floating point number using RZ (Round to Zero) as the rounding mode.

### apfloat::upcast

fn upcast<TO_EXP_SZ: u32, TO_FRACTION_SZ: u32, FROM_EXP_SZ: u32, FROM_FRACTION_SZ: u32>
(f: APFloat<FROM_EXP_SZ, FROM_FRACTION_SZ>) -> APFloat<TO_EXP_SZ, TO_FRACTION_SZ> {


Upcast the given apfloat to another (larger) apfloat representation. Note: denormal inputs get flushed to zero.

### apfloat::normalize

pub fn normalize<EXP_SZ:u32, FRACTION_SZ:u32,
WIDE_FRACTION:u32 = {FRACTION_SZ + u32:1}>(
sign: bits[1], exp: bits[EXP_SZ],
fraction_with_hidden: bits[WIDE_FRACTION])
-> APFloat<EXP_SZ, FRACTION_SZ>


Returns a normalized APFloat with the given components. fraction_with_hidden is the fraction (including the hidden bit). This function only normalizes in the direction of decreasing the exponent. Input must be a normal number or zero. Subnormals/Denormals are flushed to zero in the result.

### apfloat::is_zero_or_subnormal

pub fn is_zero_or_subnormal<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns true if x == 0 or x is a subnormal number.

### apfloat::cast_to_fixed

pub fn cast_to_fixed<NUM_DST_BITS:u32, EXP_SZ:u32, FRACTION_SZ:u32>(
to_cast: APFloat<EXP_SZ, FRACTION_SZ>)
-> sN[NUM_DST_BITS]


Casts the floating point number to a fixed point number. Unrepresentable numbers are cast to the minimum representable number (largest magnitude negative number).

### apfloat::eq_2

pub fn eq_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns true if x == y. Denormals are Zero (DAZ). Always returns false if x or y is NaN.

### apfloat::gt_2

pub fn gt_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns true if x > y. Denormals are Zero (DAZ). Always returns false if x or y is NaN.

### apfloat::gte_2

pub fn gte_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns true if x >= y. Denormals are Zero (DAZ). Always returns false if x or y is NaN.

### apfloat::lte_2

pub fn lte_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns true if x <= y. Denormals are Zero (DAZ). Always returns false if x or y is NaN.

### apfloat::lt_2

pub fn lt_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool


Returns true if x < y. Denormals are Zero (DAZ). Always returns false if x or y is NaN.

### apfloat::round_towards_zero

pub fn round_towards_zero<EXP_SZ:u32, FRACTION_SZ:u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>


Returns an APFloat with all its bits past the decimal point set to 0.

### apfloat::to_int

pub fn to_int<EXP_SZ: u32, FRACTION_SZ: u32, RESULT_SZ:u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>) -> sN[RESULT_SZ]


Returns the signed integer part of the input float, truncating any fractional bits if necessary.

### apfloat::add/sub

pub fn add<EXP_SZ: u32, FRACTION_SZ: u32>(x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>

pub fn sub<EXP_SZ: u32, FRACTION_SZ: u32>(x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>


Returns the sum/difference of x and y based on a generalization of IEEE 754 single-precision floating-point addition, with the following exceptions:

• Both input and output denormals are treated as/flushed to 0.
• Only round-to-nearest mode is supported.
• No exception flags are raised/reported.

In all other cases, results should be identical to other conforming implementations (modulo exact fraction values in the NaN case.

#### Implementation details

Floating-point addition, like any FP operation, is much more complicated than integer addition, and has many more steps. Being the first operation described, we'll take extra care to explain floating-point addition:

1. Expand fractions: Floating-point operations are computed with bits beyond that in their normal representations for increased precision. For IEEE 754 numbers, there are three extra, called the guard, rounding and sticky bits. The first two behave normally, but the last, the "sticky" bit, is special. During shift operations (below), if a "1" value is ever shifted into the sticky bit, it "sticks" - the bit will remain "1" through any further shift operations. In this step, the fractions are expanded by these three bits.
2. Align fractions: To ensure that fractions are added with appropriate magnitudes, they must be aligned according to their exponents. To do so, the smaller significant needs to be shifted to the right (each right shift is equivalent to increasing the exponent by one).

• The extra precision bits are populated in this shift.
• As part of this step, the leading 1 bit... and a sign bit Note: The sticky bit is calculated and applied in this step.
3. Sign-adjustment: if the fractions differ in sign, then the fraction with the smaller initial exponent needs to be (two's complement) negated.

4. Add the fractions and capture the carry bit. Note that, if the signs of the fractions differs, then this could result in higher bits being cleared.

5. Normalize the fractions: Shift the result so that the leading '1' is present in the proper space. This means shifting right one place if the result set the carry bit, and to the left some number of places if high bits were cleared.

• The sticky bit must be preserved in any of these shifts!
6. Rounding: Here, the extra precision bits are examined to determine if the result fraction's last bit should be rounded up. IEEE 754 supports five rounding modes:

• Round towards 0: just chop off the extra precision bits.
• Round towards +infinity: round up if any extra precision bits are set.
• Round towards -infinity: round down if any extra precision bits are set.
• Round to nearest, ties away from zero: Rounds to the nearest value. In cases where the extra precision bits are halfway between values, i.e., 0b100, then the result is rounded up for positive numbers and down for negative ones.
• Round to nearest, ties to even: Rounds to the nearest value. In cases where the extra precision bits are halfway between values, then the result is rounded in whichever direction causes the LSB of the result significant to be 0.

• This is the most commonly-used rounding mode.
• This is [currently] the only supported mode by the DSLX implementation.
7. Special case handling: The results are examined for special cases such as NaNs, infinities, or (optionally) subnormals.

##### Result sign determination

The sign of the result will normally be the same as the sign of the operand with the greater exponent, but there are two extra cases to consider. If the operands have the same exponent, then the sign will be that of the greater fraction, and if the result is 0, then we favor positive 0 vs. negative 0 (types are as for a C float implementation):

  let fraction = (addend_x as s29) + (addend_y as s29);
let fraction_is_zero = fraction == s29:0;
let result_sign = match (fraction_is_zero, fraction < s29:0) {
(true, _) => u1:0,
(false, true) => !greater_exp.sign,
_ => greater_exp.sign,
};

##### Rounding

As complicated as rounding is to describe, its implementation is relatively straightforward (types are as for a C float implementation):

  let normal_chunk = shifted_fraction[0:3];
let half_way_chunk = shifted_fraction[2:4];
let do_round_up =
if (normal_chunk > u3:0x4) | (half_way_chunk == u2:0x3) { u1:1 }
else { u1:0 };

// We again need an extra bit for carry.
let rounded_fraction = if do_round_up { (shifted_fraction as u28) + u28:0x8 }
else { shifted_fraction as u28 };
let rounding_carry = rounded_fraction[-1:];


### apfloat::mul

pub fn mul<EXP_SZ: u32, FRACTION_SZ: u32>(x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>


Returns the product of x and y, with the following exceptions:

• Both input and output denormals are treated as/flushed to 0.
• Only round-to-nearest mode is supported.
• No exception flags are raised/reported.

In all other cases, results should be identical to other conforming implementations (modulo exact fraction values in the NaN case).

### apfloat::fma

pub fn fma<EXP_SZ: u32, FRACTION_SZ: u32>(a: APFloat<EXP_SZ, FRACTION_SZ>,
b: APFloat<EXP_SZ, FRACTION_SZ>,
c: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ> {


Returns the Fused Multiply and Add (FMA) result of computing a*b + c.

The FMA operation is a three-operand operation that computes the product of the first two and the sum of that with the third. The IEEE 754-2008 description of the operation states that the operation should be performed "as if with unbounded range and precision", limited only by rounding of the final result. In other words, this differs from a sequence of a separate multiply followed by an add in that there is only a single rounding step (instead of the two involved in separate operations).

In practice, this means A) that the precision of an FMA is higher than individual ops, and thus that B) an FMA requires significantly more internal precision bits than naively expected.

For binary32 inputs, to achieve the standard-specified precision, the initial mul requires the usual 48 ((23 fraction + 1 "hidden") * 2) fraction bits. When performing the subsequent add step, though, it is necessary to maintain 72 fraction bits ((23 fraction + 1 "hidden") * 3). Fortunately, this sum includes the guard, round, and sticky bits (so we don't need 75). The mathematical derivation of the exact amount will not be given here (as I've not done it), but the same calculated size would apply for other data types (i.e., 54 * 2 = 108 and 54 * 3 = 162 for binary64).

Aside from determining the necessary precision bits, the FMA implementation is rather straightforward, especially after reviewing the adder and multiplier.

## float64

To help with the float64 or double precision floating point numbers, float32.x defines the following type aliases.

pub type F64 = apfloat::APFloat<11, 52>;
pub type FloatTag = apfloat::APFloatTag;
pub type TaggedF64 = (FloatTag, F64);


Besides float64 specializations of the functions in apfloat.x, the following functions are defined just for float64.

### float64::to_int64

pub fn to_int64(x: F64) -> s64;


Convert the F64 struct into a 64 bit integer.

## float32

To help with the float32 or single precision floating point numbers, float32.x defines the following type aliases.

pub type F32 = apfloat::APFloat<8, 23>;
pub type FloatTag = apfloat::APFloatTag;
pub type TaggedF32 = (FloatTag, F32);
pub const F32_ONE_FLAT = u32:0x3f800000;


Besides float32 specializations of the functions in apfloat.x, the following functions are defined just for float32.

### float32::to_int32, float32::from_int32

pub fn to_int32(x: F32) -> s32
pub fn from_int32(x: s32) -> F32

Convert the F32 struct to and from a 32bit integer.

# float32::fixed_fraction

pub fn fixed_fraction(input_float: F32) -> u23


TBD

# float32::fast_rsqrt_config_refinements/float32::fast_rsqrt

pub fn fast_rsqrt_config_refinements<NUM_REFINEMENTS: u32 = {u32:1}>(x: F32) -> F32
pub fn fast_rsqrt(x: F32) -> F32


Floating point (fast (approximate) inverse square root)[ https://en.wikipedia.org/wiki/Fast_inverse_square_root]. This should be able to compute 1.0 / sqrt(x) using fewer hardware resources than using a sqrt and division module, although this hasn't been benchmarked yet. Latency is expected to be lower as well. The tradeoff is that this offers slighlty less precision (error is < 0.2% in worst case). Note that:

• Input denormals are treated as/flushed to 0. (denormals-are-zero / DAZ).
• Only round-to-nearest mode is supported.
• No exception flags are raised/reported.
• We emit a single, canonical representation for NaN (qnan) but accept all NaN respresentations as input.

fast_rsqrt_config_refinements allows the user to specify the number of Computes an approximation of 1.0 / sqrt(x). NUM_REFINEMENTS can be increased to tradeoff more hardware resources for more accuracy.

fast_rsqrt does exactly one refinement.

## bfloat16

To help with the bfloat16 floating point numbers, bfloat16.x defines the following type aliases.

pub type BF16 = apfloat::APFloat<8, 7>;
pub type FloatTag = apfloat::APFloatTag;
pub type TaggedBF16 = (FloatTag, BF16);

Besides bfloat16 specializations of the functions in apfloat.x, the following functions are defined just for bfloat16.

### bfloat16:to_int16

pub fn to_int16(x: BF16) -> s16

Convert the the BF16 struct into a 16 bit integer.

### bfloat16:increment_fraction

pub fn increment_fraction(input: BF16) -> BF16


Increments the fraction of the input BF16 by one and returns the normalized result. Input must be a normal non-zero number.

## Testing

Several different methods are used to test these routines, depending on applicability. These are:

• Reference comparison: exhaustive testing
• Reference comparison: space-sampling
• Formal proving

When comparing to a reference, a natural question is the stability of the reference, i.e., is the reference answer the same across all versions or environments? Will the answer given by glibc/libm on AArch64 be the same as one given by a hardware FMA unit on a GPU? Fortunately, all "correct" implementations will give the same results for the same inputs. 1 In addition, POSIX has the same result-precision language. It's worth noting that -ffast-math doesn't currently affect FMA emission/fusion/fission/etc.

differ between implementations due to the table maker's dilemma.

### Exhaustive testing

This is the happiest case - where the input space is so small that we can iterate over every possible input, effectively treating the input as a binary iteration counter. Sadly, this is uncommon (except, perhaps for ML math), as binary32 is the precision floor for most problems, and a 64-bit input space is well beyond our current abilities. Still - if your problem can be exhaustively tested (with respect to a trusted reference), it should be exhaustively tested!

None of our current ops are tested in this way, although the bf16 cases could/should be.

### Space-sampling

When the problem input space is too large for exhaustive testing, then random samples can be tested instead. This approach can't give complete verification of an implementation, but, given enough samples, it can yield a high degree of confidence.

The existing modules are tested in this way. This could be improved by preventing re-testing of any given sample (at the cost of memory and, perhaps, atomic/locking costs) and by identifying interesting "corner cases" of the input space and focusing on those.

### Formal verification

This sort of testing utilizes our formal solver infrastructure to prove correctness with the solver's internal FP implementation. This is fully described in the solvers documentation.

1. There are operations for which this is not true. Transcendental ops may