# Floating-point routines

XLS provides implementations of several floating-point operations and may add more at any time. Here are listed notable details of our implementations or of floating-point operations in general. Unless otherwise specified, not possible or out-of-scope, all operations and types should be IEEE-754 compliant.

For example, floating-point exceptions have not been implemented; they're
outside our current scope. The numeric results of multiplication, on the other
hand, should *exactly* match those of any other compliant implementation.

## APFloat

Floating-point operations are, in general, defined by the same sequence of steps regardless of their underlying bit widths: fractional parts must be expanded then aligned, then an operation (add, multiply, etc.) must be performed, interpreting the fractions as integral types, followed by rounding, special case handling, and reconstructing an output FP type.

This observation leads to the possibility of *generic* floating-point routines:
a fully parameterized add, for example, which can be instantiated with and 8-bit
exponent and 23-bit fractional part for binary32 types, and an 11-bit exponent
and 52-bit fractional part for binary64 types. Even more interesting, a
hypothetical bfloat32 type could *immediately* be supported by, say,
instantiating that adder with, say, 15 exponent bits and 16 fractional ones.

As much as possible, XLS implements FP operations in terms of its
`APFloat`

(arbitrary-precision floating-point) type. `APFloat`

is a parameterized
floating-point structure with a fixed one-bit sign and specifiable exponent and
fractional part size. For ease of use, common types, such as
`float32`

, are
defined in terms of those `APFloat`

types.

For example, the generic "is X infinite" operation is defined in `apfloat.x`

as:

```
// Returns whether or not the given APFloat represents an infinite quantity.
pub fn is_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool {
(x.bexp == std::mask_bits<EXP_SZ>() && x.fraction == bits[FRACTION_SZ]:0)
}
```

Whereas in `float32.x`

, `F32`

is defined as:

```
pub type F32 = apfloat::APFloat<u32:8, u32:23>;
```

and `is_inf`

is exposed as:

```
pub fn is_inf(f: F32) -> bool { apfloat::is_inf<u32:8, u32:23>(f) }
```

In this way, users can refer to `F32`

types and can use them as and with
`float32::is_inf(f)`

, giving them simplified access to a generic operation.

## Supported operations

Here are listed the routines so far implemented in XLS. Unless otherwise
specified, operations are implemented in terms of `APFloat`

s such that they can
support any precisions (aside from corner cases, such as a zero-byte fractional
part).

`apfloat::tag`

```
pub fn tag<EXP_SZ:u32, FRACTION_SZ:u32>(input_float: APFloat<EXP_SZ, FRACTION_SZ>) -> APFloatTag
```

Returns the type of float as one of `APFloatTag::ZERO`

, `APFloatTag::SUBNORMAL`

,
`APFloatTag::INFINITY`

, `APFloatTag::NAN`

and `APFloatTag::NORMAL`

.

`apfloat::qnan`

```
pub fn qnan<EXP_SZ:u32, FRACTION_SZ:u32>() -> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns a `quiet NaN`

.

`apfloat::is_nan`

```
pub fn is_nan<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns whether or not the given `APFloat`

represents `NaN`

.

`apfloat::inf`

```
pub fn inf<EXP_SZ:u32, FRACTION_SZ:u32>(sign: bits[1]) -> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns a positive or a negative infinity depending upon the given sign parameter.

`apfloat::is_inf`

```
pub fn is_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns whether or not the given `APFloat`

represents an infinite quantity.

`apfloat::is_pos_inf`

```
pub fn is_pos_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns whether or not the given `APFloat`

represents a positive infinite quantity.

`apfloat::is_neg_inf`

```
pub fn is_neg_inf<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns whether or not the given `APFloat`

represents a negative infinite quantity.

`apfloat::zero`

```
pub fn zero<EXP_SZ:u32, FRACTION_SZ:u32>(sign: bits[1]) -> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns a positive or negative zero depending upon the given sign parameter.

`apfloat::one`

```
pub fn one<EXP_SZ:u32, FRACTION_SZ:u32>(sign: bits[1]) -> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns one or minus one depending upon the given sign parameter.

`apfloat::negate`

```
pub fn negate<EXP_SZ:u32, FRACTION_SZ:u32>(x: APFloat<EXP_SZ, FRACTION_SZ>) -> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns the negative of `x`

unless it is a `NaN`

, in which case it will change it
from a quiet to signaling `NaN`

or from signaling to a quiet `NaN`

.

`apfloat::max_normal_exp`

```
pub fn max_normal_exp<EXP_SZ : u32>() -> sN[EXP_SZ]
```

Maximum value of the exponent for normal numbers with EXP_SZ bits in the exponent field. For single precision floats this value is 127.

`apfloat::min_normal_exp`

```
pub fn min_normal_exp<EXP_SZ : u32>() -> sN[EXP_SZ]
```

Minimum value of the exponent for normal numbers with EXP_SZ bits in the exponent field. For single precision floats this value is -126.

`apfloat::unbiased_exponent`

```
pub fn unbiased_exponent<EXP_SZ:u32, FRACTION_SZ:u32>(f: APFloat<EXP_SZ, FRACTION_SZ>) -> sN[EXP_SZ]
```

Returns the unbiased exponent. For normal numbers it is ```
bexp - 2^EXP_SZ +
1``and for subnormals it is,
```

2 - 2^EXP_SZ`. For infinity and `NaN`

, there are
no guarantees, as the unbiased exponent has no meaning in that case.

For example, for single precision normal numbers the unbiased exponent is
`bexp - 127``and for subnormal numbers it is`

-126`.

`apfloat::bias`

```
pub fn bias<EXP_SZ: u32, FRACTION_SZ: u32>(unbiased_exponent: sN[EXP_SZ]) -> bits[EXP_SZ]
```

Returns the biased exponent which is equal to `unbiased_exponent + 2^EXP_SZ - 1`

Since the function only takes as input the unbiased exponent, it cannot distinguish between normal and subnormal numbers, as a result it assumes that the input is the exponent for a normal number.

`apfloat::flatten`

```
pub fn flatten<EXP_SZ:u32, FRACTION_SZ:u32,
TOTAL_SZ:u32 = {u32:1+EXP_SZ+FRACTION_SZ}>(
x: APFloat<EXP_SZ, FRACTION_SZ>)
-> bits[TOTAL_SZ]
```

Returns a bit string of size `1 + EXP_SZ + FRACTION_SZ`

where the first bit is
the sign bit, the next `EXP_SZ`

bit encode the biased exponent and the last
`FRACTION_SZ`

are the significand without the hidden bit.

`apfloat::unflatten`

```
pub fn unflatten<EXP_SZ:u32, FRACTION_SZ:u32,
TOTAL_SZ:u32 = {u32:1+EXP_SZ+FRACTION_SZ}>(x: bits[TOTAL_SZ])
-> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns a `APFloat`

struct whose flattened version would be the input string
`x`

.

`apfloat:ldexp`

```
pub fn ldexp<EXP_SZ:u32, FRACTION_SZ:u32>(
fraction: APFloat<EXP_SZ, FRACTION_SZ>,
exp: s32) -> APFloat<EXP_SZ, FRACTION_SZ>
```

`ldexp`

(load exponent) computes `fraction * 2^exp`

. Note that:

- Input denormals are treated as/flushed to 0. (denormals-are-zero / DAZ). Similarly, denormal results are flushed to 0.
- No exception flags are raised/reported.
- We emit a single, canonical representation for NaN (qnan) but accept all
`NaN`

representations as input.

`apfloat::cast_from_fixed_using_rne`

```
pub fn cast_from_fixed_using_rne<EXP_SZ:u32, FRACTION_SZ:u32, NUM_SRC_BITS:u32>(
to_cast: sN[NUM_SRC_BITS])
-> APFloat<EXP_SZ, FRACTION_SZ> {
```

Casts the fixed point number to a floating point number using RNE (Round to Nearest Even) as the rounding mode.

`apfloat::cast_from_fixed_using_rz`

```
pub fn cast_from_fixed_using_rz<EXP_SZ:u32, FRACTION_SZ:u32, NUM_SRC_BITS:u32>(
to_cast: sN[NUM_SRC_BITS])
-> APFloat<EXP_SZ, FRACTION_SZ> {
```

Casts the fixed point number to a floating point number using RZ (Round to Zero) as the rounding mode.

`apfloat::normalize`

```
pub fn normalize<EXP_SZ:u32, FRACTION_SZ:u32,
WIDE_FRACTION:u32 = {FRACTION_SZ + u32:1}>(
sign: bits[1], exp: bits[EXP_SZ],
fraction_with_hidden: bits[WIDE_FRACTION])
-> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns a normalized APFloat with the given components. `fraction_with_hidden`

is the fraction (including the hidden bit). This function only normalizes in the
direction of decreasing the exponent. Input must be a normal number or zero.
Subnormals/Denormals are flushed to zero in the result.

`apfloat::is_zero_or_subnormal`

```
pub fn is_zero_or_subnormal<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns `true`

if `x == 0`

or `x`

is a subnormal number.

`apfloat::cast_to_fixed`

```
pub fn cast_to_fixed<NUM_DST_BITS:u32, EXP_SZ:u32, FRACTION_SZ:u32>(
to_cast: APFloat<EXP_SZ, FRACTION_SZ>)
-> sN[NUM_DST_BITS]
```

Casts the floating point number to a fixed point number. Unrepresentable numbers are cast to the minimum representable number (largest magnitude negative number).

`apfloat::eq_2`

```
pub fn eq_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns `true`

if `x == y`

. Denormals are Zero (DAZ). Always returns `false`

if
`x`

or `y`

is `NaN`

.

`apfloat::gt_2`

```
pub fn gt_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns `true`

if `x > y`

. Denormals are Zero (DAZ). Always returns `false`

if
`x`

or `y`

is `NaN`

.

`apfloat::gte_2`

```
pub fn gte_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns `true`

if `x >= y`

. Denormals are Zero (DAZ). Always returns `false`

if
`x`

or `y`

is `NaN`

.

`apfloat::lte_2`

```
pub fn lte_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns `true`

if `x <= y`

. Denormals are Zero (DAZ). Always returns `false`

if
`x`

or `y`

is `NaN`

.

`apfloat::lt_2`

```
pub fn lt_2<EXP_SZ: u32, FRACTION_SZ: u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>) -> bool
```

Returns `true`

if `x < y`

. Denormals are Zero (DAZ). Always returns `false`

if
`x`

or `y`

is `NaN`

.

`apfloat::round_towards_zero`

```
pub fn round_towards_zero<EXP_SZ:u32, FRACTION_SZ:u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns an `APFloat`

with all its bits past the decimal point set to `0`

.

`apfloat::to_int`

```
pub fn to_int<EXP_SZ: u32, FRACTION_SZ: u32, RESULT_SZ:u32>(
x: APFloat<EXP_SZ, FRACTION_SZ>) -> sN[RESULT_SZ]
```

Returns the signed integer part of the input float, truncating any fractional bits if necessary.

`apfloat::add/sub`

```
pub fn add<EXP_SZ: u32, FRACTION_SZ: u32>(x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>
pub fn sub<EXP_SZ: u32, FRACTION_SZ: u32>(x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns the sum/difference of `x`

and `y`

based on a generalization of IEEE 754
single-precision floating-point addition, with the following exceptions:

- Both input and output denormals are treated as/flushed to 0.
- Only round-to-nearest mode is supported.
- No exception flags are raised/reported.

In all other cases, results should be identical to other conforming
implementations (modulo exact fraction values in the `NaN`

case.

#### Implementation details

Floating-point addition, like any FP operation, is much more complicated than integer addition, and has many more steps. Being the first operation described, we'll take extra care to explain floating-point addition:

**Expand fractions:**Floating-point operations are computed with bits beyond that in their normal representations for increased precision. For IEEE 754 numbers, there are three extra, called the guard, rounding and sticky bits. The first two behave normally, but the last, the "sticky" bit, is special. During shift operations (below), if a "1" value is ever shifted into the sticky bit, it "sticks" - the bit will remain "1" through any further shift operations. In this step, the fractions are expanded by these three bits.**Align fractions:**To ensure that fractions are added with appropriate magnitudes, they must be aligned according to their exponents. To do so, the smaller significant needs to be shifted to the right (each right shift is equivalent to increasing the exponent by one). - The extra precision bits are populated in this shift. - As part of this step, the leading 1 bit... and a sign bit Note: The sticky bit is calculated and applied in this step.**Sign-adjustment:**if the fractions differ in sign, then the fraction with the smaller initial exponent needs to be (two's complement) negated.**Add the fractions and capture the carry bit.**Note that, if the signs of the fractions differs, then this could result in higher bits being cleared.**Normalize the fractions:**Shift the result so that the leading '1' is present in the proper space. This means shifting right one place if the result set the carry bit, and to the left some number of places if high bits were cleared. - The sticky bit must be preserved in any of these shifts!**Rounding:**Here, the extra precision bits are examined to determine if the result fraction's last bit should be rounded up. IEEE 754 supports five rounding modes: - Round towards 0: just chop off the extra precision bits. - Round towards +infinity: round up if any extra precision bits are set. - Round towards -infinity: round down if any extra precision bits are set. - Round to nearest, ties away from zero: Rounds to the nearest value. In cases where the extra precision bits are halfway between values, i.e., 0b100, then the result is rounded up for positive numbers and down for negative ones. - Round to nearest, ties to even: Rounds to the nearest value. In cases where the extra precision bits are halfway between values, then the result is rounded in whichever direction causes the LSB of the result significant to be 0. - This is the most commonly-used rounding mode. - This is [currently] the only supported mode by the DSLX implementation.**Special case handling:**The results are examined for special cases such as NaNs, infinities, or (optionally) subnormals.

##### Result sign determination

The sign of the result will normally be the same as the sign of the operand with
the greater exponent, but there are two extra cases to consider. If the operands
have the same exponent, then the sign will be that of the greater fraction, and
if the result is 0, then we favor positive 0 vs. negative 0 (types are as for a
C `float`

implementation):

```
let fraction = (addend_x as s29) + (addend_y as s29);
let fraction_is_zero = fraction == s29:0;
let result_sign = match (fraction_is_zero, fraction < s29:0) {
(true, _) => u1:0,
(false, true) => !greater_exp.sign,
_ => greater_exp.sign,
};
```

##### Rounding

As complicated as rounding is to describe, its implementation is relatively
straightforward (types are as for a C `float`

implementation):

```
let normal_chunk = shifted_fraction[0:3];
let half_way_chunk = shifted_fraction[2:4];
let do_round_up =
u1:1 if (normal_chunk > u3:0x4) | (half_way_chunk == u2:0x3)
else u1:0;
// We again need an extra bit for carry.
let rounded_fraction = (shifted_fraction as u28) + u28:0x8 if do_round_up
else (shifted_fraction as u28);
let rounding_carry = rounded_fraction[-1:];
```

`apfloat::mul`

```
pub fn mul<EXP_SZ: u32, FRACTION_SZ: u32>(x: APFloat<EXP_SZ, FRACTION_SZ>,
y: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ>
```

Returns the product of `x`

and `y`

, with the following exceptions:

- Both input and output denormals are treated as/flushed to 0.
- Only round-to-nearest mode is supported.
- No exception flags are raised/reported.

In all other cases, results should be identical to other conforming implementations (modulo exact fraction values in the NaN case).

`apfloat::fma`

```
pub fn fma<EXP_SZ: u32, FRACTION_SZ: u32>(a: APFloat<EXP_SZ, FRACTION_SZ>,
b: APFloat<EXP_SZ, FRACTION_SZ>,
c: APFloat<EXP_SZ, FRACTION_SZ>)
-> APFloat<EXP_SZ, FRACTION_SZ> {
```

Returns the Fused Multiply and Add (FMA) result of computing `a*b + c`

.

The FMA operation is a three-operand operation that computes the product of the first two and the sum of that with the third. The IEEE 754-2008 description of the operation states that the operation should be performed "as if with unbounded range and precision", limited only by rounding of the final result. In other words, this differs from a sequence of a separate multiply followed by an add in that there is only a single rounding step (instead of the two involved in separate operations).

In practice, this means A) that the precision of an FMA is higher than individual ops, and thus that B) an FMA requires significantly more internal precision bits than naively expected.

For binary32 inputs, to achieve the standard-specified precision, the initial
mul requires the usual 48 ((23 fraction + 1 "hidden") * 2) fraction bits. When
performing the subsequent add step, though, it is necessary to maintain *72*
fraction bits ((23 fraction + 1 "hidden") * 3). Fortunately, this sum includes
the guard, round, and sticky bits (so we don't need 75). The mathematical
derivation of the exact amount will not be given here (as I've not done it), but
the same calculated size would apply for other data types (i.e., 54 * 2 = 108
and 54 * 3 = 162 for binary64).

Aside from determining the necessary precision bits, the FMA implementation is rather straightforward, especially after reviewing the adder and multiplier.

## float64

To help with the `float64`

or
double precision
floating point numbers, `float32.x`

defines the following type aliases.

```
pub type F64 = apfloat::APFloat<11, 52>;
pub type FloatTag = apfloat::APFloatTag;
pub type TaggedF64 = (FloatTag, F64);
```

Besides `float64`

specializations of the functions in `apfloat.x`

, the following
functions are defined just for `float64`

.

`float64::to_int64`

```
pub fn to_int64(x: F64) -> s64;
```

`F64`

struct into a 64 bit integer.
## float32

To help with the `float32`

or
single precision
floating point numbers, `float32.x`

defines the following type aliases.

```
pub type F32 = apfloat::APFloat<8, 23>;
pub type FloatTag = apfloat::APFloatTag;
pub type TaggedF32 = (FloatTag, F32);
pub const F32_ONE_FLAT = u32:0x3f800000;
```

Besides `float32`

specializations of the functions in `apfloat.x`

, the following
functions are defined just for `float32`

.

`float32::to_int32`

, `float32::from_int32`

```
pub fn to_int32(x: F32) -> s32
pub fn from_int32(x: s32) -> F32
```

`F32`

struct to and from a 32bit integer.
`float32::fixed_fraction`

```
pub fn fixed_fraction(input_float: F32) -> u23
```

TBD

`float32::fast_rsqrt_config_refinements`

/`float32::fast_rsqrt`

```
pub fn fast_rsqrt_config_refinements<NUM_REFINEMENTS: u32 = {u32:1}>(x: F32) -> F32
pub fn fast_rsqrt(x: F32) -> F32
```

Floating point (fast (approximate) inverse square root)[
https://en.wikipedia.org/wiki/Fast_inverse_square_root]. This should be able to
compute `1.0 / sqrt(x)`

using fewer hardware resources than using a `sqrt`

and
division module, although this hasn't been benchmarked yet. Latency is expected
to be lower as well. The tradeoff is that this offers slighlty less precision
(error is < 0.2% in worst case). Note that:

- Input denormals are treated as/flushed to 0. (denormals-are-zero / DAZ).
- Only round-to-nearest mode is supported.
- No exception flags are raised/reported.
- We emit a single, canonical representation for NaN (qnan) but accept all NaN respresentations as input.

`fast_rsqrt_config_refinements`

allows the user to specify the number of
Computes an approximation of 1.0 / sqrt(x). `NUM_REFINEMENTS`

can be increased
to tradeoff more hardware resources for more accuracy.

`fast_rsqrt`

does exactly one refinement.

## bfloat16

To help with the
`bfloat16`

floating point numbers, `bfloat16.x`

defines the following type aliases.

```
pub type BF16 = apfloat::APFloat<8, 7>;
pub type FloatTag = apfloat::APFloatTag;
pub type TaggedBF16 = (FloatTag, BF16);
```

`bfloat16`

specializations of the functions in `apfloat.x`

, the following
functions are defined just for `bfloat16`

.
`bfloat16:to_int16`

```
pub fn to_int16(x: BF16) -> s16
```

`BF16`

struct into a 16 bit integer.
`bfloat16:increment_fraction`

```
pub fn increment_fraction(input: BF16) -> BF16
```

Increments the fraction of the input BF16 by one and returns the normalized
result. Input must be a normal *non-zero* number.

## Testing

Several different methods are used to test these routines, depending on applicability. These are:

- Reference comparison: exhaustive testing
- Reference comparison: space-sampling
- Formal proving

When comparing to a reference, a natural question is the stability of the reference, i.e., is the reference answer the same across all versions or environments? Will the answer given by glibc/libm on AArch64 be the same as one given by a hardware FMA unit on a GPU? Fortunately, all "correct" implementations will give the same results for the same inputs.* In addition, POSIX has the same result-precision language. It's worth noting that -ffast-math doesn't currently affect FMA emission/fusion/fission/etc.

* - There are operations for which this is not true. Transcendental ops may
differ between implementations due to the
*table maker's dilemma*.

### Exhaustive testing

This is the happiest case - where the input space is so small that we can
iterate over every possible input, effectively treating the input as a binary
iteration counter. Sadly, this is uncommon (except, perhaps for ML math), as
binary32 is the precision floor for most problems, and a 64-bit input space is
well beyond our current abilities. Still - if your problem *can* be exhaustively
tested (with respect to a trusted reference), it *should* be exhaustively
tested!

None of our current ops are tested in this way, although the bf16 cases could/should be.

### Space-sampling

When the problem input space is too large for exhaustive testing, then random samples can be tested instead. This approach can't give complete verification of an implementation, but, given enough samples, it can yield a high degree of confidence.

The existing modules are tested in this way. This could be improved by preventing re-testing of any given sample (at the cost of memory and, perhaps, atomic/locking costs) and by identifying interesting "corner cases" of the input space and focusing on those.

### Formal verification

This sort of testing utilizes our formal solver infrastructure to prove correctness with the solver's internal FP implementation. This is fully described in the solvers documentation.