Skip to content

Optimization Passes List

This is an automatically generated list of all optimization passes available with the default opt_main. This is generated automatically based on comments in the header files.

If the opt level is set below 'Min opt level' the pass will act as a no-op.

If the opt level is set above 'Cap opt level' the pass (or passes within the compound pass) will be executed with the opt level capped to the specified value.

Warning: Many of these passes have descriptions generated by the Gemini LLM and may not accurately reflect the behavior of the passes. As time goes on manual verification or editing of the pass descriptions may be done to improve accuracy.

default_pipeline - The default pipeline.

Text-proto

Invoked Passes

arith_simp - Arithmetic Simplifications

This pass performs various arithmetic optimizations such as replacement of divide by a constant with non-divide operations.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ArithSimplificationPass is an optimization pass in XLS that applies a wide range of arithmetic transformations to simplify expressions, reduce hardware complexity, and improve overall efficiency. It utilizes a StatelessQueryEngine to identify constant values and bit patterns, which enables many of its powerful transformations. This pass operates in an iterative fixed-point manner, meaning it repeatedly applies simplifications until no further changes can be made, ensuring a thorough optimization.

Key optimizations performed by this pass include:

  1. Division and Modulo by Constants:

    * Division by Power of Two: udiv(x, 2^K) or sdiv(x, 2^K) is replaced with a right or left shift by K bits, respectively. This is a fundamental hardware optimization.

    * Modulo by Power of Two: umod(x, 2^K) is replaced with a bit slice to extract the K least significant bits. For smod(x, 2^K), it's handled by a select operation based on the sign of x to ensure correct signed modulo semantics.

    * Division/Modulo by Other Constants (Unsigned): udiv(x, K) where K is a constant (not a power of two) is transformed using "magic multiplication" algorithms, which replace division with a combination of shifts and multiplies. This is often significantly more efficient in hardware than a dedicated division unit.

    * Division/Modulo by Other Constants (Signed): sdiv(x, K) where K is a constant, also employs magic multiplication, with additional logic to correctly handle signed semantics and rounding towards zero.

    * Constant Dividend Optimization: If the dividend is a small constant (e.g., udiv(13, x) or sdiv(13, x)), the operation is replaced by a lookup table (PrioritySelect over a Literal array). This is particularly effective for small constant dividends where a full division circuit is overkill.

  2. Shift Operations:

    * Logical Shift by Constant: shll(x, K) or shrl(x, K) where K is a constant, is replaced by a combination of bit_slice and concat operations. This avoids generating a potentially larger barrel shifter for fixed-amount shifts.

      ```
      // Original: shrl(x, literal(2, bits=N_shift_amount_width))
      // Example for shift by 2
      // Optimized: concat(literal(0, bits=2),
                           bit_slice(x, start=2, width=N-2))
      ```
    

    * Arithmetic Shift Right by Constant: shra(x, K) is replaced by a sign_extend of a bit_slice of x, effectively performing an arithmetic right shift without a barrel shifter.

    * Removal of Zero-Extended Shift Amounts: If a shift amount is the result of a concat with leading zeros (a common pattern for zero-extension), the leading zeros are removed from the shift amount, simplifying the shift input.

    * Removal of Shift Guards: If a shift amount is clamped by a select operation (e.g., select(UGt(amt, LIMIT), {LIMIT, amt})) where LIMIT is greater than or equal to the bit width of the value being shifted, the clamping logic is removed as it is redundant (any shift amount beyond the bit width produces the same result).

  3. Comparisons with Negated Operands:

    * eq(-lhs, -rhs) => eq(lhs, rhs) (and similar for ne).

    * Signed comparisons like slt(-lhs, -rhs): These are transformed into an xor of the reversed comparison (sgt(lhs, rhs)) and additional terms to correctly handle MIN_VALUE edge cases (where neg(MIN_VALUE) == MIN_VALUE).

    * Signed comparison of negation to literal eq(-expr, K): These are transformed into eq(expr, -K), with similar xor logic for inequalities to correctly handle MIN_VALUE. These transformations are only applied if the negate operation has no other users to avoid introducing logic replication.

  4. Multiply Operations:

    * Multiply by Power of Two: umul(x, 2^K) or smul(x, 2^K) are replaced with a left shift by K bits.

    * Multiply by Small Constant (Sum-of-Shifts): mul(x, K) where K is a small constant with a few set bits can be replaced by a sum of shifted x values (e.g., x * 5 => x + (x << 2)). This is applied if the number of adders required is below a kAdderLimit. A complementary optimization uses subtraction from a power-of-two shift for constants like 7 (e.g., x * 7 => (x << 3) - x).

    * Zero-Width Multiply Operands: Multiplies where one or both operands have a zero bit width are replaced by a literal zero, as the result will always be zero.

    * Multiply used by Narrowing Slice: If a multiply operation's result is only used by bit_slice operations that extract a narrower result, the multiply itself can be performed at the required output width, reducing the size of the multiplier hardware.

  5. Decode Operations:

    * 1-bit Wide Decode: A decode operation with a 1-bit output is simplified to an eq(index, 0) comparison, as only the index 0 will produce a true output.

    * 1-bit Operand Decode: A decode operation with a 1-bit input and a 2-bit output is simplified to concat(operand, not(operand)), which is its direct one-hot representation.

    * Removal of Zero-Extended Index: Similar to shifts, if a decode index is formed by a concat with leading zeros (zero-extension), the leading zeros are removed from the index.

  6. decode(N) - 1 Pattern: The common pattern for creating a mask with the N least-significant bits set (e.g., sub(decode(N), 1)) is rewritten to not(shll(all_ones, N)), eliminating an adder and often resulting in more efficient hardware.

  7. Comparison of Injective Operations: The pass identifies patterns where the result of an injective arithmetic operation (like add or sub) is compared against a constant (e.g., (X + C0) == C1). These are simplified by performing the inverse arithmetic operation on the constants to isolate X (e.g., X == C1 - C0).

The ArithSimplificationPass operates to a fixed point, meaning it repeatedly applies these simplifications until no further changes are possible. This iterative approach ensures that all possible arithmetic optimizations are applied, resulting in a more efficient and optimized IR for hardware synthesis.

Header

array_simp - Array Simplification

Pass which simplifies or eliminates some array-type operations such as ArrayIndex.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ArraySimplificationPass is a comprehensive optimization pass designed to simplify and, where possible, eliminate various array-related operations within the XLS IR. The primary objective is to reduce the complexity of array manipulation, leading to more efficient and compact hardware implementations. This pass employs a combination of pattern matching and range analysis to identify and transform redundant or inefficient array constructs.

The pass operates in an iterative, fixed-point manner, continually applying transformations until no further simplifications can be made. It focuses on Array, ArrayIndex, ArrayUpdate, ArraySlice, and Select (or PrioritySelect) operations when they involve array types.

Key simplifications performed by this pass include:

  1. Clamping Out-of-Bounds ArrayIndex Indices: If range analysis (QueryEngine) can definitively determine that an ArrayIndex operation is attempting to access an out-of-bounds memory location, the index is clamped to the maximum in-bounds index (i.e., the last element of the array). This ensures correct behavior in the generated hardware and can simplify subsequent logic by removing checks for illegal accesses.

  2. Simplifying ArrayIndex Operations: * Empty Indices: An array_index(A, {}) operation (where the index list is empty) is replaced directly by the array A itself, as it logically references the entire array.

    * ArrayIndex of ArrayUpdate with Matching Indices: If an array_index operation accesses the same index that was previously updated by an array_update operation (e.g., array_index(array_update(A, V, {idx}), {idx})), and the index is known to be in bounds, it is replaced directly with the updated value V. This bypasses the array storage entirely.

    * Trivial Dimensions: If an ArrayIndex accesses an array dimension of size 1 (e.g., bits[32][1]), the index for that dimension is replaced with a literal 0, as there is only one possible element to access. This simplifies the indexing logic.

    * ArrayIndex of Array with Constant First Index: An array_index(array(E0, E1, ...), {C, ...}) operation, where C is a constant index, is transformed to directly index into the corresponding element Ec. This bypasses the array construction, reducing IR nodes.

    * ArrayIndex of ArrayConcat with Constant First Index: Similar to Array, if an ArrayIndex accesses an array_concat with a constant first index, it bypasses the array_concat and directly indexes into the appropriate sub-array within the concatenated structure.

    * Consecutive ArrayIndex Operations: A sequence of nested array_index operations (e.g., array_index(array_index(A, {idx0}), {idx1})) is flattened into a single array_index(A, {idx0, idx1}), combining the indices for greater efficiency.

    * ArrayIndex of Select (or PrioritySelect): An operation of the form array_index(select(P, cases=[A0, A1]), {idx}) is transformed into select(P, cases=[array_index(A0, {idx}), array_index(A1, {idx})]). This effectively pushes the select operation inside the array access, potentially reducing the overall multiplexer (mux) width if the array elements are smaller than the entire array. This is particularly beneficial for "small" arrays or when the outer array_index is the sole user of the select operation.

  3. Simplifying ArrayUpdate Operations: * Empty Indices: An array_update(A, V, {}) operation (with no indices) is replaced directly by the update value V, as it logically replaces the entire array.

    * Redundant ArrayUpdate: If an array_update operation updates an element with its current value (e.g., array_update(A, array_index(A, {idx}), {idx})) and the index is known to be in bounds, it is replaced with the original array A, as no actual change occurs.

    * ArrayUpdate on Unit-Dimension Arrays: An array_update(A[1], V, {idx}) (updating a 1-element array, possibly within a multi-dimensional array) is replaced with a select operation. This select conditionally chooses between an array-packed version of the update value V and the original array A, based on whether the index idx matches the single valid index. This allows for explicit conditional updates for these small arrays.

    * Hoisting ArrayUpdate above Array: An array_update applied to an array literal (e.g., array_update(array(A,B,C), V, {idx})) can be transformed into a new array where the element at idx is replaced by V, potentially after applying sub-updates if the array is multi-dimensional. This is particularly beneficial for constant arrays.

    * Flattening Sequential ArrayUpdate Chains: The pass identifies and flattens sequences of array_update operations that modify the same array. If a later update necessarily overwrites elements updated by an earlier array_update, the earlier, redundant update can be elided.

    * ArrayUpdate of ArrayIndex: An array_update of the form array_update(A, array_update(array_index(A, {i}), V, {j}), {i}) can be simplified to array_update(A, V, {i, j}), effectively merging nested update operations for increased efficiency.

  4. Simplifying Array Operations: * Decomposition and Recomposition Elimination: The pass can detect a pattern where an array is first decomposed into its individual elements (using array_index operations) and then immediately recomposed back into an array (using an array operation). If the indices are sequential and cover the entire range, this redundant decomposition-recomposition pair is eliminated, replacing the array operation with the original source array or a simple array_index on the source.

  5. Simplifying ArraySlice Operations: * Literal Start Index: An array_slice operation with a literal start index is replaced by an array literal constructed from the directly indexed elements of the source array. This eliminates dynamic slicing operations in hardware.

  6. Simplifying Select of Array-typed Values: * Conditional Array Assignment: A select of the form select(P, cases=[A, array_update(A, V, {idx})]) is transformed into array_update(A, select(P, cases=[array_index(A, {idx}), V]), {idx}). This optimization hoists the select inside the array_update, reducing the multiplexer width to operate only on the single array element being conditionally updated.

    * Select of Arrays to Array of Selects: A select operation where all its cases (and optionally default value) are array-typed values (e.g., select(P, cases=[array_0, array_1])) can be converted to an array of select operations (e.g., array(select(P, {array_0[0], array_1[0]}), select(P, {array_0[1], array_1[1]}), ...)). This transformation can expose smaller select operations for further optimization but also has the potential to increase logic replication. Therefore, it is applied judiciously based on heuristics.

This pass is essential for translating high-level array abstractions into efficient hardware structures by aggressively simplifying and optimizing array access and manipulation patterns, contributing significantly to improved hardware QoR.

Header

array_untuple - Array UnTuple

Pass which changes any (non-external) array-of-tuple into a tuple-of-arrays. We can see through tuples quite well but can't see through arrays to anywhere near the same extent. Therefore the struct-of-array representation is always superior.

Note that this pass makes no attempt to unpack or repack arrays which escape the function-base. This means that anything which comes in through a function param, or a procs recv or escapes through a function return or a proc send is not untuple'd.

TODO(allight): We could do this at the cost of a significant number of ir nodes. We should experiment to see if this is worth doing.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ArrayUntuplePass is an optimization pass in XLS that transforms arrays of tuples into tuples of arrays. This conversion, often referred to as the "struct-of-arrays" (SoA) transformation, is generally superior for hardware implementation. The XLS IR and its optimization passes can "see through" tuples much more effectively than through arrays. By converting an array of structs into a struct of arrays, this pass exposes significantly more opportunities for subsequent optimizations (such as bit-slicing, dead code elimination, and common subexpression elimination), ultimately leading to more efficient and compact hardware designs.

Core Principle: The pass aims to convert a data structure of the form Array<Tuple<A, B, C>> into Tuple<Array<A>, Array<B>, Array<C>>.

How it Works:

  1. FindUntupleGroups (Union-Find for Related Nodes): This function utilizes a UnionFind data structure to group nodes that represent parts of the same array-of-tuples structure and should be transformed together. It essentially identifies equivalence classes of nodes that are interconnected within the array-of-tuples dataflow. For instance, if an array_update modifies an array-of-tuples, both the original array and the array_update result are considered part of the same equivalence group. Operations considered for grouping include: ArrayUpdate, ArraySlice, ArrayConcat, Eq, Ne, Sel (and its variants PrioritySel, OneHotSel), Gate, and Next.

  2. FindExternalGroups (Identifying External Visibility): This is a crucial preliminary step that identifies "external" uses of arrays-of-tuples. If an array-of-tuples either originates from outside the current function/proc (e.g., as a function parameter or a proc's receive operand) or escapes its boundary (e.g., as a function return value or a proc's send operand), it cannot be untupled because doing so would alter the external interface of the component. Such groups of nodes are marked as "excluded" from transformation. Additional exclusions include: * Arrays of empty tuples (to avoid potential infinite loops in fuzzer- generated code).

    * Arrays of arrays (this pass currently does not recursively unwrap them).

    * The update_value operand of an array_update (as it represents an external input to the update operation).

    * Operands of unhandled operations that are arrays of tuples.

  3. UntupleVisitor (Applying Transformations): This DfsVisitorWithDefault traverses the function or proc nodes in topological order and applies the necessary transformations if a node represents an array-of-tuples and its equivalence group is not marked as excluded. The visitor handles various IR operations: * Literal: Converts an array-of-tuples literal into a tuple of array literals.

    * StateRead: For Procs, it untuples a StateRead operation by creating new StateElements for each tuple element, with each new element holding an array corresponding to that specific tuple field.

    * Next: Transforms a next operation on an array-of-tuples state element into a series of next operations, one for each tuple-of-arrays state element.

    * ArrayIndex: Converts an array_index operation on an array-of- tuples into a tuple of array_index operations, where each accesses a corresponding element from the untupled array structure.

    * Array: Transforms an array construction that creates an array-of-tuples into a tuple of array constructions, effectively building the tuple-of-arrays representation.

    * ArrayUpdate, ArraySlice, ArrayConcat: Similar transformations are applied to these operations, distributing them across the individual tuple elements to operate on the corresponding arrays.

    * Gate, Select (and its variants PrioritySel, OneHotSel): These operations are distributed across the elements of the tuple, creating new operations for each element of the untupled structure.

    * Eq, Ne (Comparisons): Comparisons on arrays of tuples are distributed across the tuple elements. For eq, the individual comparison results are and-reduced; for ne, they are or-reduced.

Benefits:

  • Improved Optimization Opportunities: By converting arrays of tuples to tuples of arrays, the pass exposes the individual array components. This allows other subsequent passes (e.g., BitSliceSimplificationPass, DcePass, CsePass) to operate more effectively, as they are generally more adept at optimizing bit-level operations and tuple structures.
  • Reduced Hardware Complexity: Can lead to a more modular and efficient hardware implementation by breaking down complex array-of-tuple structures into simpler, parallel arrays.
  • Enhanced IR Clarity: The transformed IR can be easier to understand and reason about, as logical components are grouped more coherently and less opaque data structures are used.

Example (tuple_index(array_index(array_of_tuples, idx), element_idx)): Consider an array of tuples where we first index into the array, and then extract an element from the resulting tuple:

// Original IR snippet
fn my_func(my_array: (bits[32], bits[16])[4], idx: bits[2]) -> bits[16] {
  array_element: (bits[32], bits[16]) = array_index(my_array, indices=[idx])
  ret tuple_element: bits[16] = tuple_index(array_element, index=1)
}

After ArrayUntuplePass, my_array would conceptually be transformed into separate arrays for each tuple element, e.g., (my_array_elem0: bits[32][4], my_array_elem1: bits[16][4]). The access pattern would then become simplified:

// Optimized IR (conceptually, after ArrayUntuplePass and DCE)
fn my_func(my_array_elem0: bits[32][4],
           my_array_elem1: bits[16][4],
           idx: bits[2]) -> bits[16] {
  ret array_index_elem1: bits[16] = array_index(my_array_elem1,
                                                indices=[idx])
}

The pass effectively streamlines data access by creating direct arrays for each tuple element, allowing for more direct and optimized indexing into the relevant array, eliminating the intermediate tuple operations.

Header

basic_simp - Basic Simplifications

This pass does simple pattern-matching optimizations which are ~always a good idea to do (replacing a node with a constant, removing operands of nodes, etc). They improve QoR, do not increase the number of nodes in the graph, preserve the same abstraction level, and do not impede later optimizations via obfuscation. These optimizations require no analyses beyond looking at the node and its operands. Examples include: not(not(x)) => x, x + 0 => x, etc.

Note: What follows was generated by the Gemini LLM. Not human verified.

The BasicSimplificationPass applies a collection of straightforward, pattern-matching optimizations to the XLS IR. These optimizations are designed to improve the Quality of Results (QoR) without significantly increasing the number of nodes in the graph, preserving the current abstraction level, and avoiding obfuscation that might hinder subsequent optimization passes. The pass primarily relies on local analysis of a node and its immediate operands.

The pass iteratively traverses the nodes in the function (or proc) and attempts to apply various simplification patterns until a fixed point is reached (no further changes can be made).

Key simplifications performed by this pass include:

  1. Identity Operations: * add(x, 0) => x (and similar for sub, shll, shrl, shra with 0) * and(x, x) => x (and or(x, x) => x) * nand(x, x, x) => not(x) * xor(x, x) => 0 (for an even number of identical x operands) * xor(x, x, x) => x (for an odd number of identical x operands) * not(not(x)) => x * neg(neg(x)) => x * and_reduce(x) => x (if x is 1-bit) * or_reduce(x) => x (if x is 1-bit) * xor_reduce(x) => x (if x is 1-bit) * priority_sel(bits[0], cases=[], default=X) => X (a priority select with a zero-width selector always selects its default value).

  2. Constant Propagation and Folding: * umul(x, 0) or smul(x, 0) => 0 * or(x, -1) => -1 (all ones literal) * nor(x, -1) => 0 * and(x, 0) => 0 * nand(x, 0) => 1 * xor(x, -1) => not(x) * and_reduce([]) => 1 (identity for AND reduction on an empty set) * or_reduce([]) => 0 (identity for OR reduction on an empty set) * xor_reduce([]) => 0 (identity for XOR reduction on an empty set) * Consolidation of multiple literal operands in associative/commutative N-ary operations (e.g., and(x, literal_A, y, literal_B) => and(x, y, literal_A_and_B)).

  3. Boolean Algebra Simplifications: * and(x, not(x)) => 0 * or(x, not(x)) => 1 * add(x, not(x)) => 1 (for bits types, effectively 2^N - 1) * eq(x: bits[1], 1) => x * eq(x: bits[1], 0) => not(x)

  4. Redundant Operand Removal: * or(x, 0, y) => or(x, y) * and(x, -1, y) => and(x, y) * Removal of duplicate operands in N-ary and, or, nand, nor, and xor operations (e.g., and(x, y, y, x) => and(x, y)).

  5. Flattening Nested Associative N-ary Operations: If an outer N-ary operation has an inner N-ary operation of the same opcode as one of its operands, and the inner operation has only one user (the outer operation), the inner operation's operands are flattened directly into the outer operation.

    Example:

    or.5: bits[8] = or(w, x)
    or.6: bits[8] = or(y, z)
    ret result: bits[8] = or(or.5, or.6)
    
    becomes:

    ret result: bits[8] = or(w, x, y, z)
    
  6. Comparison Simplifications: * eq(x, x) => 1 * ne(x, x) => 0 * eq(x, y) => 1 (if x and y are zero-width values) * ne(x, y) => 0 (if x and y are zero-width values)

This pass is designed to be efficient and broadly applicable, serving as a foundational optimization before more complex analyses and transformations are applied.

Header

bdd_cse - BDD-based Common Subexpression Elimination

Pass which commons equivalent expressions in the graph using binary decision diagrams.

Note: What follows was generated by the Gemini LLM. Not human verified.

The BddCsePass (BDD-based Common Subexpression Elimination) is an optimization pass in XLS that identifies and merges logically equivalent expressions within a function or proc by leveraging Binary Decision Diagrams (BDDs). Unlike the purely syntactic CsePass, BddCsePass can detect equivalences even when the expressions have different syntactic forms, as it relies on the canonical representation of Boolean functions provided by BDDs. This capability leads to more aggressive common subexpression elimination and significantly reduces hardware redundancy.

How it Works:

  1. BDD Representation: The core of this pass is the BddQueryEngine. For each bit-typed node in the function, the BddQueryEngine constructs a Binary Decision Diagram. A BDD is a directed acyclic graph that represents a Boolean function. Crucially, for any given Boolean function, its BDD is canonical (unique) if a fixed variable ordering is used. This property allows the pass to determine if two different expressions compute the same Boolean function simply by comparing their BDDs.

  2. Node Ordering for Determinism and Performance: The pass first determines a specific order in which to process the nodes (GetNodeOrder). This order is a topological sort (necessary to avoid introducing cycles) and prioritizes nodes on less critical paths. This prioritization helps ensure that CSE replacements do not inadvertently increase the overall critical path delay of the design.

  3. Equivalence Bucketing: The pass iterates through the nodes in the determined order. For each bit-typed node, it computes a hash value based on the BDD node indices of all its constituent bits. Nodes with the same hash are grouped together into "node buckets," as they are potentially logically equivalent.

  4. Logical Equivalence Check: Within each bucket, the pass performs a pairwise comparison of nodes using the is_same_value function. This function directly compares the BDDs of the two nodes. If their BDDs are identical, the nodes are confirmed to be logically equivalent.

  5. Replacement: When two or more logically equivalent nodes are found (e.g., node_A and node_B, where node_A was encountered earlier in the node_order and is chosen as the representative):

    * The later encountered node (node_B) is replaced by the earlier node (node_A). This means all users of node_B are rewired to use node_A instead.

    * The pass records that a change occurred.

  6. Post-DCE: After BddCsePass has completed its execution, a DeadCodeEliminationPass is typically run. This is essential to remove any nodes that became dead code as a result of being replaced by an equivalent node, further cleaning up the IR.

Benefits:

  • Semantic Equivalence: BddCsePass excels at identifying and commoning subexpressions that are semantically (logically) equivalent, even if their textual representation in the IR is different. This is a significant advantage over purely syntactic CSE approaches.
  • Hardware Reduction: By eliminating redundant computations, the pass directly reduces the amount of hardware required, leading to smaller area, lower power consumption, and potentially faster circuits.
  • Improved IR Quality: The resulting IR is more compact, canonical, and easier for subsequent optimization and synthesis tools to process.

Example: Consider two expressions that compute the same Boolean function but have different IR structures: x_eq_42: bits[1] = eq(x, literal(42)) and forty_two_not_ne_x: bits[1] = not(ne(literal(42), x)). Syntactically, these expressions are different.

// Original IR snippet
fn FuzzTest(x: bits[16]) -> (bits[1], bits[1]) {
  forty_two: bits[16] = literal(value=42)
  x_eq_42: bits[1] = eq(x, forty_two)
  forty_two_not_ne_x: bits[1] = not(ne(forty_two, x))
  ret tuple.0: (bits[1], bits[1]) = tuple(x_eq_42, forty_two_not_ne_x)
}

BddCsePass would build BDDs for both x_eq_42 and forty_two_not_ne_x. Since these two expressions are logically equivalent, their BDDs would be identical. The pass would then replace one with the other (e.g., forty_two_not_ne_x would be replaced by x_eq_42), resulting in shared logic:

// Optimized IR (after BddCsePass and a subsequent DCE pass)
fn FuzzTest(x: bits[16]) -> (bits[1], bits[1]) {
  forty_two: bits[16] = literal(value=42)
  x_eq_42: bits[1] = eq(x, forty_two)
  ret tuple.0: (bits[1], bits[1]) = tuple(x_eq_42, x_eq_42)
}

This effectively eliminates the redundant computation of the second comparison, leading to a more optimized design.

Header

bdd_simp - BDD-based Simplification

Runs BDD-based simplifications on the function.

Note: What follows was generated by the Gemini LLM. Not human verified.

The BddSimplificationPass leverages Binary Decision Diagrams (BDDs) to perform advanced logical simplifications on the IR. BDDs provide a canonical representation of Boolean functions, enabling powerful, logic-level optimizations that are challenging for traditional syntactic or dataflow-based passes.

The pass currently focuses on the following key areas:

  1. Replacement of Statically Known Values with Literals: Through BDD analysis, if a node's value (or specific bits of it) can be determined to be constant under all reachable conditions, it is replaced with a literal.

    Example: If my_and: bits[1] = and(param_x, param_y) and BDD analysis proves param_x is always 0b0, then my_and can be replaced with literal(0, bits=1).

  2. Removal of Redundant Operands in Boolean Operations: For N-ary Boolean operations (and, nand, or, nor), the pass identifies and removes operands that are logically redundant given the values or relationships of other operands. For instance, in and(A, B, C), if C is always true whenever A and B are true, C is redundant.

    Example: If C is a predicate a_and_b: bits[1] = and(A, B)

    ret my_and: bits[1] = and(A, B, C)

    This can be simplified to:

    ret my_and: bits[1] = and(A, B)

  3. Collapse of Select Chains into OneHotSelects (at opt_level >= 2): Chains of binary select operations, particularly where one case of a select feeds into another select (e.g., sel(pred_a, {val_a, sel(pred_b, {val_b, default_val})})), are transformed into a single one_hot_select if the selectors are provably disjoint (at most one can be true at any given time). This simplifies the IR structure and can lead to more efficient hardware implementations.

  4. Replacement of Two-Way OneHotSelect with Select (at opt_level >= 2): A one_hot_select with a 2-bit selector that is known to be one-hot can be replaced with a simpler select operation, as the one-hot property means the decision effectively depends on a single bit. This reduces the complexity of the select operation.

  5. Simplification of PrioritySelect Operations (at opt_level >= 1): If BDD analysis determines that the selector of a priority_select is guaranteed to have at least one bit set within a certain range, the priority_select can be narrowed. This involves trimming higher-priority cases that are guaranteed not to be selected, effectively reducing the number of cases the hardware needs to evaluate.

As BDD-based analysis can be computationally intensive, some optimizations are applied conditionally based on the optimization level (opt_level).

Header

bdd_simp(2) - BDD-based Simplification with opt_level <= 2

Runs BDD-based simplifications on the function.

Note: What follows was generated by the Gemini LLM. Not human verified.

The BddSimplificationPass leverages Binary Decision Diagrams (BDDs) to perform advanced logical simplifications on the IR. BDDs provide a canonical representation of Boolean functions, enabling powerful, logic-level optimizations that are challenging for traditional syntactic or dataflow-based passes.

The pass currently focuses on the following key areas:

  1. Replacement of Statically Known Values with Literals: Through BDD analysis, if a node's value (or specific bits of it) can be determined to be constant under all reachable conditions, it is replaced with a literal.

    Example: If my_and: bits[1] = and(param_x, param_y) and BDD analysis proves param_x is always 0b0, then my_and can be replaced with literal(0, bits=1).

  2. Removal of Redundant Operands in Boolean Operations: For N-ary Boolean operations (and, nand, or, nor), the pass identifies and removes operands that are logically redundant given the values or relationships of other operands. For instance, in and(A, B, C), if C is always true whenever A and B are true, C is redundant.

    Example: If C is a predicate a_and_b: bits[1] = and(A, B)

    ret my_and: bits[1] = and(A, B, C)

    This can be simplified to:

    ret my_and: bits[1] = and(A, B)

  3. Collapse of Select Chains into OneHotSelects (at opt_level >= 2): Chains of binary select operations, particularly where one case of a select feeds into another select (e.g., sel(pred_a, {val_a, sel(pred_b, {val_b, default_val})})), are transformed into a single one_hot_select if the selectors are provably disjoint (at most one can be true at any given time). This simplifies the IR structure and can lead to more efficient hardware implementations.

  4. Replacement of Two-Way OneHotSelect with Select (at opt_level >= 2): A one_hot_select with a 2-bit selector that is known to be one-hot can be replaced with a simpler select operation, as the one-hot property means the decision effectively depends on a single bit. This reduces the complexity of the select operation.

  5. Simplification of PrioritySelect Operations (at opt_level >= 1): If BDD analysis determines that the selector of a priority_select is guaranteed to have at least one bit set within a certain range, the priority_select can be narrowed. This involves trimming higher-priority cases that are guaranteed not to be selected, effectively reducing the number of cases the hardware needs to evaluate.

As BDD-based analysis can be computationally intensive, some optimizations are applied conditionally based on the optimization level (opt_level).

Header

bdd_simp(3) - BDD-based Simplification with opt_level <= 3

Runs BDD-based simplifications on the function.

Note: What follows was generated by the Gemini LLM. Not human verified.

The BddSimplificationPass leverages Binary Decision Diagrams (BDDs) to perform advanced logical simplifications on the IR. BDDs provide a canonical representation of Boolean functions, enabling powerful, logic-level optimizations that are challenging for traditional syntactic or dataflow-based passes.

The pass currently focuses on the following key areas:

  1. Replacement of Statically Known Values with Literals: Through BDD analysis, if a node's value (or specific bits of it) can be determined to be constant under all reachable conditions, it is replaced with a literal.

    Example: If my_and: bits[1] = and(param_x, param_y) and BDD analysis proves param_x is always 0b0, then my_and can be replaced with literal(0, bits=1).

  2. Removal of Redundant Operands in Boolean Operations: For N-ary Boolean operations (and, nand, or, nor), the pass identifies and removes operands that are logically redundant given the values or relationships of other operands. For instance, in and(A, B, C), if C is always true whenever A and B are true, C is redundant.

    Example: If C is a predicate a_and_b: bits[1] = and(A, B)

    ret my_and: bits[1] = and(A, B, C)

    This can be simplified to:

    ret my_and: bits[1] = and(A, B)

  3. Collapse of Select Chains into OneHotSelects (at opt_level >= 2): Chains of binary select operations, particularly where one case of a select feeds into another select (e.g., sel(pred_a, {val_a, sel(pred_b, {val_b, default_val})})), are transformed into a single one_hot_select if the selectors are provably disjoint (at most one can be true at any given time). This simplifies the IR structure and can lead to more efficient hardware implementations.

  4. Replacement of Two-Way OneHotSelect with Select (at opt_level >= 2): A one_hot_select with a 2-bit selector that is known to be one-hot can be replaced with a simpler select operation, as the one-hot property means the decision effectively depends on a single bit. This reduces the complexity of the select operation.

  5. Simplification of PrioritySelect Operations (at opt_level >= 1): If BDD analysis determines that the selector of a priority_select is guaranteed to have at least one bit set within a certain range, the priority_select can be narrowed. This involves trimming higher-priority cases that are guaranteed not to be selected, effectively reducing the number of cases the hardware needs to evaluate.

As BDD-based analysis can be computationally intensive, some optimizations are applied conditionally based on the optimization level (opt_level).

Header

bitslice_simp - Bit-slice simplification

Pass which simplifies bit-slices. This includes collapsing sequential bit-slices, eliminating degenerate full-width slices, and others.

Note: What follows was generated by the Gemini LLM. Not human verified.

The BitSliceSimplificationPass is a crucial optimization pass that targets and simplifies operations related to bit manipulation: bit_slice, dynamic_bit_slice, and bit_slice_update. The overarching goal is to reduce logical complexity, enhance hardware efficiency, and generate cleaner IR by eliminating redundancies and transforming patterns into more canonical forms that are easier for hardware synthesis tools to process.

The pass employs a combination of local pattern matching and advanced query engine analysis (utilizing LazyTernaryQueryEngine or PartialInfoQueryEngine for higher optimization levels) to identify and capitalize on simplification opportunities.

Key simplifications performed by this pass include:

  1. Simplifying BitSlice Operations: * Full-Width Slices: A bit_slice that extracts the entire bit width of its operand (e.g., bit_slice(x, start=0, width=N)) is a no-operation and is directly replaced by its operand x.

    * Consecutive BitSlice Operations: Nested bit_slice operations (e.g., bit_slice(bit_slice(x, s0, w0), s1, w1)) are collapsed into a single bit_slice(x, s0 + s1, w1), effectively combining the slicing operations.

    * Hoisting BitSlice above Concat: A bit_slice of a concat operation can often be rewritten as a concat of smaller bit_slices of the original concat operands. This allows for more precise extraction of relevant bits and can eliminate unnecessary concatenations.

    * Hoisting BitSlice above Bitwise/Arithmetic Operations: If a bit_slice is the sole consumer of a bitwise operation (e.g., and, or, xor) or targets the low bits of an arithmetic operation (e.g., add, sub, neg), the bit_slice can be pushed down into the operands. This transforms bit_slice(op(A, B), ...) into op(bit_slice(A, ...), bit_slice(B, ...)), reducing the width of the intermediate op result.

    * Hoisting BitSlice above SignExt: Handles various cases where a bit_slice operates on a sign_ext operation. It simplifies this by transforming it into a direct slice of the original value or a sign_ext of a smaller slice, depending on where the slice falls relative to the sign bit of the extended value.

    * Hoisting BitSlice above Shifts and Decodes: If all users of a shift or decode operation are bit-slices entirely contained within a larger slice, the bit_slice can be hoisted above the shift/decode. This effectively performs the shift/decode on a smaller bit-width operand, reducing computation.

    * Hoisting BitSlice above Selects: If all users of a select (or one_hot_select/priority_select) are bit-slices entirely contained within a larger slice, the bit_slice can be hoisted above the select. This results in the select operating on smaller, sliced operands, which can reduce the multiplexer complexity.

  2. Simplifying DynamicBitSlice Operations: * Out-of-Bounds Slices: A dynamic_bit_slice whose start index is provably out of bounds (meaning it attempts to slice entirely beyond the operand's bit width) is replaced with a literal zero, as no data can be extracted.

    * Literal Start Index: A dynamic_bit_slice with a known constant start index is converted into a static bit_slice operation. This eliminates dynamic addressing logic in the hardware.

    * Hoisting DynamicBitSlice above Select of Literals: If the start index of a dynamic_bit_slice is the result of a select operation where all the cases are literal constant values, the dynamic_bit_slice can be hoisted. This replaces the original select with a new select whose cases are static bit_slice operations, each corresponding to one of the literal start indices.

      ```
      // Original IR
      start1: bits[32] = literal(value=5)
      start2: bits[32] = literal(value=25)
      p: bits[32] = select(x, cases=[start1, start2])
      q: bits[45] = dynamic_bit_slice(to_slice, p, width=45)
    
      // Optimized IR (conceptually)
      slice1: bits[45] = bit_slice(to_slice, start=5, width=45)
      slice2: bits[45] = bit_slice(to_slice, start=25, width=45)
      q: bits[45] = select(x, cases=[slice1, slice2])
      ```
    

    * Scaled Dynamic Bit Slices: Optimizes dynamic_bit_slice operations where the start index is a known multiple of the slice width. This involves conceptually treating the to_slice operand as an array of smaller bit elements and then using a select operation to choose the correct element from this conceptual array.

  3. Simplifying BitSliceUpdate Operations: * Out-of-Bounds Updates: A bit_slice_update whose start index is provably out of bounds is replaced by its to_update operand, as it would have no effect on the target bit vector.

    * Literal Start Index: A bit_slice_update with a known constant start index is replaced by an equivalent concat and bit_slice structure. This eliminates the need for dynamic update logic in the hardware.

    * Hoisting BitSliceUpdate above Select of Literals: Similar to DynamicBitSlice with select of literals, if the start index of a bit_slice_update is derived from a select of literals, the bit_slice_update can be hoisted, resulting in static bit_slice_update operations for each case of the original select.

    * Scaled Bit Slice Updates: Optimizes bit_slice_update operations where the start index is a known multiple of the update value's width. This involves converting the to_update operand into an array of smaller bit elements, performing an array_update on this conceptual array, and then flattening the result back into a bit vector using concat.

The pass operates in a worklist-driven fashion, particularly for bit_slice nodes, ensuring that simplifications are iteratively applied until a fixed point is reached. This comprehensive approach guarantees a thorough optimization of bit manipulation logic, crucial for high-quality hardware generation.

Header

bool_simp - boolean simplification

Attempts to simplify bitwise / boolean expressions (e.g. of multiple variables).

Note: What follows was generated by the Gemini LLM. Not human verified.

The BooleanSimplificationPass focuses on optimizing logical expressions within the IR. It achieves this by analyzing the truth table generated from a local cluster of Boolean operations and replacing them with a simpler, logically equivalent expression. This is particularly effective for reducing complex combinations of and, or, not, nand, nor, and xor operations into more fundamental and efficient forms.

The core mechanism involves:

  1. Truth Table Generation: For a given Boolean expression, the pass traces its inputs up to a small, manageable set of "frontier" nodes (typically parameters or results of non-Boolean operations). It then conceptually "flows" all possible combinations of true/false values from these frontier nodes through the intermediate Boolean operations to determine the resulting truth table of the entire expression. This is currently performed for up to 3 frontier nodes (X, Y, Z), resulting in an 8-bit truth table.

    Example (conceptual truth table for inputs X, Y, Z):

    X: 00001111   // Represents the value of X for all 8 input combinations
    Y: 00110011   // Represents the value of Y for all 8 input combinations
    Z: 01010101   // Represents the value of Z for all 8 input combinations
    // ... The pass computes the output bit vector based on the logic of the
    //     expression e.g., for `and(X, Y)`, the result would be `00000001`
    
  2. Truth Table Matching and Replacement: The computed N-bit truth table is then compared against a pre-computed set of canonical truth tables for various simple Boolean operations (e.g., and, or, not, nand). If a match is found and the canonical replacement is simpler (based on a cost heuristic), the original complex Boolean expression is replaced by this simpler form. This can also detect expressions that are always true or always false and replace them with literal values.

    Example (original expression involving not and and):

    fn f(x: bits[42], y: bits[42]) -> bits[42] {
      and_op: bits[42] = and(x, y)
      ret not_op: bits[42] = not(and_op)
    }
    
    After BooleanSimplificationPass, this could be transformed into:

    fn f(x: bits[42], y: bits[42]) -> bits[42] {
      ret nand_op: bits[42] = nand(x, y)
    }
    
  3. Handling of Redundancy and Symmetry: The pass incorporates knowledge of Boolean algebra rules, including symmetric properties of operations (e.g., and(A, B) is equivalent to and(B, A)) and the identification of redundant inputs (e.g., in and(X, X, Y), one X is redundant).

This pass is fundamental for logic optimization, as it can significantly reduce the complexity of Boolean logic, leading to more compact and faster hardware implementations.

Header

canon - Canonicalization

IR Canonicalization.

Note: What follows was generated by the Gemini LLM. Not human verified.

The CanonicalizationPass performs various transformations to bring the IR into a standardized, canonical form. This simplifies subsequent optimization passes by ensuring consistent patterns for common operations. By having a predictable structure, other passes can focus on more complex logical transformations without needing to handle numerous syntactic variations of the same underlying operation.

Key canonicalizations include:

  1. Ordering of Literals in Commutative/Comparison Operations: For commutative operations (like add, mul, and, or, xor) and comparison operations, literals are consistently moved to the right-hand side. This means an expression like add(constant, x) becomes add(x, constant). For comparisons, ult(constant, x) becomes ugt(x, constant).

    Example:

    one:bits[2] = literal(value=1)
    ret addval: bits[2] = add(one, x)
    

    becomes:

    ret addval: bits[2] = add(x, one)
    
  2. Conversion of Subtraction to Addition with Negation: sub(x, literal) is transformed into add(x, negate(literal)).

    Example:

    one:bits[2] = literal(value=1)
    ret subval: bits[2] = sub(x, one)
    

    becomes (assuming a 2-bit unsigned system where -1 is 3):

    ret addval: bits[2] = add(x, literal(value=3))
    
  3. Removal of No-op Zero-extends: If a zero_ext operation extends to the same bit width as its operand, it's a no-op and is removed.

    Example:

    zero_ext: bits[16] = zero_ext(x, new_bit_count=16)
    

    becomes:

    x
    
  4. Replacement of Zero-extend with Concat: A zero_ext operation is replaced by a concat operation with a zero literal to explicitly show the bit extension.

    Example:

    zero_ext: bits[42] = zero_ext(x: bits[33], new_bit_count=42)
    

    becomes:

    concat: bits[42] = concat(literal(0, bits=9), x)
    
  5. Removal of Useless Bitwise Reductions: Bitwise reductions (and_reduce, or_reduce, xor_reduce) on single-bit operands are replaced by the operand itself.

    Example:

    and_reduce.1: bits[1] = and_reduce(x: bits[1])
    

    becomes:

    x
    
  6. Canonicalization of Clamp Operations: Various forms of clamping operations (min/max) implemented with select and comparison operations are canonicalized to a consistent select form.

  7. Inversion of Selector for Select Operations: If a 2-way select has a not operation as its selector, the not is removed and the cases are swapped.

    Example:

    not.1: bits[1] = not(p)
    ret sel.2: bits[8] = sel(not.1, cases=[x, y])
    

    becomes:

    ret sel.2: bits[8] = sel(p, cases=[y, x])
    
  8. Simplification of Select with Full Case Coverage: A select operation with a default value is transformed into a select without a default if the number of cases plus the default covers all possible selector values.

  9. Simplification of Next/StateRead Predicates: For next_value and state_read operations, predicates that are always true are removed, making the operation unconditional. If a next_value predicate is always false, the next_value is removed.

These canonicalizations aim to simplify the graph, make patterns more recognizable for other passes, and improve the overall efficiency of downstream optimizations.

Header

channel_legalization - Legalize multiple send/recvs per channel

Pass that legalizes multiple send/receive operations per channel.

This pass adds cross-activation tokens to guarantee that later activations of a proc cannot send or receive on a channel until all previous activations have completed working with that channel.

The ChannelLegalizationPass is an optimization pass designed to ensure the correctness and determinism of hardware designs that contain multiple send or receive operations targeting the same channel within a single Proc activation. In hardware, simultaneous, unordered access to a shared channel can lead to non-deterministic behavior, race conditions, or resource conflicts. This pass addresses these issues by introducing explicit token dependencies to enforce a well-defined ordering of these operations.

The pass operates by analyzing the communication patterns within Procs and enforcing strictness rules based on the channel's strictness attribute. This attribute, defined in the channel configuration, dictates how access to a shared channel should be managed:

  • kTotalOrder: All operations on the channel must adhere to a strict total order. The pass verifies if the existing token graph already enforces this order; if not, it introduces mechanisms (cross-activation tokens) to explicitly enforce it.
  • kRuntimeOrdered: Similar to kTotalOrder, but allows for dynamic ordering at runtime as long as correctness is maintained. The pass inserts assertions to verify this at runtime if violations occur.
  • kProvenMutuallyExclusive: Operations on the channel are expected to be mutually exclusive at compile time (i.e., only one operation can be active in any given activation). The pass uses a Z3 solver to formally prove this mutual exclusivity; if it cannot be proven, a compile-time error is raised.
  • kRuntimeMutuallyExclusive: Operations are expected to be mutually exclusive at runtime. The pass adds runtime assertions that will trigger an error during simulation or execution if simultaneous access occurs.
  • kArbitraryStaticOrder: Allows for any static ordering of operations. The pass will introduce token dependencies to impose an arbitrary but deterministic static order if multiple operations are detected.

The core mechanism for enforcing ordering when multiple operations on a channel exist is by introducing "cross-activation tokens." For each send or receive operation that is part of a multiple-access group on a given channel:

  1. New State Elements for Tokens: A new token-typed StateElement is added to the Proc for each send or receive operation that requires legalization. These new state elements act as implicit tokens, tracking the completion of operations across different activations of the Proc.

  2. Next Operations for New Tokens: For each newly added token state element, a next_value operation is introduced. This next_value is configured to update the token state with the token output of the corresponding send or receive operation, conditioned by the send/receive predicate (if it exists). This ensures that the token for the next activation is only available after the current activation's operation has completed.

  3. Modifying Original Tokens: The original token input to each send or receive operation is replaced with an after_all operation. This after_all operation combines the original incoming token with all the newly created implicit tokens from other operations on the same channel. This effectively creates a dependency chain, ensuring that a given send/receive cannot proceed until all other send/receive operations on that channel from the previous activation have completed, and its own incoming token is ready.

This entire process ensures that even when multiple send/receive operations target the same channel, their execution is properly sequenced, preventing data hazards and ensuring deterministic behavior in the generated hardware. The pass also includes checks to ensure that channels of type kStreaming (the default) are legalized, while other channel kinds are skipped.

Example (simplified conceptual view for multiple receives on channel in with kTotalOrder strictness, where no explicit token dependency exists initially):

// Original IR snippet for a proc with multiple receives on 'in'
chan in(bits[32], id=0, kind=streaming, ops=receive_only,
        flow_control=ready_valid, strictness=total_order)

top proc my_proc() {
  tok: token = literal(value=token)
  recv0: (token, bits[32]) = receive(tok, channel=in)
  recv0_tok: token = tuple_index(recv0, index=0)
  recv0_data: bits[32] = tuple_index(recv0, index=1)

  recv1: (token, bits[32]) = receive(tok, channel=in)
  // Also uses original 'tok', creating a conflict
  recv1_tok: token = tuple_index(recv1, index=0)
  recv1_data: bits[32] = tuple_index(recv1, index=1)

  // ... other logic and next state
}

For this scenario, where recv0 and recv1 initially use the same tok without explicit ordering, ChannelLegalizationPass would perform transformations similar to:

// Optimized IR snippet (simplified, showing key changes)
chan in(bits[32], id=0, kind=streaming, ops=receive_only,
        flow_control=ready_valid, strictness=total_order)

top proc my_proc() {
  original_tok_input: token = literal(value=token) // Original token input

  // New state elements to track tokens across activations
  implicit_token__recv0_state: token = state_element(init=token)
  implicit_token__recv1_state: token = state_element(init=token)

  // recv0 now waits on implicit_token__recv1_state from previous activation
  recv0: (token, bits[32]) =
         receive(
           after_all(
             original_tok_input,
             state_read(implicit_token__recv1_state)),
           channel=in)
  recv0_tok: token = tuple_index(recv0, index=0)
  recv0_data: bits[32] = tuple_index(recv0, index=1)
  // Update implicit_token__recv0_state
  next (implicit_token__recv0_state, recv0_tok)

  // recv1 now waits on implicit_token__recv0_state from previous activation
  recv1: (token, bits[32]) =
         receive(
           after_all(
             original_tok_input,
             state_read(implicit_token__recv0_state)),
           channel=in)
  recv1_tok: token = tuple_index(recv1, index=0)
  recv1_data: bits[32] = tuple_index(recv1, index=1)
  // Update implicit_token__recv1_state
  next (implicit_token__recv1_state, recv1_tok)

  // ... other logic and next state
}

This introduces a circular dependency through the state, ensuring that recv0 in activation N-1 must complete before recv1 in activation N can proceed, and vice-versa, thereby enforcing a total order of access across activations and preventing simultaneous access.

Header

comparison_simp - Comparison Simplification

Simplifies logical operations on the results of comparison operations. For example:

eq(x, 0) && ne(x, 1) => eq(x, 0) eq(x, 0) && ne(x, 0) => 0 eq(x, 0) || ne(x, 0) => 1

Note: What follows was generated by the Gemini LLM. Not human verified.

The ComparisonSimplificationPass is an optimization pass that focuses on simplifying logical operations that consume the results of comparison operations. Its primary goal is to reduce redundant comparisons and streamline complex Boolean expressions derived from equality, inequality, and ordering checks, thereby leading to more efficient hardware implementations.

The pass operates in an iterative, fixed-point manner, continuously applying simplifications until no further changes can be made. It performs two main types of optimizations:

  1. Range-Based Simplification of Comparison/Logical Trees: * ComputeRangeEquivalence: This function analyzes single-bit nodes (which typically represent the results of comparisons or logical operations) and determines their equivalent range of values. A RangeEquivalence struct is used to store the original Node and an IntervalSet representing the precise set of values for which the node evaluates to true. * For basic comparison operations (e.g., eq(x, 42), ult(x, 10)), it directly computes the IntervalSet that makes the comparison true.

      *   For logical operations (`and`, `or`, `not`, `nand`, `nor`) whose
          operands are also comparison results (or other logical operations
          with known ranges), it combines their `IntervalSet`s using set
          intersection (for `and`), set union (for `or`), or set complement
          (for `not`, `nand`, `nor`).
    

    * Replacing with Constants or Simpler Comparisons: Once the RangeEquivalence for a node is accurately determined, the pass attempts to simplify the node itself: * Maximal Range (Always True): If the computed IntervalSet covers the entire possible range of values for x, the node (e.g., a complex Boolean expression) is replaced with a literal 1 (representing true).

      *   **Empty Range (Always False)**: If the `IntervalSet` is empty,
          indicating that the condition can never be met, the node is
          replaced with a literal `0` (representing false).
    
      *   **Precise Value**: If the `IntervalSet` reduces to a single,
          precise value `C` (e.g., `x` must be `42`), the node is replaced
          with an `eq(x, C)` operation.
    
      *   **Complementary Precise Value**: If the `IntervalSet` represents
          all values except a single value `C`, the node is replaced with a
          `ne(x, C)` operation.
    
      *   **ULt/UGt Interval**: If the `IntervalSet` corresponds to a
          contiguous range starting from 0 up to `C` (e.g., `[0, 42]`),
          it's replaced with `ult(x, C+1)`. Conversely, if it's a range
          from `C` to the maximum value (e.g., `[42, MAX]`), it's replaced
          with `ugt(x, C-1)`.
    

    * Guards against De-optimization: The pass incorporates checks to prevent replacing existing simple comparison operations (like eq, ne, ult, ugt, etc.) or inversions of multi-user comparisons with new, potentially more complex, comparisons. This guards against unintended logic duplication and an increase in hardware area.

  2. TransformDerivedComparisons: This function identifies and eliminates redundant comparison operations by recognizing fundamental Boolean identities and relationships: * Commuted Operands: If a comparison ult(x, y) exists in the IR, and a redundant ugt(y, x) is found, the latter is replaced by the former.

    * Inverted Comparison: If ult(x, y) exists, and uge(x, y) is found, the latter can be replaced by not(ult(x, y)), leveraging the existing comparison.

    * Inverted and Commuted Comparison: If ult(x, y) exists, and ule(y, x) is found, it can be replaced by not(ult(x, y)), combining both the inversion and commutation optimizations.

By aggressively simplifying logical operations over comparisons, this pass significantly reduces the complexity of control logic, which is crucial for generating smaller and faster hardware designs.

Example: Consider the expression eq(x, 0) && ne(x, 1).

// Original IR snippet
fn f(x: bits[32]) -> bits[1] {
  literal.0: bits[32] = literal(value=0)
  literal.1: bits[32] = literal(value=1)
  x_eq_0: bits[1] = eq(x, literal.0)
  x_ne_1: bits[1] = ne(x, literal.1)
  ret and.final: bits[1] = and(x_eq_0, x_ne_1)
}

The ComputeRangeEquivalence would determine that: * x_eq_0 is equivalent to x being in the range [[0, 0]]. * x_ne_1 is equivalent to x being in [[0, 0], [2, MAX]]. When these two ranges are combined by the and operation, the resulting effective range for x is [[0, 0]]. Since this represents x being precisely equal to 0, the pass will simplify and.final to eq(x, 0):

// Optimized IR (simplified)
fn f(x: bits[32]) -> bits[1] {
  literal.0: bits[32] = literal(value=0)
  ret eq.final: bits[1] = eq(x, literal.0)
}

This effectively removes the redundant ne(x, 1) comparison, leading to simpler logic.

Header

concat_simp - Concat simplification

Pass which simplifies concats. This includes removing single-operand concats, flattening trees of dependent concats, and others.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ConcatSimplificationPass is an optimization pass in XLS that focuses on simplifying concat operations in the Intermediate Representation (IR). Concatenation is a fundamental operation in hardware description languages, used to combine smaller bit vectors into larger ones. This pass aims to reduce the complexity, improve the readability, and optimize the hardware implementation of concat operations and related patterns by applying various rewrite rules.

The pass operates by iteratively processing concat nodes using a worklist algorithm. This ensures that newly created or modified concat nodes are re-evaluated for further simplification. Additionally, it performs a second pass over other operations that interact with concats to identify further optimization opportunities.

Here's a detailed breakdown of the key optimizations performed:

  1. Removing Trivial Concats (Single Operand): If a concat operation has only one operand, it effectively acts as an identity operation. The pass replaces such a concat with its single operand, eliminating the redundant node and simplifying the dataflow.

    concat(x) => x
    
  2. Flattening Trees of Concats: Nested concat operations (e.g., concat(A, concat(B, C))) are flattened into a single concat operation with all the original leaf operands. This reduces the depth of the IR graph, simplifies subsequent analyses, and can lead to more compact hardware.

    concat(a, b, concat(c, concat(d, e))) => concat(a, b, c, d, e)
    
  3. Merging Consecutive Literal Operands: If a concat operation has two or more consecutive operands that are known literals, these literals are merged into a single, wider literal. This reduces both the number of operands to the concat and the total number of literal nodes in the graph.

    concat(literal_1, literal_2, x, literal_3, literal_4)
    // => concat(merged_literal_1_2, x, merged_literal_3_4)
    
  4. Eliminating Zero-Bit Operands: Any operand to a concat that has a bit width of zero (e.g., bits[0]) is functionally a no-op. The pass identifies and removes such operands from the concat's operand list, cleaning up the IR.

  5. Hoisting Reverse Operations above Concats: If a concat operation is used by a single reverse operation (and has no other non-reducible users), the reverse operation is "hoisted" above the concat. This means the reverse is applied individually to each operand of the concat, and then those reversed operands are concatenated in reverse order. This can expose more optimization opportunities for the individual reverse operations.

    reverse(concat(a, b, c)) => concat(reverse(c), reverse(b), reverse(a))
    
  6. Merging Consecutive Bit Slices: If a concat has consecutive operands that are bit_slice operations from the same source node and slice consecutive bits, these two bit_slices are merged into a single, wider bit_slice. This simplifies the slicing logic and reduces the number of intermediate nodes.

    concat(bit_slice(x, start=2, width=2), bit_slice(x, start=0, width=2))
    // => concat(bit_slice(x, start=0, width=4))
    
  7. Hoisting Bitwise Operations above Concats (TryHoistBitWiseOperation): If a bitwise operation (e.g., and, or, xor) has all its operands as concat operations, the bitwise operation can be "hoisted" above the concatenations. This means the operands of the original concats are sliced into corresponding bit ranges, the bitwise operation is applied to these slices, and then the results are re-concatenated. This often exposes opportunities for common subexpression elimination or further simplification on the smaller bitwise operations.

    xor(concat(a, b), concat(c, d)) => concat(xor(a, c), xor(b, d))
    
  8. Hoisting Bitwise Operations with Constants above Concats (TryHoistBitWiseWithConstant): This is a specialized version of hoisting bitwise operations. If a binary bitwise operation has one constant operand and one concat operand, the bitwise operation can be distributed across the concat's operands and combined with corresponding slices of the constant. This effectively narrows the operations.

    // Given x: bits[8], y: bits[8]
    and(0b10101011, concat(x, y))
    // => concat(and(x, 0b1010), and(y, 0b1011))
    
  9. Narrowing and Hoisting Bitwise Operations (TryNarrowAndHoistBitWiseOperation): This optimization specifically targets scenarios where a concat is used as an operand to a bitwise operation (like and, or, xor), and the concat itself has one constant operand (0 or all ones) and one variable operand. The goal is to narrow the bitwise operation to only the "interesting" bits (those from the variable operand) and hoist it above the concat. This can significantly reduce the bit width of the affected operations.

    // Given x: bits[10], b: bits[4]
    or(x, concat(0b111111, b)) // concat(6 ones, b)
    // => concat(0b111111, or(bit_slice(x, 0, 4), b))
    
  10. Bypassing Reduction of Concatenation (TryBypassReductionOfConcatenation): If a bitwise reduction operation (e.g., or_reduce, and_reduce, xor_reduce) is applied to a concat operation, it can be bypassed. The reduction is distributed to each operand of the concat, and then a new non-reductive bitwise operation (e.g., or, and, xor) combines the results. This can simplify complex reduction paths.

    or_reduce(concat(a, b)) => or(or_reduce(a), or_reduce(b))
    
  11. Distributing Reducible Operations (TryDistributeReducibleOperation): This optimization distributes eq and ne operations across concats. If eq(concat(A, B), C) is found, it's transformed into and(eq(A, slice(C)), eq(B, slice(C))). For ne, the transformation uses an or of nes. This helps "see through" concatenations to simplify comparisons with composite values.

    // Given x: bits[5], y: bits[10]
    eq(concat(x, y), literal(0, bits=15))
    // => and(eq(x, literal(0, bits=5)), eq(y, literal(0, bits=10)))
    

The ConcatSimplificationPass is a powerful and multi-faceted pass that significantly contributes to the overall quality of the generated hardware by producing a more compact, efficient, and readable IR.

Header

cond_spec(Bdd) - Conditional specialization

Pass which specializes arms of select operations based on their selector value. When analyzing a particular arm of a sel, the value of the selector is known. This knowledge can be used to simplify the expression within the arm.

Note: What follows was generated by the Gemini LLM. Not human verified.

For example, consider the following IR:

fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
  concat: bits[32] = concat(a, b)
  ret sel.2: bits[32] = sel(a, cases=[z, concat])
}

In the sel operation, the selector is a. When considering case[1] (the concat operation), we know that a must have the value 1. Therefore, the concat(a, b) can be simplified to concat(1, b). The pass will transform the sel to:

sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])

This optimization applies to sel, priority_sel, and one_hot_sel. It can also use Binary Decision Diagrams (BDDs) for more powerful analysis of the conditions.

Header

cond_spec(false) - Conditional specialization

Pass which specializes arms of select operations based on their selector value. When analyzing a particular arm of a sel, the value of the selector is known. This knowledge can be used to simplify the expression within the arm.

Note: What follows was generated by the Gemini LLM. Not human verified.

For example, consider the following IR:

fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
  concat: bits[32] = concat(a, b)
  ret sel.2: bits[32] = sel(a, cases=[z, concat])
}

In the sel operation, the selector is a. When considering case[1] (the concat operation), we know that a must have the value 1. Therefore, the concat(a, b) can be simplified to concat(1, b). The pass will transform the sel to:

sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])

This optimization applies to sel, priority_sel, and one_hot_sel. It can also use Binary Decision Diagrams (BDDs) for more powerful analysis of the conditions.

Header

cond_spec(noBdd) - Conditional specialization

Pass which specializes arms of select operations based on their selector value. When analyzing a particular arm of a sel, the value of the selector is known. This knowledge can be used to simplify the expression within the arm.

Note: What follows was generated by the Gemini LLM. Not human verified.

For example, consider the following IR:

fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
  concat: bits[32] = concat(a, b)
  ret sel.2: bits[32] = sel(a, cases=[z, concat])
}

In the sel operation, the selector is a. When considering case[1] (the concat operation), we know that a must have the value 1. Therefore, the concat(a, b) can be simplified to concat(1, b). The pass will transform the sel to:

sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])

This optimization applies to sel, priority_sel, and one_hot_sel. It can also use Binary Decision Diagrams (BDDs) for more powerful analysis of the conditions.

Header

cond_spec(true) - Conditional specialization

Pass which specializes arms of select operations based on their selector value. When analyzing a particular arm of a sel, the value of the selector is known. This knowledge can be used to simplify the expression within the arm.

Note: What follows was generated by the Gemini LLM. Not human verified.

For example, consider the following IR:

fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
  concat: bits[32] = concat(a, b)
  ret sel.2: bits[32] = sel(a, cases=[z, concat])
}

In the sel operation, the selector is a. When considering case[1] (the concat operation), we know that a must have the value 1. Therefore, the concat(a, b) can be simplified to concat(1, b). The pass will transform the sel to:

sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])

This optimization applies to sel, priority_sel, and one_hot_sel. It can also use Binary Decision Diagrams (BDDs) for more powerful analysis of the conditions.

Header

const_fold - Constant folding

Pass which performs constant folding. Every op with only literal operands is replaced by a equivalent literal. Runs DCE after constant folding.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ConstantFoldingPass is an optimization pass in XLS that performs the crucial task of constant folding across the Intermediate Representation (IR). Its primary function is to identify and evaluate operations where all operands are compile-time constants, and then replace these operations with a single literal node that represents the computed result. This process significantly simplifies the IR, reduces the number of active operations in the data path, and ultimately contributes to smaller, faster, and more efficient hardware.

The pass operates by traversing the nodes of a function or proc in topological order. For each node, it determines if it is "constant foldable" based on the following criteria:

  1. Not Already a Literal: The node itself must not already be a literal node, as there would be no further simplification.

  2. Has Users (or Implicit Use): The node must be either consumed by at least one other node, or have an implicit use (e.g., being the return value of a function, a state element of a proc, or an I/O operation). Nodes without users are considered dead code and are typically removed by a subsequent Dead Code Elimination (DCE) pass.

  3. Not Token-Typed: Nodes with a token type cannot be folded into a literal, as tokens represent ordering constraints rather than data values.

  4. Not Side-Effecting (with exception for gate): Operations that have side effects (e.g., send, receive) generally cannot be constant-folded, as their execution order and effects are crucial for correctness. An exception is made for gate operations, which can be folded if their condition and data inputs are constant, as the gate effectively becomes a simple pass-through or a zero.

  5. All Operands are Known Constants: Critically, all of the node's operands must either be literal values themselves, or be other nodes whose values are provably constant at compile time (as determined by a StatelessQueryEngine).

If a node successfully meets these criteria, the pass proceeds to:

  1. Collect the Value of each of its constant operands.

  2. Interpret the operation using these constant Values to compute its result.

  3. Replace the original operation node with a new literal node containing the computed Value.

After the ConstantFoldingPass is applied, a DeadCodeEliminationPass is typically run to remove any operations that became dead as a result of their uses being replaced by literals.

Example: Consider a function with an add operation whose operands are both literals:

// Original IR
fn IdenticalLiterals() -> bits[8] {
  literal.1: bits[8] = literal(value=42)
  literal.2: bits[8] = literal(value=123)
  ret add.3: bits[8] = add(literal.1, literal.2)
}

The ConstantFoldingPass would identify add.3 as constant foldable. It would evaluate 42 + 123 to 165 and replace add.3 with a new literal node:

// Optimized IR (after ConstantFolding and a subsequent DCE pass)
fn IdenticalLiterals() -> bits[8] {
  ret literal.new: bits[8] = literal(value=165)
}

This optimization is foundational for many other passes, as it reduces complex expressions to their simplest forms, making subsequent analyses and transformations more effective.

Header

cse - Common subexpression elimination

Pass which performs common subexpression elimination. Equivalent ops with the same operands are commoned. The pass can find arbitrarily large common expressions.

Note: What follows was generated by the Gemini LLM. Not human verified.

The CsePass (Common Subexpression Elimination) is an optimization pass in XLS that identifies and merges equivalent computations within a function or proc. The primary goal is to eliminate redundant calculations, thereby reducing the size of the generated hardware, improving performance, and making the IR more compact.

The pass operates by building a canonical representation (CseNode) for each operation in the IR. Two operations are considered equivalent if they have:

  • The same opcode (Op).
  • The same operand list (with a canonical ordering applied for commutative operations like add or and).
  • The same additional configuration data (e.g., start and width for bit_slice, index for tuple_index).
  • The same result type (Type).

Here's a breakdown of how the pass works:

  1. CseNode Representation: A CseNode object is created for each Node in the IR. This CseNode captures the essential characteristics that define equivalence, including the opcode, a sorted list of its operands (canonicalized for commutative operations), any non-operand configuration data, and the result type. For side-effecting operations (other than gate) and certain control-flow-related operations (param, next, receive, send, assert, cover, etc.), a unique identifier is assigned to prevent them from being considered equivalent, even if their underlying data appears identical.

  2. CseNodeArena: A CseNodeArena manages the creation and storage of CseNode objects. It ensures that only one CseNode instance exists for each unique canonical operation. When a Node is processed, the arena either returns an existing CseNode (if an equivalent one has already been seen) or creates a new one.

  3. Equivalence Grouping: The pass traverses the IR in topological order. For each Node, it obtains its CseNode representation. All Nodes that map to the same CseNode are grouped together into an "equivalence bucket."

  4. Replacement: After all nodes have been processed and grouped:

    * For each equivalence bucket containing more than one Node, a "representative" node is chosen from that bucket. The selection criteria for the representative prioritize nodes with shorter assigned names, and then by node ID for determinism.

    * All other Nodes in the bucket (the "duplicates") are then replaced by the representative node. This means all users of the duplicate nodes are rewired to use the single representative node instead.

    * The pass can optionally common literal nodes. If common_literals is false, each literal is treated as unique, preventing them from being merged. This can be useful if preserving distinct literal nodes is important for some later pass.

  5. Post-DCE: After CsePass runs, a DeadCodeEliminationPass is typically executed. This is because replacing duplicate nodes with a representative leaves the original duplicate nodes with no remaining users, making them dead code that can then be safely removed from the IR.

Example: Consider a function with two identical and operations, followed by identical neg and or operations:

// Original IR snippet
fn nontrivial(x: bits[8], y: bits[8], z: bits[8]) -> bits[8] {
  and.1: bits[8] = and(x, y)
  neg.2: bits[8] = neg(and.1)
  or.3: bits[8] = or(neg.2, z)

  and.4: bits[8] = and(x, y) // Identical to and.1
  neg.5: bits[8] = neg(and.4) // Identical to neg.2
  or.6: bits[8] = or(neg.5, z) // Identical to or.3

  ret add.7: bits[8] = add(or.3, or.6)
}

The CsePass would identify that and.1 and and.4 are equivalent; neg.2 and neg.5 are equivalent; and or.3 and or.6 are equivalent. It would choose one of each pair as a representative and replace the duplicates. After a subsequent DCE pass, the IR would be simplified to:

// Optimized IR (after CsePass and DCE)
fn nontrivial(x: bits[8], y: bits[8], z: bits[8]) -> bits[8] {
  and.1: bits[8] = and(x, y)
  neg.2: bits[8] = neg(and.1)
  or.3: bits[8] = or(neg.2, z)

  ret add.7: bits[8] = add(or.3, or.3) // or.6 replaced by or.3
}

This effectively shares the computation of the redundant and, neg, and or operations, significantly reducing redundant logic and hardware requirements.

Header

dataflow - Dataflow Optimization

An optimization which uses a lattice-based dataflow analysis to find equivalent nodes in the graph and replace them with a simpler form. The analysis traces through tuples, arrays, and select operations. Optimizations which can be performed by this pass:

tuple_index(tuple(x, y), index=1) => y

select(selector, {z, z}) => z

array_index(array_update(A, x, index={42}), index={42}) => x

Note: What follows was generated by the Gemini LLM. Not human verified.

The DataflowSimplificationPass is an optimization pass in XLS that leverages a lattice-based dataflow analysis to identify and replace logically equivalent nodes in the Intermediate Representation (IR) with simpler, more direct forms. This pass is highly effective at "seeing through" structural operations such as tuples, arrays, and select statements to eliminate redundant computations and expose values that are directly available from their original source. The overarching goal is to reduce the overall complexity and size of the IR, leading to more compact and efficient hardware designs.

How it Works:

  1. NodeSourceAnalysis: The core mechanism of this pass is the NodeSourceAnalysis (implemented via NodeSourceDataflowVisitor). This analysis precisely determines the "source" of every elemental component (individual bits, tuple elements, array elements) within each node in the IR graph. A "source" is defined as the original node or literal from which a particular piece of data originates, tracing through intermediate operations like tuple_index, array_index, bit_slice, and concat. For example, if an operation is tuple_index(tuple(x, y), 1), the effective source of this operation's output is directly y.

  2. Equivalence Mapping (source_map): The pass performs a topological sort of the function's nodes. During this traversal, it constructs a hash map (source_map). The keys of this map are LeafTypeTreeView<NodeSource> objects, which uniquely identify the "source" of a node's data. The values are pointers to the Node* that represents that canonical source. If two distinct nodes are found to have the same LeafTypeTreeView<NodeSource>, they are semantically equivalent.

  3. Simplification via Replacement: * Direct Replacement of Equivalent Nodes: As the pass traverses the graph in topological order, if the NodeSource of the current node (node) is already present in the source_map (indicating that a logically equivalent node (it->second) has already been processed), then the current node is directly replaced by it->second. This effectively eliminates redundant computations where the same value is derived through different paths.

      ```
      // Original:
      tuple.1: (bits[2], bits[42]) = tuple(x, y)
      ret tuple_index.2: bits[42] = tuple_index(tuple.1, index=1)
      // After optimization:
      ret y: bits[42] = param(name=y) // tuple_index.2 replaced by y
      ```
    

    * Special Handling for Empty Types: Nodes representing empty types (e.g., zero-width bits, empty tuples) are handled with care. While they are technically equivalent to each other, replacing them indiscriminately can lead to infinite loops if, for example, two parameters of empty types repeatedly replace each other's uses. To prevent this, the pass explicitly skips replacing nodes of empty types with other equivalent empty-type nodes.

  4. MaybeReplaceWithArrayOfExistingNodes (Array Optimization): This auxiliary function specifically targets array nodes. If an array node is constructed entirely from elements that are themselves existing nodes in the graph (and whose NodeSources are already in the source_map), the array node can be replaced by a more direct array literal or a reference to an existing array operation. This is useful for: * Unboxing Arrays: Simplifying array_index(array(x), index=0) to just x.

    * Reconstructing Arrays from Updates: If a series of array_update operations effectively reconstructs an array from individual elements (some of which might be initial values), and the pass can track these sources, it can replace the array_update chain with a direct array operation.

    * Eliminating Redundant Array Construction: If an array is constructed whose elements are simply array_index operations from another existing array, the entire array construction can be replaced by the original array.

      ```
      // Original:
      element_0: bits[32] = array_index(a, indices=[zero])
      element_1: bits[32] = array_index(a, indices=[one])
      element_2: bits[32] = array_index(a, indices=[two])
      ret array: bits[32][3] = array(element_0, element_1, element_2)
      // After optimization:
      ret a: bits[32][3] = param(name=a) // Replaced by original array 'a'
      ```
    

Benefits:

  • Semantic CSE: Performs a powerful form of common subexpression elimination that identifies logically equivalent computations even if their IR structures are syntactically different.
  • IR Simplification: Reduces the number of nodes in the IR by eliminating redundant computations and simplifying dataflow paths through aggregate types (tuples and arrays).
  • Hardware Efficiency: Leads directly to more compact and efficient hardware by removing redundant logic and simplifying complex data structures, which translates to better area and performance metrics.

Header

dce - Dead Code Elimination

class DeadCodeEliminationPass iterates up from a functions result nodes and marks all visited node. After that, all unvisited nodes are considered dead.

Note: What follows was generated by the Gemini LLM. Not human verified.

The DeadCodeEliminationPass (DCE) is a fundamental optimization pass in XLS that removes "dead" or "unreachable" code from the Intermediate Representation (IR). Dead code refers to any operation whose result is not used, either directly or indirectly, to produce the final output of a function or any side-effecting operation. By removing such code, DCE simplifies the IR graph, reduces the size of the generated hardware, and can improve the performance of subsequent optimization passes by reducing the amount of code they need to analyze.

How it Works:

The pass operates by identifying "live" code—operations that are essential for the computation—and then removing everything else. This is achieved through a simple and efficient graph traversal algorithm:

  1. Identifying Live Roots: The pass starts by identifying the initial set of "live" nodes. These are nodes that are inherently live because they represent the essential outputs or side effects of the function or proc. This set includes: * The function's return value.

    * Any side-effecting operations (e.g., assert, cover, trace, send).

    * Proc next state update nodes.

    * Nodes with implicit uses (e.g., output ports in a block).

  2. Worklist-based Traversal: A worklist is initialized with all nodes that are initially considered dead (i.e., those with no users and which are deletable). The pass then iteratively removes a node from the worklist and, for each of its operands, checks if that operand has become dead as a result. If so, the operand is added to the worklist.

  3. Determining Deletability: A node is considered "deletable" if it is not a root and does not have any side effects. The pass includes a specific is_deletable check which ensures that: * Nodes with implicit uses (like block ports) are not removed.

    * Most side-effecting operations (assert, cover, trace, send, etc.) are not removed, as they are considered observable effects. gate is a special-case that is considered removable.

    * invoke nodes are generally not removed by DCE. This is a conservative approach, as the invoked function might have side effects that are not immediately apparent. InliningPass is responsible for handling the removal of invoke nodes.

  4. Node Removal: As the pass identifies dead nodes, it removes them from the function's IR graph. This process continues until the worklist is empty, at which point only live code remains.

Benefits:

  • Reduced Hardware Size: The most direct benefit is a reduction in the amount of logic that needs to be synthesized, leading to smaller and more area-efficient hardware.
  • Improved Performance: By removing unnecessary computations, the pass can sometimes shorten critical paths, leading to improved timing performance.
  • Enhanced IR Quality: A cleaner, more compact IR is easier for both humans and subsequent optimization passes to analyze and transform, which can lead to better overall optimization results.

Example: Consider a function with several operations, some of which do not contribute to the final result:

// Original IR snippet
fn some_dead_code(x: bits[42], y: bits[42]) -> bits[42] {
  neg.1: bits[42] = neg(x)
  add.2: bits[42] = add(x, y)      // Dead: not used by any live node
  neg.3: bits[42] = neg(add.2)     // Dead: only user is also dead
  ret sub.4: bits[42] = sub(neg.1, y)
}

The DeadCodeEliminationPass would perform the following steps: 1. Start with the live root, sub.4 (the function's return value).

  1. The live set becomes {sub.4, neg.1, y}. x is also live since its used by neg.1.

  2. Nodes add.2 and neg.3 are identified as having no users that are in the live set. Since they have no other users and are not side-effecting, they are considered dead.

  3. The pass removes add.2 and neg.3 from the graph.

The final, optimized IR would be:

// Optimized IR (after DCE)
fn some_dead_code(x: bits[42], y: bits[42]) -> bits[42] {
  neg.1: bits[42] = neg(x)
  ret sub.4: bits[42] = sub(neg.1, y)
}

The pass has successfully eliminated the unused add and neg operations, resulting in a more efficient implementation.

Header

dfe - Dead Function Elimination

Dead Function Elimination.

Note: What follows was generated by the Gemini LLM. Not human verified.

This pass removes unreachable functions, procs, and blocks (collectively FunctionBases) from the package. It is crucial for reducing the size and complexity of the IR by discarding any components that do not contribute to the final output of the top-level entity.

The pass requires that a top entity (function, proc, or block) be set in the package to determine the starting point of liveness analysis.

The process involves:

  1. Identifying Live Roots: The pass first identifies the initial set of "live" FunctionBases. This typically includes: * The explicitly designated top-level function, proc, or block.

    * For old-style procs, any proc that communicates with external channels (e.g., send_only or receive_only) is considered live. Liveness then propagates to other procs that communicate with these live procs over internal channels.

    * For new-style procs and blocks, liveness is determined by which other procs or blocks are instantiated by the top-level proc/block.

  2. Transitive Closure Analysis: A depth-first search (DFS) is performed starting from these live roots. During this traversal, all directly or indirectly referenced FunctionBases (e.g., functions invoked by other functions, body functions of counted_for loops, instantiated blocks) are marked as "reached" or "live".

  3. Elimination of Dead Code: After the traversal: * Any FunctionBase that was not marked as reached is considered dead and is removed from the package.

    * Any channels that are no longer associated with any live proc are also removed.

This pass ensures that the final IR contains only the necessary logic, simplifying subsequent optimization and synthesis steps.

Example: Consider a package with three functions: f_live, f_dead, and f_top. f_top is set as the top entity and invokes f_live. f_dead is defined but never invoked by f_top or f_live.

package my_package

fn f_dead() -> bits[32] {
  ret literal.0: bits[32] = literal(value=42)
}

fn f_live(x: bits[32]) -> bits[32] {
  ret add.1: bits[32] = add(x, literal(value=1))
}

top fn f_top(y: bits[32]) -> bits[32] {
  invoke.0: bits[32] = invoke(y, to_apply=f_live)
  ret invoke.0
}

After DeadFunctionEliminationPass runs, f_dead will be removed, and the package will contain only f_live and f_top.

Header

fixedpoint_proc_state_flattening - Proc State Flattening

Prepare proc state for further analysis by removing arrays and tuples.

Text-proto

Options Set

Run to a fixedpoint.

Invoked Passes

fixedpoint_simp - Fixed-point Simplification

Standard simplification pipeline.

This is run to a fixedpoint and avoids many time-consuming analyses.

Text-proto

Options Set

Run to a fixedpoint.

Invoked Passes

fixedpoint_simp(2) - Max-2 Fixed-point Simplification

Standard simplification pipeline.

Opt level is capped at 2

Text-proto

Options Set

Run to a fixedpoint.

Cap opt level: 2

Invoked Passes

fixedpoint_simp(3) - Max-3 Fixed-point Simplification

Standard simplification pipeline.

Opt level is capped at 3

Text-proto

Options Set

Run to a fixedpoint.

Cap opt level: 3

Invoked Passes

fixedpoint_simp(>=1,<=2) - Min-1 Max-2 Fixedpoint Simplification

Standard simplification pipeline.

Opt level is capped at 2 and skipped if less than 1

Text-proto

Options Set

Run to a fixedpoint.

Min opt level: 1

Cap opt level: 2

Invoked Passes

full-inlining - full function inlining passes

Fully inline all functions in a single step.

Text-proto

Invoked Passes

ident_remove - Identity Removal

Identity Removal - eliminate all identity() expressions.

Note: What follows was generated by the Gemini LLM. Not human verified.

The IdentityRemovalPass is an optimization pass in XLS that focuses on simplifying the Intermediate Representation (IR) by eliminating all identity() expressions. An identity() operation simply passes its single operand through unchanged. While syntactically valid, these operations introduce unnecessary nodes into the graph, increasing its complexity without contributing any new computational value.

How it Works:

  1. Traversal: The pass performs a forward traversal through all nodes in the function or proc.

  2. Identification: For each node encountered, it checks if the node's opcode (op()) is Op::kIdentity.

  3. Replacement: If an identity() node is found, all its uses are directly replaced with its single operand. This means any node that was consuming the output of the identity() operation will now directly consume the input to the identity() operation.

  4. Removal (Implicitly by DCE): After all uses of an identity() node have been replaced, that identity() node itself becomes dead code (it no longer has any consumers). A subsequent DeadCodeEliminationPass is typically run after IdentityRemovalPass to remove these now-unused identity() nodes, further cleaning up the IR.

Benefits:

  • IR Simplification: Directly reduces the number of nodes in the IR, making the graph simpler and easier for subsequent passes to analyze and transform.
  • Hardware Reduction: Eliminates unnecessary wires and buffers that might otherwise be generated for identity() operations in hardware, contributing to smaller area.
  • Improved Performance: Removing redundant operations can slightly improve simulation and synthesis performance by reducing the total amount of logic.

Example: Consider a function with a chain of identity() operations that pass through intermediate results:

// Original IR snippet
fn simple_neg(x: bits[8]) -> bits[8] {
  one: bits[8] = literal(value=1)
  v1:  bits[8] = identity(x)
  add: bits[8] = add(v1, one)
  v2:  bits[8] = identity(add)
  v3:  bits[8] = identity(v2)
  ret add2:bits[8] = sub(v3, one)
}

The IdentityRemovalPass would process v1, v2, and v3, replacing their uses with their respective operands. After this pass and a subsequent DeadCodeEliminationPass, the IR would be simplified to:

// Optimized IR (after IdentityRemovalPass and DCE)
fn simple_neg(x: bits[8]) -> bits[8] {
  one: bits[8] = literal(value=1)
  add: bits[8] = add(x, one)          // v1 replaced by x
  // v3 replaced by add (which was v2's replacement)
  ret add2:bits[8] = sub(add, one)
}

This effectively removes all the intermediate identity() nodes, streamlining the dataflow and simplifying the overall design.

Header

inlining - Inlines invocations

Inlines a package toward the top function/proc.

If full then all functions are inlined into the top.

If leaf then only leaf functions are inlined into their caller. This allows other passes to optimize on smaller graphs.

Note: What follows was generated by the Gemini LLM. Not human verified.

The InliningPass is a fundamental optimization pass in XLS that transforms invoke operations (function calls) by replacing them with the actual body of the invoked function. This is a powerful compiler optimization that eliminates the overhead associated with function calls, exposes more optimization opportunities for subsequent passes (such as Common Subexpression Elimination or Dead Code Elimination), and can ultimately lead to more monolithic and potentially faster hardware designs.

The pass supports two primary modes of inlining, controlled by the InlineDepth enum:

  1. kFull (Full Inlining): In this mode, the pass attempts to recursively inline all functions into their callers, working its way up the call graph towards the top-level function or proc. The process continues until only the top-level entity remains (or until it encounters foreign functions, which are external hardware blocks and cannot be inlined).

  2. kLeafOnly (Leaf-Only Inlining): This mode selectively inlines only "leaf" functions—those that do not call any other functions themselves (or only call foreign functions). This can be beneficial when other optimization passes are more effective on smaller, already-inlined graphs, or when fine-grained control over inlining is desired. It also inlines functions that have a single caller and only call leaf functions, or another function which is already being inlined.

How it Works:

  1. Call Graph Analysis: The pass initially constructs a CallGraph for the entire package. This graph precisely depicts the calling relationships between different functions.

  2. Inlining Order: Functions are processed for inlining in a post-order traversal of the call graph (i.e., leaf functions are processed first). This strategic order ensures that when a function Foo is inlined into its callers, any functions called by Foo have already been inlined (if eligible), thereby preventing redundant work and simplifying the overall inlining process.

  3. InlineInvoke Function: This template function is responsible for the actual inlining of a single invoke operation: * Operand Mapping: It establishes a mapping from the parameters of the invoked function to the corresponding actual arguments (operands) of the invoke instruction.

    * Node Cloning and Rewiring: It traverses the body of the invoked function in topological order. For each node within the invoked function, it clones the node into the caller function. During this cloning process, any references to parameters of the invoked function are transparently replaced with their corresponding arguments from the invoke operation.

    * Source Location Propagation: Source location information is meticulously merged from the original nodes and the invoke instruction to maintain traceability and debugging information.

    * Name Propagation: Node names are propagated intelligently to preserve readability after inlining: * If the invoked function's parameter name is a prefix of a node's name, and the corresponding invoke operand also has an assigned name, the inlined node's name is derived by substituting the parameter name with the operand's name.

      *   If the `invoke` instruction itself has an assigned name, the
          return value of the inlined function inherits that name.
    
      *   If neither of the above applies, and the inlined node had a name
          derived from a parameter, that derived name is used.
    

    * Cover and Assert Label Deduplication: For cover and assert operations, if inlining would result in multiple instances with identical labels, the labels are modified to include a unique prefix (derived from the caller function and a unique inline_count) to ensure uniqueness in the generated Verilog.

    * Replacement and Removal: Finally, the original invoke instruction is replaced by the return value of the inlined function (or a tuple if the function returns multiple values), and the invoke node itself is removed from the caller function as it is no longer needed.

By aggressively inlining functions, InliningPass creates larger, flatter graphs that provide subsequent optimization passes with a broader scope for analysis and transformation, ultimately leading to more optimized hardware designs.

Example (Full Inlining): Consider a caller function that invokes a callee function:

// Original IR
package some_package

fn callee(x: bits[32], y: bits[32]) -> bits[32] {
  ret add.1: bits[32] = add(x, y)
}

fn caller() -> bits[32] {
  literal.2: bits[32] = literal(value=2)
  ret invoke.3: bits[32] = invoke(literal.2, literal.2, to_apply=callee)
}

After InliningPass (with kFull depth), the invoke.3 in caller would be replaced by the body of callee, resulting in:

// Optimized IR (after Inlining and a subsequent DCE pass)
package some_package

fn caller() -> bits[32] {
  literal.2: bits[32] = literal(value=2)
  // callee's body is inlined here
  add.1: bits[32] = add(literal.2, literal.2)
  ret add.1
}

This effectively eliminates the function call overhead and exposes the add.1 to further optimizations within the caller function. For kLeafOnly, callee would be inlined into caller, but if caller itself was called by another function, that outer function would not have caller inlined into it. If caller called another function callee2 which itself called a third function callee3 then callee3 would be inlined into callee2 but callee2 would not be inlined into caller.

Header

label-recovery - LabelRecovery

At the end of the pass pipeline (when inlining and optimizations have been performed) attempts to recover original names for coverpoints and assertions to whatever degree possible so they're more human-readable -- we mangle them for inlining to ensure they're unique, but often those names are way overqualified.

Note: What follows was generated by the Gemini LLM. Not human verified.

The LabelRecoveryPass is an optimization pass in XLS designed to enhance the human-readability of labels associated with cover points and assert statements. These labels are crucial for debugging and formal verification. During earlier optimization stages, particularly InliningPass, these labels are deliberately mangled (e.g., by prepending caller function names and invocation IDs). This mangling ensures their uniqueness in the flattened IR, which is critical for generating correct Verilog. However, these mangled names can become excessively long and difficult for users to interpret.

The primary purpose of LabelRecoveryPass is to revert these mangled labels back to their original, more concise, user-defined forms wherever possible, without reintroducing naming collisions that would break downstream tools.

How it Works:

  1. Collecting Original Labels: The pass iterates through all nodes in the function or proc. For each cover and assert node, it checks for the presence of an original_label attribute. This attribute stores the name that the user originally provided before any mangling occurred during inlining.

  2. Grouping by Original Label: It constructs a map where the keys are the original_label strings, and the values are lists of all cover or assert nodes that share that specific original_label. This grouping is essential for identifying potential naming collisions.

  3. Collision Check and Recovery: For each unique original_label found in the map: * No Collision: If a particular original_label is associated with only one cover or assert node in the entire function, it means that no naming collision occurred (e.g., the original statement was inlined only once, or it was already unique). In this safe scenario, the pass restores the node's label attribute to its original_label.

    * Collision Present: If an original_label is associated with multiple cover or assert nodes, it signifies that inlining has created multiple instances of what was originally a single, uniquely named statement. In this case, the pass does not attempt to recover the original label for these collided nodes. Their mangled names are retained to ensure uniqueness and prevent errors in downstream tools like Verilog generation.

Benefits:

  • Improved Debuggability: By restoring original labels, the generated Verilog becomes significantly more understandable for hardware designers, making it easier to pinpoint issues related to cover points and assert statements during simulation and debugging.
  • Enhanced Readability: Reduces the verbosity of the generated IR and Verilog code, contributing to overall better readability.
  • Preserves Intent: Ensures that the designer's original intent and chosen naming convention for cover and assert labels are reflected in the final output wherever possible.

Limitations (as noted in the code):

  • The current implementation only recovers labels if there are no collisions for a given original label. It does not attempt to find minimal distinguishing prefixes or other sophisticated renaming schemes for collided labels, although this is noted as a potential future improvement.

Example: Consider a callee function with a cover point that is invoked only once by a caller function:

// Original IR
package p

fn callee(p: bits[1]) -> () {
  ret cover.10: () = cover(p, label="my_cover_label")
}

top fn caller(p: bits[1]) -> () {
  ret invoke.20: () = invoke(p, to_apply=callee)
}

After InliningPass, the cover.10 in callee might be inlined into caller and its label mangled (e.g., to "caller_0_callee_my_cover_label").

The LabelRecoveryPass would then perform the following steps: 1. It would locate the inlined cover node within the caller function.

  1. It would retrieve its original_label, which is "my_cover_label".

  2. It would determine that no other cover or assert node in caller shares the original_label "my_cover_label".

  3. Consequently, it would restore the label of the inlined cover node to "my_cover_label".

// Optimized IR (after InliningPass, DCEPass, and LabelRecoveryPass)
package p

top fn caller(p: bits[1]) -> () {
  // Label recovered to original
  ret cover.new: () = cover(p, label="my_cover_label")
}

If callee were invoked multiple times, leading to multiple cover nodes with the same original_label (e.g., in a loop), their labels would remain mangled to ensure uniqueness, as per the pass's collision avoidance strategy.

Header

leaf-inlining - Inlines invocations

Inlines a package toward the top function/proc.

If full then all functions are inlined into the top.

If leaf then only leaf functions are inlined into their caller. This allows other passes to optimize on smaller graphs.

Note: What follows was generated by the Gemini LLM. Not human verified.

The InliningPass is a fundamental optimization pass in XLS that transforms invoke operations (function calls) by replacing them with the actual body of the invoked function. This is a powerful compiler optimization that eliminates the overhead associated with function calls, exposes more optimization opportunities for subsequent passes (such as Common Subexpression Elimination or Dead Code Elimination), and can ultimately lead to more monolithic and potentially faster hardware designs.

The pass supports two primary modes of inlining, controlled by the InlineDepth enum:

  1. kFull (Full Inlining): In this mode, the pass attempts to recursively inline all functions into their callers, working its way up the call graph towards the top-level function or proc. The process continues until only the top-level entity remains (or until it encounters foreign functions, which are external hardware blocks and cannot be inlined).

  2. kLeafOnly (Leaf-Only Inlining): This mode selectively inlines only "leaf" functions—those that do not call any other functions themselves (or only call foreign functions). This can be beneficial when other optimization passes are more effective on smaller, already-inlined graphs, or when fine-grained control over inlining is desired. It also inlines functions that have a single caller and only call leaf functions, or another function which is already being inlined.

How it Works:

  1. Call Graph Analysis: The pass initially constructs a CallGraph for the entire package. This graph precisely depicts the calling relationships between different functions.

  2. Inlining Order: Functions are processed for inlining in a post-order traversal of the call graph (i.e., leaf functions are processed first). This strategic order ensures that when a function Foo is inlined into its callers, any functions called by Foo have already been inlined (if eligible), thereby preventing redundant work and simplifying the overall inlining process.

  3. InlineInvoke Function: This template function is responsible for the actual inlining of a single invoke operation: * Operand Mapping: It establishes a mapping from the parameters of the invoked function to the corresponding actual arguments (operands) of the invoke instruction.

    * Node Cloning and Rewiring: It traverses the body of the invoked function in topological order. For each node within the invoked function, it clones the node into the caller function. During this cloning process, any references to parameters of the invoked function are transparently replaced with their corresponding arguments from the invoke operation.

    * Source Location Propagation: Source location information is meticulously merged from the original nodes and the invoke instruction to maintain traceability and debugging information.

    * Name Propagation: Node names are propagated intelligently to preserve readability after inlining: * If the invoked function's parameter name is a prefix of a node's name, and the corresponding invoke operand also has an assigned name, the inlined node's name is derived by substituting the parameter name with the operand's name.

      *   If the `invoke` instruction itself has an assigned name, the
          return value of the inlined function inherits that name.
    
      *   If neither of the above applies, and the inlined node had a name
          derived from a parameter, that derived name is used.
    

    * Cover and Assert Label Deduplication: For cover and assert operations, if inlining would result in multiple instances with identical labels, the labels are modified to include a unique prefix (derived from the caller function and a unique inline_count) to ensure uniqueness in the generated Verilog.

    * Replacement and Removal: Finally, the original invoke instruction is replaced by the return value of the inlined function (or a tuple if the function returns multiple values), and the invoke node itself is removed from the caller function as it is no longer needed.

By aggressively inlining functions, InliningPass creates larger, flatter graphs that provide subsequent optimization passes with a broader scope for analysis and transformation, ultimately leading to more optimized hardware designs.

Example (Full Inlining): Consider a caller function that invokes a callee function:

// Original IR
package some_package

fn callee(x: bits[32], y: bits[32]) -> bits[32] {
  ret add.1: bits[32] = add(x, y)
}

fn caller() -> bits[32] {
  literal.2: bits[32] = literal(value=2)
  ret invoke.3: bits[32] = invoke(literal.2, literal.2, to_apply=callee)
}

After InliningPass (with kFull depth), the invoke.3 in caller would be replaced by the body of callee, resulting in:

// Optimized IR (after Inlining and a subsequent DCE pass)
package some_package

fn caller() -> bits[32] {
  literal.2: bits[32] = literal(value=2)
  // callee's body is inlined here
  add.1: bits[32] = add(literal.2, literal.2)
  ret add.1
}

This effectively eliminates the function call overhead and exposes the add.1 to further optimizations within the caller function. For kLeafOnly, callee would be inlined into caller, but if caller itself was called by another function, that outer function would not have caller inlined into it. If caller called another function callee2 which itself called a third function callee3 then callee3 would be inlined into callee2 but callee2 would not be inlined into caller.

Header

loop_unroll - Unroll counted loops

Note: What follows was generated by the Gemini LLM. Not human verified.

Pass which unrolls counted_for loops. Each iteration of the counted_for loop is replaced with an explicit invoke operation.

For example, a loop like this:

fn body(i: bits[4], accum: bits[32], zero: bits[32]) -> bits[32] {
  zero_ext.3: bits[32] = zero_ext(i, new_bit_count=32)
  add.4: bits[32] = add(zero_ext.3, accum)
  ret add.5: bits[32] = add(add.4, zero)
}

fn unrollable() -> bits[32] {
  literal.1: bits[32] = literal(value=0)
  ret counted_for.2: bits[32] = counted_for(literal.1, trip_count=2,
  stride=2, body=body, invariant_args=[literal.1])
}

would be transformed into:

invoke.0: bits[32] = invoke(literal(0), literal(0), literal(0),
to_apply=body)
invoke.1: bits[32] = invoke(literal(2), invoke.0, literal(0), to_apply=body)
ret invoke.1

Header

lut_conversion - LUT Conversion

Pass which opportunistically converts nodes to lookup tables (selects) where we can prove it's beneficial.

Note: What follows was generated by the Gemini LLM. Not human verified.

The LutConversionPass is an optimization pass in XLS that intelligently converts complex combinational logic into more efficient lookup tables (LUTs). These LUTs are represented in the IR as select operations. This is particularly beneficial for hardware synthesis, especially for FPGA targets where LUTs are fundamental configurable logic blocks, capable of implementing arbitrary Boolean functions very efficiently within a fixed-size resource. The pass aims to improve hardware area and potentially delay by replacing complex gate-level logic with a single, equivalent select operation.

How it Works:

  1. Identifying Select Operations and Selectors: The pass iterates through nodes in reverse topological order, focusing on select operations. For each select, it analyzes its selector input.

  2. Minimum Cut Analysis for Selector: The core of this pass involves a "minimum cut" analysis performed on the selector node, facilitated by DataflowGraphAnalysis. This analysis identifies the minimal set of ancestor nodes (min_cut) of the selector that, when their values are known, completely determine the value of the selector. * A crucial parameter, max_unknown_bits, is used with GetMinCutFor. This sets an upper limit on the number of "unknown" bits (bits whose values are not fixed to 0 or 1) allowed in the min_cut. If the number of unknown bits exceeds this limit, the cut is considered too large to be efficiently represented as a LUT, and the optimization is not performed. This prevents the generation of excessively large LUTs that might consume too many hardware resources.

  3. Constructing the Lookup Table (Truth Table): If a sufficiently small min_cut is found, the pass then constructs a conceptual lookup table. This involves: * Enumerating all min_cut values: It systematically iterates through all possible combinations of values that the nodes in the min_cut can take.

    * Simulating the selector: For each combination of min_cut values, it uses an IrInterpreter to simulate the sub-graph that computes the selector and determines its resulting value.

    * Mapping Selector Values to Case Results: With the selector values determined for each min_cut combination, the pass then knows precisely which case of the original select operation would be chosen. It fetches the corresponding output value from that case (or the default value of the original select). This effectively creates a mapping from each min_cut combination to the final output value of the original select operation.

  4. Replacing with a New Select (LUT): Finally, the original select operation is replaced by a new select operation that directly implements the lookup table: * New Selector: A new selector for this LUT is constructed by concatenating the "unknown" bits of the min_cut nodes. These are the bits that actually differentiate the behavior of the selector.

    * New Cases: The cases of the new select are the computed output values (from step 3) for each possible combination of the new selector bits. The order of these cases is determined by the enumeration of min_cut values.

Benefits:

  • Area Optimization: By replacing complex gate-level logic with efficient LUT structures, this pass can significantly reduce the overall hardware area, especially for FPGA-based designs.
  • Performance Improvement: LUTs typically have predictable and often low delay characteristics, which can lead to improved timing performance for the converted logic.
  • IR Simplification: Can simplify complex logical structures into a single select node, making the IR more abstract and easier to manage for subsequent optimization stages.

Example: Consider a select operation whose selector is derived from a complex arithmetic expression of x, for instance, selector = add(x, x) (effectively 2*x). The cases of the select are literals.

// Original IR snippet
fn simple_select(x: bits[3]) -> bits[3] {
  literal.0: bits[3] = literal(value=0)
  literal.1: bits[3] = literal(value=1)
  literal.2: bits[3] = literal(value=2)
  literal.3: bits[3] = literal(value=3)
  doubled_x: bits[3] = add(x, x) // Selector for the final select
  selector: bits[3] = doubled_x
  ret result: bits[3] =
              sel(selector,
                  cases=[literal.0, literal.1, literal.2, literal.3],
                  default=x)
}

The LutConversionPass would identify selector (which is doubled_x) as a target. It would perform a minimum cut on selector and find that x is the minimal set of inputs. It would then enumerate all possible values for x (0 to 7 for bits[3]), compute the selector value and the select result for each x, creating a new select with x as its selector and the computed results as cases:

// Optimized IR (simplified after LutConversionPass and a DCE pass)
fn simple_select(x: bits[3]) -> bits[3] {
  // ... (literals 0-3 remain if still used elsewhere, otherwise removed by
  //     DCE)
  ret result: bits[3] = sel(x, cases=[
    literal(value=0), // x=0, doubled_x=0 -> cases[0]
    literal(value=2), // x=1, doubled_x=2 -> cases[2]
    x,                // x=2, doubled_x=4 -> default
    x,                // x=3, doubled_x=6 -> default
    literal(value=0), // x=4, doubled_x=0 -> cases[0]
    literal(value=2), // x=5, doubled_x=2 -> cases[2]
    x,                // x=6, doubled_x=4 -> default
    x                 // x=7, doubled_x=6 -> default
  ]) // Cases derived from evaluating the original select for each possible x
}

The resulting select directly maps x to the output, effectively converting the add and original select into a single, more efficient LUT implementation.

Header

map_inlining - Inline map operations

A pass to convert map nodes to in-line Invoke nodes. We don't directly lower maps to Verilog.

Note: What follows was generated by the Gemini LLM. Not human verified.

The MapInliningPass is an optimization pass that transforms map operations in the XLS IR into a series of explicit invoke operations. In the XLS compilation flow, map operations represent a high-level construct that applies a function to each element of an array. Since target hardware description languages like Verilog do not directly support map operations, this pass is crucial for lowering this abstraction to a more hardware-friendly representation.

The pass operates as follows:

  1. Identification of map nodes: It traverses the IR to identify all map instructions.

  2. Transformation: For each map instruction, it replaces the map node with an array operation. This array operation is composed of multiple invoke operations, where each invoke corresponds to one application of the mapped function to an element of the input array.

    * For an input array of size N, the map operation map(input_array, to_apply=map_fn) is replaced by an array of N elements.

    * Each element of this newly constructed array is an invoke operation.

    * Each invoke operation is explicitly passed a single element from the input_array (accessed via array_index using a literal index) and applies the map_fn to it.

This effectively "unrolls" the map operation, making each application of the map_fn to an array element an explicit operation in the IR. This is analogous to how a loop unrolling pass might transform a counted_for loop.

Example: Consider a map operation that applies map_fn to an array a:

fn map_fn(x: bits[32]) -> bits[16] {
  ret bit_slice.1: bits[16] = bit_slice(x, start=0, width=16)
}

fn main(a: bits[32][4]) -> bits[16][4] {
  ret result: bits[16][4] = map(a, to_apply=map_fn)
}

After MapInliningPass, the map operation would be replaced by an array of four invoke operations, each applying map_fn to a specific element of a:

fn map_fn(x: bits[32]) -> bits[16] {
  ret bit_slice.1: bits[16] = bit_slice(x, start=0, width=16)
}

fn main(a: bits[32][4]) -> bits[16][4] {
  array_index.0: bits[32] = array_index(a, indices=[literal(0)])
  invoke.0: bits[16] = invoke(array_index.0, to_apply=map_fn)
  array_index.1: bits[32] = array_index(a, indices=[literal(1)])
  invoke.1: bits[16] = invoke(array_index.1, to_apply=map_fn)
  array_index.2: bits[32] = array_index(a, indices=[literal(2)])
  invoke.2: bits[16] = invoke(array_index.2, to_apply=map_fn)
  array_index.3: bits[32] = array_index(a, indices=[literal(3)])
  invoke.3: bits[16] = invoke(array_index.3, to_apply=map_fn)
  ret array.4: bits[16][4] = array(invoke.0, invoke.1, invoke.2, invoke.3)
}

This transformation is crucial for enabling the subsequent stages of hardware synthesis.

Header

narrow - Narrowing

A pass which reduces the width of operations eliminating redundant or unused bits.

Note: What follows was generated by the Gemini LLM. Not human verified.

The NarrowingPass is a powerful optimization pass in XLS designed to reduce the bit width of bits-typed operations and values within a function or proc. This is a critical optimization for hardware generation, as a reduction in bit widths directly translates to smaller, more efficient, and often faster hardware (fewer wires, smaller functional units, and smaller registers). The pass employs various advanced analysis techniques to determine the minimum required bit width for each value without altering its functional correctness.

The pass is configured with an AnalysisType enum, which dictates the sophistication of the analysis used:

  • kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
  • kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
  • kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) of select operations and other conditional control flow to derive more precise ranges for values within specific execution paths.
  • kRangeWithOptionalContext: Allows the choice between kRange and kRangeWithContext based on the options.use_context_narrowing_analysis flag, providing flexibility in trade-off between analysis time and precision.

The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:

  1. Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a literal node. This simplifies the graph and can enable further optimizations.

  2. Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (eq, ne, ult, ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.

      ```
      // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings
      // Leading `0110` and trailing `0011` match for both operands.
      // Optimized: UGt(0bXXX, 0bYYY)
      ```
    

    * Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.

  3. Narrowing Negate Operations: * Trailing Zeros: If an operand to a negate operation has known trailing zeros, the negate can be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of the negate operation.

    * Sign Extension Hoisting: For negate(sign_ext(x)), if the range analysis shows that x cannot be the minimum representable signed value, this can be simplified to sign_ext(negate(x)), reducing the width of the negate operation. More generally, it finds the minimal bit width for the inner negate and then re-extends.

  4. Narrowing Shift Operations (shll, shra, shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).

    * Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the shift_value are known to be zero or sign bits, the shift_value can be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. The shift_amount itself can also be narrowed if it has leading zeros.

    * Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.

  5. Narrowing Add and Sub Operations: * Common Leading Zeros/Sign Bits: For add and sub operations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions.

  6. Narrowing Multiply Operations (umul, smul, umulp, smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.

    * Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.

    * Extended Operands (Zero-extend/Sign-extend): When operands are the results of zero_ext or sign_ext operations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g., umul or smul) if necessary to maintain correctness.

    * Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.

    * Partial Product Optimization: For umulp/smulp (partial product multiplies), if the two partial products are immediately summed (e.g., add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended.

  7. Narrowing ArrayIndex Operations: * Literal Index Values: If an index to an array_index is a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.

    * Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.

    * Known Out-of-Bounds Index: Similar to BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.

    * Convert ArrayIndex to Select: For array_index operations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to a select chain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic.

  8. Narrowing Decode Operations: * Leading Zeros in Index: If the index of a decode operation has leading zeros, the index can be sliced to remove them, and the decode performed on the narrower index, with the result zero-extended if needed.

    * Zero Index: A decode with a provably zero index is replaced by a literal one, as decode(0) always results in a one-hot encoding with the least significant bit set.

The NarrowingPass is typically run repeatedly within an optimization pipeline until a fixed point is reached. This iterative process is crucial for achieving maximal bit width reduction across the entire IR, as narrowing one node can often enable further narrowing of its consumers or producers.

Header

narrow(Context) - Narrowing

A pass which reduces the width of operations eliminating redundant or unused bits.

Note: What follows was generated by the Gemini LLM. Not human verified.

The NarrowingPass is a powerful optimization pass in XLS designed to reduce the bit width of bits-typed operations and values within a function or proc. This is a critical optimization for hardware generation, as a reduction in bit widths directly translates to smaller, more efficient, and often faster hardware (fewer wires, smaller functional units, and smaller registers). The pass employs various advanced analysis techniques to determine the minimum required bit width for each value without altering its functional correctness.

The pass is configured with an AnalysisType enum, which dictates the sophistication of the analysis used:

  • kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
  • kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
  • kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) of select operations and other conditional control flow to derive more precise ranges for values within specific execution paths.
  • kRangeWithOptionalContext: Allows the choice between kRange and kRangeWithContext based on the options.use_context_narrowing_analysis flag, providing flexibility in trade-off between analysis time and precision.

The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:

  1. Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a literal node. This simplifies the graph and can enable further optimizations.

  2. Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (eq, ne, ult, ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.

      ```
      // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings
      // Leading `0110` and trailing `0011` match for both operands.
      // Optimized: UGt(0bXXX, 0bYYY)
      ```
    

    * Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.

  3. Narrowing Negate Operations: * Trailing Zeros: If an operand to a negate operation has known trailing zeros, the negate can be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of the negate operation.

    * Sign Extension Hoisting: For negate(sign_ext(x)), if the range analysis shows that x cannot be the minimum representable signed value, this can be simplified to sign_ext(negate(x)), reducing the width of the negate operation. More generally, it finds the minimal bit width for the inner negate and then re-extends.

  4. Narrowing Shift Operations (shll, shra, shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).

    * Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the shift_value are known to be zero or sign bits, the shift_value can be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. The shift_amount itself can also be narrowed if it has leading zeros.

    * Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.

  5. Narrowing Add and Sub Operations: * Common Leading Zeros/Sign Bits: For add and sub operations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions.

  6. Narrowing Multiply Operations (umul, smul, umulp, smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.

    * Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.

    * Extended Operands (Zero-extend/Sign-extend): When operands are the results of zero_ext or sign_ext operations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g., umul or smul) if necessary to maintain correctness.

    * Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.

    * Partial Product Optimization: For umulp/smulp (partial product multiplies), if the two partial products are immediately summed (e.g., add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended.

  7. Narrowing ArrayIndex Operations: * Literal Index Values: If an index to an array_index is a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.

    * Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.

    * Known Out-of-Bounds Index: Similar to BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.

    * Convert ArrayIndex to Select: For array_index operations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to a select chain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic.

  8. Narrowing Decode Operations: * Leading Zeros in Index: If the index of a decode operation has leading zeros, the index can be sliced to remove them, and the decode performed on the narrower index, with the result zero-extended if needed.

    * Zero Index: A decode with a provably zero index is replaced by a literal one, as decode(0) always results in a one-hot encoding with the least significant bit set.

The NarrowingPass is typically run repeatedly within an optimization pipeline until a fixed point is reached. This iterative process is crucial for achieving maximal bit width reduction across the entire IR, as narrowing one node can often enable further narrowing of its consumers or producers.

Header

narrow(OptionalContext) - Narrowing

A pass which reduces the width of operations eliminating redundant or unused bits.

Note: What follows was generated by the Gemini LLM. Not human verified.

The NarrowingPass is a powerful optimization pass in XLS designed to reduce the bit width of bits-typed operations and values within a function or proc. This is a critical optimization for hardware generation, as a reduction in bit widths directly translates to smaller, more efficient, and often faster hardware (fewer wires, smaller functional units, and smaller registers). The pass employs various advanced analysis techniques to determine the minimum required bit width for each value without altering its functional correctness.

The pass is configured with an AnalysisType enum, which dictates the sophistication of the analysis used:

  • kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
  • kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
  • kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) of select operations and other conditional control flow to derive more precise ranges for values within specific execution paths.
  • kRangeWithOptionalContext: Allows the choice between kRange and kRangeWithContext based on the options.use_context_narrowing_analysis flag, providing flexibility in trade-off between analysis time and precision.

The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:

  1. Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a literal node. This simplifies the graph and can enable further optimizations.

  2. Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (eq, ne, ult, ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.

      ```
      // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings
      // Leading `0110` and trailing `0011` match for both operands.
      // Optimized: UGt(0bXXX, 0bYYY)
      ```
    

    * Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.

  3. Narrowing Negate Operations: * Trailing Zeros: If an operand to a negate operation has known trailing zeros, the negate can be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of the negate operation.

    * Sign Extension Hoisting: For negate(sign_ext(x)), if the range analysis shows that x cannot be the minimum representable signed value, this can be simplified to sign_ext(negate(x)), reducing the width of the negate operation. More generally, it finds the minimal bit width for the inner negate and then re-extends.

  4. Narrowing Shift Operations (shll, shra, shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).

    * Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the shift_value are known to be zero or sign bits, the shift_value can be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. The shift_amount itself can also be narrowed if it has leading zeros.

    * Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.

  5. Narrowing Add and Sub Operations: * Common Leading Zeros/Sign Bits: For add and sub operations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions.

  6. Narrowing Multiply Operations (umul, smul, umulp, smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.

    * Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.

    * Extended Operands (Zero-extend/Sign-extend): When operands are the results of zero_ext or sign_ext operations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g., umul or smul) if necessary to maintain correctness.

    * Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.

    * Partial Product Optimization: For umulp/smulp (partial product multiplies), if the two partial products are immediately summed (e.g., add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended.

  7. Narrowing ArrayIndex Operations: * Literal Index Values: If an index to an array_index is a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.

    * Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.

    * Known Out-of-Bounds Index: Similar to BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.

    * Convert ArrayIndex to Select: For array_index operations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to a select chain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic.

  8. Narrowing Decode Operations: * Leading Zeros in Index: If the index of a decode operation has leading zeros, the index can be sliced to remove them, and the decode performed on the narrower index, with the result zero-extended if needed.

    * Zero Index: A decode with a provably zero index is replaced by a literal one, as decode(0) always results in a one-hot encoding with the least significant bit set.

The NarrowingPass is typically run repeatedly within an optimization pipeline until a fixed point is reached. This iterative process is crucial for achieving maximal bit width reduction across the entire IR, as narrowing one node can often enable further narrowing of its consumers or producers.

Header

narrow(Range) - Narrowing

A pass which reduces the width of operations eliminating redundant or unused bits.

Note: What follows was generated by the Gemini LLM. Not human verified.

The NarrowingPass is a powerful optimization pass in XLS designed to reduce the bit width of bits-typed operations and values within a function or proc. This is a critical optimization for hardware generation, as a reduction in bit widths directly translates to smaller, more efficient, and often faster hardware (fewer wires, smaller functional units, and smaller registers). The pass employs various advanced analysis techniques to determine the minimum required bit width for each value without altering its functional correctness.

The pass is configured with an AnalysisType enum, which dictates the sophistication of the analysis used:

  • kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
  • kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
  • kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) of select operations and other conditional control flow to derive more precise ranges for values within specific execution paths.
  • kRangeWithOptionalContext: Allows the choice between kRange and kRangeWithContext based on the options.use_context_narrowing_analysis flag, providing flexibility in trade-off between analysis time and precision.

The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:

  1. Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a literal node. This simplifies the graph and can enable further optimizations.

  2. Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (eq, ne, ult, ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.

      ```
      // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings
      // Leading `0110` and trailing `0011` match for both operands.
      // Optimized: UGt(0bXXX, 0bYYY)
      ```
    

    * Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.

  3. Narrowing Negate Operations: * Trailing Zeros: If an operand to a negate operation has known trailing zeros, the negate can be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of the negate operation.

    * Sign Extension Hoisting: For negate(sign_ext(x)), if the range analysis shows that x cannot be the minimum representable signed value, this can be simplified to sign_ext(negate(x)), reducing the width of the negate operation. More generally, it finds the minimal bit width for the inner negate and then re-extends.

  4. Narrowing Shift Operations (shll, shra, shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).

    * Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the shift_value are known to be zero or sign bits, the shift_value can be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. The shift_amount itself can also be narrowed if it has leading zeros.

    * Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.

  5. Narrowing Add and Sub Operations: * Common Leading Zeros/Sign Bits: For add and sub operations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions.

  6. Narrowing Multiply Operations (umul, smul, umulp, smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.

    * Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.

    * Extended Operands (Zero-extend/Sign-extend): When operands are the results of zero_ext or sign_ext operations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g., umul or smul) if necessary to maintain correctness.

    * Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.

    * Partial Product Optimization: For umulp/smulp (partial product multiplies), if the two partial products are immediately summed (e.g., add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended.

  7. Narrowing ArrayIndex Operations: * Literal Index Values: If an index to an array_index is a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.

    * Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.

    * Known Out-of-Bounds Index: Similar to BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.

    * Convert ArrayIndex to Select: For array_index operations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to a select chain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic.

  8. Narrowing Decode Operations: * Leading Zeros in Index: If the index of a decode operation has leading zeros, the index can be sliced to remove them, and the decode performed on the narrower index, with the result zero-extended if needed.

    * Zero Index: A decode with a provably zero index is replaced by a literal one, as decode(0) always results in a one-hot encoding with the least significant bit set.

The NarrowingPass is typically run repeatedly within an optimization pipeline until a fixed point is reached. This iterative process is crucial for achieving maximal bit width reduction across the entire IR, as narrowing one node can often enable further narrowing of its consumers or producers.

Header

narrow(Ternary) - Narrowing

A pass which reduces the width of operations eliminating redundant or unused bits.

Note: What follows was generated by the Gemini LLM. Not human verified.

The NarrowingPass is a powerful optimization pass in XLS designed to reduce the bit width of bits-typed operations and values within a function or proc. This is a critical optimization for hardware generation, as a reduction in bit widths directly translates to smaller, more efficient, and often faster hardware (fewer wires, smaller functional units, and smaller registers). The pass employs various advanced analysis techniques to determine the minimum required bit width for each value without altering its functional correctness.

The pass is configured with an AnalysisType enum, which dictates the sophistication of the analysis used:

  • kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
  • kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
  • kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) of select operations and other conditional control flow to derive more precise ranges for values within specific execution paths.
  • kRangeWithOptionalContext: Allows the choice between kRange and kRangeWithContext based on the options.use_context_narrowing_analysis flag, providing flexibility in trade-off between analysis time and precision.

The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:

  1. Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a literal node. This simplifies the graph and can enable further optimizations.

  2. Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (eq, ne, ult, ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.

      ```
      // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings
      // Leading `0110` and trailing `0011` match for both operands.
      // Optimized: UGt(0bXXX, 0bYYY)
      ```
    

    * Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.

  3. Narrowing Negate Operations: * Trailing Zeros: If an operand to a negate operation has known trailing zeros, the negate can be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of the negate operation.

    * Sign Extension Hoisting: For negate(sign_ext(x)), if the range analysis shows that x cannot be the minimum representable signed value, this can be simplified to sign_ext(negate(x)), reducing the width of the negate operation. More generally, it finds the minimal bit width for the inner negate and then re-extends.

  4. Narrowing Shift Operations (shll, shra, shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).

    * Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the shift_value are known to be zero or sign bits, the shift_value can be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. The shift_amount itself can also be narrowed if it has leading zeros.

    * Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.

  5. Narrowing Add and Sub Operations: * Common Leading Zeros/Sign Bits: For add and sub operations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions.

  6. Narrowing Multiply Operations (umul, smul, umulp, smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.

    * Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.

    * Extended Operands (Zero-extend/Sign-extend): When operands are the results of zero_ext or sign_ext operations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g., umul or smul) if necessary to maintain correctness.

    * Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.

    * Partial Product Optimization: For umulp/smulp (partial product multiplies), if the two partial products are immediately summed (e.g., add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended.

  7. Narrowing ArrayIndex Operations: * Literal Index Values: If an index to an array_index is a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.

    * Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.

    * Known Out-of-Bounds Index: Similar to BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.

    * Convert ArrayIndex to Select: For array_index operations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to a select chain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic.

  8. Narrowing Decode Operations: * Leading Zeros in Index: If the index of a decode operation has leading zeros, the index can be sliced to remove them, and the decode performed on the narrower index, with the result zero-extended if needed.

    * Zero Index: A decode with a provably zero index is replaced by a literal one, as decode(0) always results in a one-hot encoding with the least significant bit set.

The NarrowingPass is typically run repeatedly within an optimization pipeline until a fixed point is reached. This iterative process is crucial for achieving maximal bit width reduction across the entire IR, as narrowing one node can often enable further narrowing of its consumers or producers.

Header

next_value_opt - Next Value Optimization

Pass which tries to optimize next_value nodes within a Proc.

Note: What follows was generated by the Gemini LLM. Not human verified.

This pass performs several key optimizations to simplify the logic related to state updates and expose further opportunities for other passes.

Optimizations include:

  1. Removing Literal Predicates: If a next_value node has a predicate that evaluates to a constant false, the next_value operation is dead and thus removed. If the predicate is a constant true, the predicate is removed, making the next_value unconditional.

    Example (dead next_value):

    x: bits[32]
    next_value(x, value=my_value, predicate=literal(value=0, bits=1))
    

    This next_value would be entirely removed, as it will never update x.

  2. Splitting Select-based Values: When a next_value operation's value operand is a select, priority_sel, or one_hot_sel operation, this pass can split the single next_value into multiple next_value nodes. Each new next_value corresponds to one of the select's cases, and its predicate is the condition under which that specific case would be chosen. This can lead to more granular control over state updates and enable further optimizations on individual branches.

    Example (splitting a select):

    x: bits[2]
    select_val: bits[2] = select(x,
                                 cases=[literal(2), literal(1)],
                                 default=literal(3))
    next_value(x, value=select_val)
    

    This could be transformed into (simplified representation):

    next_value(x, value=literal(2), predicate=eq(x, literal(0)))
    next_value(x, value=literal(1), predicate=eq(x, literal(1)))
    next_value(x, value=literal(3), predicate=ugt(x, literal(1)))
    

    This splitting is controlled by the max_split_depth parameter to prevent excessive IR growth.

  3. Handling Non-Synthesizable State Elements: The pass identifies state-elements that are marked as non-synthesizable and are performing an identity update. In such cases, it removes the read of the synthesizable version, allowing subsequent passes (like conditional specialization) to more effectively predicate the state read.

For best results, this pass should be run after any transformations that modernize old-style next (...) lines into explicit next_value nodes.

Header

non_synth_separation - Non-Synthesizable Separation

Separates out non-synthesizable nodes like assert/cover/trace from the main function into a cloned function. Every function effectively has two versions, one with synthesizable nodes and one without. The synthesizable version invokes the non-synthesizable version of its function. This ensures that non-synthesizable uses of values do not affect the optimization of the synthesizable parts of the function.

Note: What follows was generated by the Gemini LLM. Not human verified.

The NonSynthSeparationPass is a crucial optimization pass in XLS that isolates non-synthesizable operations (such as assert, cover, and trace) from the main, synthesizable logic of a function or proc. This separation is essential because non-synthesizable operations can introduce data dependencies that might prevent optimizations on the synthesizable parts of the design. For instance, a value used in an assert could be seen as "live" by the Dead Code Elimination (DCE) pass, even if that value is not used in any synthesizable computation. This would prevent the logic generating that value from being removed.

By separating these concerns, the pass allows the synthesizable parts of the design to be aggressively optimized without being constrained by the needs of verification and debugging constructs.

How it Works:

  1. Identification: The pass first traverses each function and proc in the package to determine if it contains any non-synthesizable nodes (assert, cover, trace). If a function or proc is purely synthesizable, it is left untouched.

  2. Cloning: For each function or proc that contains non-synthesizable operations, a new, separate "non-synthesizable" version is created: * For Functions: The original function is cloned to create a new function (e.g., f is cloned to non_synth_f).

    * For Procs: The original proc is cloned, but as a Function rather than a Proc. This is because the non-synthesizable operations are stateless from a hardware perspective (they don't generate registers). Any proc-specific nodes (like StateRead, Next, Send, Receive) are handled specially during this cloning: * StateRead operations become Params in the new function.

      *   `Next` operations are handled to update non-synthesizable state,
          but do not become part of the new function's logic.
    
      *   `Send` and `Receive` operations are similarly abstracted away,
          with their data payloads passed as parameters to the new
          function.
    
  3. Separation: * Original (Synthesizable) Version: All non-synthesizable nodes are removed from the original function or proc. It is then modified to invoke the newly created non-synthesizable function, passing it all the necessary data (e.g., function parameters, proc state, received data) that the non-synthesizable operations depend on. This invoke itself is non-synthesizable but has a minimal impact on the synthesizable logic.

    * New (Non-Synthesizable) Version: The cloned function is stripped of all its synthesizable outputs (its return value is made a useless empty tuple). It retains only the logic necessary to support the non-synthesizable operations.

  4. Gate to Select Conversion: Gate operations, which are special power-optimization nodes, are typically not removed by DCE. To prevent their unnecessary duplication in the non-synthesizable function, they are converted to equivalent select operations, which can be more easily optimized away if they become redundant.

Benefits:

  • Improved Optimization: Frees the synthesizable logic from dependencies related to verification and debugging, enabling more aggressive optimizations like DCE, constant folding, and strength reduction.
  • Clear Separation of Concerns: Creates a clean architectural separation between synthesizable hardware logic and non-synthesizable verification/debugging logic, improving the clarity and maintainability of the IR.
  • Correctness: Ensures that verification constructs like assert, cover, and trace are preserved and correctly reflect the behavior of the design, while not interfering with the synthesis of the core functional logic.

Example: Consider a function that contains both a synthesizable computation and an assert:

// Original IR snippet
fn my_func(x: bits[32]) -> bits[32] {
  tok: token = literal(value=token)
  intermediate: bits[32] = add(x, literal(1))
  is_zero: bits[1] = eq(intermediate, literal(0))
  assert.1: token = assert(tok, is_zero, message="intermediate is zero!")
  ret result: bits[32] = mul(intermediate, literal(2))
}

After NonSynthSeparationPass, the IR would be split into two functions:

// Synthesizable version (original function, modified)
fn my_func(x: bits[32]) -> bits[32] {
  intermediate: bits[32] = add(x, literal(1))
  is_zero: bits[1] = eq(intermediate, literal(0))

  // Invoke the non-synthesizable part, passing necessary values
  invoke_non_synth: () = invoke(is_zero, to_apply=non_synth_my_func)

  ret result: bits[32] = mul(intermediate, literal(2))
}

// Non-synthesizable version (newly created function)
fn non_synth_my_func(is_zero: bits[1]) -> () {
  tok: token = literal(value=token)
  assert.1: token = assert(tok, is_zero, message="intermediate is zero!")

  ret useless_return: () = tuple() // Return value is made useless
}

In this transformed IR, the main my_func is now free of any assert operations, and optimizations like DCE can proceed on is_zero if result does not depend on it. The non_synth_my_func encapsulates the verification logic, ensuring it is preserved without hindering the optimization of the synthesizable path.

Header

one-leaf-inlining - leaf function inlining passes

inline one level of functions.

Text-proto

Invoked Passes

post-inlining - Post-inlining passes

Passes performed after inlining

Text-proto

Invoked Passes

post-inlining-opt - post-inlining optimization passes

Passes performed after inlining

Text-proto

Invoked Passes

post-inlining-opt(>=1) - min-1 post-inlining optimization passes

Passes performed after inlining

Text-proto

Options Set

Min opt level: 1

Invoked Passes

pre-inlining - pre-inlining passes

Passes performed before each inlining.

Text-proto

Invoked Passes

proc_state_array_flat - Proc State Array Flattening

Pass which flattens array elements of the proc state into their constituent elements. Tuples are flattened in a different pass. Flattening improves optimizability because each state element can be considered and transformed in isolation. Flattening also gives the scheduler more flexibility; without flattening, each element in the aggregate must have the same lifetime.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ProcStateArrayFlatteningPass is an optimization pass designed to enhance the optimizability and scheduling flexibility of Procs in XLS. Its core function is to transform one-dimensional, array-typed state elements within the proc state into a tuple of their individual constituent elements. (Note: A distinct pass, ProcStateTupleFlatteningPass, is responsible for handling the flattening of tuple-typed state elements.)

The rationale behind this flattening is multifaceted:

  1. Improved Optimizability: By breaking down arrays into their individual elements, subsequent optimization passes (such as narrowing, constant folding, or dead code elimination) can consider and transform each element in isolation. This often enables more aggressive and effective optimizations that might be inhibited when elements are tightly coupled within an array structure.

  2. Increased Scheduler Flexibility: When a state element is represented as an array, all elements within that array are typically treated as a single, atomic unit by the scheduler. This implies that they must all have the same lifetime and be updated or accessed simultaneously. By flattening the array into a tuple of individual elements, the scheduler gains greater flexibility. Each element can then potentially have an independent lifetime, enabling more efficient scheduling and resource allocation in the generated hardware.

The pass identifies array-typed state elements that are suitable for flattening. Currently, the heuristic for flattening is to unconditionally flatten "small" arrays, specifically those with a size of 2 or less. While a kMaxArrayFlattenSize constant is defined, the current implementation applies a stricter size limit.

The transformation process for an array-typed state element (e.g., state: bits[8][2]) involves:

  1. Determining Flattening: The ShouldFlattenStateElement function checks if a given state element is an array and meets the predefined size heuristic for flattening.

  2. Transforming StateElement Type: The original array-typed StateElement is replaced by a new StateElement of a tuple type, where each element of the tuple corresponds to an element of the original array. For example, an ArrayType of bits[8][2] becomes a TupleType of (bits[8], bits[8]).

  3. Transforming initial_value: The initial value of the StateElement is converted from an array literal to a tuple literal with the corresponding element values.

  4. Transforming StateRead Operations: Any StateRead operation that previously read the array-typed state element is modified. Instead of directly reading the array, it now reads the new tuple-typed state element. Its uses are then replaced by an array operation that reconstructs the original array from tuple_index operations on the new tuple-typed state read. This ensures that any downstream logic consuming the original array continues to see the expected array structure.

  5. Transforming Next Operations: Any next_value operation that computed the next state for the array-typed element is also modified. Its value operand (which would have been an array) is transformed into a tuple by creating array_index operations for each element of the original array and then wrapping these in a tuple operation.

Example: Consider a Proc with a state element state that is a 2-element array of bits[8]:

// Original IR snippet
package my_package

proc my_proc(state: bits[8][2], init={0, 1}) {
  // ...
  state_read.0: bits[8][2] = state_read(state)
  idx0: bits[32] = literal(0)
  elem0: bits[8] = array_index(state_read.0, indices=[idx0])
  // ... further usage of elem0
  // computes next state as an array
  next_val: bits[8][2] = array(literal(1), literal(2))
  next (state, next_val)
}

After ProcStateArrayFlatteningPass, the state element would be flattened into a tuple (bits[8], bits[8]):

// Optimized IR snippet (simplified)
package my_package

// state is now a tuple
proc my_proc(state: (bits[8], bits[8]), init={(0, 1)}) {
  // ...
  state_read.0: (bits[8], bits[8]) = state_read(state) // Reads the tuple
  elem0_from_tuple: bits[8] = tuple_index(state_read.0, index=0)
  elem1_from_tuple: bits[8] = tuple_index(state_read.0, index=1)
  // Any original uses of `state_read.0` would be replaced by an `array` of
  // these tuple_indexed elements.
  reconstructed_array: bits[8][2] = array(elem0_from_tuple, elem1_from_tuple)

  // next_value now produces a tuple (assuming original `next_val` was an
  // array of literals)
  next_val_tuple: (bits[8], bits[8]) = tuple(literal(1), literal(2))
  next (state, next_val_tuple)
}

This transformation facilitates creating more fine-grained control and optimization opportunities for individual elements of the original array, which is highly beneficial for hardware synthesis.

Header

proc_state_bits_shatter - Proc State Bits Shattering

Pass which transforms Bits-type elements of the proc state into tuples of components. Only flattens where it can show that doing so will enable dynamic state feedback opportunities later (assuming it's followed by a ProcStateTupleFlatteningPass).

Note: What follows was generated by the Gemini LLM. Not human verified.

The ProcStateBitsShatteringPass is an optimization pass that refactors bits-typed elements within a Proc's state. It breaks down (or "shatters") these wide bit vectors into a tuple of smaller bits components.

The primary motivation for this transformation is to expose opportunities for subsequent passes (such as ProcStateTupleFlatteningPass) to enable more granular and dynamic control over individual state bits, which can lead to more efficient hardware implementations.

The pass intelligently identifies bits-typed state elements where splitting would be beneficial. It only performs the transformation if it can demonstrate that doing so will likely enable dynamic state feedback. This often involves detecting specific patterns within the next_value expressions that define the next state, particularly when those next_value expressions involve concat operations whose operands are select or priority_select operations.

Here's a breakdown of its operation:

  1. Identification of Splitting Candidates: The pass iterates through all bits-typed state elements within a Proc.

  2. Analysis of next_value Expressions: For each state element, it examines its associated next_value operations: * Simple pass-through next_values (where the next value is just the current state read) do not offer new splitting opportunities.

    * If a next_value is defined by a concat operation, the pass analyzes the bit widths of the concat's operands to identify potential "split points" within the overall bit vector. For example, for concat(hi, mid, lo), split points would be after lo and after mid.

    * A key heuristic for determining if splitting is beneficial is when one or more of these concat operands are themselves select or priority_select operations (or small select operations, if enabled by options.split_next_value_selects). This pattern suggests that different parts of the state element are updated based on different conditions, and splitting them allows for more fine-grained conditional updates to individual components.

  3. Intersection of Split Points: If multiple next_value operations update the same state element, the pass calculates the intersection of their suggested split points. This ensures that any performed splitting is mutually beneficial for all updates to that state element.

  4. Transformation: If a beneficial set of split points (resulting in more than one component) is identified, the bits-typed state element is transformed into a tuple of smaller bits types. * The StateRead operation for the original state element is replaced with a concat of tuple_index operations, effectively reconstructing the original bit vector from the new tuple components when the full value is needed.

    * Each next_value operation for the original state element is replaced with a tuple of bit_slice operations, extracting the appropriate bit ranges for each component of the new tuple state.

    * The initial_value of the state element is also converted into a tuple literal matching the new structure.

Example: Consider a 16-bit state element x where its next_value is a concat of three smaller bit groups (x_hi, new_x_mid, x_lo), and new_x_mid is derived from a select operation.

// Original state and next value
x: bits[16] = state_element(init=0)
// ...
x_lo: bits[6] = bit_slice(x, start=0, width=6)
x_mid: bits[1] = bit_slice(x, start=6, width=1)
x_hi: bits[9] = bit_slice(x, start=7, width=9)
// Conditional update
new_x_mid: bits[1] = select(or_reduce(x_lo), cases={x_mid, not(x_mid)})
next_value(x, value=concat(x_hi, new_x_mid, x_lo))

The ProcStateBitsShatteringPass would identify that x can be beneficially split into three components (e.g., 6 bits, 1 bit, 9 bits) because new_x_mid is a select operation, indicating a conditional update to a sub-part of x. It would transform x into a tuple state element:

// Transformed state element and its updates (simplified)
x: (bits[6], bits[1], bits[9]) = state_element(init=(0,0,0))
// ...
x_read_tuple: (bits[6], bits[1], bits[9]) = state_read(x)
x_new_lo: bits[6] = tuple_index(x_read_tuple, 0)
x_new_mid: bits[1] = tuple_index(x_read_tuple, 1)
x_new_hi: bits[9] = tuple_index(x_read_tuple, 2)

// Original `StateRead` users now reconstruct the full bit vector
original_x_usage: bits[16] = concat(x_new_hi, x_new_mid, x_new_lo)

// `next_value` now operates on the tuple components
next_value(x, value=tuple(bit_slice(concat_value, 0, 6),
                          bit_slice(concat_value, 6, 1),
                          bit_slice(concat_value, 7, 9)))

This transformation enables subsequent passes to optimize the updates to x_lo, x_mid, and x_hi independently, particularly when conditional logic can be localized to specific components, ultimately leading to more efficient hardware.

Header

proc_state_narrow - Proc State Narrowing

Pass which tries to minimize the size and total number of elements of the proc state. The optimizations include removal of dead state elements and zero-width elements.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ProcStateNarrowingPass is an optimization pass designed to reduce the bit width of bits-typed state elements within a Proc. This is a critical optimization for hardware generation, as a smaller bit width for state elements directly translates to smaller and more efficient registers in the synthesized hardware. The pass achieves this by employing a ProcStateRangeQueryEngine to analyze the possible range of values a state element can take over multiple cycles, and then safely narrows its bit width if a smaller range is identified.

The pass focuses on two primary types of bit width reduction:

  1. Removal of Known Leading Bits (Zero or One Extension): If range analysis can prove that the most significant bits (MSBs) of a state element are always constant (e.g., always zero or always one), these leading redundant bits can be "removed." The state element's type is then truncated to exclude these bits. When the state is read, these bits are re-concatenated as literals to reconstruct the full width for any operations that depend on the original wider value. When a next_value is computed, only the non-constant lower bits are extracted.

    Example: If a 32-bit state element foo is always updated such that its 29 MSBs are consistently zero, the pass can narrow its type to bits[3].

    // Original state element
    state_element(foo, bits[32], init=0)
    
    // Example next_value (simplified, assuming MSBs are always 0)
    next_value(foo, zero_extend(add(literal(1, bits=3),
    bit_slice(state_read(foo), 0, 3)), 32))
    

    After narrowing, foo might become bits[3]:

    // Narrowed state element
    state_element(foo, bits[3], init=0)
    
    // Read of foo now re-extends it for compatibility
    extended_foo: bits[32] = concat(literal(0, bits=29), state_read(foo))
    
    // Next value is computed on the narrowed bits
    next_value(foo, bit_slice(new_value_for_foo, 0, 3))
    
  2. Signed Bit Width Narrowing: For state elements representing signed values, the pass utilizes interval analysis (interval_ops::MinimumSignedBitCount) to determine the smallest number of bits strictly necessary to represent the full range of signed values a state element can take. If the actual bit width is larger than this minimum, it is narrowed. To preserve signed semantics, the pass uses sign_ext to reconstruct the original width when the state element is read.

    Example: If a 32-bit state element bar is known to consistently hold signed values between -8 and 7, it can be narrowed to bits[4] (as 4 bits can represent signed values from -8 to 7).

    // Original state element
    state_element(bar, bits[32], init=0)
    
    // Example next_value (simplified, range stays within [-8, 7])
    next_value(bar, add(state_read(bar), literal(1)))
    

    After narrowing:

    // Narrowed state element
    state_element(bar, bits[4], init=0)
    
    // Read of bar now sign-extends it for compatibility
    extended_bar: bits[32] = sign_ext(state_read(bar), 32)
    
    // Next value is computed on the narrowed bits
    next_value(bar, bit_slice(new_value_for_bar, 0, 4))
    

This pass is highly integrated with ProcStateRangeQueryEngine, which performs an inter-cycle analysis to precisely track the evolution of state element values, enabling robust and correct narrowing decisions. The transformations involve updating the StateElement definition, and modifying StateRead and Next operations to slice/extend or concatenate as appropriate, ensuring functional equivalence with the original wider representation.

Header

proc_state_opt - Proc State Optimization

Pass which tries to minimize the size and total number of elements of the proc state. The optimizations include removal of dead state elements and zero-width elements.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ProcStateOptimizationPass is an optimization pass in XLS that focuses on minimizing the size and total number of state elements within a Proc. This is a critical optimization for hardware generation, as fewer and smaller state elements directly translate to reduced register count and memory usage, leading to more compact, faster, and more efficient hardware designs.

The pass achieves its goals through several key transformations, often operating in an iterative fashion until no further improvements are possible:

  1. Removal of Zero-Width State Elements: State elements that have a bit width of zero (and are not token-typed) are inherently useless as they cannot store any information. This pass identifies and removes such elements. Their corresponding StateRead operations are replaced with their initial_value (which will be a zero-width literal tuple), and Next operations are replaced with empty tuples.

  2. Removal of Constant State Elements: If a state element's value is provably constant throughout the entire execution of the Proc (i.e., its initial_value is a constant, and all its next_value operations consistently assign that same constant value or are no-ops), then the state element is redundant. The pass replaces all reads of such a state element with its constant value and removes the element and its associated Next operations. This process is run to a fixed point, meaning it will iterate until no more constant state elements can be found.

  3. Removal of Unobservable State Elements: This is a more advanced optimization that identifies state elements whose values do not affect any observable output of the Proc (e.g., outputs to channels, assertions, traces). A state element X is considered observable if: * A side-effecting operation (e.g., send, assert, cover) directly or indirectly depends on X. * The next-state value of another observable state element depends on X.

    To achieve this, the pass performs: * State Dependency Analysis: It computes a dependency map showing which state elements each node in the Proc depends upon. This is a forward dataflow analysis. * Transitive Closure: It constructs an adjacency matrix for state element dependencies and then computes its transitive closure. This determines which state elements ultimately influence which others, and importantly, which ones influence observable outputs. * Removal: Any state element that does not transitively affect an observable output is deemed unobservable and removed. Reads of such elements are replaced with a zero-valued literal, and their Next operations are removed.

  4. Conversion of Constant Chains to State Machines: This optimization identifies sequences of state elements that effectively form a counter or a simple state machine driven by a constant input. For example, if next[C[i+1]] is semantically equivalent to state_read[C[i]] and next[C[0]] is a constant, this sequence can be converted into a single, compact state machine. Instead of using a one-hot encoding (which ProcStateArrayFlatteningPass might produce), this pass converts it into a binary-encoded state machine using log2(chain_length) bits, which is more area-efficient. The reads of the original state elements are then replaced by a select operation that decodes the state machine's value to produce the corresponding original state element value.

Benefits:

  • Area Reduction: Directly reduces the number of registers and memory required for state, leading to smaller hardware footprints.
  • Power Reduction: Fewer registers and less active logic contribute to lower power consumption in the synthesized design.
  • IR Simplification: Removes redundant state and associated logic, making the IR cleaner and easier for subsequent compilation stages to process.

Example (Removing a constant state element):

// Original IR snippet
package my_package

chan out(bits[32], kind=streaming, ops=send_only, flow_control=ready_valid)

top proc my_proc() {
  tkn: token = literal(value=token)
  zero_val: bits[32] = literal(value=0)
  one_val: bits[32] = literal(value=1)

  // This state element `constant_state` is always `1`
  constant_state: bits[32] = state_element(init=1)
  next (constant_state, one_val) // Always updates to 1

  // `not_constant_state` changes based on its LSB
  not_constant_state: bits[32] = state_element(init=0)
  not_constant_state_read: bits[32] = state_read(not_constant_state)
  not_constant_lsb: bits[1] =
      bit_slice(not_constant_state_read, start=0, width=1)
  not_constant_next_lsb: bits[1] = not(not_constant_lsb)
  next_not_constant_state: bits[32] =
      concat(bit_slice(not_constant_state_read, start=1, width=31),
             not_constant_next_lsb)
  next (not_constant_state, next_not_constant_state)

  // Usage: sum of constant_state and the LSB of not_constant_state
  state_usage: bits[32] = add(state_read(constant_state),
                              bit_slice(state_read(not_constant_state),
                                        start=0, width=1))
  send_tok: token = send(tkn, state_usage, channel=out)
  next (send_tok)
}

The ProcStateOptimizationPass would identify constant_state as provably constant (always 1). It would then replace all uses of state_read(constant_state) with literal(1) and remove the constant_state element and its next_value:

// Optimized IR (simplified after ProcStateOptimizationPass and a subsequent
// DCE pass)
package my_package

chan out(bits[32], id=0, kind=streaming, ops=send_only,
         flow_control=ready_valid)

top proc my_proc() {
  tkn: token = literal(value=token)
  zero_val: bits[32] = literal(value=0)
  one_val: bits[32] = literal(value=1)

  // `constant_state` has been removed by the pass

  not_constant_state: bits[32] = state_element(init=0)
  not_constant_state_read: bits[32] = state_read(not_constant_state)
  not_constant_lsb: bits[1] = bit_slice(not_constant_state_read, start=0,
                                         width=1)
  not_constant_next_lsb: bits[1] = not(not_constant_lsb)
  next_not_constant_state: bits[32] =
    concat(bit_slice(not_constant_state_read, start=1, width=31),
           not_constant_next_lsb)
  next (not_constant_state, next_not_constant_state)

  // `state_read(constant_state)` is replaced by `literal(1)`
  state_usage: bits[32] = add(literal(1, bits=32),
                              bit_slice(state_read(not_constant_state),
                                        start=0, width=1))
  send_tok: token = send(tkn, state_usage, channel=out)
  next (send_tok)
}

This effectively reduces the number of state elements and simplifies the logic by replacing constant state accesses with their known values.

Header

proc_state_provenance_narrow - Proc State Provenance Narrowing

Pass which tries to minimize the size and total number of elements of the proc state. This pass works by examining the provenance of the bits making up the next value to determine which (if any) bits are never actually modified.

NB This is a separate pass from ProcStateNarrowing for simplicity of implementation. That pass mostly assumes we'll have a range-analysis which this does not need.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ProcStateProvenanceNarrowingPass is an optimization pass designed to reduce the bit-width of state elements within a Proc. It achieves this by performing a detailed "provenance" analysis on the bits that constitute the next_value of each state element. The goal is to identify which specific bits of a state element are never actually modified from their initial value, regardless of the control flow or data inputs to the Proc. These unchanging bits can then be removed from the state element, leading to a narrower, more efficient register in hardware.

This pass is distinct from ProcStateNarrowingPass (which primarily relies on range analysis) because ProcStateProvenanceNarrowingPass does not require complex range information. Instead, it directly inspects the bit-level origins and updates.

How it Works:

  1. Bit Provenance Analysis (BitProvenanceAnalysis): The pass utilizes BitProvenanceAnalysis to determine, for every bit of every node, its origin. This analysis can trace whether a bit is derived from a specific input bit, a constant, or a combination of sources. It provides a precise understanding of how each bit of a next_value is influenced by the current state and inputs.

  2. Ternary Query Engine (LazyTernaryQueryEngine): Alongside provenance, a LazyTernaryQueryEngine is used to provide ternary information about bits (known zero, known one, or unknown). This helps in identifying bits that are effectively constant or always take a certain value.

  3. UnchangedBits Function: This is the core logic for identifying bits that can be narrowed. For each state element, it examines all its Next operations. For a given Next(state_read, next_value, predicate): * If next_value is the same as state_read, those bits are trivially unchanged. * Otherwise, it compares the bits of next_value with the initial_bits of the state element. Using both ternary and provenance analysis, it identifies bits in next_value that are provably identical to their corresponding initial_bits or are known to be constants that match the initial_bits. * The unchanged_bits mask is iteratively refined (using bitwise AND) across all Next operations for a given state element. A bit is deemed "unchanged" only if all Next operations either preserve its initial value or are predicated off (i.e., don't update it under conditions where it could change).

  4. Bit Segment Extraction (ExtractBitSegments): Once the final unchanged_bits mask is determined, this function breaks down the state element's bit range into contiguous BitSegments. Each segment is marked as either a constant segment (if it corresponds to bits that are always unchanged and match the initial value) or a variable segment (for bits that can change).

  5. NarrowTransform (State Element Transformation): This custom Proc::StateElementTransformer is used to apply the narrowing transformation: * New State Element: The original state element is replaced by a new StateElement with a narrower bit-width, comprising only the variable segments identified in step 4. * TransformStateRead: When the narrowed StateElement is read by a StateRead operation, it is reconstructed into its original, wider form. This is done by concatenating the new, narrower StateRead output with Literal nodes for all the constant segments. This ensures that downstream logic continues to receive the expected full-width value. * TransformNextValue: The next_value computation is also narrowed. Only the variable segments from the original next_value are extracted (using bit_slice operations) and concatenated to form the new, narrower next_value for the transformed StateElement.

Benefits:

  • Significant Area Reduction: Directly reduces the number of flip-flops (registers) required for state elements, leading to smaller hardware area.
  • Power Efficiency: Fewer registers typically consume less dynamic and static power.
  • Improved Optimizability: A narrower state element simplifies its uses and can expose further optimization opportunities for operations that consume it.

Example: Consider a 128-bit state element foo in a Proc, where a portion of its bits (e.g., bits 32-63) are known to always hold a specific constant value (e.g., 0xFFFF_0000) and never change.

// Original IR snippet
package my_package

proc my_proc(foo: bits[128],
             init=0x0000_0000_FFFF_0000_0000_0000_0000_0000) {
  // ...
  // A next_value operation where bits 32-63 (0xFFFF_0000) are always
  // preserved, but other bits might change based on inputs.
  next_val_update: bits[128] = some_complex_logic(state_read(foo), input_val)
  next (foo, next_val_update)
}

ProcStateProvenanceNarrowingPass would identify that bits 32-63 of foo are unchanging. It would then transform foo into a narrower state element, effectively removing those constant bits from its storage:

// Optimized IR snippet (simplified, after narrowing and cleanup)
package my_package

proc my_proc(foo: bits[96],
             init=0x0000_0000_0000_0000_0000_0000) { // Foo is now 96 bits
  // ...
  foo_read_narrow: bits[96] = state_read(foo)

  // Reconstruct original 128-bit value by concatenating narrowed state with
  // constant bits
  foo_reconstructed: bits[128] = concat(
    bit_slice(foo_read_narrow, 64, 32), // High variable bits
    literal(0x0000_FFFF, bits=32), // Constant bits that were removed
    bit_slice(foo_read_narrow, 0, 32)  // Low variable bits
  )
  // ... downstream logic now uses `foo_reconstructed`

  // Next value also narrowed
  next_val_update_narrowed: bits[96] = some_complex_logic_narrowed(...)
  next (foo, next_val_update_narrowed)
}

This reduces the physical register size for foo by 32 bits, leading to a more compact and efficient hardware implementation while maintaining functional equivalence.

Header

proc_state_tuple_flat - Proc State Tuple Flattening

Pass which flattens tuple elements of the proc state into their constituent components. Array elements are flattened in a different pass. Flattening improves optimizability because each state element can be considered and transformed in isolation. Flattening also gives the scheduler more flexibility; without flattening, each element in the aggregate must have the same lifetime.

Header

ram_rewrite - RAM Rewrite

Pass that rewrites RAMs of one type to a new type. Generally this will be some kind of lowering from more abstract to concrete RAMs.

Note: What follows was generated by the Gemini LLM. Not human verified.

The RamRewritePass facilitates the transformation of RAM (Random Access Memory) interfaces within an XLS package. Its primary role is to lower more abstract RAM models to concrete RAM types that are directly implementable by target hardware. This is a crucial step in the hardware generation flow.

The pass is driven by RamRewrite configurations, which specify the original RAM type (from_config), the target RAM type (to_config), and the mapping between logical channel names (e.g., read_req) and their corresponding physical channel names in the IR.

Key aspects of the RamRewritePass include:

  1. RAM Abstraction Levels: * kAbstract: A high-level, generic RAM model that provides a simplified interface. * k1RW (One Read/Write Port): A concrete RAM with a single port that can perform either a read or a write operation per cycle. Its request channel typically includes address, data, write enable, and read enable signals bundled together. * k1R1W (One Read, One Write Port): A concrete RAM with separate, independent read and write ports, enabling simultaneous read and write operations.

  2. Channel Rewiring and Repacking: The core function of the pass is to replace existing send and receive operations associated with the from_config RAM channels with new send and receive operations connected to the to_config RAM channels. This often necessitates "repacking" the data payloads to match the new channel types.

    * Repacking for send operations: The RepackPayload function transforms the structure of an old send operation's operand to match the expected input format of the new RAM channel. This can involve adding literal values (e.g., write_enable or read_enable bits for k1RW RAMs), dropping unused fields, or simply passing through data if the formats are compatible.

      Example (Abstract Read Request to 1RW Request):
      An abstract read request might be a tuple `(address, mask)`. When
      rewritten to a `k1RW` RAM, its request channel expects `(address,
      data, write_mask, read_mask, write_enable, read_enable)`. The pass
      will construct a new tuple, filling `data`, `write_mask` with zero
      literals, setting `write_enable` to `0b0` and `read_enable` to `0b1`.
    
      ```
      // Original abstract read request send
      send_abstract_read_req: token = send(tok,
                                           (addr, mask),
                                           channel=abstract_read_req)
    
      // Transformed 1RW request send (simplified)
      send_1rw_req: token =
                    send(tok,
                         (addr, data, wmask, mask, we, re),
                         channel=1rw_req)
      ```
    

    * Repacking for receive operations: Similarly, the output of the new receive operation (which adheres to the to_config channel type) is repacked to conform to the expected output format of the original from_config channel. This ensures that downstream logic that consumes the received data continues to function correctly.

  3. Channel Creation and Removal: The pass dynamically creates new channels that correspond to the to_config RAM type and subsequently removes the original channels associated with the from_config RAM type. This ensures a clean transition to the new RAM interface.

  4. Proc-Scoped Channels: The pass properly handles both global and proc-scoped channels. If a proc_name is specified in the RamRewrite configuration, the channels are treated as local to that particular proc.

  5. Error Handling: The pass includes robust checks for potential issues such as type mismatches (e.g., inconsistencies in address width between from_config and to_config) and unsupported transformation scenarios, returning informative errors where applicable.

This pass is a critical element in the XLS compilation flow, enabling the compiler to generate optimized hardware for diverse RAM architectures while preserving the functional correctness of the original design.

Header

reassociation - Reassociation

Reassociates associative operations to reduce delay by transforming chains of operations to a balanced tree of operations, and gathering together constants in the expression for folding.

Note: What follows was generated by the Gemini LLM. Not human verified.

The ReassociationPass is an optimization pass in XLS that reorders associative operations such as addition (add), subtraction (sub), and signed/unsigned multiplication (smul, umul). Its primary objectives are to reduce the critical path delay of the synthesized hardware and to facilitate further constant folding by grouping literal operands.

This pass is crucial for transforming deep, unbalanced chains of operations (e.g., (((A + B) + C) + D)) into more balanced tree structures (e.g., ((A + B) + (C + D))). It also effectively gathers together constant operands within an associative expression, making them amenable to subsequent constant folding optimizations.

Key Concepts:

  • Associativity: The pass exploits the associative property of operations like addition and multiplication. This property states that for an associative operation op, (A op B) op C is mathematically equivalent to A op (B op C). Reassociation leverages this to restructure the computation graph.
  • Critical Path Delay Reduction: A deep, unbalanced chain of operations in the IR directly translates to a long critical path in the hardware, which limits the maximum clock frequency. By rebalancing these chains into a shallower tree, the pass effectively reduces the overall critical path delay.
  • Constant Folding Facilitation: Grouping constant operands together within an associative expression allows the ConstantFoldingPass to combine them into a single literal. This simplifies the expression and reduces dynamic computation.
  • Overflow Handling: The pass is designed to carefully consider potential overflow conditions, particularly for signed arithmetic and operations involving different bit widths. This ensures that the semantic correctness of the original computation is rigorously preserved during reassociation.

How it Works:

  1. AssociativeElements Class: This internal helper class is central to the pass's operation. For each node identified as part of an associative chain, it recursively collects and models: * variables_: Non-constant leaf nodes (e.g., parameters or results of other complex expressions). * constants_: Constant leaf nodes (i.e., literal operations). * op_: The primary associative operation (Op::kAdd, Op::kSMul, etc.) governing the combination of these elements. * needs_negate: A boolean flag indicating if a variable needs to be logically negated, typically arising from subtraction operations (e.g., A - B is conceptually treated as A + (-B)). * overflows_: A flag indicating whether the sub-expression represented by this AssociativeElements object can potentially overflow at its current bit-width. This information is crucial for safe combination. * depth_: Tracks the depth of the longest chain of associative operations leading to this particular node.

  2. OneShotReassociationVisitor: This visitor traverses the IR in a single pass to compute and collect the AssociativeElements for each relevant node. It includes logic to handle various operations: * Associative Operations (Add, Sub, SMul, UMul): For these operations, it recursively combines the AssociativeElements of its operands, managing negation for subtraction and tracking potential overflow conditions. * Neg: Handles negation by logically flipping the needs_negate flag of its operand's AssociativeElements representation. * Extensions (ZeroExtend, SignExtend) and Concat operations that effectively act as ZeroExtend: These operations can widen a value without fundamentally changing its underlying numerical value. The pass attempts to "see through" these by propagating the AssociativeElements of the inner operand if no overflow occurs and signedness compatibility is maintained.

  3. ReassociationCache: This caching mechanism stores the computed AssociativeElements for each node. This avoids redundant computations during iterative passes and helps ensure that analyses are kept up-to-date even after transformations have altered the graph structure.

  4. Reassociate Function (Main Loop): * Fixed-Point Iteration: The pass operates in a fixed-point manner. It identifies nodes that are prime candidates for reassociation and repeatedly applies the transformation until no further beneficial changes can be made across the entire function. * Candidate Selection: Nodes are selected for reassociation if they meet specific criteria, including: * Not already a leaf node (i.e., they are part of a larger associative chain). * Not composed entirely of constants (these are handled by the separate ConstantFoldingPass). * Having users that require materialization (i.e., the reassociation is not trivially hidden by other optimizations). * Their current depth in the IR graph is greater than the minimum possible depth for a balanced tree of its elements, or they contain multiple constants that can be meaningfully grouped. * ReassociateOneOperation: This function is responsible for actually reconstructing the selected associative operation into a balanced tree: * Grouping Constants: It first groups all constant operands together to be combined by ConstantFoldingPass. * Balancing Tree: It then constructs a balanced binary tree using the variable elements, minimizing the depth of the overall computation. The tree is typically biased to the left to ensure a consistent output structure, which can further aid subsequent Common Subexpression Elimination (CSE). * Negation and Extension Handling: It correctly applies necessary negations and extensions (sign-extend/zero-extend) during the reconstruction process to strictly preserve semantic equivalence.

Benefits:

  • Reduced Critical Path: By transforming deep chains of operations into balanced trees, the pass significantly reduces the critical path delay in the generated hardware, allowing for higher clock frequencies.
  • Improved Constant Folding: Grouping constant operands together within associative expressions enables ConstantFoldingPass to reduce them to single literals, thereby simplifying the IR and reducing dynamic computation.
  • Enhanced IR Structure: Creates a more canonical and optimized structure for associative expressions, which can benefit subsequent optimization passes by providing a clearer and more efficient representation of the computation.

Example (Rebalancing additions): Consider a deep, left-leaning chain of addition operations:

// Original IR snippet (deep left-leaning chain)
fn deep_add(a: bits[32], b: bits[32], c: bits[32], d: bits[32]) -> bits[32] {
  add.1: bits[32] = add(a, b)
  add.2: bits[32] = add(add.1, c)
  ret add.3: bits[32] = add(add.2, d)
}

The ReassociationPass would transform this into a balanced tree structure:

// Optimized IR (after ReassociationPass and a subsequent DCE pass)
fn deep_add(a: bits[32], b: bits[32], c: bits[32], d: bits[32]) -> bits[32] {
  add.left: bits[32] = add(a, b)
  add.right: bits[32] = add(c, d)
  ret add.final: bits[32] = add(add.left, add.right)
}

This transformation reduces the maximum depth of the additions from 3 to 2, directly improving potential timing performance.

Header

recv_default - Receive default value simplification

Optimization which removes useless selects between the data value of a conditional or non-blocking receive and the default value of the receive (all zeros).

Note: What follows was generated by the Gemini LLM. Not human verified.

The ReceiveDefaultValueSimplificationPass is an optimization pass tailored for Procs in XLS. Its purpose is to identify and eliminate redundant select or priority_select operations that are used to explicitly handle the default zero value produced by conditional (receive_if) or non-blocking (receive_non_blocking) receive operations when no data is available.

In XLS, when a conditional or non-blocking receive operation does not successfully yield data (either its predicate is false, or it's non-blocking and no data is present on the channel), the data output of the receive operation defaults to an all-zeros value of the appropriate type. It's a common pattern to see IR where this default zero behavior is explicitly reinforced by a select or priority_select operation.

The pass targets two primary patterns:

  1. Conditional Receive (receive_if) Pattern: If a receive_if(token, predicate, channel_name) operation has its data output (tuple_index(receive_op, 1)) fed into a select operation like select(predicate, cases=[literal(0), receive_data_output]), this select is redundant. When predicate is false, receive_if already produces zero data, and the select chooses literal(0). When predicate is true, receive_if produces the actual receive_data_output, and the select chooses receive_data_output. In both cases, the select simply passes through the effective receive_data_output.

  2. Non-Blocking Receive (receive_non_blocking) Pattern: A receive_non_blocking(token, channel_name) operation typically returns a tuple (token, data_output, valid_bit). If the data_output (tuple_index(receive_op, 1)) is subsequently fed into a select like select(valid_bit, cases=[literal(0), receive_data_output]), this select is also redundant. When valid_bit is false, receive_data_output is implicitly zero, and the select chooses literal(0). When valid_bit is true, receive_data_output is the actual received data, and the select chooses receive_data_output. Again, the select is effectively a passthrough.

The ReceiveDefaultValueSimplificationPass identifies these specific redundant patterns and replaces the select or priority_select node directly with the actual data output from the receive operation, thereby eliminating unnecessary logic and simplifying the IR.

Example (Conditional Receive with redundant select):

// Original IR snippet
chan in(bits[32], kind=streaming, ops=receive_only, flow_control=ready_valid)

proc p(pred: bits[1], s: bits[32], init={1, 42}) {
  tkn: token = literal(value=token)
  // data is 0 if pred is false
  receive.1: (token, bits[32]) = receive(tkn, predicate=pred, channel=in)
  receive_data: bits[32] = tuple_index(receive.1, index=1)
  // This select is redundant: it effectively always yields 'receive_data'
  select.2: bits[32] = sel(pred, cases=[literal(0), receive_data])
  next (pred, select.2)
}

After ReceiveDefaultValueSimplificationPass:

// Optimized IR snippet
chan in(bits[32], kind=streaming, ops=receive_only, flow_control=ready_valid)

proc p(pred: bits[1], s: bits[32], init={1, 42}) {
  tkn: token = literal(value=token)
  receive.1: (token, bits[32]) = receive(tkn, predicate=pred, channel=in)
  receive_data: bits[32] = tuple_index(receive.1, index=1)
  // select.2 has been removed; receive_data is used directly
  next (pred, receive_data)
}

This simplification reduces the number of nodes in the IR, contributing to smaller and potentially more efficient hardware implementations.

Header

resource_sharing - Resource Sharing

Note: What follows was generated by the Gemini LLM. Not human verified.

The ResourceSharingPass is a critical optimization pass in XLS that aims to reduce hardware area by sharing common computational resources (e.g., adders, multipliers) among different parts of the design. This is achieved by identifying mutually exclusive operations—operations that are guaranteed not to execute at the same time—and then transforming the IR to allow them to share the same physical hardware resource through multiplexing.

The pass is configured with a ProfitabilityGuard enum, which determines the heuristic used to select which sharing opportunities to exploit:

  • kInDegree: Prioritizes sharing based on the in-degree of nodes in a conceptual "folding graph." This heuristic favors nodes that are potential "destinations" for folding, especially those with more distinct "source" operations that can be multiplexed into them. This approach is often effective for asymmetric folding scenarios where one complex operation can subsume several simpler ones.
  • kCliques: Selects sharing opportunities based on identifying cliques (groups of mutually exclusive operations) in the folding graph. This approach is particularly suitable for symmetric folding scenarios where multiple operations of similar complexity can all share a single resource.
  • kRandom: Randomly selects a subset of potential folding actions. This is typically used for testing, fuzzing, or exploration purposes rather than for deterministic optimization.
  • kAlways: Attempts to perform all possible resource sharing that is identified as legal. This is generally used for aggressive optimization benchmarking or during specific testing phases.

The core mechanism of ResourceSharingPass involves several stages:

  1. Mutual Exclusion Analysis: * ComputeMutualExclusionAnalysis: This is a central function that identifies pairs or groups of operations that are mutually exclusive. It builds a "mutual exclusivity relation" between nodes, proving that two operations are guaranteed not to be active simultaneously. * It leverages NodeForwardDependencyAnalysis to trace dataflow dependencies and BddQueryEngine (Binary Decision Diagrams) for precise logical analysis. The BddQueryEngine is critical for proving that a particular operation's value does not influence another under specific conditions (e.g., when a particular arm of a select operation is chosen). * The analysis primarily focuses on PrioritySelect operations, as these explicitly define conditional execution paths, which naturally create clear mutual exclusivity between their cases. The analysis is conservative and avoids making assumptions through Invoke nodes before function inlining to ensure correctness.

  2. Folding Graph Construction: Based on the results of the mutual exclusivity analysis, a FoldingGraph is constructed. This graph represents potential "folding actions," where a "source" operation can be "folded" into a "destination" operation, implying that they can share the same underlying hardware resource. This sharing is controlled by a select operation whose condition is derived from the mutual exclusivity proof.

  3. Profitability Estimation: * The pass utilizes AreaEstimator and DelayEstimator tools to assess the profitability of each potential folding action. This involves a cost-benefit analysis. * EstimateAreaForSelectingASingleInput: Calculates the area overhead associated with adding multiplexers (select logic) required to steer the inputs to a shared resource. * EstimateAreaForNegatingNode: Estimates the area cost if an operand needs to be negated (e.g., when folding a sub operation into an add operation that operates on the negated operand). * The area_saved for a folding action is calculated as the estimated area of the "from" node minus the area overhead of the multiplexers and any additional required logic. * DelayAnalysis is used to consider the impact of sharing on the critical path delay. The kMaxDelaySpread constant defines an acceptable limit for the increase in delay due to multiplexing.

  4. Selection of Folding Actions: * SelectFoldingActions: This function, guided by the chosen ProfitabilityGuard heuristic, selects a subset of the legal and profitable folding actions to perform. * It prioritizes actions that offer significant area savings while keeping delay impacts within acceptable bounds. * The selected actions are sorted (e.g., by descending area savings) and then "legalized" (LegalizeSequenceOfFolding) to prevent overlapping or conflicting transformations from being applied.

  5. IR Transformation: Once a final set of NaryFoldingActions is selected, the IR is transformed to implement the sharing. This involves: * Creating new select or priority_select operations (or chains of them) to multiplex the inputs of the shared resource based on the conditions derived from the original mutual exclusivity. * Replacing the original "from" operations with the output of the newly created multiplexing structure, which then feeds into the "to" (shared) operation. * Removing the now redundant "from" operations.

Example: Consider two umul operations, mul0(a, b) and mul1(c, d), which are mutually exclusive (e.g., they are in different arms of a priority_select). The ResourceSharingPass might identify this and transform the IR to use a single shared umul hardware resource:

// Original IR snippet (simplified)
selector: bits[1] = ...
mul0: bits[32] = umul(a, b, width=32)
mul1: bits[32] = umul(c, d, width=32)
ret result: bits[32] = priority_sel(selector, cases=[mul0], default=mul1)

After ResourceSharingPass (conceptual representation, as actual IR is more complex with explicit multiplexing):

// Optimized IR snippet (simplified)
selector: bits[1] = ...

// Multiplexed inputs to the shared multiplier
input_a_mux: bits[32] = priority_sel(selector, cases=[a], default=c)
input_b_mux: bits[32] = priority_sel(selector, cases=[b], default=d)

// Shared multiplier operation
umul_shared: bits[32] = umul(input_a_mux, input_b_mux, width=32)

ret result: bits[32] = umul_shared // The shared multiplier is now the result

This pass is vital for achieving high-quality hardware designs by efficiently utilizing available hardware resources, particularly in designs with significant conditional logic.

Header

scheduling-opt - scheduling opt passes

Passes performed immediately before starting scheduling.

These optimizations are used to fix up any changes the mutual-exclusion pass & proc-state legalize performs before the initial scheduling.

TODO(allight): We might wish to move these into opt only. sched running optimization passes is an artifact of how these tools evolved which no longer makes too much sense. To do this will require making a decision on how to handle 'mutual-exclusion opt' however.

Text-proto

Options Set

Min opt level: 1

Invoked Passes

select_lifting - Select Lifting

Pass which replace the pattern v = sel (c, array[i], array[j], ...) where all cases of the select reference the same array, to the code z = sel (c, i, j, ...) v = array[z]

Note: What follows was generated by the Gemini LLM. Not human verified.

The SelectLiftingPass is an optimization pass in XLS that restructures select and priority_select operations to improve hardware efficiency. It identifies scenarios where a select's cases are all derived from a common "shared" operation (such as an array_index or a binary arithmetic operation) and "lifts" that shared operation out of the select. This transformation effectively moves the multiplexing (the select) from the output of the shared operation to its input, which often leads to significant hardware savings and can sometimes improve performance.

This pass is particularly effective in two main scenarios:

  1. Lifting ArrayIndex Operations: When all cases of a select are array_index operations on the same array, the pass can lift the array_index out of the select. It creates a new select that chooses between the indices of the original array_index operations. The result of this new select (the chosen index) is then used in a single, shared array_index operation. This replaces multiple array reads followed by a wide data multiplexer with a smaller index multiplexer followed by a single array read, which is generally much more area-efficient.

    // Original IR snippet
    v = sel(c, array[i], array[j])
    
    // Optimized IR (after lifting)
    z = sel(c, i, j)
    v = array[z]
    
  2. Lifting Binary Operations: When all cases of a select are binary operations of the same kind (e.g., all adds, subs, or muls) and share one common operand, the binary operation can be lifted out of the select. A new select is created to choose between the "other" (non-shared) operands. The result of this new select is then used in a single, shared binary operation with the common operand.

    // Original IR snippet
    v = sel(c, x + a, x + b)
    
    // Optimized IR (after lifting)
    z = sel(c, a, b)
    v = x + z
    

Profitability and Applicability Guards:

The pass includes several "guards" to ensure that the transformation is both applicable and profitable:

  • CanLiftSelect: This function checks if a select is a candidate for lifting. For ArrayIndex lifting, it verifies that all cases are indeed array_index operations on the same array and that their index types are compatible. For binary operation lifting, it identifies a common shared_node across all cases and ensures the operation type is consistent and supported (e.g., add, sub, and, or, mul, shifts). It also handles "identity" cases, where one of the select's cases is the shared operand itself.
  • ShouldLiftSelect: This function determines if a liftable select should actually be transformed, based on profitability heuristics: * Area Estimation: For binary operations, it performs a simple cost-benefit analysis based on the bit-widths of the original and transformed operations. The transformation is generally considered profitable if the total bit-width of the new, narrower select and the shared operation is less than or equal to the total bit-width of the original select and all the individual binary operations (if they were single-use). * Delay Estimation: If a delay_model is provided, the pass uses CriticalPathDelayAnalysis to estimate the impact of the transformation on the critical path. Lifting is avoided if it is predicted to increase the latency. This is particularly important for scenarios like converting multiple constant-shifts into a single variable-shift, which is often more expensive in hardware. * Heuristics for High-Latency Operations: For high-latency operations like multiplication, lifting is more cautiously applied, especially if it involves identity cases, to avoid potentially degrading performance.

How it Works:

  1. Worklist-based Iteration: The pass maintains a worklist of select and priority_select nodes that are candidates for optimization. It iterates through this worklist, applying transformations until no more profitable lifting opportunities can be found.

  2. Applicability and Profitability Checks: For each select in the worklist, it calls CanLiftSelect and then ShouldLiftSelect to determine if a valid and beneficial transformation exists.

  3. IR Transformation: If a transformation is approved: * LiftSelectForArrayIndex: Rewires the IR for array_index lifting by creating a new select for the indices and a new shared array_index. * LiftSelectForBinaryOperation: Rewires the IR for binary operation lifting by creating a new select for the non-shared operands and a new shared binary operation. It correctly handles identity cases by inserting the appropriate identity literal (e.g., 0 for add, -1 for and) into the new select.

  4. Cleanup: The original select node, now redundant, is marked for deletion, which is typically handled by a subsequent DeadCodeEliminationPass. Any new select nodes created during the transformation are added to the worklist, allowing for potential further optimizations.

By systematically applying these transformations, SelectLiftingPass plays a crucial role in optimizing multiplexer-heavy designs, leading to more efficient use of hardware resources and improved overall performance.

Header

select_merge - Select Merging

Note: What follows was generated by the Gemini LLM. Not human verified.

The SelectMergingPass is an optimization pass in XLS that focuses on merging consecutive select-like operations (PrioritySelect and OneHotSelect) into a single, larger select operation. This transformation can be highly beneficial for hardware synthesis, as it often allows the synthesis tool to infer more efficient and compact multiplexer structures from a single, unified select rather than from a cascade of smaller ones.

The core principle of this pass is to identify patterns where the output of one select operation is used as an input (a case or a default value) to another select operation. When such a pattern is found, and certain conditions are met (primarily that the inner select has only one user), the pass merges the two selects into a single, equivalent select.

How it Works:

The pass operates by traversing the IR and applying merging logic to each PrioritySelect and OneHotSelect node it encounters. The transformation logic is tailored to the specific type of select operation:

  1. Merging OneHotSelect Operations: * Pattern: one_hot_sel(selector_A, cases=[..., one_hot_sel(selector_B, cases=[...]), ...]) * Transformation: The pass constructs a new, larger one_hot_sel. * New Selector: The new selector is a concat of several components: * The original selector bits of selector_A that did not correspond to the inner one_hot_sel. * A new set of selector bits created by performing a bitwise AND of the selector_B with the bit from selector_A that originally selected the inner one_hot_sel. This effectively qualifies the inner selector with the outer selection condition. * New Cases: The new set of cases is a flattened list comprising the original cases of both the outer and inner one_hot_sels, in the correct order.

  2. Merging PrioritySelect Operations: * Pattern: priority_sel(selector_A, cases=[..., priority_sel(selector_B, ...), ...], default=...) * Transformation: Similar to OneHotSelect, a new, larger priority_sel is constructed. * New Selector: The new selector is also a concat, but the logic for combining the selectors is more complex to correctly preserve the priority encoding. It involves ANDing the inner selector bits with the outer selector bits and also considering the conditions under which the inner priority_sel would fall through to its default value. * New Cases and Default: The new cases are a flattened list of the original cases. The new default value is typically the default value of the inner priority_sel if it was part of the default path of the outer priority_sel. * Selector Swapping for Single-Bit PrioritySelects: A special heuristic is applied for single-bit priority_sels. If the default value is the one that is a select, the pass can effectively negate the selector and swap the case and default values to expose a more direct merging opportunity.

Key Conditions for Merging:

  • Single Use: The inner select operation must have only one user (the outer select). This is a critical constraint that prevents the pass from increasing the amount of logic by duplicating the inner select's computation. If the inner select were used elsewhere, merging it would require either duplicating it or introducing complex logic, which would likely be counterproductive.
  • Matching select Types: The pass is designed to merge selects of the same kind (e.g., OneHotSelect into OneHotSelect, PrioritySelect into PrioritySelect).

Benefits:

  • Improved Hardware Synthesis: Synthesis tools are often better at optimizing a single, larger multiplexer structure than a cascade of smaller ones. Merging selects can lead to more efficient hardware with reduced area and potentially better timing.
  • IR Simplification: Reduces the depth of the IR graph by collapsing chained select operations, making the IR easier for subsequent passes to analyze.

Example (OneHotSelect merging): Consider a scenario where one one_hot_sel is used as a case in another:

// Original IR snippet
fn f(p0: bits[2], p1: bits[2], x: bits[32], y: bits[32],
     z: bits[32]) -> bits[32] {
  one_hot_sel.1: bits[32] = one_hot_sel(p0, cases=[x, y])
  ret one_hot_sel.2: bits[32] = one_hot_sel(p1, cases=[one_hot_sel.1, z])
}

If one_hot_sel.1 has only one user (one_hot_sel.2), the SelectMergingPass can merge them. Let's say p1 selects one_hot_sel.1 with its 0-th bit.

// Optimized IR (simplified)
fn f(p0: bits[2], p1: bits[2], x: bits[32], y: bits[32],
     z: bits[32]) -> bits[32] {
  // New selector combines p0 and p1
  new_selector: bits[3] = concat(
    bit_slice(p1, start=1, width=1), // Original bit 1 of p1
    and(bit_slice(p1, start=0, width=1), bit_slice(p0, start=1, width=1)),
    // p1[0] AND p0[1]
    and(bit_slice(p1, start=0, width=1), bit_slice(p0, start=0, width=1))
    // p1[0] AND p0[0]
  )
  ret one_hot_sel.merged: bits[32] = one_hot_sel(new_selector,
                                                 cases=[x, y, z])
}

This transformation creates a single one_hot_sel that can be more efficiently synthesized, as it avoids the intermediate multiplexer represented by one_hot_sel.1.

Header

select_range_simp - Select Range Simplification

Pass which simplifies selects and one-hot-selects. Example optimizations include removing dead arms and eliminating selects with constant selectors. Uses range analysis to determine possible values.

Header

select_simp - Select Simplification

Pass which simplifies selects and one-hot-selects. Example optimizations include removing dead arms and eliminating selects with constant selectors. Uses ternary analysis to determine possible values.

Header

simp - Simplification

Standard simplification pipeline.

This is run a large number of times and avoids many time-consuming analyses.

Text-proto

Invoked Passes

simp(2) - Max-2 Simplification

Standard simplification pipeline.

Opt level is capped at 2

Text-proto

Options Set

Cap opt level: 2

Invoked Passes

simp(3) - Max-3 Simplification

Standard simplification pipeline.

Opt level is capped at 3

Text-proto

Options Set

Cap opt level: 3

Invoked Passes

simp(>=1,<=2) - Min-1 Max-2 Simplification

Standard simplification pipeline.

Opt level is capped at 2 and skipped if less than 1

Text-proto

Options Set

Min opt level: 1

Cap opt level: 2

Invoked Passes

simplify-and-inline - Iteratively inline and simplify

Inlining segment of the pipeline

This is performed to fixedpoint and each run a single layer of the function hierarchy is inlined away.

Text-proto

Options Set

Run to a fixedpoint.

Invoked Passes

sparsify_select - Sparsify Select

The SparsifySelectPass is a type of range analysis-informed dead code elimination that removes cases from selects when range analysis proves that they can never occur. It does this by splitting a select into many selects, each of which covers a single interval from the selector interval set.

Note: What follows was generated by the Gemini LLM. Not human verified.

The SparsifySelectPass is an optimization pass that utilizes range analysis to perform a specialized form of dead code elimination on select operations. Its core function is to identify and remove or simplify cases within a select instruction that are provably unreachable based on the possible range of values its selector can take. This leads to reduced hardware complexity and improved efficiency.

Here's a detailed breakdown of its operation:

  1. Range Analysis: The pass begins by employing a PartialInfoQueryEngine to perform a precise range analysis on the selector operand of each select operation. This analysis determines the IntervalSet – the complete set of possible values – that the selector can attain during execution.

  2. Identification of Dead Cases: If the IntervalSet of the selector reveals that it can only take on a subset of all possible values (i.e., the size of the IntervalSet is less than the total number of cases in the select), then certain cases of the select are deemed unreachable or "dead code."

  3. Transformation into a Chain of Selects: The primary transformation involves replacing the original select operation with a new, potentially nested, structure composed of multiple select operations. Each new select in this chain is meticulously designed to cover a specific Interval from the selector_intervals.

    * The helper function IntervalsSortedBySize sorts the reachable intervals by their size (number of points covered), typically processing smaller intervals first. This heuristic aims to produce a more compact or efficient select decision tree. * For each Interval, a new select is constructed. This select first determines if the value of the original selector falls within the bounds of the current Interval using comparison operations. * If the selector is found to be within the Interval, a nested select (or a direct reference to the corresponding case value if the interval is a single precise value) is used to choose among the relevant cases of the original select that correspond to values within this Interval. * If the selector is not within the current Interval, control flows to the "next" select in the chain. If it's the last interval being considered, it defaults to a zero value (or the original default value of the select, if one exists).

    The net result is a decision tree that efficiently evaluates only the cases that are genuinely reachable given the selector's known range.

Example: Consider a select operation where the selector add.3 is a bits[4] type. Although bits[4] can represent 16 values (0-15), range analysis might prove that add.3 can only ever take on values within the interval [8, 11]. The original select has 16 cases, one for each possible value from 0 to 15.

// Original IR snippet
fn func(x: bits[2]) -> bits[4] {
  zero_ext.1: bits[4] = zero_ext(x, new_bit_count=4)
  literal.2: bits[4] = literal(value=8)
  // This is the selector for sel.20
  add.3: bits[4] = add(zero_ext.1, literal.2)
  // ... many literals for the 16 cases (literal.4 through literal.19)
  ret sel.20: bits[4] = sel(add.3, cases=[
    literal.4, literal.5, literal.6, literal.7, literal.8, literal.9,
    literal.10, literal.11, literal.12, literal.13, literal.14,
    literal.15, literal.16, literal.17, literal.18, literal.19
  ])
}

The SparsifySelectPass will recognize that only selector values 8, 9, 10, and 11 are possible. It will then rewrite sel.20 into a nested structure that effectively covers only these reachable cases:

// Optimized IR snippet (simplified conceptual representation)
fn func(x: bits[2]) -> bits[4] {
  // ... zero_ext.1, literal.2, add.3 remain as before
  selector_val: bits[4] = add.3 // The original selector value

  ret main_select: bits[4] = sel(
      // Check if selector_val is in [8, 11]
      and(uge(selector_val, literal(8)), ule(selector_val, literal(11))),
      cases=[
          // If in range [8, 11], dispatch to a select that handles these
          // specific values
          sel(sub(selector_val, literal(8)), cases=[
              literal(8),  // for selector_val == 8
              literal(9),  // for selector_val == 9
              literal(10), // for selector_val == 10
              literal(11)  // for selector_val == 11
          ]),
          // If not in range, default to 0 (or the original select's default
          // if it had one)
          literal(0)
      ]
  )
}

This transformation significantly reduces the logic required in the hardware for the select operation by eliminating the evaluation of impossible conditions, thereby contributing to smaller and faster hardware designs.

Header

strength_red - Strength Reduction

Replaces operations with equivalent cheaper operations. For example, multiply by a power-of-two constant may be replaced with a shift left.

Note: What follows was generated by the Gemini LLM. Not human verified.

The StrengthReductionPass is a crucial optimization pass in XLS that aims to improve hardware efficiency by replacing computationally expensive operations with equivalent but cheaper alternatives. This transformation is fundamental for reducing the area, delay, and power consumption of the generated hardware. The pass intelligently identifies patterns where a complex operation can be implemented using simpler primitives, often leveraging constant operands or known bit patterns.

The pass operates by traversing the IR and applying a variety of strength reduction techniques. It utilizes a UnionQueryEngine (combining StatelessQueryEngine and LazyTernaryQueryEngine) to analyze the properties of nodes, such as whether a value is fully known (constant), its leading/trailing zeros, or known sign bits.

Key strength reductions performed by this pass include:

  1. Replacement of Known Values with Literals: If the QueryEngine can prove that a node (especially a non-literal operation) always evaluates to a constant value, it's replaced with a literal node. This is a general simplification that can occur across various operation types, reducing dynamic computation to a static value.

  2. Sinking Operations into Selects: If an operation has one operand that is a select (or priority_select) with all its cases being known constants, and other operands of the operation are also known constants, the operation can be "sunk" into the select. This means the operation is duplicated for each case of the select, performed on the constant values, and then a new select is formed with the results. This often allows for further constant folding or narrowing of the operations.

    // Original IR
    v1 := Select(selector, [const1, const2])
    v2 := Add(v1, const3)
    
    // Optimized IR (conceptually)
    v1_c1 := Add(const1, const3)
    v1_c2 := Add(const2, const3)
    v2 := Select(selector, [v1_c1, v1_c2])
    
  3. Additions (add): * Addition with Zero: add(x, 0) is replaced directly by x. * Addition to OR (No Carry): If add(A, B) is such that there are no possible carries between any bit positions (i.e., for any bit position i, it's not possible for both A[i] and B[i] to be 1 simultaneously, as determined by ternary analysis), the add can be replaced by a bitwise or operation, which is typically cheaper. * Splitting Adders (No Carry Propagation): At higher optimization levels (SplitsEnabled), if QueryEngine can determine a bit position in add(A, B) where a carry can provably not propagate (e.g., if the sum bit A[i] | B[i] is always 0), the adder can be split into two smaller adders. The lower-order bits are added independently, and the higher-order bits are added independently, with the results concatenated. This can reduce the critical path delay.

  4. Subtractions (sub): * Subtraction by Zero: sub(x, 0) is replaced directly by x. * Subtracting from All-Ones: sub(all_ones, x) is replaced by not(x). * Splitting Subtractors (No Borrow Propagation): Similar to adders, at higher optimization levels (SplitsEnabled), if a bit position in sub(A, B) can be identified where a borrow can provably not propagate (e.g., if A[i] is always 1 and B[i] is always 0), the subtractor can be split into two smaller subtractors, with results concatenated.

  5. Bitwise AND with Mask (and): and(x, mask) where mask is a constant with a single contiguous run of set bits (e.g., 0b011100) can be strength-reduced into a concat of zeros, a bit_slice of x (extracting the bits corresponding to the set bits in the mask), and more zeros. This avoids a full bitwise AND for simple masking operations.

  6. Select to SignExt Transformation: A select operation that effectively acts as a sign extension (e.g., if the selector is 0 it picks 0, if 1 it picks all ones, and then extended to a wider width) can be replaced by a sign_ext operation. This is common when a 1-bit predicate controls the sign extension of a value.

  7. SignExt to ZeroExt Transformation: If sign_ext(x, ...) is such that the MSB of x is known to be 0 (as determined by QueryEngine), the sign_ext can be replaced with a zero_ext, which is generally simpler to implement in hardware.

  8. Gate Operations: * If the condition of a gate(cond, data) is a known 0, the gate is replaced with a literal 0. * If the condition is a known 1, the gate is replaced with its data operand. * If the data operand is a known 0, the gate is replaced with a literal 0 (regardless of the condition).

  9. Single-bit Add and Ne to Xor: For 1-bit operands, add(x, y) and ne(x, y) are equivalent to xor(x, y), and are transformed accordingly, using the typically cheaper xor operation.

  10. Comparisons against Powers of Two (uge, ult): Comparisons like x >= 2^K or x < 2^K where 2^K is a power of two constant, can be simplified to a comparison of a bit_slice of x (extracting the relevant leading bits) against 0.

  11. Eq(x, 0) for bits[2] to Not(Or(bit_slice(x,0,1), bit_slice(x,1,1))): This specifically optimizes equality checks against zero for 2-bit values into a more fundamental Boolean logic expression.

  12. Arithmetic Operations with a Single Unknown Bit: For expensive arithmetic operations (smul, umul, sdiv, udiv, smod, umod) with two operands where one operand is a constant and the other has only one unknown bit (all other bits are known constants), the operation is transformed into a select operation. The select's selector is the single unknown bit, and its cases are the two possible results of the arithmetic operation when that unknown bit is 0 or 1. This replaces a complex arithmetic unit with a simple multiplexer for two pre-computed literal results.

The pass is run repeatedly until a fixed point is reached, allowing these local transformations to propagate and enable further simplifications across the entire IR.

Header

table_switch - Table switch conversion

TableSwitchPass converts chains of Select nodes into ArrayIndex ops. These chains have the form: sel.(N)(eq.X, literal.A, literal.B) sel.(N+1)(eq.Y, sel.(N), literal.C) sel.(N+2)(eq.Z, sel.(N+1), literal.D) And so on. In these chains, eq.X, eq.Y, and eq.Z must all be comparisons of the same value against different literals.

Current limitations: - Either the start or end index in the chain must be 0. - The increment between indices must be positive or negative 1. - There can be no "gaps" between indices. - The Select ops have to be binary (i.e., selecting between only two cases).

Note: What follows was generated by the Gemini LLM. Not human verified.

The TableSwitchPass is an optimization pass in XLS that transforms specific chains of select (or priority_select) nodes into more efficient array_index operations. This is a crucial optimization for hardware synthesis, as it converts a series of conditional branches into a direct table lookup mechanism, which can be implemented more compactly and with lower delay in hardware, especially when the conditions are based on comparing a single value against a set of constants.

The pass specifically targets select chains that conform to the following pattern:

  • Chained Select or PrioritySelect Operations: The pass identifies sequences of select or priority_select nodes where: * The selector of each select is a comparison (e.g., eq or ne) of a common "index" value against a distinct literal "key." * The result of one select operation feeds into a case (or the default value) of the next select operation in the chain, creating a cascaded lookup structure. * The "value" chosen by each matching comparison in the chain is a literal.

The pass then conceptually constructs a lookup table (represented as an array of literal values) where the common "index" value from the select chain is used to directly access an element in this array, and the elements of the array are the "value" literals from the original select chain.

How it Works:

  1. MatchLinks: This is the core pattern-matching function. It recursively traverses the select chain, extracting essential information about each "link" (each select node) in the chain. This information includes the common index node, the literal key it's compared against, the value literal chosen if the comparison is true, and the next node in the chain if the comparison is false. It also handles PrioritySelect operations where the selector is a concat of single-bit selectors, identifying individual bits of the selector as corresponding to specific cases.

  2. LinksToTable: Once a complete chain of Links is identified, this function attempts to convert it into a Value that represents a literal array. It maps the keys from the Links to corresponding indices in this array, with the associated value literals becoming the array elements. The function intelligently handles situations where the index space is not fully covered by explicit Links, filling in the gaps with the "else" value (the default or fall-through value from the original chain).

  3. Replacement: If LinksToTable successfully generates a valid literal array, the entire select chain (or the part of a priority_select chain that can be converted) is replaced by an array_index operation. The index for this array_index is the common index node from the original select chain, and the array being indexed is the newly created literal array.

Current Limitations (as described in the code):

  • Either the start or end index in the chain must be 0.
  • The increment between indices must be positive or negative 1.
  • There can be no "gaps" between indices.
  • The Select ops must be binary (i.e., selecting between only two cases).

Example (simplification of a binary tree of sel operations performing a lookup):

// Original IR snippet
fn main(index: bits[32]) -> bits[32] {
  literal.0: bits[32] = literal(value=0)
  literal.1: bits[32] = literal(value=1)
  literal.2: bits[32] = literal(value=2)
  literal.3: bits[32] = literal(value=3)
  eq.10: bits[1] = eq(index, literal.0)
  eq.11: bits[1] = eq(index, literal.1)
  eq.12: bits[1] = eq(index, literal.2)
  eq.13: bits[1] = eq(index, literal.3)
  sel.20: bits[32] = sel(eq.10, cases=[literal.0, literal.1])
  sel.21: bits[32] = sel(eq.11, cases=[sel.20, literal.2])
  sel.22: bits[32] = sel(eq.12, cases=[sel.21, literal.3])
  ret sel.23: bits[32] = sel(eq.13, cases=[sel.22, literal.4])
}

The TableSwitchPass would recognize this pattern and convert it into (conceptually):

// Optimized IR (simplified)
fn main(index: bits[32]) -> bits[32] {
  // ...
  // Constructed array of values from the selects
  array_literal: bits[32][5] = literal(value={1, 2, 3, 4, 0}) // Assuming 0
  is the default/else value
  ret array_index.new: bits[32] = array_index(array_literal, indices=[index])
}

This transformation significantly simplifies the hardware required for such lookup-like structures by replacing complex multiplexing logic with a direct memory access or a simpler decoder for the array index, leading to improved hardware QoR.

Header

token_dependency - Convert data dependencies between effectful operations into token dependencies

Pass which turns data dependencies between certain effectful operations into token dependencies. In particular, transitive data dependencies between receives and other effectful ops are turned into token dependencies whenever no such token dependency already exists.

Note: What follows was generated by the Gemini LLM. Not human verified.

For example, consider a scenario where a send operation uses data produced by a receive operation, but due to the other logic between the send and receive the data produced by the receive is not observable by the send. Other passes would be free to remove the data dependency on the send (since the data from the receive is irrelevant). This would leave the send and receive disconnected so, without an explicit token dependency, the scheduler might reorder these, leading to incorrect behavior.

chan test_channel(
  bits[32], id=0, kind=streaming, ops=send_receive,
  flow_control=ready_valid)

top proc main(__state: (), init={()}) {
  __token: token = literal(value=token, id=1000)
  receive.1: (token, bits[32]) = receive(__token, channel=test_channel)
  tuple_index.2: token = tuple_index(receive.1, index=0)
  tuple_index.3: bits[32] = tuple_index(receive.1, index=1)
  send.4: token = send(__token, tuple_index.3, channel=test_channel)
  after_all.5: token = after_all(send.4, tuple_index.2)
  tuple.6: () = tuple()
  next (tuple.6)
}

In this example, send.4 uses tuple_index.3 which comes from receive.1. The pass will identify this data dependency and add tuple_index.2 (the token from receive.1) as an operand to an after_all operation that also consumes the token from send.4, thus enforcing the order:

after_all.5: token = after_all(send.4, after_all_new.X)
  // where after_all_new.X is effectively the token from receive.1

This ensures that receive.1 completes before send.4 can proceed.

Header

token_simp - Simplify token networks

Pass that simplifies token networks. For example, if an AfterAll node has operands where one operand is an ancestor of another in the token graph, then the ancestor can be omitted. Similarly, duplicate operands can be removed and AfterAlls with one operand can be replaced with their operand.

Note: What follows was generated by the Gemini LLM. Not human verified.

The TokenSimplificationPass is a crucial optimization pass in XLS that focuses on simplifying the token network within a function or proc. Token networks are essential for correctly enforcing ordering constraints between side-effecting operations (such as send and receive) in a dataflow graph. This pass aims to reduce the complexity and redundancy in these networks, leading to more compact and efficient control logic in the generated hardware.

The pass performs several types of simplifications on token-typed operations, primarily after_all and min_delay:

  1. Simplifying after_all Operations: * Single Operand after_all: An after_all operation with only one operand (e.g., after_all(tok)) is redundant and is replaced directly by its single operand tok.

      ```
      // Original IR
      after_all.1: token = after_all(tok)
      next (after_all.1, ...)
    
      // Optimized IR
      next (tok, ...)
      ```
    

    * Duplicate Operands in after_all: If an after_all operation has multiple identical operands (e.g., after_all(tok, tok, tok)), the duplicates are removed, leaving only one instance of each unique token.

      ```
      // Original IR
      after_all.1: token = after_all(tok, tok, tok)
      next (after_all.1, ...)
    
      // Optimized IR
      next (tok, ...)
      ```
    

    * Nested after_all Operations: If an after_all operation has another after_all as an operand, and all users of the inner after_all are also after_all operations, they can be collapsed. This effectively flattens the token dependency chain.

      ```
      // Original IR
      after_all.1: token = after_all(tok_a, tok_b)
      after_all.2: token = after_all(after_all.1, tok_c)
      next (after_all.2, ...)
    
      // Optimized IR
      after_all.new: token = after_all(tok_a, tok_b, tok_c)
      next (after_all.new, ...)
      ```
    

    * Replacing Overlapping after_all Operands: If an after_all operation has multiple operands where one operand is an ancestor (directly or indirectly) of another in the token graph, the ancestor token can be removed from the after_all's operand list. This is because the dependency is already implied by the descendant token. The pass is conservative and avoids analyzing through invoke nodes to ensure correctness before function inlining.

      ```
      // Original IR
      send.2: token = send(tok, literal(10), channel=test_channel)
      send.3: token = send(send.2, literal(10), channel=test_channel)
      // send.3 depends on send.2
      after_all.4: token = after_all(tok, send.2, send.3)
      next (after_all.4, ...)
    
      // Optimized IR (simplified to the latest token in the chain)
      // send.2 and tok are removed from after_all.4 because send.3 implies
      // them
      next (send.3, ...)
      ```
    
  2. Simplifying min_delay Operations: * Zero Delay min_delay: A min_delay operation with a delay of 0 (e.g., min_delay(tok, delay=0)) is a no-operation and is replaced by its operand tok. Similarly, min_delay(after_all(), delay=X) is replaced by after_all(), as there are no actual events to delay.

      ```
      // Original IR
      min_delay.1: token = min_delay(tok, delay=0)
      next (min_delay.1, ...)
    
      // Optimized IR
      next (tok, ...)
      ```
    

    * Nested min_delay Operations: Consecutive min_delay operations can be collapsed into a single min_delay with the sum of their delays (e.g., min_delay(min_delay(tok, delay=1), delay=2) becomes min_delay(tok, delay=3)).

      ```
      // Original IR
      min_delay.1: token = min_delay(tok, delay=1)
      min_delay.2: token = min_delay(min_delay.1, delay=2)
      next (min_delay.2, ...)
    
      // Optimized IR
      min_delay.new: token = min_delay(tok, delay=3)
      next (min_delay.new, ...)
      ```
    

    * Pushing Down min_delay through after_all: If all operands of an after_all are min_delay operations, the smallest common delay can be factored out and applied to the entire after_all operation. This can reduce the total number of min_delay operations and simplify scheduling.

      ```
      // Original IR
      min_delay.1: token = min_delay(tok_a, delay=1)
      min_delay.2: token = min_delay(tok_b, delay=2)
      after_all.3: token = after_all(min_delay.1, min_delay.2)
      next (after_all.3, ...)
    
      // Optimized IR (factoring out common delay of 1)
      min_delay.new: token = min_delay(after_all(tok_a, min_delay(tok_b,
      delay=1)), delay=1) next (min_delay.new, ...)
      ```
    

This pass iteratively applies these simplifications to the token network until a fixed point is reached. By reducing redundant token dependencies and optimizing delay annotations, the pass contributes to more efficient scheduling and resource utilization in the generated hardware.

Header

useless_assert_remove - Remove useless (always true) asserts

Pass which removes asserts that have a literal 1 as the condition, meaning they are never triggered. Rewires tokens to ensure nothing breaks.

Header

useless_io_remove - Remove useless send/receive

Pass which removes sends/receives that have literal false as their condition. Also removes the condition from sends/receives that have literal true as their condition.

Note: What follows was generated by the Gemini LLM. Not human verified.

This pass optimizes the IR by eliminating or simplifying I/O operations (send and receive) based on their predicate (condition) values. This helps reduce dead code and simplify the control flow of the hardware.

Specifically, it handles two main cases:

  1. Removal of Useless Conditional I/O: If a send or receive operation has a predicate that is a literal false, and there are other active I/O operations on the same channel, the useless operation is removed. If it's the only remaining I/O operation on that channel, its data input is replaced with a literal zero to avoid leaving the channel unused.

    Example (SendIf with literal false predicate):

    send.1: token = send(tkn, data, predicate=literal(value=0))
    // This send will never execute
    // ... other operations, potentially another send on the same channel
    

    This send.1 would be replaced by tkn (its token operand), effectively removing the send operation.

  2. Simplification of Unconditional I/O: If a send or receive operation has a predicate that is a literal true, the predicate itself is removed. The operation then becomes an unconditional send or receive.

    Example (ReceiveIf with literal true predicate):

    receive.1: (token, bits[32]) = receive(tkn, predicate=literal(value=1),
    channel=my_channel)
    

    This receive.1 would be transformed into:

    receive.1: (token, bits[32]) = receive(tkn, channel=my_channel)
    

Header