Optimization Passes List
This is an automatically generated list of all optimization passes available
with the default opt_main. This is generated automatically based on comments
in the header files.
If the opt level is set below 'Min opt level' the pass will act as a no-op.
If the opt level is set above 'Cap opt level' the pass (or passes within the compound pass) will be executed with the opt level capped to the specified value.
Warning: Many of these passes have descriptions generated by the Gemini LLM and may not accurately reflect the behavior of the passes. As time goes on manual verification or editing of the pass descriptions may be done to improve accuracy.
default_pipeline - The default pipeline.
Invoked Passes
arith_simp - Arithmetic Simplifications
This pass performs various arithmetic optimizations such as replacement of divide by a constant with non-divide operations.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ArithSimplificationPass is an optimization pass in XLS that applies a
wide range of arithmetic transformations to simplify expressions, reduce
hardware complexity, and improve overall efficiency. It utilizes a
StatelessQueryEngine to identify constant values and bit patterns, which
enables many of its powerful transformations. This pass operates in an
iterative fixed-point manner, meaning it repeatedly applies simplifications
until no further changes can be made, ensuring a thorough optimization.
Key optimizations performed by this pass include:
-
Division and Modulo by Constants:
* Division by Power of Two:
udiv(x, 2^K)orsdiv(x, 2^K)is replaced with a right or left shift byKbits, respectively. This is a fundamental hardware optimization.* Modulo by Power of Two:
umod(x, 2^K)is replaced with a bit slice to extract theKleast significant bits. Forsmod(x, 2^K), it's handled by aselectoperation based on the sign ofxto ensure correct signed modulo semantics.* Division/Modulo by Other Constants (Unsigned):
udiv(x, K)whereKis a constant (not a power of two) is transformed using "magic multiplication" algorithms, which replace division with a combination of shifts and multiplies. This is often significantly more efficient in hardware than a dedicated division unit.* Division/Modulo by Other Constants (Signed):
sdiv(x, K)whereKis a constant, also employs magic multiplication, with additional logic to correctly handle signed semantics and rounding towards zero.* Constant Dividend Optimization: If the dividend is a small constant (e.g.,
udiv(13, x)orsdiv(13, x)), the operation is replaced by a lookup table (PrioritySelectover aLiteralarray). This is particularly effective for small constant dividends where a full division circuit is overkill. -
Shift Operations:
* Logical Shift by Constant:
shll(x, K)orshrl(x, K)whereKis a constant, is replaced by a combination ofbit_sliceandconcatoperations. This avoids generating a potentially larger barrel shifter for fixed-amount shifts.``` // Original: shrl(x, literal(2, bits=N_shift_amount_width)) // Example for shift by 2 // Optimized: concat(literal(0, bits=2), bit_slice(x, start=2, width=N-2)) ```* Arithmetic Shift Right by Constant:
shra(x, K)is replaced by asign_extendof abit_sliceofx, effectively performing an arithmetic right shift without a barrel shifter.* Removal of Zero-Extended Shift Amounts: If a shift amount is the result of a
concatwith leading zeros (a common pattern for zero-extension), the leading zeros are removed from the shift amount, simplifying the shift input.* Removal of Shift Guards: If a shift amount is clamped by a
selectoperation (e.g.,select(UGt(amt, LIMIT), {LIMIT, amt})) whereLIMITis greater than or equal to the bit width of the value being shifted, the clamping logic is removed as it is redundant (any shift amount beyond the bit width produces the same result). -
Comparisons with Negated Operands:
*
eq(-lhs, -rhs)=>eq(lhs, rhs)(and similar forne).* Signed comparisons like
slt(-lhs, -rhs): These are transformed into anxorof the reversed comparison (sgt(lhs, rhs)) and additional terms to correctly handleMIN_VALUEedge cases (whereneg(MIN_VALUE) == MIN_VALUE).* Signed comparison of negation to literal
eq(-expr, K): These are transformed intoeq(expr, -K), with similarxorlogic for inequalities to correctly handleMIN_VALUE. These transformations are only applied if thenegateoperation has no other users to avoid introducing logic replication. -
Multiply Operations:
* Multiply by Power of Two:
umul(x, 2^K)orsmul(x, 2^K)are replaced with a left shift byKbits.* Multiply by Small Constant (Sum-of-Shifts):
mul(x, K)whereKis a small constant with a few set bits can be replaced by a sum of shiftedxvalues (e.g.,x * 5=>x + (x << 2)). This is applied if the number of adders required is below akAdderLimit. A complementary optimization uses subtraction from a power-of-two shift for constants like 7 (e.g.,x * 7=>(x << 3) - x).* Zero-Width Multiply Operands: Multiplies where one or both operands have a zero bit width are replaced by a literal zero, as the result will always be zero.
* Multiply used by Narrowing Slice: If a multiply operation's result is only used by
bit_sliceoperations that extract a narrower result, the multiply itself can be performed at the required output width, reducing the size of the multiplier hardware. -
Decode Operations:
* 1-bit Wide Decode: A
decodeoperation with a 1-bit output is simplified to aneq(index, 0)comparison, as only the index 0 will produce a true output.* 1-bit Operand Decode: A
decodeoperation with a 1-bit input and a 2-bit output is simplified toconcat(operand, not(operand)), which is its direct one-hot representation.* Removal of Zero-Extended Index: Similar to shifts, if a decode index is formed by a
concatwith leading zeros (zero-extension), the leading zeros are removed from the index. -
decode(N) - 1Pattern: The common pattern for creating a mask with theNleast-significant bits set (e.g.,sub(decode(N), 1)) is rewritten tonot(shll(all_ones, N)), eliminating an adder and often resulting in more efficient hardware. -
Comparison of Injective Operations: The pass identifies patterns where the result of an injective arithmetic operation (like
addorsub) is compared against a constant (e.g.,(X + C0) == C1). These are simplified by performing the inverse arithmetic operation on the constants to isolateX(e.g.,X == C1 - C0).
The ArithSimplificationPass operates to a fixed point, meaning it
repeatedly applies these simplifications until no further changes are
possible. This iterative approach ensures that all possible arithmetic
optimizations are applied, resulting in a more efficient and optimized IR for
hardware synthesis.
array_simp - Array Simplification
Pass which simplifies or eliminates some array-type operations such as ArrayIndex.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ArraySimplificationPass is a comprehensive optimization pass designed
to simplify and, where possible, eliminate various array-related operations
within the XLS IR. The primary objective is to reduce the complexity of array
manipulation, leading to more efficient and compact hardware implementations.
This pass employs a combination of pattern matching and range analysis to
identify and transform redundant or inefficient array constructs.
The pass operates in an iterative, fixed-point manner, continually applying
transformations until no further simplifications can be made. It focuses on
Array, ArrayIndex, ArrayUpdate, ArraySlice, and Select (or
PrioritySelect) operations when they involve array types.
Key simplifications performed by this pass include:
-
Clamping Out-of-Bounds
ArrayIndexIndices: If range analysis (QueryEngine) can definitively determine that anArrayIndexoperation is attempting to access an out-of-bounds memory location, the index is clamped to the maximum in-bounds index (i.e., the last element of the array). This ensures correct behavior in the generated hardware and can simplify subsequent logic by removing checks for illegal accesses. -
Simplifying
ArrayIndexOperations: * Empty Indices: Anarray_index(A, {})operation (where the index list is empty) is replaced directly by the arrayAitself, as it logically references the entire array.*
ArrayIndexofArrayUpdatewith Matching Indices: If anarray_indexoperation accesses the same index that was previously updated by anarray_updateoperation (e.g.,array_index(array_update(A, V, {idx}), {idx})), and the index is known to be in bounds, it is replaced directly with the updated valueV. This bypasses the array storage entirely.* Trivial Dimensions: If an
ArrayIndexaccesses an array dimension of size 1 (e.g.,bits[32][1]), the index for that dimension is replaced with a literal0, as there is only one possible element to access. This simplifies the indexing logic.*
ArrayIndexofArraywith Constant First Index: Anarray_index(array(E0, E1, ...), {C, ...})operation, whereCis a constant index, is transformed to directly index into the corresponding elementEc. This bypasses the array construction, reducing IR nodes.*
ArrayIndexofArrayConcatwith Constant First Index: Similar toArray, if anArrayIndexaccesses anarray_concatwith a constant first index, it bypasses thearray_concatand directly indexes into the appropriate sub-array within the concatenated structure.* Consecutive
ArrayIndexOperations: A sequence of nestedarray_indexoperations (e.g.,array_index(array_index(A, {idx0}), {idx1})) is flattened into a singlearray_index(A, {idx0, idx1}), combining the indices for greater efficiency.*
ArrayIndexofSelect(orPrioritySelect): An operation of the formarray_index(select(P, cases=[A0, A1]), {idx})is transformed intoselect(P, cases=[array_index(A0, {idx}), array_index(A1, {idx})]). This effectively pushes theselectoperation inside the array access, potentially reducing the overall multiplexer (mux) width if the array elements are smaller than the entire array. This is particularly beneficial for "small" arrays or when the outerarray_indexis the sole user of theselectoperation. -
Simplifying
ArrayUpdateOperations: * Empty Indices: Anarray_update(A, V, {})operation (with no indices) is replaced directly by the update valueV, as it logically replaces the entire array.* Redundant
ArrayUpdate: If anarray_updateoperation updates an element with its current value (e.g.,array_update(A, array_index(A, {idx}), {idx})) and the index is known to be in bounds, it is replaced with the original arrayA, as no actual change occurs.*
ArrayUpdateon Unit-Dimension Arrays: Anarray_update(A[1], V, {idx})(updating a 1-element array, possibly within a multi-dimensional array) is replaced with aselectoperation. Thisselectconditionally chooses between an array-packed version of the update valueVand the original arrayA, based on whether the indexidxmatches the single valid index. This allows for explicit conditional updates for these small arrays.* Hoisting
ArrayUpdateaboveArray: Anarray_updateapplied to anarrayliteral (e.g.,array_update(array(A,B,C), V, {idx})) can be transformed into a newarraywhere the element atidxis replaced byV, potentially after applying sub-updates if the array is multi-dimensional. This is particularly beneficial for constant arrays.* Flattening Sequential
ArrayUpdateChains: The pass identifies and flattens sequences ofarray_updateoperations that modify the same array. If a later update necessarily overwrites elements updated by an earlierarray_update, the earlier, redundant update can be elided.*
ArrayUpdateofArrayIndex: Anarray_updateof the formarray_update(A, array_update(array_index(A, {i}), V, {j}), {i})can be simplified toarray_update(A, V, {i, j}), effectively merging nested update operations for increased efficiency. -
Simplifying
ArrayOperations: * Decomposition and Recomposition Elimination: The pass can detect a pattern where an array is first decomposed into its individual elements (usingarray_indexoperations) and then immediately recomposed back into an array (using anarrayoperation). If the indices are sequential and cover the entire range, this redundant decomposition-recomposition pair is eliminated, replacing thearrayoperation with the original sourcearrayor a simplearray_indexon the source. -
Simplifying
ArraySliceOperations: * Literal Start Index: Anarray_sliceoperation with a literal start index is replaced by anarrayliteral constructed from the directly indexed elements of the source array. This eliminates dynamic slicing operations in hardware. -
Simplifying
Selectof Array-typed Values: * Conditional Array Assignment: Aselectof the formselect(P, cases=[A, array_update(A, V, {idx})])is transformed intoarray_update(A, select(P, cases=[array_index(A, {idx}), V]), {idx}). This optimization hoists theselectinside thearray_update, reducing the multiplexer width to operate only on the single array element being conditionally updated.*
Selectof Arrays to Array ofSelects: Aselectoperation where all its cases (and optionally default value) are array-typed values (e.g.,select(P, cases=[array_0, array_1])) can be converted to anarrayofselectoperations (e.g.,array(select(P, {array_0[0], array_1[0]}), select(P, {array_0[1], array_1[1]}), ...)). This transformation can expose smallerselectoperations for further optimization but also has the potential to increase logic replication. Therefore, it is applied judiciously based on heuristics.
This pass is essential for translating high-level array abstractions into efficient hardware structures by aggressively simplifying and optimizing array access and manipulation patterns, contributing significantly to improved hardware QoR.
array_untuple - Array UnTuple
Pass which changes any (non-external) array-of-tuple into a tuple-of-arrays. We can see through tuples quite well but can't see through arrays to anywhere near the same extent. Therefore the struct-of-array representation is always superior.
Note that this pass makes no attempt to unpack or repack arrays which escape the function-base. This means that anything which comes in through a function param, or a procs recv or escapes through a function return or a proc send is not untuple'd.
TODO(allight): We could do this at the cost of a significant number of ir nodes. We should experiment to see if this is worth doing.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ArrayUntuplePass is an optimization pass in XLS that transforms arrays
of tuples into tuples of arrays. This conversion, often referred to as the
"struct-of-arrays" (SoA) transformation, is generally superior for hardware
implementation. The XLS IR and its optimization passes can "see through"
tuples much more effectively than through arrays. By converting an array of
structs into a struct of arrays, this pass exposes significantly more
opportunities for subsequent optimizations (such as bit-slicing, dead code
elimination, and common subexpression elimination), ultimately leading to
more efficient and compact hardware designs.
Core Principle:
The pass aims to convert a data structure of the form Array<Tuple<A, B, C>>
into Tuple<Array<A>, Array<B>, Array<C>>.
How it Works:
-
FindUntupleGroups(Union-Find for Related Nodes): This function utilizes aUnionFinddata structure to group nodes that represent parts of the same array-of-tuples structure and should be transformed together. It essentially identifies equivalence classes of nodes that are interconnected within the array-of-tuples dataflow. For instance, if anarray_updatemodifies an array-of-tuples, both the original array and thearray_updateresult are considered part of the same equivalence group. Operations considered for grouping include:ArrayUpdate,ArraySlice,ArrayConcat,Eq,Ne,Sel(and its variantsPrioritySel,OneHotSel),Gate, andNext. -
FindExternalGroups(Identifying External Visibility): This is a crucial preliminary step that identifies "external" uses of arrays-of-tuples. If an array-of-tuples either originates from outside the current function/proc (e.g., as a function parameter or a proc'sreceiveoperand) or escapes its boundary (e.g., as a function return value or a proc'ssendoperand), it cannot be untupled because doing so would alter the external interface of the component. Such groups of nodes are marked as "excluded" from transformation. Additional exclusions include: * Arrays of empty tuples (to avoid potential infinite loops in fuzzer- generated code).* Arrays of arrays (this pass currently does not recursively unwrap them).
* The
update_valueoperand of anarray_update(as it represents an external input to the update operation).* Operands of unhandled operations that are arrays of tuples.
-
UntupleVisitor(Applying Transformations): ThisDfsVisitorWithDefaulttraverses the function or proc nodes in topological order and applies the necessary transformations if a node represents an array-of-tuples and its equivalence group is not marked as excluded. The visitor handles various IR operations: *Literal: Converts an array-of-tuples literal into a tuple of array literals.*
StateRead: For Procs, it untuples aStateReadoperation by creating newStateElements for each tuple element, with each new element holding an array corresponding to that specific tuple field.*
Next: Transforms anextoperation on an array-of-tuples state element into a series ofnextoperations, one for each tuple-of-arrays state element.*
ArrayIndex: Converts anarray_indexoperation on an array-of- tuples into atupleofarray_indexoperations, where each accesses a corresponding element from the untupled array structure.*
Array: Transforms anarrayconstruction that creates an array-of-tuples into atupleofarrayconstructions, effectively building the tuple-of-arrays representation.*
ArrayUpdate,ArraySlice,ArrayConcat: Similar transformations are applied to these operations, distributing them across the individual tuple elements to operate on the corresponding arrays.*
Gate,Select(and its variantsPrioritySel,OneHotSel): These operations are distributed across the elements of the tuple, creating new operations for each element of the untupled structure.*
Eq,Ne(Comparisons): Comparisons on arrays of tuples are distributed across the tuple elements. Foreq, the individual comparison results areand-reduced; forne, they areor-reduced.
Benefits:
- Improved Optimization Opportunities: By converting arrays of tuples
to tuples of arrays, the pass exposes the individual array components.
This allows other subsequent passes (e.g.,
BitSliceSimplificationPass,DcePass,CsePass) to operate more effectively, as they are generally more adept at optimizing bit-level operations and tuple structures.
- Reduced Hardware Complexity: Can lead to a more modular and efficient hardware implementation by breaking down complex array-of-tuple structures into simpler, parallel arrays.
- Enhanced IR Clarity: The transformed IR can be easier to understand and reason about, as logical components are grouped more coherently and less opaque data structures are used.
Example (tuple_index(array_index(array_of_tuples, idx), element_idx)):
Consider an array of tuples where we first index into the array, and then
extract an element from the resulting tuple:
// Original IR snippet
fn my_func(my_array: (bits[32], bits[16])[4], idx: bits[2]) -> bits[16] {
array_element: (bits[32], bits[16]) = array_index(my_array, indices=[idx])
ret tuple_element: bits[16] = tuple_index(array_element, index=1)
}
After ArrayUntuplePass, my_array would conceptually be transformed into
separate arrays for each tuple element, e.g., (my_array_elem0: bits[32][4],
my_array_elem1: bits[16][4]). The access pattern would then become
simplified:
// Optimized IR (conceptually, after ArrayUntuplePass and DCE)
fn my_func(my_array_elem0: bits[32][4],
my_array_elem1: bits[16][4],
idx: bits[2]) -> bits[16] {
ret array_index_elem1: bits[16] = array_index(my_array_elem1,
indices=[idx])
}
The pass effectively streamlines data access by creating direct arrays for each tuple element, allowing for more direct and optimized indexing into the relevant array, eliminating the intermediate tuple operations.
basic_simp - Basic Simplifications
This pass does simple pattern-matching optimizations which are ~always a good idea to do (replacing a node with a constant, removing operands of nodes, etc). They improve QoR, do not increase the number of nodes in the graph, preserve the same abstraction level, and do not impede later optimizations via obfuscation. These optimizations require no analyses beyond looking at the node and its operands. Examples include: not(not(x)) => x, x + 0 => x, etc.
Note: What follows was generated by the Gemini LLM. Not human verified.
The BasicSimplificationPass applies a collection of straightforward,
pattern-matching optimizations to the XLS IR. These optimizations are
designed to improve the Quality of Results (QoR) without significantly
increasing the number of nodes in the graph, preserving the current
abstraction level, and avoiding obfuscation that might hinder subsequent
optimization passes. The pass primarily relies on local analysis of a node
and its immediate operands.
The pass iteratively traverses the nodes in the function (or proc) and attempts to apply various simplification patterns until a fixed point is reached (no further changes can be made).
Key simplifications performed by this pass include:
-
Identity Operations: *
add(x, 0)=>x(and similar forsub,shll,shrl,shrawith 0) *and(x, x)=>x(andor(x, x)=>x) *nand(x, x, x)=>not(x)*xor(x, x)=>0(for an even number of identicalxoperands) *xor(x, x, x)=>x(for an odd number of identicalxoperands) *not(not(x))=>x*neg(neg(x))=>x*and_reduce(x)=>x(ifxis 1-bit) *or_reduce(x)=>x(ifxis 1-bit) *xor_reduce(x)=>x(ifxis 1-bit) *priority_sel(bits[0], cases=[], default=X)=>X(a priority select with a zero-width selector always selects its default value). -
Constant Propagation and Folding: *
umul(x, 0)orsmul(x, 0)=>0*or(x, -1)=>-1(all ones literal) *nor(x, -1)=>0*and(x, 0)=>0*nand(x, 0)=>1*xor(x, -1)=>not(x)*and_reduce([])=>1(identity for AND reduction on an empty set) *or_reduce([])=>0(identity for OR reduction on an empty set) *xor_reduce([])=>0(identity for XOR reduction on an empty set) * Consolidation of multiple literal operands in associative/commutative N-ary operations (e.g.,and(x, literal_A, y, literal_B)=>and(x, y, literal_A_and_B)). -
Boolean Algebra Simplifications: *
and(x, not(x))=>0*or(x, not(x))=>1*add(x, not(x))=>1(forbitstypes, effectively2^N - 1) *eq(x: bits[1], 1)=>x*eq(x: bits[1], 0)=>not(x) -
Redundant Operand Removal: *
or(x, 0, y)=>or(x, y)*and(x, -1, y)=>and(x, y)* Removal of duplicate operands in N-aryand,or,nand,nor, andxoroperations (e.g.,and(x, y, y, x)=>and(x, y)). -
Flattening Nested Associative N-ary Operations: If an outer N-ary operation has an inner N-ary operation of the same opcode as one of its operands, and the inner operation has only one user (the outer operation), the inner operation's operands are flattened directly into the outer operation.
Example:
becomes:or.5: bits[8] = or(w, x) or.6: bits[8] = or(y, z) ret result: bits[8] = or(or.5, or.6)ret result: bits[8] = or(w, x, y, z) -
Comparison Simplifications: *
eq(x, x)=>1*ne(x, x)=>0*eq(x, y)=>1(ifxandyare zero-width values) *ne(x, y)=>0(ifxandyare zero-width values)
This pass is designed to be efficient and broadly applicable, serving as a foundational optimization before more complex analyses and transformations are applied.
bdd_cse - BDD-based Common Subexpression Elimination
Pass which commons equivalent expressions in the graph using binary decision diagrams.
Note: What follows was generated by the Gemini LLM. Not human verified.
The BddCsePass (BDD-based Common Subexpression Elimination) is an
optimization pass in XLS that identifies and merges logically equivalent
expressions within a function or proc by leveraging Binary Decision Diagrams
(BDDs). Unlike the purely syntactic CsePass, BddCsePass can detect
equivalences even when the expressions have different syntactic forms, as it
relies on the canonical representation of Boolean functions provided by BDDs.
This capability leads to more aggressive common subexpression elimination and
significantly reduces hardware redundancy.
How it Works:
-
BDD Representation: The core of this pass is the
BddQueryEngine. For each bit-typed node in the function, theBddQueryEngineconstructs a Binary Decision Diagram. A BDD is a directed acyclic graph that represents a Boolean function. Crucially, for any given Boolean function, its BDD is canonical (unique) if a fixed variable ordering is used. This property allows the pass to determine if two different expressions compute the same Boolean function simply by comparing their BDDs. -
Node Ordering for Determinism and Performance: The pass first determines a specific order in which to process the nodes (
GetNodeOrder). This order is a topological sort (necessary to avoid introducing cycles) and prioritizes nodes on less critical paths. This prioritization helps ensure that CSE replacements do not inadvertently increase the overall critical path delay of the design. -
Equivalence Bucketing: The pass iterates through the nodes in the determined order. For each bit-typed node, it computes a hash value based on the BDD node indices of all its constituent bits. Nodes with the same hash are grouped together into "node buckets," as they are potentially logically equivalent.
-
Logical Equivalence Check: Within each bucket, the pass performs a pairwise comparison of nodes using the
is_same_valuefunction. This function directly compares the BDDs of the two nodes. If their BDDs are identical, the nodes are confirmed to be logically equivalent. -
Replacement: When two or more logically equivalent nodes are found (e.g.,
node_Aandnode_B, wherenode_Awas encountered earlier in thenode_orderand is chosen as the representative):* The later encountered node (
node_B) is replaced by the earlier node (node_A). This means all users ofnode_Bare rewired to usenode_Ainstead.* The pass records that a change occurred.
-
Post-DCE: After
BddCsePasshas completed its execution, aDeadCodeEliminationPassis typically run. This is essential to remove any nodes that became dead code as a result of being replaced by an equivalent node, further cleaning up the IR.
Benefits:
- Semantic Equivalence:
BddCsePassexcels at identifying and commoning subexpressions that are semantically (logically) equivalent, even if their textual representation in the IR is different. This is a significant advantage over purely syntactic CSE approaches.
- Hardware Reduction: By eliminating redundant computations, the pass directly reduces the amount of hardware required, leading to smaller area, lower power consumption, and potentially faster circuits.
- Improved IR Quality: The resulting IR is more compact, canonical, and easier for subsequent optimization and synthesis tools to process.
Example:
Consider two expressions that compute the same Boolean function but have
different IR structures: x_eq_42: bits[1] = eq(x, literal(42)) and
forty_two_not_ne_x: bits[1] = not(ne(literal(42), x)).
Syntactically, these expressions are different.
// Original IR snippet
fn FuzzTest(x: bits[16]) -> (bits[1], bits[1]) {
forty_two: bits[16] = literal(value=42)
x_eq_42: bits[1] = eq(x, forty_two)
forty_two_not_ne_x: bits[1] = not(ne(forty_two, x))
ret tuple.0: (bits[1], bits[1]) = tuple(x_eq_42, forty_two_not_ne_x)
}
BddCsePass would build BDDs for both x_eq_42 and forty_two_not_ne_x.
Since these two expressions are logically equivalent, their BDDs would be
identical. The pass would then replace one with the other (e.g.,
forty_two_not_ne_x would be replaced by x_eq_42), resulting in shared
logic:
// Optimized IR (after BddCsePass and a subsequent DCE pass)
fn FuzzTest(x: bits[16]) -> (bits[1], bits[1]) {
forty_two: bits[16] = literal(value=42)
x_eq_42: bits[1] = eq(x, forty_two)
ret tuple.0: (bits[1], bits[1]) = tuple(x_eq_42, x_eq_42)
}
This effectively eliminates the redundant computation of the second comparison, leading to a more optimized design.
bdd_simp - BDD-based Simplification
Runs BDD-based simplifications on the function.
Note: What follows was generated by the Gemini LLM. Not human verified.
The BddSimplificationPass leverages Binary Decision Diagrams (BDDs) to
perform advanced logical simplifications on the IR. BDDs provide a canonical
representation of Boolean functions, enabling powerful, logic-level
optimizations that are challenging for traditional syntactic or
dataflow-based passes.
The pass currently focuses on the following key areas:
-
Replacement of Statically Known Values with Literals: Through BDD analysis, if a node's value (or specific bits of it) can be determined to be constant under all reachable conditions, it is replaced with a literal.
Example: If
my_and: bits[1] = and(param_x, param_y)and BDD analysis provesparam_xis always0b0, thenmy_andcan be replaced withliteral(0, bits=1). -
Removal of Redundant Operands in Boolean Operations: For N-ary Boolean operations (
and,nand,or,nor), the pass identifies and removes operands that are logically redundant given the values or relationships of other operands. For instance, inand(A, B, C), ifCis always true wheneverAandBare true,Cis redundant.Example: If
Cis a predicatea_and_b: bits[1] = and(A, B)ret my_and: bits[1] = and(A, B, C)
This can be simplified to:
ret my_and: bits[1] = and(A, B)
-
Collapse of Select Chains into OneHotSelects (at
opt_level >= 2): Chains of binaryselectoperations, particularly where one case of aselectfeeds into anotherselect(e.g.,sel(pred_a, {val_a, sel(pred_b, {val_b, default_val})})), are transformed into a singleone_hot_selectif the selectors are provably disjoint (at most one can be true at any given time). This simplifies the IR structure and can lead to more efficient hardware implementations. -
Replacement of Two-Way OneHotSelect with Select (at
opt_level >= 2): Aone_hot_selectwith a 2-bit selector that is known to be one-hot can be replaced with a simplerselectoperation, as the one-hot property means the decision effectively depends on a single bit. This reduces the complexity of the select operation. -
Simplification of PrioritySelect Operations (at
opt_level >= 1): If BDD analysis determines that the selector of apriority_selectis guaranteed to have at least one bit set within a certain range, thepriority_selectcan be narrowed. This involves trimming higher-priority cases that are guaranteed not to be selected, effectively reducing the number of cases the hardware needs to evaluate.
As BDD-based analysis can be computationally intensive, some optimizations
are applied conditionally based on the optimization level (opt_level).
bdd_simp(2) - BDD-based Simplification with opt_level <= 2
Runs BDD-based simplifications on the function.
Note: What follows was generated by the Gemini LLM. Not human verified.
The BddSimplificationPass leverages Binary Decision Diagrams (BDDs) to
perform advanced logical simplifications on the IR. BDDs provide a canonical
representation of Boolean functions, enabling powerful, logic-level
optimizations that are challenging for traditional syntactic or
dataflow-based passes.
The pass currently focuses on the following key areas:
-
Replacement of Statically Known Values with Literals: Through BDD analysis, if a node's value (or specific bits of it) can be determined to be constant under all reachable conditions, it is replaced with a literal.
Example: If
my_and: bits[1] = and(param_x, param_y)and BDD analysis provesparam_xis always0b0, thenmy_andcan be replaced withliteral(0, bits=1). -
Removal of Redundant Operands in Boolean Operations: For N-ary Boolean operations (
and,nand,or,nor), the pass identifies and removes operands that are logically redundant given the values or relationships of other operands. For instance, inand(A, B, C), ifCis always true wheneverAandBare true,Cis redundant.Example: If
Cis a predicatea_and_b: bits[1] = and(A, B)ret my_and: bits[1] = and(A, B, C)
This can be simplified to:
ret my_and: bits[1] = and(A, B)
-
Collapse of Select Chains into OneHotSelects (at
opt_level >= 2): Chains of binaryselectoperations, particularly where one case of aselectfeeds into anotherselect(e.g.,sel(pred_a, {val_a, sel(pred_b, {val_b, default_val})})), are transformed into a singleone_hot_selectif the selectors are provably disjoint (at most one can be true at any given time). This simplifies the IR structure and can lead to more efficient hardware implementations. -
Replacement of Two-Way OneHotSelect with Select (at
opt_level >= 2): Aone_hot_selectwith a 2-bit selector that is known to be one-hot can be replaced with a simplerselectoperation, as the one-hot property means the decision effectively depends on a single bit. This reduces the complexity of the select operation. -
Simplification of PrioritySelect Operations (at
opt_level >= 1): If BDD analysis determines that the selector of apriority_selectis guaranteed to have at least one bit set within a certain range, thepriority_selectcan be narrowed. This involves trimming higher-priority cases that are guaranteed not to be selected, effectively reducing the number of cases the hardware needs to evaluate.
As BDD-based analysis can be computationally intensive, some optimizations
are applied conditionally based on the optimization level (opt_level).
bdd_simp(3) - BDD-based Simplification with opt_level <= 3
Runs BDD-based simplifications on the function.
Note: What follows was generated by the Gemini LLM. Not human verified.
The BddSimplificationPass leverages Binary Decision Diagrams (BDDs) to
perform advanced logical simplifications on the IR. BDDs provide a canonical
representation of Boolean functions, enabling powerful, logic-level
optimizations that are challenging for traditional syntactic or
dataflow-based passes.
The pass currently focuses on the following key areas:
-
Replacement of Statically Known Values with Literals: Through BDD analysis, if a node's value (or specific bits of it) can be determined to be constant under all reachable conditions, it is replaced with a literal.
Example: If
my_and: bits[1] = and(param_x, param_y)and BDD analysis provesparam_xis always0b0, thenmy_andcan be replaced withliteral(0, bits=1). -
Removal of Redundant Operands in Boolean Operations: For N-ary Boolean operations (
and,nand,or,nor), the pass identifies and removes operands that are logically redundant given the values or relationships of other operands. For instance, inand(A, B, C), ifCis always true wheneverAandBare true,Cis redundant.Example: If
Cis a predicatea_and_b: bits[1] = and(A, B)ret my_and: bits[1] = and(A, B, C)
This can be simplified to:
ret my_and: bits[1] = and(A, B)
-
Collapse of Select Chains into OneHotSelects (at
opt_level >= 2): Chains of binaryselectoperations, particularly where one case of aselectfeeds into anotherselect(e.g.,sel(pred_a, {val_a, sel(pred_b, {val_b, default_val})})), are transformed into a singleone_hot_selectif the selectors are provably disjoint (at most one can be true at any given time). This simplifies the IR structure and can lead to more efficient hardware implementations. -
Replacement of Two-Way OneHotSelect with Select (at
opt_level >= 2): Aone_hot_selectwith a 2-bit selector that is known to be one-hot can be replaced with a simplerselectoperation, as the one-hot property means the decision effectively depends on a single bit. This reduces the complexity of the select operation. -
Simplification of PrioritySelect Operations (at
opt_level >= 1): If BDD analysis determines that the selector of apriority_selectis guaranteed to have at least one bit set within a certain range, thepriority_selectcan be narrowed. This involves trimming higher-priority cases that are guaranteed not to be selected, effectively reducing the number of cases the hardware needs to evaluate.
As BDD-based analysis can be computationally intensive, some optimizations
are applied conditionally based on the optimization level (opt_level).
bitslice_simp - Bit-slice simplification
Pass which simplifies bit-slices. This includes collapsing sequential bit-slices, eliminating degenerate full-width slices, and others.
Note: What follows was generated by the Gemini LLM. Not human verified.
The BitSliceSimplificationPass is a crucial optimization pass that targets
and simplifies operations related to bit manipulation: bit_slice,
dynamic_bit_slice, and bit_slice_update. The overarching goal is to
reduce logical complexity, enhance hardware efficiency, and generate cleaner
IR by eliminating redundancies and transforming patterns into more canonical
forms that are easier for hardware synthesis tools to process.
The pass employs a combination of local pattern matching and advanced query
engine analysis (utilizing LazyTernaryQueryEngine or
PartialInfoQueryEngine for higher optimization levels) to identify and
capitalize on simplification opportunities.
Key simplifications performed by this pass include:
-
Simplifying
BitSliceOperations: * Full-Width Slices: Abit_slicethat extracts the entire bit width of its operand (e.g.,bit_slice(x, start=0, width=N)) is a no-operation and is directly replaced by its operandx.* Consecutive
BitSliceOperations: Nestedbit_sliceoperations (e.g.,bit_slice(bit_slice(x, s0, w0), s1, w1)) are collapsed into a singlebit_slice(x, s0 + s1, w1), effectively combining the slicing operations.* Hoisting
BitSliceaboveConcat: Abit_sliceof aconcatoperation can often be rewritten as aconcatof smallerbit_slices of the originalconcatoperands. This allows for more precise extraction of relevant bits and can eliminate unnecessary concatenations.* Hoisting
BitSliceabove Bitwise/Arithmetic Operations: If abit_sliceis the sole consumer of a bitwise operation (e.g.,and,or,xor) or targets the low bits of an arithmetic operation (e.g.,add,sub,neg), thebit_slicecan be pushed down into the operands. This transformsbit_slice(op(A, B), ...)intoop(bit_slice(A, ...), bit_slice(B, ...)), reducing the width of the intermediateopresult.* Hoisting
BitSliceaboveSignExt: Handles various cases where abit_sliceoperates on asign_extoperation. It simplifies this by transforming it into a direct slice of the original value or asign_extof a smaller slice, depending on where the slice falls relative to the sign bit of the extended value.* Hoisting
BitSliceabove Shifts and Decodes: If all users of a shift or decode operation are bit-slices entirely contained within a larger slice, thebit_slicecan be hoisted above the shift/decode. This effectively performs the shift/decode on a smaller bit-width operand, reducing computation.* Hoisting
BitSliceabove Selects: If all users of aselect(orone_hot_select/priority_select) are bit-slices entirely contained within a larger slice, thebit_slicecan be hoisted above theselect. This results in theselectoperating on smaller, sliced operands, which can reduce the multiplexer complexity. -
Simplifying
DynamicBitSliceOperations: * Out-of-Bounds Slices: Adynamic_bit_slicewhose start index is provably out of bounds (meaning it attempts to slice entirely beyond the operand's bit width) is replaced with a literal zero, as no data can be extracted.* Literal Start Index: A
dynamic_bit_slicewith a known constant start index is converted into a staticbit_sliceoperation. This eliminates dynamic addressing logic in the hardware.* Hoisting
DynamicBitSliceaboveSelectof Literals: If the start index of adynamic_bit_sliceis the result of aselectoperation where all the cases are literal constant values, thedynamic_bit_slicecan be hoisted. This replaces the originalselectwith a newselectwhose cases are staticbit_sliceoperations, each corresponding to one of the literal start indices.``` // Original IR start1: bits[32] = literal(value=5) start2: bits[32] = literal(value=25) p: bits[32] = select(x, cases=[start1, start2]) q: bits[45] = dynamic_bit_slice(to_slice, p, width=45) // Optimized IR (conceptually) slice1: bits[45] = bit_slice(to_slice, start=5, width=45) slice2: bits[45] = bit_slice(to_slice, start=25, width=45) q: bits[45] = select(x, cases=[slice1, slice2]) ```* Scaled Dynamic Bit Slices: Optimizes
dynamic_bit_sliceoperations where the start index is a known multiple of the slice width. This involves conceptually treating theto_sliceoperand as an array of smaller bit elements and then using aselectoperation to choose the correct element from this conceptual array. -
Simplifying
BitSliceUpdateOperations: * Out-of-Bounds Updates: Abit_slice_updatewhose start index is provably out of bounds is replaced by itsto_updateoperand, as it would have no effect on the target bit vector.* Literal Start Index: A
bit_slice_updatewith a known constant start index is replaced by an equivalentconcatandbit_slicestructure. This eliminates the need for dynamic update logic in the hardware.* Hoisting
BitSliceUpdateaboveSelectof Literals: Similar toDynamicBitSlicewithselectof literals, if the start index of abit_slice_updateis derived from aselectof literals, thebit_slice_updatecan be hoisted, resulting in staticbit_slice_updateoperations for each case of the originalselect.* Scaled Bit Slice Updates: Optimizes
bit_slice_updateoperations where the start index is a known multiple of the update value's width. This involves converting theto_updateoperand into an array of smaller bit elements, performing anarray_updateon this conceptual array, and then flattening the result back into a bit vector usingconcat.
The pass operates in a worklist-driven fashion, particularly for bit_slice
nodes, ensuring that simplifications are iteratively applied until a fixed
point is reached. This comprehensive approach guarantees a thorough
optimization of bit manipulation logic, crucial for high-quality hardware
generation.
bool_simp - boolean simplification
Attempts to simplify bitwise / boolean expressions (e.g. of multiple variables).
Note: What follows was generated by the Gemini LLM. Not human verified.
The BooleanSimplificationPass focuses on optimizing logical expressions
within the IR. It achieves this by analyzing the truth table generated from
a local cluster of Boolean operations and replacing them with a simpler,
logically equivalent expression. This is particularly effective for reducing
complex combinations of and, or, not, nand, nor, and xor
operations into more fundamental and efficient forms.
The core mechanism involves:
-
Truth Table Generation: For a given Boolean expression, the pass traces its inputs up to a small, manageable set of "frontier" nodes (typically parameters or results of non-Boolean operations). It then conceptually "flows" all possible combinations of true/false values from these frontier nodes through the intermediate Boolean operations to determine the resulting truth table of the entire expression. This is currently performed for up to 3 frontier nodes (X, Y, Z), resulting in an 8-bit truth table.
Example (conceptual truth table for inputs X, Y, Z):
X: 00001111 // Represents the value of X for all 8 input combinations Y: 00110011 // Represents the value of Y for all 8 input combinations Z: 01010101 // Represents the value of Z for all 8 input combinations // ... The pass computes the output bit vector based on the logic of the // expression e.g., for `and(X, Y)`, the result would be `00000001` -
Truth Table Matching and Replacement: The computed N-bit truth table is then compared against a pre-computed set of canonical truth tables for various simple Boolean operations (e.g.,
and,or,not,nand). If a match is found and the canonical replacement is simpler (based on a cost heuristic), the original complex Boolean expression is replaced by this simpler form. This can also detect expressions that are alwaystrueor alwaysfalseand replace them with literal values.Example (original expression involving
notandand):Afterfn f(x: bits[42], y: bits[42]) -> bits[42] { and_op: bits[42] = and(x, y) ret not_op: bits[42] = not(and_op) }BooleanSimplificationPass, this could be transformed into:fn f(x: bits[42], y: bits[42]) -> bits[42] { ret nand_op: bits[42] = nand(x, y) } -
Handling of Redundancy and Symmetry: The pass incorporates knowledge of Boolean algebra rules, including symmetric properties of operations (e.g.,
and(A, B)is equivalent toand(B, A)) and the identification of redundant inputs (e.g., inand(X, X, Y), oneXis redundant).
This pass is fundamental for logic optimization, as it can significantly reduce the complexity of Boolean logic, leading to more compact and faster hardware implementations.
canon - Canonicalization
IR Canonicalization.
Note: What follows was generated by the Gemini LLM. Not human verified.
The CanonicalizationPass performs various transformations to bring the IR
into a standardized, canonical form. This simplifies subsequent optimization
passes by ensuring consistent patterns for common operations. By having a
predictable structure, other passes can focus on more complex logical
transformations without needing to handle numerous syntactic variations of
the same underlying operation.
Key canonicalizations include:
-
Ordering of Literals in Commutative/Comparison Operations: For commutative operations (like
add,mul,and,or,xor) and comparison operations, literals are consistently moved to the right-hand side. This means an expression likeadd(constant, x)becomesadd(x, constant). For comparisons,ult(constant, x)becomesugt(x, constant).Example:
one:bits[2] = literal(value=1) ret addval: bits[2] = add(one, x)becomes:
ret addval: bits[2] = add(x, one) -
Conversion of Subtraction to Addition with Negation:
sub(x, literal)is transformed intoadd(x, negate(literal)).Example:
one:bits[2] = literal(value=1) ret subval: bits[2] = sub(x, one)becomes (assuming a 2-bit unsigned system where -1 is 3):
ret addval: bits[2] = add(x, literal(value=3)) -
Removal of No-op Zero-extends: If a
zero_extoperation extends to the same bit width as its operand, it's a no-op and is removed.Example:
zero_ext: bits[16] = zero_ext(x, new_bit_count=16)becomes:
x -
Replacement of Zero-extend with Concat: A
zero_extoperation is replaced by aconcatoperation with a zero literal to explicitly show the bit extension.Example:
zero_ext: bits[42] = zero_ext(x: bits[33], new_bit_count=42)becomes:
concat: bits[42] = concat(literal(0, bits=9), x) -
Removal of Useless Bitwise Reductions: Bitwise reductions (
and_reduce,or_reduce,xor_reduce) on single-bit operands are replaced by the operand itself.Example:
and_reduce.1: bits[1] = and_reduce(x: bits[1])becomes:
x -
Canonicalization of Clamp Operations: Various forms of clamping operations (min/max) implemented with
selectand comparison operations are canonicalized to a consistentselectform. -
Inversion of Selector for Select Operations: If a 2-way
selecthas anotoperation as its selector, thenotis removed and the cases are swapped.Example:
not.1: bits[1] = not(p) ret sel.2: bits[8] = sel(not.1, cases=[x, y])becomes:
ret sel.2: bits[8] = sel(p, cases=[y, x]) -
Simplification of Select with Full Case Coverage: A
selectoperation with adefaultvalue is transformed into aselectwithout adefaultif the number ofcasesplus thedefaultcovers all possible selector values. -
Simplification of Next/StateRead Predicates: For
next_valueandstate_readoperations, predicates that are alwaystrueare removed, making the operation unconditional. If anext_valuepredicate is alwaysfalse, thenext_valueis removed.
These canonicalizations aim to simplify the graph, make patterns more recognizable for other passes, and improve the overall efficiency of downstream optimizations.
channel_legalization - Legalize multiple send/recvs per channel
Pass that legalizes multiple send/receive operations per channel.
This pass adds cross-activation tokens to guarantee that later activations of a proc cannot send or receive on a channel until all previous activations have completed working with that channel.
The ChannelLegalizationPass is an optimization pass designed to ensure the
correctness and determinism of hardware designs that contain multiple send
or receive operations targeting the same channel within a single Proc
activation. In hardware, simultaneous, unordered access to a shared channel
can lead to non-deterministic behavior, race conditions, or resource
conflicts. This pass addresses these issues by introducing explicit token
dependencies to enforce a well-defined ordering of these operations.
The pass operates by analyzing the communication patterns within Procs and
enforcing strictness rules based on the channel's strictness attribute.
This attribute, defined in the channel configuration, dictates how access to
a shared channel should be managed:
kTotalOrder: All operations on the channel must adhere to a strict total order. The pass verifies if the existing token graph already enforces this order; if not, it introduces mechanisms (cross-activation tokens) to explicitly enforce it.
kRuntimeOrdered: Similar tokTotalOrder, but allows for dynamic ordering at runtime as long as correctness is maintained. The pass inserts assertions to verify this at runtime if violations occur.
kProvenMutuallyExclusive: Operations on the channel are expected to be mutually exclusive at compile time (i.e., only one operation can be active in any given activation). The pass uses a Z3 solver to formally prove this mutual exclusivity; if it cannot be proven, a compile-time error is raised.
kRuntimeMutuallyExclusive: Operations are expected to be mutually exclusive at runtime. The pass adds runtime assertions that will trigger an error during simulation or execution if simultaneous access occurs.
kArbitraryStaticOrder: Allows for any static ordering of operations. The pass will introduce token dependencies to impose an arbitrary but deterministic static order if multiple operations are detected.
The core mechanism for enforcing ordering when multiple operations on a
channel exist is by introducing "cross-activation tokens." For each send or
receive operation that is part of a multiple-access group on a given
channel:
-
New State Elements for Tokens: A new
token-typedStateElementis added to the Proc for eachsendorreceiveoperation that requires legalization. These new state elements act as implicit tokens, tracking the completion of operations across different activations of the Proc. -
NextOperations for New Tokens: For each newly addedtokenstate element, anext_valueoperation is introduced. Thisnext_valueis configured to update the token state with the token output of the correspondingsendorreceiveoperation, conditioned by thesend/receivepredicate (if it exists). This ensures that the token for the next activation is only available after the current activation's operation has completed. -
Modifying Original Tokens: The original
tokeninput to eachsendorreceiveoperation is replaced with anafter_alloperation. Thisafter_alloperation combines the original incoming token with all the newly created implicit tokens from other operations on the same channel. This effectively creates a dependency chain, ensuring that a givensend/receivecannot proceed until all othersend/receiveoperations on that channel from the previous activation have completed, and its own incoming token is ready.
This entire process ensures that even when multiple send/receive
operations target the same channel, their execution is properly sequenced,
preventing data hazards and ensuring deterministic behavior in the generated
hardware. The pass also includes checks to ensure that channels of type
kStreaming (the default) are legalized, while other channel kinds are
skipped.
Example (simplified conceptual view for multiple receives on channel in
with kTotalOrder strictness, where no explicit token dependency exists
initially):
// Original IR snippet for a proc with multiple receives on 'in'
chan in(bits[32], id=0, kind=streaming, ops=receive_only,
flow_control=ready_valid, strictness=total_order)
top proc my_proc() {
tok: token = literal(value=token)
recv0: (token, bits[32]) = receive(tok, channel=in)
recv0_tok: token = tuple_index(recv0, index=0)
recv0_data: bits[32] = tuple_index(recv0, index=1)
recv1: (token, bits[32]) = receive(tok, channel=in)
// Also uses original 'tok', creating a conflict
recv1_tok: token = tuple_index(recv1, index=0)
recv1_data: bits[32] = tuple_index(recv1, index=1)
// ... other logic and next state
}
For this scenario, where recv0 and recv1 initially use the same tok
without explicit ordering, ChannelLegalizationPass would perform
transformations similar to:
// Optimized IR snippet (simplified, showing key changes)
chan in(bits[32], id=0, kind=streaming, ops=receive_only,
flow_control=ready_valid, strictness=total_order)
top proc my_proc() {
original_tok_input: token = literal(value=token) // Original token input
// New state elements to track tokens across activations
implicit_token__recv0_state: token = state_element(init=token)
implicit_token__recv1_state: token = state_element(init=token)
// recv0 now waits on implicit_token__recv1_state from previous activation
recv0: (token, bits[32]) =
receive(
after_all(
original_tok_input,
state_read(implicit_token__recv1_state)),
channel=in)
recv0_tok: token = tuple_index(recv0, index=0)
recv0_data: bits[32] = tuple_index(recv0, index=1)
// Update implicit_token__recv0_state
next (implicit_token__recv0_state, recv0_tok)
// recv1 now waits on implicit_token__recv0_state from previous activation
recv1: (token, bits[32]) =
receive(
after_all(
original_tok_input,
state_read(implicit_token__recv0_state)),
channel=in)
recv1_tok: token = tuple_index(recv1, index=0)
recv1_data: bits[32] = tuple_index(recv1, index=1)
// Update implicit_token__recv1_state
next (implicit_token__recv1_state, recv1_tok)
// ... other logic and next state
}
This introduces a circular dependency through the state, ensuring that
recv0 in activation N-1 must complete before recv1 in activation N
can proceed, and vice-versa, thereby enforcing a total order of access across
activations and preventing simultaneous access.
comparison_simp - Comparison Simplification
Simplifies logical operations on the results of comparison operations. For example:
eq(x, 0) && ne(x, 1) => eq(x, 0) eq(x, 0) && ne(x, 0) => 0 eq(x, 0) || ne(x, 0) => 1
Note: What follows was generated by the Gemini LLM. Not human verified.
The ComparisonSimplificationPass is an optimization pass that focuses on
simplifying logical operations that consume the results of comparison
operations. Its primary goal is to reduce redundant comparisons and
streamline complex Boolean expressions derived from equality, inequality, and
ordering checks, thereby leading to more efficient hardware implementations.
The pass operates in an iterative, fixed-point manner, continuously applying simplifications until no further changes can be made. It performs two main types of optimizations:
-
Range-Based Simplification of Comparison/Logical Trees: *
ComputeRangeEquivalence: This function analyzes single-bit nodes (which typically represent the results of comparisons or logical operations) and determines their equivalent range of values. ARangeEquivalencestruct is used to store the originalNodeand anIntervalSetrepresenting the precise set of values for which the node evaluates to true. * For basic comparison operations (e.g.,eq(x, 42),ult(x, 10)), it directly computes theIntervalSetthat makes the comparison true.* For logical operations (`and`, `or`, `not`, `nand`, `nor`) whose operands are also comparison results (or other logical operations with known ranges), it combines their `IntervalSet`s using set intersection (for `and`), set union (for `or`), or set complement (for `not`, `nand`, `nor`).* Replacing with Constants or Simpler Comparisons: Once the
RangeEquivalencefor a node is accurately determined, the pass attempts to simplify the node itself: * Maximal Range (Always True): If the computedIntervalSetcovers the entire possible range of values forx, the node (e.g., a complex Boolean expression) is replaced with a literal1(representing true).* **Empty Range (Always False)**: If the `IntervalSet` is empty, indicating that the condition can never be met, the node is replaced with a literal `0` (representing false). * **Precise Value**: If the `IntervalSet` reduces to a single, precise value `C` (e.g., `x` must be `42`), the node is replaced with an `eq(x, C)` operation. * **Complementary Precise Value**: If the `IntervalSet` represents all values except a single value `C`, the node is replaced with a `ne(x, C)` operation. * **ULt/UGt Interval**: If the `IntervalSet` corresponds to a contiguous range starting from 0 up to `C` (e.g., `[0, 42]`), it's replaced with `ult(x, C+1)`. Conversely, if it's a range from `C` to the maximum value (e.g., `[42, MAX]`), it's replaced with `ugt(x, C-1)`.* Guards against De-optimization: The pass incorporates checks to prevent replacing existing simple comparison operations (like
eq,ne,ult,ugt, etc.) or inversions of multi-user comparisons with new, potentially more complex, comparisons. This guards against unintended logic duplication and an increase in hardware area. -
TransformDerivedComparisons: This function identifies and eliminates redundant comparison operations by recognizing fundamental Boolean identities and relationships: * Commuted Operands: If a comparisonult(x, y)exists in the IR, and a redundantugt(y, x)is found, the latter is replaced by the former.* Inverted Comparison: If
ult(x, y)exists, anduge(x, y)is found, the latter can be replaced bynot(ult(x, y)), leveraging the existing comparison.* Inverted and Commuted Comparison: If
ult(x, y)exists, andule(y, x)is found, it can be replaced bynot(ult(x, y)), combining both the inversion and commutation optimizations.
By aggressively simplifying logical operations over comparisons, this pass significantly reduces the complexity of control logic, which is crucial for generating smaller and faster hardware designs.
Example: Consider the expression eq(x, 0) && ne(x, 1).
// Original IR snippet
fn f(x: bits[32]) -> bits[1] {
literal.0: bits[32] = literal(value=0)
literal.1: bits[32] = literal(value=1)
x_eq_0: bits[1] = eq(x, literal.0)
x_ne_1: bits[1] = ne(x, literal.1)
ret and.final: bits[1] = and(x_eq_0, x_ne_1)
}
The ComputeRangeEquivalence would determine that:
* x_eq_0 is equivalent to x being in the range [[0, 0]].
* x_ne_1 is equivalent to x being in [[0, 0], [2, MAX]].
When these two ranges are combined by the and operation, the resulting
effective range for x is [[0, 0]]. Since this represents x being
precisely equal to 0, the pass will simplify and.final to eq(x, 0):
// Optimized IR (simplified)
fn f(x: bits[32]) -> bits[1] {
literal.0: bits[32] = literal(value=0)
ret eq.final: bits[1] = eq(x, literal.0)
}
This effectively removes the redundant ne(x, 1) comparison, leading to
simpler logic.
concat_simp - Concat simplification
Pass which simplifies concats. This includes removing single-operand concats, flattening trees of dependent concats, and others.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ConcatSimplificationPass is an optimization pass in XLS that focuses
on simplifying concat operations in the Intermediate Representation (IR).
Concatenation is a fundamental operation in hardware description languages,
used to combine smaller bit vectors into larger ones. This pass aims to
reduce the complexity, improve the readability, and optimize the hardware
implementation of concat operations and related patterns by applying
various rewrite rules.
The pass operates by iteratively processing concat nodes using a worklist
algorithm. This ensures that newly created or modified concat nodes are
re-evaluated for further simplification. Additionally, it performs a second
pass over other operations that interact with concats to identify further
optimization opportunities.
Here's a detailed breakdown of the key optimizations performed:
-
Removing Trivial Concats (Single Operand): If a
concatoperation has only one operand, it effectively acts as an identity operation. The pass replaces such aconcatwith its single operand, eliminating the redundant node and simplifying the dataflow.concat(x) => x -
Flattening Trees of Concats: Nested
concatoperations (e.g.,concat(A, concat(B, C))) are flattened into a singleconcatoperation with all the original leaf operands. This reduces the depth of the IR graph, simplifies subsequent analyses, and can lead to more compact hardware.concat(a, b, concat(c, concat(d, e))) => concat(a, b, c, d, e) -
Merging Consecutive Literal Operands: If a
concatoperation has two or more consecutive operands that are known literals, these literals are merged into a single, wider literal. This reduces both the number of operands to theconcatand the total number of literal nodes in the graph.concat(literal_1, literal_2, x, literal_3, literal_4) // => concat(merged_literal_1_2, x, merged_literal_3_4) -
Eliminating Zero-Bit Operands: Any operand to a
concatthat has a bit width of zero (e.g.,bits[0]) is functionally a no-op. The pass identifies and removes such operands from theconcat's operand list, cleaning up the IR. -
Hoisting
ReverseOperations aboveConcats: If aconcatoperation is used by a singlereverseoperation (and has no other non-reducible users), thereverseoperation is "hoisted" above theconcat. This means thereverseis applied individually to each operand of theconcat, and then those reversed operands are concatenated in reverse order. This can expose more optimization opportunities for the individualreverseoperations.reverse(concat(a, b, c)) => concat(reverse(c), reverse(b), reverse(a)) -
Merging Consecutive Bit Slices: If a
concathas consecutive operands that arebit_sliceoperations from the same source node and slice consecutive bits, these twobit_slices are merged into a single, widerbit_slice. This simplifies the slicing logic and reduces the number of intermediate nodes.concat(bit_slice(x, start=2, width=2), bit_slice(x, start=0, width=2)) // => concat(bit_slice(x, start=0, width=4)) -
Hoisting Bitwise Operations above
Concats (TryHoistBitWiseOperation): If a bitwise operation (e.g.,and,or,xor) has all its operands asconcatoperations, the bitwise operation can be "hoisted" above the concatenations. This means the operands of the originalconcats are sliced into corresponding bit ranges, the bitwise operation is applied to these slices, and then the results are re-concatenated. This often exposes opportunities for common subexpression elimination or further simplification on the smaller bitwise operations.xor(concat(a, b), concat(c, d)) => concat(xor(a, c), xor(b, d)) -
Hoisting Bitwise Operations with Constants above
Concats (TryHoistBitWiseWithConstant): This is a specialized version of hoisting bitwise operations. If a binary bitwise operation has one constant operand and oneconcatoperand, the bitwise operation can be distributed across theconcat's operands and combined with corresponding slices of the constant. This effectively narrows the operations.// Given x: bits[8], y: bits[8] and(0b10101011, concat(x, y)) // => concat(and(x, 0b1010), and(y, 0b1011)) -
Narrowing and Hoisting Bitwise Operations (
TryNarrowAndHoistBitWiseOperation): This optimization specifically targets scenarios where aconcatis used as an operand to a bitwise operation (likeand,or,xor), and theconcatitself has one constant operand (0 or all ones) and one variable operand. The goal is to narrow the bitwise operation to only the "interesting" bits (those from the variable operand) and hoist it above theconcat. This can significantly reduce the bit width of the affected operations.// Given x: bits[10], b: bits[4] or(x, concat(0b111111, b)) // concat(6 ones, b) // => concat(0b111111, or(bit_slice(x, 0, 4), b)) -
Bypassing Reduction of Concatenation (
TryBypassReductionOfConcatenation): If a bitwise reduction operation (e.g.,or_reduce,and_reduce,xor_reduce) is applied to aconcatoperation, it can be bypassed. The reduction is distributed to each operand of theconcat, and then a new non-reductive bitwise operation (e.g.,or,and,xor) combines the results. This can simplify complex reduction paths.or_reduce(concat(a, b)) => or(or_reduce(a), or_reduce(b)) -
Distributing Reducible Operations (
TryDistributeReducibleOperation): This optimization distributeseqandneoperations acrossconcats. Ifeq(concat(A, B), C)is found, it's transformed intoand(eq(A, slice(C)), eq(B, slice(C))). Forne, the transformation uses anorofnes. This helps "see through" concatenations to simplify comparisons with composite values.// Given x: bits[5], y: bits[10] eq(concat(x, y), literal(0, bits=15)) // => and(eq(x, literal(0, bits=5)), eq(y, literal(0, bits=10)))
The ConcatSimplificationPass is a powerful and multi-faceted pass that
significantly contributes to the overall quality of the generated hardware by
producing a more compact, efficient, and readable IR.
cond_spec(Bdd) - Conditional specialization
Pass which specializes arms of select operations based on their selector
value. When analyzing a particular arm of a sel, the value of the
selector is known. This knowledge can be used to simplify the expression
within the arm.
Note: What follows was generated by the Gemini LLM. Not human verified.
For example, consider the following IR:
fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
concat: bits[32] = concat(a, b)
ret sel.2: bits[32] = sel(a, cases=[z, concat])
}
In the sel operation, the selector is a. When considering case[1] (the
concat operation), we know that a must have the value 1. Therefore, the
concat(a, b) can be simplified to concat(1, b). The pass will transform
the sel to:
sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])
This optimization applies to sel, priority_sel, and one_hot_sel. It can
also use Binary Decision Diagrams (BDDs) for more powerful analysis of the
conditions.
cond_spec(false) - Conditional specialization
Pass which specializes arms of select operations based on their selector
value. When analyzing a particular arm of a sel, the value of the
selector is known. This knowledge can be used to simplify the expression
within the arm.
Note: What follows was generated by the Gemini LLM. Not human verified.
For example, consider the following IR:
fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
concat: bits[32] = concat(a, b)
ret sel.2: bits[32] = sel(a, cases=[z, concat])
}
In the sel operation, the selector is a. When considering case[1] (the
concat operation), we know that a must have the value 1. Therefore, the
concat(a, b) can be simplified to concat(1, b). The pass will transform
the sel to:
sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])
This optimization applies to sel, priority_sel, and one_hot_sel. It can
also use Binary Decision Diagrams (BDDs) for more powerful analysis of the
conditions.
cond_spec(noBdd) - Conditional specialization
Pass which specializes arms of select operations based on their selector
value. When analyzing a particular arm of a sel, the value of the
selector is known. This knowledge can be used to simplify the expression
within the arm.
Note: What follows was generated by the Gemini LLM. Not human verified.
For example, consider the following IR:
fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
concat: bits[32] = concat(a, b)
ret sel.2: bits[32] = sel(a, cases=[z, concat])
}
In the sel operation, the selector is a. When considering case[1] (the
concat operation), we know that a must have the value 1. Therefore, the
concat(a, b) can be simplified to concat(1, b). The pass will transform
the sel to:
sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])
This optimization applies to sel, priority_sel, and one_hot_sel. It can
also use Binary Decision Diagrams (BDDs) for more powerful analysis of the
conditions.
cond_spec(true) - Conditional specialization
Pass which specializes arms of select operations based on their selector
value. When analyzing a particular arm of a sel, the value of the
selector is known. This knowledge can be used to simplify the expression
within the arm.
Note: What follows was generated by the Gemini LLM. Not human verified.
For example, consider the following IR:
fn f(a: bits[1], b: bits[31], z: bits[32]) -> bits[32] {
concat: bits[32] = concat(a, b)
ret sel.2: bits[32] = sel(a, cases=[z, concat])
}
In the sel operation, the selector is a. When considering case[1] (the
concat operation), we know that a must have the value 1. Therefore, the
concat(a, b) can be simplified to concat(1, b). The pass will transform
the sel to:
sel.2: bits[32] = sel(a, cases=[z, concat(literal(1), b)])
This optimization applies to sel, priority_sel, and one_hot_sel. It can
also use Binary Decision Diagrams (BDDs) for more powerful analysis of the
conditions.
const_fold - Constant folding
Pass which performs constant folding. Every op with only literal operands is replaced by a equivalent literal. Runs DCE after constant folding.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ConstantFoldingPass is an optimization pass in XLS that performs the
crucial task of constant folding across the Intermediate Representation (IR).
Its primary function is to identify and evaluate operations where all
operands are compile-time constants, and then replace these operations with a
single literal node that represents the computed result. This process
significantly simplifies the IR, reduces the number of active operations in
the data path, and ultimately contributes to smaller, faster, and more
efficient hardware.
The pass operates by traversing the nodes of a function or proc in topological order. For each node, it determines if it is "constant foldable" based on the following criteria:
-
Not Already a Literal: The node itself must not already be a
literalnode, as there would be no further simplification. -
Has Users (or Implicit Use): The node must be either consumed by at least one other node, or have an implicit use (e.g., being the return value of a function, a state element of a proc, or an I/O operation). Nodes without users are considered dead code and are typically removed by a subsequent Dead Code Elimination (DCE) pass.
-
Not Token-Typed: Nodes with a
tokentype cannot be folded into aliteral, as tokens represent ordering constraints rather than data values. -
Not Side-Effecting (with exception for
gate): Operations that have side effects (e.g.,send,receive) generally cannot be constant-folded, as their execution order and effects are crucial for correctness. An exception is made forgateoperations, which can be folded if their condition and data inputs are constant, as thegateeffectively becomes a simple pass-through or a zero. -
All Operands are Known Constants: Critically, all of the node's operands must either be
literalvalues themselves, or be other nodes whose values are provably constant at compile time (as determined by aStatelessQueryEngine).
If a node successfully meets these criteria, the pass proceeds to:
-
Collect the
Valueof each of its constant operands. -
Interpret the operation using these constant
Values to compute its result. -
Replace the original operation node with a new
literalnode containing the computedValue.
After the ConstantFoldingPass is applied, a DeadCodeEliminationPass is
typically run to remove any operations that became dead as a result of their
uses being replaced by literals.
Example:
Consider a function with an add operation whose operands are both literals:
// Original IR
fn IdenticalLiterals() -> bits[8] {
literal.1: bits[8] = literal(value=42)
literal.2: bits[8] = literal(value=123)
ret add.3: bits[8] = add(literal.1, literal.2)
}
The ConstantFoldingPass would identify add.3 as constant foldable. It
would evaluate 42 + 123 to 165 and replace add.3 with a new literal
node:
// Optimized IR (after ConstantFolding and a subsequent DCE pass)
fn IdenticalLiterals() -> bits[8] {
ret literal.new: bits[8] = literal(value=165)
}
This optimization is foundational for many other passes, as it reduces complex expressions to their simplest forms, making subsequent analyses and transformations more effective.
cse - Common subexpression elimination
Pass which performs common subexpression elimination. Equivalent ops with the same operands are commoned. The pass can find arbitrarily large common expressions.
Note: What follows was generated by the Gemini LLM. Not human verified.
The CsePass (Common Subexpression Elimination) is an optimization pass in
XLS that identifies and merges equivalent computations within a function or
proc. The primary goal is to eliminate redundant calculations, thereby
reducing the size of the generated hardware, improving performance, and
making the IR more compact.
The pass operates by building a canonical representation (CseNode) for each
operation in the IR. Two operations are considered equivalent if they have:
- The same opcode (
Op). - The same operand list (with a canonical ordering applied for commutative
operations like
addorand). - The same additional configuration data (e.g.,
startandwidthforbit_slice,indexfortuple_index). - The same result type (
Type).
Here's a breakdown of how the pass works:
-
CseNodeRepresentation: ACseNodeobject is created for eachNodein the IR. ThisCseNodecaptures the essential characteristics that define equivalence, including the opcode, a sorted list of its operands (canonicalized for commutative operations), any non-operand configuration data, and the result type. For side-effecting operations (other thangate) and certain control-flow-related operations (param,next,receive,send,assert,cover, etc.), a unique identifier is assigned to prevent them from being considered equivalent, even if their underlying data appears identical. -
CseNodeArena: ACseNodeArenamanages the creation and storage ofCseNodeobjects. It ensures that only oneCseNodeinstance exists for each unique canonical operation. When aNodeis processed, the arena either returns an existingCseNode(if an equivalent one has already been seen) or creates a new one. -
Equivalence Grouping: The pass traverses the IR in topological order. For each
Node, it obtains itsCseNoderepresentation. AllNodes that map to the sameCseNodeare grouped together into an "equivalence bucket." -
Replacement: After all nodes have been processed and grouped:
* For each equivalence bucket containing more than one
Node, a "representative" node is chosen from that bucket. The selection criteria for the representative prioritize nodes with shorter assigned names, and then by node ID for determinism.* All other
Nodes in the bucket (the "duplicates") are then replaced by the representative node. This means all users of the duplicate nodes are rewired to use the single representative node instead.* The pass can optionally common
literalnodes. Ifcommon_literalsisfalse, eachliteralis treated as unique, preventing them from being merged. This can be useful if preserving distinct literal nodes is important for some later pass. -
Post-DCE: After
CsePassruns, aDeadCodeEliminationPassis typically executed. This is because replacing duplicate nodes with a representative leaves the original duplicate nodes with no remaining users, making them dead code that can then be safely removed from the IR.
Example:
Consider a function with two identical and operations, followed by
identical neg and or operations:
// Original IR snippet
fn nontrivial(x: bits[8], y: bits[8], z: bits[8]) -> bits[8] {
and.1: bits[8] = and(x, y)
neg.2: bits[8] = neg(and.1)
or.3: bits[8] = or(neg.2, z)
and.4: bits[8] = and(x, y) // Identical to and.1
neg.5: bits[8] = neg(and.4) // Identical to neg.2
or.6: bits[8] = or(neg.5, z) // Identical to or.3
ret add.7: bits[8] = add(or.3, or.6)
}
The CsePass would identify that and.1 and and.4 are equivalent; neg.2
and neg.5 are equivalent; and or.3 and or.6 are equivalent. It would
choose one of each pair as a representative and replace the duplicates.
After a subsequent DCE pass, the IR would be simplified to:
// Optimized IR (after CsePass and DCE)
fn nontrivial(x: bits[8], y: bits[8], z: bits[8]) -> bits[8] {
and.1: bits[8] = and(x, y)
neg.2: bits[8] = neg(and.1)
or.3: bits[8] = or(neg.2, z)
ret add.7: bits[8] = add(or.3, or.3) // or.6 replaced by or.3
}
This effectively shares the computation of the redundant and, neg, and
or operations, significantly reducing redundant logic and hardware
requirements.
dataflow - Dataflow Optimization
An optimization which uses a lattice-based dataflow analysis to find equivalent nodes in the graph and replace them with a simpler form. The analysis traces through tuples, arrays, and select operations. Optimizations which can be performed by this pass:
tuple_index(tuple(x, y), index=1) => y
select(selector, {z, z}) => z
array_index(array_update(A, x, index={42}), index={42}) => x
Note: What follows was generated by the Gemini LLM. Not human verified.
The DataflowSimplificationPass is an optimization pass in XLS that
leverages a lattice-based dataflow analysis to identify and replace logically
equivalent nodes in the Intermediate Representation (IR) with simpler, more
direct forms. This pass is highly effective at "seeing through" structural
operations such as tuples, arrays, and select statements to eliminate
redundant computations and expose values that are directly available from
their original source.
The overarching goal is to reduce the overall complexity and size of the IR,
leading to more compact and efficient hardware designs.
How it Works:
-
NodeSourceAnalysis: The core mechanism of this pass is theNodeSourceAnalysis(implemented viaNodeSourceDataflowVisitor). This analysis precisely determines the "source" of every elemental component (individual bits, tuple elements, array elements) within each node in the IR graph. A "source" is defined as the original node or literal from which a particular piece of data originates, tracing through intermediate operations liketuple_index,array_index,bit_slice, andconcat. For example, if an operation istuple_index(tuple(x, y), 1), the effective source of this operation's output is directlyy. -
Equivalence Mapping (
source_map): The pass performs a topological sort of the function's nodes. During this traversal, it constructs a hash map (source_map). The keys of this map areLeafTypeTreeView<NodeSource>objects, which uniquely identify the "source" of a node's data. The values are pointers to theNode*that represents that canonical source. If two distinct nodes are found to have the sameLeafTypeTreeView<NodeSource>, they are semantically equivalent. -
Simplification via Replacement: * Direct Replacement of Equivalent Nodes: As the pass traverses the graph in topological order, if the
NodeSourceof the current node (node) is already present in thesource_map(indicating that a logically equivalent node (it->second) has already been processed), then the currentnodeis directly replaced byit->second. This effectively eliminates redundant computations where the same value is derived through different paths.``` // Original: tuple.1: (bits[2], bits[42]) = tuple(x, y) ret tuple_index.2: bits[42] = tuple_index(tuple.1, index=1) // After optimization: ret y: bits[42] = param(name=y) // tuple_index.2 replaced by y ```* Special Handling for Empty Types: Nodes representing empty types (e.g., zero-width bits, empty tuples) are handled with care. While they are technically equivalent to each other, replacing them indiscriminately can lead to infinite loops if, for example, two parameters of empty types repeatedly replace each other's uses. To prevent this, the pass explicitly skips replacing nodes of empty types with other equivalent empty-type nodes.
-
MaybeReplaceWithArrayOfExistingNodes(Array Optimization): This auxiliary function specifically targetsarraynodes. If anarraynode is constructed entirely from elements that are themselves existing nodes in the graph (and whoseNodeSources are already in thesource_map), thearraynode can be replaced by a more directarrayliteral or a reference to an existing array operation. This is useful for: * Unboxing Arrays: Simplifyingarray_index(array(x), index=0)to justx.* Reconstructing Arrays from Updates: If a series of
array_updateoperations effectively reconstructs an array from individual elements (some of which might be initial values), and the pass can track these sources, it can replace thearray_updatechain with a directarrayoperation.* Eliminating Redundant Array Construction: If an
arrayis constructed whose elements are simplyarray_indexoperations from another existing array, the entirearrayconstruction can be replaced by the original array.``` // Original: element_0: bits[32] = array_index(a, indices=[zero]) element_1: bits[32] = array_index(a, indices=[one]) element_2: bits[32] = array_index(a, indices=[two]) ret array: bits[32][3] = array(element_0, element_1, element_2) // After optimization: ret a: bits[32][3] = param(name=a) // Replaced by original array 'a' ```
Benefits:
- Semantic CSE: Performs a powerful form of common subexpression elimination that identifies logically equivalent computations even if their IR structures are syntactically different.
- IR Simplification: Reduces the number of nodes in the IR by eliminating redundant computations and simplifying dataflow paths through aggregate types (tuples and arrays).
- Hardware Efficiency: Leads directly to more compact and efficient hardware by removing redundant logic and simplifying complex data structures, which translates to better area and performance metrics.
dce - Dead Code Elimination
class DeadCodeEliminationPass iterates up from a functions result nodes and marks all visited node. After that, all unvisited nodes are considered dead.
Note: What follows was generated by the Gemini LLM. Not human verified.
The DeadCodeEliminationPass (DCE) is a fundamental optimization pass in XLS
that removes "dead" or "unreachable" code from the Intermediate
Representation (IR). Dead code refers to any operation whose result is not
used, either directly or indirectly, to produce the final output of a
function or any side-effecting operation. By removing such code, DCE
simplifies the IR graph, reduces the size of the generated hardware, and can
improve the performance of subsequent optimization passes by reducing the
amount of code they need to analyze.
How it Works:
The pass operates by identifying "live" code—operations that are essential for the computation—and then removing everything else. This is achieved through a simple and efficient graph traversal algorithm:
-
Identifying Live Roots: The pass starts by identifying the initial set of "live" nodes. These are nodes that are inherently live because they represent the essential outputs or side effects of the function or proc. This set includes: * The function's return value.
* Any side-effecting operations (e.g.,
assert,cover,trace,send).* Proc
nextstate update nodes.* Nodes with implicit uses (e.g., output ports in a block).
-
Worklist-based Traversal: A worklist is initialized with all nodes that are initially considered dead (i.e., those with no users and which are deletable). The pass then iteratively removes a node from the worklist and, for each of its operands, checks if that operand has become dead as a result. If so, the operand is added to the worklist.
-
Determining Deletability: A node is considered "deletable" if it is not a root and does not have any side effects. The pass includes a specific
is_deletablecheck which ensures that: * Nodes with implicit uses (like block ports) are not removed.* Most side-effecting operations (
assert,cover,trace,send, etc.) are not removed, as they are considered observable effects.gateis a special-case that is considered removable.*
invokenodes are generally not removed by DCE. This is a conservative approach, as the invoked function might have side effects that are not immediately apparent.InliningPassis responsible for handling the removal ofinvokenodes. -
Node Removal: As the pass identifies dead nodes, it removes them from the function's IR graph. This process continues until the worklist is empty, at which point only live code remains.
Benefits:
- Reduced Hardware Size: The most direct benefit is a reduction in the amount of logic that needs to be synthesized, leading to smaller and more area-efficient hardware.
- Improved Performance: By removing unnecessary computations, the pass can sometimes shorten critical paths, leading to improved timing performance.
- Enhanced IR Quality: A cleaner, more compact IR is easier for both humans and subsequent optimization passes to analyze and transform, which can lead to better overall optimization results.
Example: Consider a function with several operations, some of which do not contribute to the final result:
// Original IR snippet
fn some_dead_code(x: bits[42], y: bits[42]) -> bits[42] {
neg.1: bits[42] = neg(x)
add.2: bits[42] = add(x, y) // Dead: not used by any live node
neg.3: bits[42] = neg(add.2) // Dead: only user is also dead
ret sub.4: bits[42] = sub(neg.1, y)
}
The DeadCodeEliminationPass would perform the following steps:
1. Start with the live root, sub.4 (the function's return value).
-
The live set becomes
{sub.4, neg.1, y}.xis also live since its used byneg.1. -
Nodes
add.2andneg.3are identified as having no users that are in the live set. Since they have no other users and are not side-effecting, they are considered dead. -
The pass removes
add.2andneg.3from the graph.
The final, optimized IR would be:
// Optimized IR (after DCE)
fn some_dead_code(x: bits[42], y: bits[42]) -> bits[42] {
neg.1: bits[42] = neg(x)
ret sub.4: bits[42] = sub(neg.1, y)
}
The pass has successfully eliminated the unused add and neg operations,
resulting in a more efficient implementation.
dfe - Dead Function Elimination
Dead Function Elimination.
Note: What follows was generated by the Gemini LLM. Not human verified.
This pass removes unreachable functions, procs, and blocks (collectively
FunctionBases) from the package. It is crucial for reducing the size and
complexity of the IR by discarding any components that do not contribute to
the final output of the top-level entity.
The pass requires that a top entity (function, proc, or block) be set in
the package to determine the starting point of liveness analysis.
The process involves:
-
Identifying Live Roots: The pass first identifies the initial set of "live"
FunctionBases. This typically includes: * The explicitly designated top-level function, proc, or block.* For old-style procs, any proc that communicates with external channels (e.g.,
send_onlyorreceive_only) is considered live. Liveness then propagates to other procs that communicate with these live procs over internal channels.* For new-style procs and blocks, liveness is determined by which other procs or blocks are instantiated by the top-level proc/block.
-
Transitive Closure Analysis: A depth-first search (DFS) is performed starting from these live roots. During this traversal, all directly or indirectly referenced
FunctionBases (e.g., functions invoked by other functions,bodyfunctions ofcounted_forloops, instantiated blocks) are marked as "reached" or "live". -
Elimination of Dead Code: After the traversal: * Any
FunctionBasethat was not marked as reached is considered dead and is removed from the package.* Any channels that are no longer associated with any live proc are also removed.
This pass ensures that the final IR contains only the necessary logic, simplifying subsequent optimization and synthesis steps.
Example:
Consider a package with three functions: f_live, f_dead, and f_top.
f_top is set as the top entity and invokes f_live. f_dead is defined
but never invoked by f_top or f_live.
package my_package
fn f_dead() -> bits[32] {
ret literal.0: bits[32] = literal(value=42)
}
fn f_live(x: bits[32]) -> bits[32] {
ret add.1: bits[32] = add(x, literal(value=1))
}
top fn f_top(y: bits[32]) -> bits[32] {
invoke.0: bits[32] = invoke(y, to_apply=f_live)
ret invoke.0
}
After DeadFunctionEliminationPass runs, f_dead will be removed, and the
package will contain only f_live and f_top.
fixedpoint_proc_state_flattening - Proc State Flattening
Prepare proc state for further analysis by removing arrays and tuples.
Options Set
Run to a fixedpoint.
Invoked Passes
fixedpoint_simp - Fixed-point Simplification
Standard simplification pipeline.
This is run to a fixedpoint and avoids many time-consuming analyses.
Options Set
Run to a fixedpoint.
Invoked Passes
fixedpoint_simp(2) - Max-2 Fixed-point Simplification
Standard simplification pipeline.
Opt level is capped at 2
Options Set
Run to a fixedpoint.
Cap opt level: 2
Invoked Passes
fixedpoint_simp(3) - Max-3 Fixed-point Simplification
Standard simplification pipeline.
Opt level is capped at 3
Options Set
Run to a fixedpoint.
Cap opt level: 3
Invoked Passes
fixedpoint_simp(>=1,<=2) - Min-1 Max-2 Fixedpoint Simplification
Standard simplification pipeline.
Opt level is capped at 2 and skipped if less than 1
Options Set
Run to a fixedpoint.
Min opt level: 1
Cap opt level: 2
Invoked Passes
full-inlining - full function inlining passes
Fully inline all functions in a single step.
Invoked Passes
ident_remove - Identity Removal
Identity Removal - eliminate all identity() expressions.
Note: What follows was generated by the Gemini LLM. Not human verified.
The IdentityRemovalPass is an optimization pass in XLS that focuses on
simplifying the Intermediate Representation (IR) by eliminating all
identity() expressions. An identity() operation simply passes its single
operand through unchanged. While syntactically valid, these operations
introduce unnecessary nodes into the graph, increasing its complexity without
contributing any new computational value.
How it Works:
-
Traversal: The pass performs a forward traversal through all nodes in the function or proc.
-
Identification: For each node encountered, it checks if the node's opcode (
op()) isOp::kIdentity. -
Replacement: If an
identity()node is found, all its uses are directly replaced with its single operand. This means any node that was consuming the output of theidentity()operation will now directly consume the input to theidentity()operation. -
Removal (Implicitly by DCE): After all uses of an
identity()node have been replaced, thatidentity()node itself becomes dead code (it no longer has any consumers). A subsequentDeadCodeEliminationPassis typically run afterIdentityRemovalPassto remove these now-unusedidentity()nodes, further cleaning up the IR.
Benefits:
- IR Simplification: Directly reduces the number of nodes in the IR, making the graph simpler and easier for subsequent passes to analyze and transform.
- Hardware Reduction: Eliminates unnecessary wires and buffers that
might otherwise be generated for
identity()operations in hardware, contributing to smaller area.
- Improved Performance: Removing redundant operations can slightly improve simulation and synthesis performance by reducing the total amount of logic.
Example:
Consider a function with a chain of identity() operations that pass through
intermediate results:
// Original IR snippet
fn simple_neg(x: bits[8]) -> bits[8] {
one: bits[8] = literal(value=1)
v1: bits[8] = identity(x)
add: bits[8] = add(v1, one)
v2: bits[8] = identity(add)
v3: bits[8] = identity(v2)
ret add2:bits[8] = sub(v3, one)
}
The IdentityRemovalPass would process v1, v2, and v3, replacing their
uses with their respective operands. After this pass and a subsequent
DeadCodeEliminationPass, the IR would be simplified to:
// Optimized IR (after IdentityRemovalPass and DCE)
fn simple_neg(x: bits[8]) -> bits[8] {
one: bits[8] = literal(value=1)
add: bits[8] = add(x, one) // v1 replaced by x
// v3 replaced by add (which was v2's replacement)
ret add2:bits[8] = sub(add, one)
}
This effectively removes all the intermediate identity() nodes,
streamlining the dataflow and simplifying the overall design.
inlining - Inlines invocations
Inlines a package toward the top function/proc.
If full then all functions are inlined into the top.
If leaf then only leaf functions are inlined into their caller. This allows
other passes to optimize on smaller graphs.
Note: What follows was generated by the Gemini LLM. Not human verified.
The InliningPass is a fundamental optimization pass in XLS that transforms
invoke operations (function calls) by replacing them with the actual body
of the invoked function. This is a powerful compiler optimization that
eliminates the overhead associated with function calls, exposes more
optimization opportunities for subsequent passes (such as Common
Subexpression Elimination or Dead Code Elimination), and can ultimately lead
to more monolithic and potentially faster hardware designs.
The pass supports two primary modes of inlining, controlled by the
InlineDepth enum:
-
kFull(Full Inlining): In this mode, the pass attempts to recursively inline all functions into their callers, working its way up the call graph towards the top-level function or proc. The process continues until only the top-level entity remains (or until it encounters foreign functions, which are external hardware blocks and cannot be inlined). -
kLeafOnly(Leaf-Only Inlining): This mode selectively inlines only "leaf" functions—those that do not call any other functions themselves (or only call foreign functions). This can be beneficial when other optimization passes are more effective on smaller, already-inlined graphs, or when fine-grained control over inlining is desired. It also inlines functions that have a single caller and only call leaf functions, or another function which is already being inlined.
How it Works:
-
Call Graph Analysis: The pass initially constructs a
CallGraphfor the entire package. This graph precisely depicts the calling relationships between different functions. -
Inlining Order: Functions are processed for inlining in a post-order traversal of the call graph (i.e., leaf functions are processed first). This strategic order ensures that when a function
Foois inlined into its callers, any functions called byFoohave already been inlined (if eligible), thereby preventing redundant work and simplifying the overall inlining process. -
InlineInvokeFunction: This template function is responsible for the actual inlining of a singleinvokeoperation: * Operand Mapping: It establishes a mapping from the parameters of the invoked function to the corresponding actual arguments (operands) of theinvokeinstruction.* Node Cloning and Rewiring: It traverses the body of the invoked function in topological order. For each node within the invoked function, it clones the node into the caller function. During this cloning process, any references to parameters of the invoked function are transparently replaced with their corresponding arguments from the
invokeoperation.* Source Location Propagation: Source location information is meticulously merged from the original nodes and the
invokeinstruction to maintain traceability and debugging information.* Name Propagation: Node names are propagated intelligently to preserve readability after inlining: * If the invoked function's parameter name is a prefix of a node's name, and the corresponding
invokeoperand also has an assigned name, the inlined node's name is derived by substituting the parameter name with the operand's name.* If the `invoke` instruction itself has an assigned name, the return value of the inlined function inherits that name. * If neither of the above applies, and the inlined node had a name derived from a parameter, that derived name is used.* Cover and Assert Label Deduplication: For
coverandassertoperations, if inlining would result in multiple instances with identical labels, the labels are modified to include a unique prefix (derived from the caller function and a uniqueinline_count) to ensure uniqueness in the generated Verilog.* Replacement and Removal: Finally, the original
invokeinstruction is replaced by the return value of the inlined function (or a tuple if the function returns multiple values), and theinvokenode itself is removed from the caller function as it is no longer needed.
By aggressively inlining functions, InliningPass creates larger, flatter
graphs that provide subsequent optimization passes with a broader scope for
analysis and transformation, ultimately leading to more optimized hardware
designs.
Example (Full Inlining):
Consider a caller function that invokes a callee function:
// Original IR
package some_package
fn callee(x: bits[32], y: bits[32]) -> bits[32] {
ret add.1: bits[32] = add(x, y)
}
fn caller() -> bits[32] {
literal.2: bits[32] = literal(value=2)
ret invoke.3: bits[32] = invoke(literal.2, literal.2, to_apply=callee)
}
After InliningPass (with kFull depth), the invoke.3 in caller would
be replaced by the body of callee, resulting in:
// Optimized IR (after Inlining and a subsequent DCE pass)
package some_package
fn caller() -> bits[32] {
literal.2: bits[32] = literal(value=2)
// callee's body is inlined here
add.1: bits[32] = add(literal.2, literal.2)
ret add.1
}
This effectively eliminates the function call overhead and exposes the
add.1 to further optimizations within the caller function. For
kLeafOnly, callee would be inlined into caller, but if caller itself
was called by another function, that outer function would not have caller
inlined into it. If caller called another function callee2 which itself
called a third function callee3 then callee3 would be inlined into
callee2 but callee2 would not be inlined into caller.
label-recovery - LabelRecovery
At the end of the pass pipeline (when inlining and optimizations have been performed) attempts to recover original names for coverpoints and assertions to whatever degree possible so they're more human-readable -- we mangle them for inlining to ensure they're unique, but often those names are way overqualified.
Note: What follows was generated by the Gemini LLM. Not human verified.
The LabelRecoveryPass is an optimization pass in XLS designed to enhance
the human-readability of labels associated with cover points and assert
statements. These labels are crucial for debugging and formal verification.
During earlier optimization stages, particularly InliningPass, these labels
are deliberately mangled (e.g., by prepending caller function names and
invocation IDs). This mangling ensures their uniqueness in the flattened IR,
which is critical for generating correct Verilog. However, these mangled
names can become excessively long and difficult for users to interpret.
The primary purpose of LabelRecoveryPass is to revert these mangled labels
back to their original, more concise, user-defined forms wherever possible,
without reintroducing naming collisions that would break downstream tools.
How it Works:
-
Collecting Original Labels: The pass iterates through all nodes in the function or proc. For each
coverandassertnode, it checks for the presence of anoriginal_labelattribute. This attribute stores the name that the user originally provided before any mangling occurred during inlining. -
Grouping by Original Label: It constructs a map where the keys are the
original_labelstrings, and the values are lists of allcoverorassertnodes that share that specificoriginal_label. This grouping is essential for identifying potential naming collisions. -
Collision Check and Recovery: For each unique
original_labelfound in the map: * No Collision: If a particularoriginal_labelis associated with only onecoverorassertnode in the entire function, it means that no naming collision occurred (e.g., the original statement was inlined only once, or it was already unique). In this safe scenario, the pass restores the node'slabelattribute to itsoriginal_label.* Collision Present: If an
original_labelis associated with multiplecoverorassertnodes, it signifies that inlining has created multiple instances of what was originally a single, uniquely named statement. In this case, the pass does not attempt to recover the original label for these collided nodes. Their mangled names are retained to ensure uniqueness and prevent errors in downstream tools like Verilog generation.
Benefits:
- Improved Debuggability: By restoring original labels, the generated
Verilog becomes significantly more understandable for hardware designers,
making it easier to pinpoint issues related to
coverpoints andassertstatements during simulation and debugging.
- Enhanced Readability: Reduces the verbosity of the generated IR and Verilog code, contributing to overall better readability.
- Preserves Intent: Ensures that the designer's original intent and
chosen naming convention for
coverandassertlabels are reflected in the final output wherever possible.
Limitations (as noted in the code):
- The current implementation only recovers labels if there are no collisions for a given original label. It does not attempt to find minimal distinguishing prefixes or other sophisticated renaming schemes for collided labels, although this is noted as a potential future improvement.
Example:
Consider a callee function with a cover point that is invoked only once
by a caller function:
// Original IR
package p
fn callee(p: bits[1]) -> () {
ret cover.10: () = cover(p, label="my_cover_label")
}
top fn caller(p: bits[1]) -> () {
ret invoke.20: () = invoke(p, to_apply=callee)
}
After InliningPass, the cover.10 in callee might be inlined into
caller and its label mangled (e.g., to
"caller_0_callee_my_cover_label").
The LabelRecoveryPass would then perform the following steps:
1. It would locate the inlined cover node within the caller function.
-
It would retrieve its
original_label, which is"my_cover_label". -
It would determine that no other
coverorassertnode incallershares theoriginal_label"my_cover_label". -
Consequently, it would restore the label of the inlined
covernode to"my_cover_label".
// Optimized IR (after InliningPass, DCEPass, and LabelRecoveryPass)
package p
top fn caller(p: bits[1]) -> () {
// Label recovered to original
ret cover.new: () = cover(p, label="my_cover_label")
}
If callee were invoked multiple times, leading to multiple cover nodes
with the same original_label (e.g., in a loop), their labels would remain
mangled to ensure uniqueness, as per the pass's collision avoidance strategy.
leaf-inlining - Inlines invocations
Inlines a package toward the top function/proc.
If full then all functions are inlined into the top.
If leaf then only leaf functions are inlined into their caller. This allows
other passes to optimize on smaller graphs.
Note: What follows was generated by the Gemini LLM. Not human verified.
The InliningPass is a fundamental optimization pass in XLS that transforms
invoke operations (function calls) by replacing them with the actual body
of the invoked function. This is a powerful compiler optimization that
eliminates the overhead associated with function calls, exposes more
optimization opportunities for subsequent passes (such as Common
Subexpression Elimination or Dead Code Elimination), and can ultimately lead
to more monolithic and potentially faster hardware designs.
The pass supports two primary modes of inlining, controlled by the
InlineDepth enum:
-
kFull(Full Inlining): In this mode, the pass attempts to recursively inline all functions into their callers, working its way up the call graph towards the top-level function or proc. The process continues until only the top-level entity remains (or until it encounters foreign functions, which are external hardware blocks and cannot be inlined). -
kLeafOnly(Leaf-Only Inlining): This mode selectively inlines only "leaf" functions—those that do not call any other functions themselves (or only call foreign functions). This can be beneficial when other optimization passes are more effective on smaller, already-inlined graphs, or when fine-grained control over inlining is desired. It also inlines functions that have a single caller and only call leaf functions, or another function which is already being inlined.
How it Works:
-
Call Graph Analysis: The pass initially constructs a
CallGraphfor the entire package. This graph precisely depicts the calling relationships between different functions. -
Inlining Order: Functions are processed for inlining in a post-order traversal of the call graph (i.e., leaf functions are processed first). This strategic order ensures that when a function
Foois inlined into its callers, any functions called byFoohave already been inlined (if eligible), thereby preventing redundant work and simplifying the overall inlining process. -
InlineInvokeFunction: This template function is responsible for the actual inlining of a singleinvokeoperation: * Operand Mapping: It establishes a mapping from the parameters of the invoked function to the corresponding actual arguments (operands) of theinvokeinstruction.* Node Cloning and Rewiring: It traverses the body of the invoked function in topological order. For each node within the invoked function, it clones the node into the caller function. During this cloning process, any references to parameters of the invoked function are transparently replaced with their corresponding arguments from the
invokeoperation.* Source Location Propagation: Source location information is meticulously merged from the original nodes and the
invokeinstruction to maintain traceability and debugging information.* Name Propagation: Node names are propagated intelligently to preserve readability after inlining: * If the invoked function's parameter name is a prefix of a node's name, and the corresponding
invokeoperand also has an assigned name, the inlined node's name is derived by substituting the parameter name with the operand's name.* If the `invoke` instruction itself has an assigned name, the return value of the inlined function inherits that name. * If neither of the above applies, and the inlined node had a name derived from a parameter, that derived name is used.* Cover and Assert Label Deduplication: For
coverandassertoperations, if inlining would result in multiple instances with identical labels, the labels are modified to include a unique prefix (derived from the caller function and a uniqueinline_count) to ensure uniqueness in the generated Verilog.* Replacement and Removal: Finally, the original
invokeinstruction is replaced by the return value of the inlined function (or a tuple if the function returns multiple values), and theinvokenode itself is removed from the caller function as it is no longer needed.
By aggressively inlining functions, InliningPass creates larger, flatter
graphs that provide subsequent optimization passes with a broader scope for
analysis and transformation, ultimately leading to more optimized hardware
designs.
Example (Full Inlining):
Consider a caller function that invokes a callee function:
// Original IR
package some_package
fn callee(x: bits[32], y: bits[32]) -> bits[32] {
ret add.1: bits[32] = add(x, y)
}
fn caller() -> bits[32] {
literal.2: bits[32] = literal(value=2)
ret invoke.3: bits[32] = invoke(literal.2, literal.2, to_apply=callee)
}
After InliningPass (with kFull depth), the invoke.3 in caller would
be replaced by the body of callee, resulting in:
// Optimized IR (after Inlining and a subsequent DCE pass)
package some_package
fn caller() -> bits[32] {
literal.2: bits[32] = literal(value=2)
// callee's body is inlined here
add.1: bits[32] = add(literal.2, literal.2)
ret add.1
}
This effectively eliminates the function call overhead and exposes the
add.1 to further optimizations within the caller function. For
kLeafOnly, callee would be inlined into caller, but if caller itself
was called by another function, that outer function would not have caller
inlined into it. If caller called another function callee2 which itself
called a third function callee3 then callee3 would be inlined into
callee2 but callee2 would not be inlined into caller.
loop_unroll - Unroll counted loops
Note: What follows was generated by the Gemini LLM. Not human verified.
Pass which unrolls counted_for loops. Each iteration of the counted_for
loop is replaced with an explicit invoke operation.
For example, a loop like this:
fn body(i: bits[4], accum: bits[32], zero: bits[32]) -> bits[32] {
zero_ext.3: bits[32] = zero_ext(i, new_bit_count=32)
add.4: bits[32] = add(zero_ext.3, accum)
ret add.5: bits[32] = add(add.4, zero)
}
fn unrollable() -> bits[32] {
literal.1: bits[32] = literal(value=0)
ret counted_for.2: bits[32] = counted_for(literal.1, trip_count=2,
stride=2, body=body, invariant_args=[literal.1])
}
would be transformed into:
invoke.0: bits[32] = invoke(literal(0), literal(0), literal(0),
to_apply=body)
invoke.1: bits[32] = invoke(literal(2), invoke.0, literal(0), to_apply=body)
ret invoke.1
lut_conversion - LUT Conversion
Pass which opportunistically converts nodes to lookup tables (selects) where we can prove it's beneficial.
Note: What follows was generated by the Gemini LLM. Not human verified.
The LutConversionPass is an optimization pass in XLS that intelligently
converts complex combinational logic into more efficient lookup tables
(LUTs). These LUTs are represented in the IR as select operations. This is
particularly beneficial for hardware synthesis, especially for FPGA targets
where LUTs are fundamental configurable logic blocks, capable of implementing
arbitrary Boolean functions very efficiently within a fixed-size resource.
The pass aims to improve hardware area and potentially delay by replacing
complex gate-level logic with a single, equivalent select operation.
How it Works:
-
Identifying
SelectOperations and Selectors: The pass iterates through nodes in reverse topological order, focusing onselectoperations. For eachselect, it analyzes itsselectorinput. -
Minimum Cut Analysis for Selector: The core of this pass involves a "minimum cut" analysis performed on the
selectornode, facilitated byDataflowGraphAnalysis. This analysis identifies the minimal set of ancestor nodes (min_cut) of theselectorthat, when their values are known, completely determine the value of theselector. * A crucial parameter,max_unknown_bits, is used withGetMinCutFor. This sets an upper limit on the number of "unknown" bits (bits whose values are not fixed to 0 or 1) allowed in themin_cut. If the number of unknown bits exceeds this limit, the cut is considered too large to be efficiently represented as a LUT, and the optimization is not performed. This prevents the generation of excessively large LUTs that might consume too many hardware resources. -
Constructing the Lookup Table (Truth Table): If a sufficiently small
min_cutis found, the pass then constructs a conceptual lookup table. This involves: * Enumerating allmin_cutvalues: It systematically iterates through all possible combinations of values that the nodes in themin_cutcan take.* Simulating the
selector: For each combination ofmin_cutvalues, it uses anIrInterpreterto simulate the sub-graph that computes theselectorand determines its resulting value.* Mapping Selector Values to Case Results: With the
selectorvalues determined for eachmin_cutcombination, the pass then knows precisely which case of the originalselectoperation would be chosen. It fetches the corresponding output value from that case (or the default value of the originalselect). This effectively creates a mapping from eachmin_cutcombination to the final output value of the originalselectoperation. -
Replacing with a New
Select(LUT): Finally, the originalselectoperation is replaced by a newselectoperation that directly implements the lookup table: * New Selector: A new selector for this LUT is constructed by concatenating the "unknown" bits of themin_cutnodes. These are the bits that actually differentiate the behavior of theselector.* New Cases: The cases of the new
selectare the computed output values (from step 3) for each possible combination of the new selector bits. The order of these cases is determined by the enumeration ofmin_cutvalues.
Benefits:
- Area Optimization: By replacing complex gate-level logic with efficient LUT structures, this pass can significantly reduce the overall hardware area, especially for FPGA-based designs.
- Performance Improvement: LUTs typically have predictable and often low delay characteristics, which can lead to improved timing performance for the converted logic.
- IR Simplification: Can simplify complex logical structures into a
single
selectnode, making the IR more abstract and easier to manage for subsequent optimization stages.
Example:
Consider a select operation whose selector is derived from a complex
arithmetic expression of x, for instance, selector = add(x, x)
(effectively 2*x). The cases of the select are literals.
// Original IR snippet
fn simple_select(x: bits[3]) -> bits[3] {
literal.0: bits[3] = literal(value=0)
literal.1: bits[3] = literal(value=1)
literal.2: bits[3] = literal(value=2)
literal.3: bits[3] = literal(value=3)
doubled_x: bits[3] = add(x, x) // Selector for the final select
selector: bits[3] = doubled_x
ret result: bits[3] =
sel(selector,
cases=[literal.0, literal.1, literal.2, literal.3],
default=x)
}
The LutConversionPass would identify selector (which is doubled_x) as a
target. It would perform a minimum cut on selector and find that x is the
minimal set of inputs. It would then enumerate all possible values for x (0
to 7 for bits[3]), compute the selector value and the select result
for each x, creating a new select with x as its selector and the
computed results as cases:
// Optimized IR (simplified after LutConversionPass and a DCE pass)
fn simple_select(x: bits[3]) -> bits[3] {
// ... (literals 0-3 remain if still used elsewhere, otherwise removed by
// DCE)
ret result: bits[3] = sel(x, cases=[
literal(value=0), // x=0, doubled_x=0 -> cases[0]
literal(value=2), // x=1, doubled_x=2 -> cases[2]
x, // x=2, doubled_x=4 -> default
x, // x=3, doubled_x=6 -> default
literal(value=0), // x=4, doubled_x=0 -> cases[0]
literal(value=2), // x=5, doubled_x=2 -> cases[2]
x, // x=6, doubled_x=4 -> default
x // x=7, doubled_x=6 -> default
]) // Cases derived from evaluating the original select for each possible x
}
The resulting select directly maps x to the output, effectively
converting the add and original select into a single, more efficient LUT
implementation.
map_inlining - Inline map operations
A pass to convert map nodes to in-line Invoke nodes. We don't directly lower maps to Verilog.
Note: What follows was generated by the Gemini LLM. Not human verified.
The MapInliningPass is an optimization pass that transforms map
operations in the XLS IR into a series of explicit invoke operations. In
the XLS compilation flow, map operations represent a high-level construct
that applies a function to each element of an array. Since target hardware
description languages like Verilog do not directly support map operations,
this pass is crucial for lowering this abstraction to a more
hardware-friendly representation.
The pass operates as follows:
-
Identification of
mapnodes: It traverses the IR to identify allmapinstructions. -
Transformation: For each
mapinstruction, it replaces themapnode with anarrayoperation. Thisarrayoperation is composed of multipleinvokeoperations, where eachinvokecorresponds to one application of the mapped function to an element of the input array.* For an input array of size
N, themapoperationmap(input_array, to_apply=map_fn)is replaced by anarrayofNelements.* Each element of this newly constructed
arrayis aninvokeoperation.* Each
invokeoperation is explicitly passed a single element from theinput_array(accessed viaarray_indexusing a literal index) and applies themap_fnto it.
This effectively "unrolls" the map operation, making each application of
the map_fn to an array element an explicit operation in the IR. This is
analogous to how a loop unrolling pass might transform a counted_for loop.
Example:
Consider a map operation that applies map_fn to an array a:
fn map_fn(x: bits[32]) -> bits[16] {
ret bit_slice.1: bits[16] = bit_slice(x, start=0, width=16)
}
fn main(a: bits[32][4]) -> bits[16][4] {
ret result: bits[16][4] = map(a, to_apply=map_fn)
}
After MapInliningPass, the map operation would be replaced by an array
of four invoke operations, each applying map_fn to a specific element of
a:
fn map_fn(x: bits[32]) -> bits[16] {
ret bit_slice.1: bits[16] = bit_slice(x, start=0, width=16)
}
fn main(a: bits[32][4]) -> bits[16][4] {
array_index.0: bits[32] = array_index(a, indices=[literal(0)])
invoke.0: bits[16] = invoke(array_index.0, to_apply=map_fn)
array_index.1: bits[32] = array_index(a, indices=[literal(1)])
invoke.1: bits[16] = invoke(array_index.1, to_apply=map_fn)
array_index.2: bits[32] = array_index(a, indices=[literal(2)])
invoke.2: bits[16] = invoke(array_index.2, to_apply=map_fn)
array_index.3: bits[32] = array_index(a, indices=[literal(3)])
invoke.3: bits[16] = invoke(array_index.3, to_apply=map_fn)
ret array.4: bits[16][4] = array(invoke.0, invoke.1, invoke.2, invoke.3)
}
This transformation is crucial for enabling the subsequent stages of hardware synthesis.
narrow - Narrowing
A pass which reduces the width of operations eliminating redundant or unused bits.
Note: What follows was generated by the Gemini LLM. Not human verified.
The NarrowingPass is a powerful optimization pass in XLS designed to reduce
the bit width of bits-typed operations and values within a function or
proc. This is a critical optimization for hardware generation, as a
reduction in bit widths directly translates to smaller, more efficient, and
often faster hardware (fewer wires, smaller functional units, and smaller
registers). The pass employs various advanced analysis techniques to
determine the minimum required bit width for each value without altering its
functional correctness.
The pass is configured with an AnalysisType enum, which dictates the
sophistication of the analysis used:
kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) ofselectoperations and other conditional control flow to derive more precise ranges for values within specific execution paths.
kRangeWithOptionalContext: Allows the choice betweenkRangeandkRangeWithContextbased on theoptions.use_context_narrowing_analysisflag, providing flexibility in trade-off between analysis time and precision.
The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:
-
Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a
literalnode. This simplifies the graph and can enable further optimizations. -
Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (
eq,ne,ult,ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.``` // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings // Leading `0110` and trailing `0011` match for both operands. // Optimized: UGt(0bXXX, 0bYYY) ```* Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.
-
Narrowing
NegateOperations: * Trailing Zeros: If an operand to anegateoperation has known trailing zeros, thenegatecan be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of thenegateoperation.* Sign Extension Hoisting: For
negate(sign_ext(x)), if the range analysis shows thatxcannot be the minimum representable signed value, this can be simplified tosign_ext(negate(x)), reducing the width of thenegateoperation. More generally, it finds the minimal bit width for the innernegateand then re-extends. -
Narrowing Shift Operations (
shll,shra,shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).* Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the
shift_valueare known to be zero or sign bits, theshift_valuecan be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. Theshift_amountitself can also be narrowed if it has leading zeros.* Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.
-
Narrowing
AddandSubOperations: * Common Leading Zeros/Sign Bits: Foraddandsuboperations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions. -
Narrowing
MultiplyOperations (umul,smul,umulp,smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.* Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.
* Extended Operands (Zero-extend/Sign-extend): When operands are the results of
zero_extorsign_extoperations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g.,umulorsmul) if necessary to maintain correctness.* Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.
* Partial Product Optimization: For
umulp/smulp(partial product multiplies), if the two partial products are immediately summed (e.g.,add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended. -
Narrowing
ArrayIndexOperations: * Literal Index Values: If an index to anarray_indexis a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.* Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.
* Known Out-of-Bounds Index: Similar to
BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.* Convert
ArrayIndextoSelect: Forarray_indexoperations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to aselectchain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic. -
Narrowing
DecodeOperations: * Leading Zeros in Index: If the index of adecodeoperation has leading zeros, the index can be sliced to remove them, and thedecodeperformed on the narrower index, with the result zero-extended if needed.* Zero Index: A
decodewith a provably zero index is replaced by a literal one, asdecode(0)always results in a one-hot encoding with the least significant bit set.
The NarrowingPass is typically run repeatedly within an optimization
pipeline until a fixed point is reached. This iterative process is crucial
for achieving maximal bit width reduction across the entire IR, as narrowing
one node can often enable further narrowing of its consumers or producers.
narrow(Context) - Narrowing
A pass which reduces the width of operations eliminating redundant or unused bits.
Note: What follows was generated by the Gemini LLM. Not human verified.
The NarrowingPass is a powerful optimization pass in XLS designed to reduce
the bit width of bits-typed operations and values within a function or
proc. This is a critical optimization for hardware generation, as a
reduction in bit widths directly translates to smaller, more efficient, and
often faster hardware (fewer wires, smaller functional units, and smaller
registers). The pass employs various advanced analysis techniques to
determine the minimum required bit width for each value without altering its
functional correctness.
The pass is configured with an AnalysisType enum, which dictates the
sophistication of the analysis used:
kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) ofselectoperations and other conditional control flow to derive more precise ranges for values within specific execution paths.
kRangeWithOptionalContext: Allows the choice betweenkRangeandkRangeWithContextbased on theoptions.use_context_narrowing_analysisflag, providing flexibility in trade-off between analysis time and precision.
The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:
-
Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a
literalnode. This simplifies the graph and can enable further optimizations. -
Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (
eq,ne,ult,ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.``` // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings // Leading `0110` and trailing `0011` match for both operands. // Optimized: UGt(0bXXX, 0bYYY) ```* Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.
-
Narrowing
NegateOperations: * Trailing Zeros: If an operand to anegateoperation has known trailing zeros, thenegatecan be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of thenegateoperation.* Sign Extension Hoisting: For
negate(sign_ext(x)), if the range analysis shows thatxcannot be the minimum representable signed value, this can be simplified tosign_ext(negate(x)), reducing the width of thenegateoperation. More generally, it finds the minimal bit width for the innernegateand then re-extends. -
Narrowing Shift Operations (
shll,shra,shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).* Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the
shift_valueare known to be zero or sign bits, theshift_valuecan be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. Theshift_amountitself can also be narrowed if it has leading zeros.* Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.
-
Narrowing
AddandSubOperations: * Common Leading Zeros/Sign Bits: Foraddandsuboperations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions. -
Narrowing
MultiplyOperations (umul,smul,umulp,smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.* Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.
* Extended Operands (Zero-extend/Sign-extend): When operands are the results of
zero_extorsign_extoperations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g.,umulorsmul) if necessary to maintain correctness.* Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.
* Partial Product Optimization: For
umulp/smulp(partial product multiplies), if the two partial products are immediately summed (e.g.,add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended. -
Narrowing
ArrayIndexOperations: * Literal Index Values: If an index to anarray_indexis a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.* Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.
* Known Out-of-Bounds Index: Similar to
BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.* Convert
ArrayIndextoSelect: Forarray_indexoperations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to aselectchain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic. -
Narrowing
DecodeOperations: * Leading Zeros in Index: If the index of adecodeoperation has leading zeros, the index can be sliced to remove them, and thedecodeperformed on the narrower index, with the result zero-extended if needed.* Zero Index: A
decodewith a provably zero index is replaced by a literal one, asdecode(0)always results in a one-hot encoding with the least significant bit set.
The NarrowingPass is typically run repeatedly within an optimization
pipeline until a fixed point is reached. This iterative process is crucial
for achieving maximal bit width reduction across the entire IR, as narrowing
one node can often enable further narrowing of its consumers or producers.
narrow(OptionalContext) - Narrowing
A pass which reduces the width of operations eliminating redundant or unused bits.
Note: What follows was generated by the Gemini LLM. Not human verified.
The NarrowingPass is a powerful optimization pass in XLS designed to reduce
the bit width of bits-typed operations and values within a function or
proc. This is a critical optimization for hardware generation, as a
reduction in bit widths directly translates to smaller, more efficient, and
often faster hardware (fewer wires, smaller functional units, and smaller
registers). The pass employs various advanced analysis techniques to
determine the minimum required bit width for each value without altering its
functional correctness.
The pass is configured with an AnalysisType enum, which dictates the
sophistication of the analysis used:
kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) ofselectoperations and other conditional control flow to derive more precise ranges for values within specific execution paths.
kRangeWithOptionalContext: Allows the choice betweenkRangeandkRangeWithContextbased on theoptions.use_context_narrowing_analysisflag, providing flexibility in trade-off between analysis time and precision.
The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:
-
Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a
literalnode. This simplifies the graph and can enable further optimizations. -
Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (
eq,ne,ult,ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.``` // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings // Leading `0110` and trailing `0011` match for both operands. // Optimized: UGt(0bXXX, 0bYYY) ```* Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.
-
Narrowing
NegateOperations: * Trailing Zeros: If an operand to anegateoperation has known trailing zeros, thenegatecan be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of thenegateoperation.* Sign Extension Hoisting: For
negate(sign_ext(x)), if the range analysis shows thatxcannot be the minimum representable signed value, this can be simplified tosign_ext(negate(x)), reducing the width of thenegateoperation. More generally, it finds the minimal bit width for the innernegateand then re-extends. -
Narrowing Shift Operations (
shll,shra,shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).* Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the
shift_valueare known to be zero or sign bits, theshift_valuecan be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. Theshift_amountitself can also be narrowed if it has leading zeros.* Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.
-
Narrowing
AddandSubOperations: * Common Leading Zeros/Sign Bits: Foraddandsuboperations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions. -
Narrowing
MultiplyOperations (umul,smul,umulp,smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.* Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.
* Extended Operands (Zero-extend/Sign-extend): When operands are the results of
zero_extorsign_extoperations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g.,umulorsmul) if necessary to maintain correctness.* Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.
* Partial Product Optimization: For
umulp/smulp(partial product multiplies), if the two partial products are immediately summed (e.g.,add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended. -
Narrowing
ArrayIndexOperations: * Literal Index Values: If an index to anarray_indexis a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.* Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.
* Known Out-of-Bounds Index: Similar to
BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.* Convert
ArrayIndextoSelect: Forarray_indexoperations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to aselectchain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic. -
Narrowing
DecodeOperations: * Leading Zeros in Index: If the index of adecodeoperation has leading zeros, the index can be sliced to remove them, and thedecodeperformed on the narrower index, with the result zero-extended if needed.* Zero Index: A
decodewith a provably zero index is replaced by a literal one, asdecode(0)always results in a one-hot encoding with the least significant bit set.
The NarrowingPass is typically run repeatedly within an optimization
pipeline until a fixed point is reached. This iterative process is crucial
for achieving maximal bit width reduction across the entire IR, as narrowing
one node can often enable further narrowing of its consumers or producers.
narrow(Range) - Narrowing
A pass which reduces the width of operations eliminating redundant or unused bits.
Note: What follows was generated by the Gemini LLM. Not human verified.
The NarrowingPass is a powerful optimization pass in XLS designed to reduce
the bit width of bits-typed operations and values within a function or
proc. This is a critical optimization for hardware generation, as a
reduction in bit widths directly translates to smaller, more efficient, and
often faster hardware (fewer wires, smaller functional units, and smaller
registers). The pass employs various advanced analysis techniques to
determine the minimum required bit width for each value without altering its
functional correctness.
The pass is configured with an AnalysisType enum, which dictates the
sophistication of the analysis used:
kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) ofselectoperations and other conditional control flow to derive more precise ranges for values within specific execution paths.
kRangeWithOptionalContext: Allows the choice betweenkRangeandkRangeWithContextbased on theoptions.use_context_narrowing_analysisflag, providing flexibility in trade-off between analysis time and precision.
The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:
-
Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a
literalnode. This simplifies the graph and can enable further optimizations. -
Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (
eq,ne,ult,ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.``` // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings // Leading `0110` and trailing `0011` match for both operands. // Optimized: UGt(0bXXX, 0bYYY) ```* Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.
-
Narrowing
NegateOperations: * Trailing Zeros: If an operand to anegateoperation has known trailing zeros, thenegatecan be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of thenegateoperation.* Sign Extension Hoisting: For
negate(sign_ext(x)), if the range analysis shows thatxcannot be the minimum representable signed value, this can be simplified tosign_ext(negate(x)), reducing the width of thenegateoperation. More generally, it finds the minimal bit width for the innernegateand then re-extends. -
Narrowing Shift Operations (
shll,shra,shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).* Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the
shift_valueare known to be zero or sign bits, theshift_valuecan be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. Theshift_amountitself can also be narrowed if it has leading zeros.* Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.
-
Narrowing
AddandSubOperations: * Common Leading Zeros/Sign Bits: Foraddandsuboperations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions. -
Narrowing
MultiplyOperations (umul,smul,umulp,smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.* Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.
* Extended Operands (Zero-extend/Sign-extend): When operands are the results of
zero_extorsign_extoperations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g.,umulorsmul) if necessary to maintain correctness.* Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.
* Partial Product Optimization: For
umulp/smulp(partial product multiplies), if the two partial products are immediately summed (e.g.,add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended. -
Narrowing
ArrayIndexOperations: * Literal Index Values: If an index to anarray_indexis a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.* Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.
* Known Out-of-Bounds Index: Similar to
BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.* Convert
ArrayIndextoSelect: Forarray_indexoperations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to aselectchain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic. -
Narrowing
DecodeOperations: * Leading Zeros in Index: If the index of adecodeoperation has leading zeros, the index can be sliced to remove them, and thedecodeperformed on the narrower index, with the result zero-extended if needed.* Zero Index: A
decodewith a provably zero index is replaced by a literal one, asdecode(0)always results in a one-hot encoding with the least significant bit set.
The NarrowingPass is typically run repeatedly within an optimization
pipeline until a fixed point is reached. This iterative process is crucial
for achieving maximal bit width reduction across the entire IR, as narrowing
one node can often enable further narrowing of its consumers or producers.
narrow(Ternary) - Narrowing
A pass which reduces the width of operations eliminating redundant or unused bits.
Note: What follows was generated by the Gemini LLM. Not human verified.
The NarrowingPass is a powerful optimization pass in XLS designed to reduce
the bit width of bits-typed operations and values within a function or
proc. This is a critical optimization for hardware generation, as a
reduction in bit widths directly translates to smaller, more efficient, and
often faster hardware (fewer wires, smaller functional units, and smaller
registers). The pass employs various advanced analysis techniques to
determine the minimum required bit width for each value without altering its
functional correctness.
The pass is configured with an AnalysisType enum, which dictates the
sophistication of the analysis used:
kTernary: Utilizes a simpler ternary logic-based query engine. This analysis determines for each bit if it is always zero, always one, or unknown.
kRange: Employs a more powerful range analysis to determine the possible numerical bounds (minimum and maximum values) that a bit-vector can take. This provides more precise information for narrowing.
kRangeWithContext: Extends range analysis to be context-sensitive. This means it considers predicates (conditions) ofselectoperations and other conditional control flow to derive more precise ranges for values within specific execution paths.
kRangeWithOptionalContext: Allows the choice betweenkRangeandkRangeWithContextbased on theoptions.use_context_narrowing_analysisflag, providing flexibility in trade-off between analysis time and precision.
The pass operates by iteratively traversing the IR and applying transformations to nodes where bit widths can be safely reduced. Key optimizations include:
-
Replacing Precise Values with Literals: If an analysis (ternary or range) can definitively determine that a node's value is precisely known (i.e., it always evaluates to a constant), that node is replaced with a
literalnode. This simplifies the graph and can enable further optimizations. -
Narrowing Comparison Operations: * Stripping Matched Leading/Trailing Bits: For comparison operations (
eq,ne,ult,ugt, etc.), if leading or trailing bits of both operands are known to be identical, these matching bits can be logically removed, and the comparison is performed on the narrower remaining bits.``` // Original: UGt(0b0110_0XXX_0011, 0b0110_0YYY_0011) // bit strings // Leading `0110` and trailing `0011` match for both operands. // Optimized: UGt(0bXXX, 0bYYY) ```* Signed Comparison Simplification: Signed comparisons with zero-extended operands (or operands whose MSBs are known to be equal) can often be converted to unsigned comparisons on narrower bit widths, as the sign information becomes redundant.
-
Narrowing
NegateOperations: * Trailing Zeros: If an operand to anegateoperation has known trailing zeros, thenegatecan be performed on the more significant bits, and the trailing zeros re-concatenated, effectively reducing the width of thenegateoperation.* Sign Extension Hoisting: For
negate(sign_ext(x)), if the range analysis shows thatxcannot be the minimum representable signed value, this can be simplified tosign_ext(negate(x)), reducing the width of thenegateoperation. More generally, it finds the minimal bit width for the innernegateand then re-extends. -
Narrowing Shift Operations (
shll,shra,shrl): * Zero Shift Amount: If a shift amount is provably zero, the shift operation is replaced directly by its value operand (a no-op).* Known Leading/Trailing Zeros/Sign Bits: For shift operations, if parts of the
shift_valueare known to be zero or sign bits, theshift_valuecan be sliced to remove these redundant bits before the shift, and the result re-extended (zero-extend or sign-extend) if necessary. Theshift_amountitself can also be narrowed if it has leading zeros.* Out-of-Bounds Shifts: Shift operations that are guaranteed to shift all bits off the end of the operand (resulting in a zero value) are replaced by a literal zero.
-
Narrowing
AddandSubOperations: * Common Leading Zeros/Sign Bits: Foraddandsuboperations, if both operands share a common number of leading zeros (for unsigned results) or sign bits (for signed results), the operation can be performed on a narrower bit width. The result is then zero-extended or sign-extended back to the original width. This includes careful handling of potential overflows for signed additions. -
Narrowing
MultiplyOperations (umul,smul,umulp,smulp): * Result Wider Than Sum of Operands: If the declared result width of a multiply operation is greater than the sum of the bit widths of its operands, the multiplication can be performed at the sum-of-operand widths, and the result then extended.* Operands Wider Than Result: If the operands are wider than the required result width, they can be narrowed (sliced) before the multiplication, reducing the complexity of the multiplier.
* Extended Operands (Zero-extend/Sign-extend): When operands are the results of
zero_extorsign_extoperations, they can often be narrowed back to their original non-extended width, with the multiplication type adjusted (e.g.,umulorsmul) if necessary to maintain correctness.* Trailing Zeros: Multiplies where one or both operands have known trailing zeros can be optimized by performing the multiplication on the non-zero bits and concatenating the result with the appropriate number of trailing zeros.
* Partial Product Optimization: For
umulp/smulp(partial product multiplies), if the two partial products are immediately summed (e.g.,add(tuple_index(mulp_op, 0), tuple_index(mulp_op, 1))), the overall operation can be narrowed, and the final sum then extended. -
Narrowing
ArrayIndexOperations: * Literal Index Values: If an index to anarray_indexis a known literal, its bit width can be narrowed to the minimum required to represent the actual array size, saving bits in the index computation.* Known Zero Index: If an index is provably zero, it's replaced with a literal zero of appropriate width.
* Known Out-of-Bounds Index: Similar to
BitSliceSimplificationPass, if an index is provably out of bounds, it's clamped to the maximum valid index.* Convert
ArrayIndextoSelect: Forarray_indexoperations with a small, discrete set of possible index values (based on range analysis), the operation can be converted to aselectchain, making explicit checks for each possible index value. This can be beneficial for small arrays by removing complex indexing logic. -
Narrowing
DecodeOperations: * Leading Zeros in Index: If the index of adecodeoperation has leading zeros, the index can be sliced to remove them, and thedecodeperformed on the narrower index, with the result zero-extended if needed.* Zero Index: A
decodewith a provably zero index is replaced by a literal one, asdecode(0)always results in a one-hot encoding with the least significant bit set.
The NarrowingPass is typically run repeatedly within an optimization
pipeline until a fixed point is reached. This iterative process is crucial
for achieving maximal bit width reduction across the entire IR, as narrowing
one node can often enable further narrowing of its consumers or producers.
next_value_opt - Next Value Optimization
Pass which tries to optimize next_value nodes within a Proc.
Note: What follows was generated by the Gemini LLM. Not human verified.
This pass performs several key optimizations to simplify the logic related to state updates and expose further opportunities for other passes.
Optimizations include:
-
Removing Literal Predicates: If a
next_valuenode has a predicate that evaluates to a constantfalse, thenext_valueoperation is dead and thus removed. If the predicate is a constanttrue, the predicate is removed, making thenext_valueunconditional.Example (dead next_value):
x: bits[32] next_value(x, value=my_value, predicate=literal(value=0, bits=1))This
next_valuewould be entirely removed, as it will never updatex. -
Splitting Select-based Values: When a
next_valueoperation'svalueoperand is aselect,priority_sel, orone_hot_seloperation, this pass can split the singlenext_valueinto multiplenext_valuenodes. Each newnext_valuecorresponds to one of the select's cases, and its predicate is the condition under which that specific case would be chosen. This can lead to more granular control over state updates and enable further optimizations on individual branches.Example (splitting a
select):x: bits[2] select_val: bits[2] = select(x, cases=[literal(2), literal(1)], default=literal(3)) next_value(x, value=select_val)This could be transformed into (simplified representation):
next_value(x, value=literal(2), predicate=eq(x, literal(0))) next_value(x, value=literal(1), predicate=eq(x, literal(1))) next_value(x, value=literal(3), predicate=ugt(x, literal(1)))This splitting is controlled by the
max_split_depthparameter to prevent excessive IR growth. -
Handling Non-Synthesizable State Elements: The pass identifies
state-elements that are marked as non-synthesizable and are performing an identity update. In such cases, it removes the read of the synthesizable version, allowing subsequent passes (like conditional specialization) to more effectively predicate the state read.
For best results, this pass should be run after any transformations that
modernize old-style next (...) lines into explicit next_value nodes.
non_synth_separation - Non-Synthesizable Separation
Separates out non-synthesizable nodes like assert/cover/trace from the main function into a cloned function. Every function effectively has two versions, one with synthesizable nodes and one without. The synthesizable version invokes the non-synthesizable version of its function. This ensures that non-synthesizable uses of values do not affect the optimization of the synthesizable parts of the function.
Note: What follows was generated by the Gemini LLM. Not human verified.
The NonSynthSeparationPass is a crucial optimization pass in XLS that
isolates non-synthesizable operations (such as assert, cover, and
trace) from the main, synthesizable logic of a function or proc. This
separation is essential because non-synthesizable operations can introduce
data dependencies that might prevent optimizations on the synthesizable parts
of the design. For instance, a value used in an assert could be seen as
"live" by the Dead Code Elimination (DCE) pass, even if that value is not
used in any synthesizable computation. This would prevent the logic
generating that value from being removed.
By separating these concerns, the pass allows the synthesizable parts of the design to be aggressively optimized without being constrained by the needs of verification and debugging constructs.
How it Works:
-
Identification: The pass first traverses each function and proc in the package to determine if it contains any non-synthesizable nodes (
assert,cover,trace). If a function or proc is purely synthesizable, it is left untouched. -
Cloning: For each function or proc that contains non-synthesizable operations, a new, separate "non-synthesizable" version is created: * For Functions: The original function is cloned to create a new function (e.g.,
fis cloned tonon_synth_f).* For Procs: The original proc is cloned, but as a
Functionrather than aProc. This is because the non-synthesizable operations are stateless from a hardware perspective (they don't generate registers). Any proc-specific nodes (likeStateRead,Next,Send,Receive) are handled specially during this cloning: *StateReadoperations becomeParams in the new function.* `Next` operations are handled to update non-synthesizable state, but do not become part of the new function's logic. * `Send` and `Receive` operations are similarly abstracted away, with their data payloads passed as parameters to the new function. -
Separation: * Original (Synthesizable) Version: All non-synthesizable nodes are removed from the original function or proc. It is then modified to
invokethe newly created non-synthesizable function, passing it all the necessary data (e.g., function parameters, proc state, received data) that the non-synthesizable operations depend on. Thisinvokeitself is non-synthesizable but has a minimal impact on the synthesizable logic.* New (Non-Synthesizable) Version: The cloned function is stripped of all its synthesizable outputs (its return value is made a useless empty tuple). It retains only the logic necessary to support the non-synthesizable operations.
-
GatetoSelectConversion:Gateoperations, which are special power-optimization nodes, are typically not removed by DCE. To prevent their unnecessary duplication in the non-synthesizable function, they are converted to equivalentselectoperations, which can be more easily optimized away if they become redundant.
Benefits:
- Improved Optimization: Frees the synthesizable logic from dependencies related to verification and debugging, enabling more aggressive optimizations like DCE, constant folding, and strength reduction.
- Clear Separation of Concerns: Creates a clean architectural separation between synthesizable hardware logic and non-synthesizable verification/debugging logic, improving the clarity and maintainability of the IR.
- Correctness: Ensures that verification constructs like
assert,cover, andtraceare preserved and correctly reflect the behavior of the design, while not interfering with the synthesis of the core functional logic.
Example:
Consider a function that contains both a synthesizable computation and an
assert:
// Original IR snippet
fn my_func(x: bits[32]) -> bits[32] {
tok: token = literal(value=token)
intermediate: bits[32] = add(x, literal(1))
is_zero: bits[1] = eq(intermediate, literal(0))
assert.1: token = assert(tok, is_zero, message="intermediate is zero!")
ret result: bits[32] = mul(intermediate, literal(2))
}
After NonSynthSeparationPass, the IR would be split into two functions:
// Synthesizable version (original function, modified)
fn my_func(x: bits[32]) -> bits[32] {
intermediate: bits[32] = add(x, literal(1))
is_zero: bits[1] = eq(intermediate, literal(0))
// Invoke the non-synthesizable part, passing necessary values
invoke_non_synth: () = invoke(is_zero, to_apply=non_synth_my_func)
ret result: bits[32] = mul(intermediate, literal(2))
}
// Non-synthesizable version (newly created function)
fn non_synth_my_func(is_zero: bits[1]) -> () {
tok: token = literal(value=token)
assert.1: token = assert(tok, is_zero, message="intermediate is zero!")
ret useless_return: () = tuple() // Return value is made useless
}
In this transformed IR, the main my_func is now free of any assert
operations, and optimizations like DCE can proceed on is_zero if result
does not depend on it. The non_synth_my_func encapsulates the verification
logic, ensuring it is preserved without hindering the optimization of the
synthesizable path.
one-leaf-inlining - leaf function inlining passes
inline one level of functions.
Invoked Passes
post-inlining - Post-inlining passes
Passes performed after inlining
Invoked Passes
post-inlining-opt - post-inlining optimization passes
Passes performed after inlining
Invoked Passes
- fixedpoint_simp(2)
- cond_spec(noBdd)
- dce
- bdd_simp(2)
- dce
- basic_simp
- dce
- bdd_cse
- dce
- cond_spec(Bdd)
- dce
- fixedpoint_simp(2)
- narrow(OptionalContext)
- dce
- basic_simp
- dce
- arith_simp
- dce
- cse
- sparsify_select
- dce
- useless_assert_remove
- ram_rewrite
- useless_io_remove
- dce
- cond_spec(Bdd)
- channel_legalization
- token_dependency
- fixedpoint_simp(2)
- fixedpoint_proc_state_flattening
- proc_state_bits_shatter
- proc_state_tuple_flat
- ident_remove
- dataflow
- next_value_opt
- dce
- proc_state_narrow
- dce
- proc_state_opt
- dce
- proc_state_provenance_narrow
- dce
- proc_state_opt
- dce
- bdd_simp(3)
- dce
- basic_simp
- dce
- bdd_cse
- select_lifting
- dce
- lut_conversion
- dce
- cond_spec(Bdd)
- dce
- fixedpoint_simp(3)
- select_range_simp
- dce
- fixedpoint_simp(3)
- bdd_simp(3)
- dce
- basic_simp
- dce
- bdd_cse
- dce
- proc_state_bits_shatter
- proc_state_tuple_flat
- fixedpoint_simp(3)
- useless_assert_remove
- useless_io_remove
- next_value_opt
- proc_state_opt
- dce
- cond_spec(Bdd)
- dce
- select_merge
- dce
- fixedpoint_simp(3)
post-inlining-opt(>=1) - min-1 post-inlining optimization passes
Passes performed after inlining
Options Set
Min opt level: 1
Invoked Passes
pre-inlining - pre-inlining passes
Passes performed before each inlining.
Invoked Passes
proc_state_array_flat - Proc State Array Flattening
Pass which flattens array elements of the proc state into their constituent elements. Tuples are flattened in a different pass. Flattening improves optimizability because each state element can be considered and transformed in isolation. Flattening also gives the scheduler more flexibility; without flattening, each element in the aggregate must have the same lifetime.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ProcStateArrayFlatteningPass is an optimization pass designed to
enhance the optimizability and scheduling flexibility of Procs in XLS. Its
core function is to transform one-dimensional, array-typed state elements
within the proc state into a tuple of their individual constituent elements.
(Note: A distinct pass, ProcStateTupleFlatteningPass, is responsible for
handling the flattening of tuple-typed state elements.)
The rationale behind this flattening is multifaceted:
-
Improved Optimizability: By breaking down arrays into their individual elements, subsequent optimization passes (such as narrowing, constant folding, or dead code elimination) can consider and transform each element in isolation. This often enables more aggressive and effective optimizations that might be inhibited when elements are tightly coupled within an array structure.
-
Increased Scheduler Flexibility: When a state element is represented as an array, all elements within that array are typically treated as a single, atomic unit by the scheduler. This implies that they must all have the same lifetime and be updated or accessed simultaneously. By flattening the array into a tuple of individual elements, the scheduler gains greater flexibility. Each element can then potentially have an independent lifetime, enabling more efficient scheduling and resource allocation in the generated hardware.
The pass identifies array-typed state elements that are suitable for
flattening. Currently, the heuristic for flattening is to unconditionally
flatten "small" arrays, specifically those with a size of 2 or less. While a
kMaxArrayFlattenSize constant is defined, the current implementation
applies a stricter size limit.
The transformation process for an array-typed state element (e.g.,
state: bits[8][2]) involves:
-
Determining Flattening: The
ShouldFlattenStateElementfunction checks if a given state element is an array and meets the predefined size heuristic for flattening. -
Transforming
StateElementType: The original array-typedStateElementis replaced by a newStateElementof a tuple type, where each element of the tuple corresponds to an element of the original array. For example, anArrayTypeofbits[8][2]becomes aTupleTypeof(bits[8], bits[8]). -
Transforming
initial_value: The initial value of theStateElementis converted from an array literal to a tuple literal with the corresponding element values. -
Transforming
StateReadOperations: AnyStateReadoperation that previously read the array-typed state element is modified. Instead of directly reading the array, it now reads the new tuple-typed state element. Its uses are then replaced by anarrayoperation that reconstructs the original array fromtuple_indexoperations on the new tuple-typed state read. This ensures that any downstream logic consuming the original array continues to see the expected array structure. -
Transforming
NextOperations: Anynext_valueoperation that computed the next state for the array-typed element is also modified. Itsvalueoperand (which would have been an array) is transformed into a tuple by creatingarray_indexoperations for each element of the original array and then wrapping these in atupleoperation.
Example:
Consider a Proc with a state element state that is a 2-element array of
bits[8]:
// Original IR snippet
package my_package
proc my_proc(state: bits[8][2], init={0, 1}) {
// ...
state_read.0: bits[8][2] = state_read(state)
idx0: bits[32] = literal(0)
elem0: bits[8] = array_index(state_read.0, indices=[idx0])
// ... further usage of elem0
// computes next state as an array
next_val: bits[8][2] = array(literal(1), literal(2))
next (state, next_val)
}
After ProcStateArrayFlatteningPass, the state element would be flattened
into a tuple (bits[8], bits[8]):
// Optimized IR snippet (simplified)
package my_package
// state is now a tuple
proc my_proc(state: (bits[8], bits[8]), init={(0, 1)}) {
// ...
state_read.0: (bits[8], bits[8]) = state_read(state) // Reads the tuple
elem0_from_tuple: bits[8] = tuple_index(state_read.0, index=0)
elem1_from_tuple: bits[8] = tuple_index(state_read.0, index=1)
// Any original uses of `state_read.0` would be replaced by an `array` of
// these tuple_indexed elements.
reconstructed_array: bits[8][2] = array(elem0_from_tuple, elem1_from_tuple)
// next_value now produces a tuple (assuming original `next_val` was an
// array of literals)
next_val_tuple: (bits[8], bits[8]) = tuple(literal(1), literal(2))
next (state, next_val_tuple)
}
This transformation facilitates creating more fine-grained control and optimization opportunities for individual elements of the original array, which is highly beneficial for hardware synthesis.
proc_state_bits_shatter - Proc State Bits Shattering
Pass which transforms Bits-type elements of the proc state into tuples of components. Only flattens where it can show that doing so will enable dynamic state feedback opportunities later (assuming it's followed by a ProcStateTupleFlatteningPass).
Note: What follows was generated by the Gemini LLM. Not human verified.
The ProcStateBitsShatteringPass is an optimization pass that refactors
bits-typed elements within a Proc's state. It breaks down (or "shatters")
these wide bit vectors into a tuple of smaller bits components.
The primary motivation for this transformation is to expose opportunities for
subsequent passes (such as ProcStateTupleFlatteningPass) to enable more
granular and dynamic control over individual state bits, which can lead to
more efficient hardware implementations.
The pass intelligently identifies bits-typed state elements where splitting
would be beneficial. It only performs the transformation if it can
demonstrate that doing so will likely enable dynamic state feedback. This
often involves detecting specific patterns within the next_value
expressions that define the next state, particularly when those next_value
expressions involve concat operations whose operands are select or
priority_select operations.
Here's a breakdown of its operation:
-
Identification of Splitting Candidates: The pass iterates through all
bits-typed state elements within a Proc. -
Analysis of
next_valueExpressions: For each state element, it examines its associatednext_valueoperations: * Simple pass-throughnext_values (where the next value is just the current state read) do not offer new splitting opportunities.* If a
next_valueis defined by aconcatoperation, the pass analyzes the bit widths of theconcat's operands to identify potential "split points" within the overall bit vector. For example, forconcat(hi, mid, lo), split points would be afterloand aftermid.* A key heuristic for determining if splitting is beneficial is when one or more of these
concatoperands are themselvesselectorpriority_selectoperations (or smallselectoperations, if enabled byoptions.split_next_value_selects). This pattern suggests that different parts of the state element are updated based on different conditions, and splitting them allows for more fine-grained conditional updates to individual components. -
Intersection of Split Points: If multiple
next_valueoperations update the same state element, the pass calculates the intersection of their suggested split points. This ensures that any performed splitting is mutually beneficial for all updates to that state element. -
Transformation: If a beneficial set of split points (resulting in more than one component) is identified, the
bits-typed state element is transformed into atupleof smallerbitstypes. * TheStateReadoperation for the original state element is replaced with aconcatoftuple_indexoperations, effectively reconstructing the original bit vector from the new tuple components when the full value is needed.* Each
next_valueoperation for the original state element is replaced with atupleofbit_sliceoperations, extracting the appropriate bit ranges for each component of the new tuple state.* The
initial_valueof the state element is also converted into a tuple literal matching the new structure.
Example:
Consider a 16-bit state element x where its next_value is a concat of
three smaller bit groups (x_hi, new_x_mid, x_lo), and new_x_mid is
derived from a select operation.
// Original state and next value
x: bits[16] = state_element(init=0)
// ...
x_lo: bits[6] = bit_slice(x, start=0, width=6)
x_mid: bits[1] = bit_slice(x, start=6, width=1)
x_hi: bits[9] = bit_slice(x, start=7, width=9)
// Conditional update
new_x_mid: bits[1] = select(or_reduce(x_lo), cases={x_mid, not(x_mid)})
next_value(x, value=concat(x_hi, new_x_mid, x_lo))
The ProcStateBitsShatteringPass would identify that x can be beneficially
split into three components (e.g., 6 bits, 1 bit, 9 bits) because
new_x_mid is a select operation, indicating a conditional update to a
sub-part of x. It would transform x into a tuple state element:
// Transformed state element and its updates (simplified)
x: (bits[6], bits[1], bits[9]) = state_element(init=(0,0,0))
// ...
x_read_tuple: (bits[6], bits[1], bits[9]) = state_read(x)
x_new_lo: bits[6] = tuple_index(x_read_tuple, 0)
x_new_mid: bits[1] = tuple_index(x_read_tuple, 1)
x_new_hi: bits[9] = tuple_index(x_read_tuple, 2)
// Original `StateRead` users now reconstruct the full bit vector
original_x_usage: bits[16] = concat(x_new_hi, x_new_mid, x_new_lo)
// `next_value` now operates on the tuple components
next_value(x, value=tuple(bit_slice(concat_value, 0, 6),
bit_slice(concat_value, 6, 1),
bit_slice(concat_value, 7, 9)))
This transformation enables subsequent passes to optimize the updates to
x_lo, x_mid, and x_hi independently, particularly when conditional
logic can be localized to specific components, ultimately leading to more
efficient hardware.
proc_state_narrow - Proc State Narrowing
Pass which tries to minimize the size and total number of elements of the proc state. The optimizations include removal of dead state elements and zero-width elements.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ProcStateNarrowingPass is an optimization pass designed to reduce the
bit width of bits-typed state elements within a Proc. This is a critical
optimization for hardware generation, as a smaller bit width for state
elements directly translates to smaller and more efficient registers in the
synthesized hardware. The pass achieves this by employing a
ProcStateRangeQueryEngine to analyze the possible range of values a state
element can take over multiple cycles, and then safely narrows its bit width
if a smaller range is identified.
The pass focuses on two primary types of bit width reduction:
-
Removal of Known Leading Bits (Zero or One Extension): If range analysis can prove that the most significant bits (MSBs) of a state element are always constant (e.g., always zero or always one), these leading redundant bits can be "removed." The state element's type is then truncated to exclude these bits. When the state is read, these bits are re-concatenated as literals to reconstruct the full width for any operations that depend on the original wider value. When a
next_valueis computed, only the non-constant lower bits are extracted.Example: If a 32-bit state element
foois always updated such that its 29 MSBs are consistently zero, the pass can narrow its type tobits[3].// Original state element state_element(foo, bits[32], init=0) // Example next_value (simplified, assuming MSBs are always 0) next_value(foo, zero_extend(add(literal(1, bits=3), bit_slice(state_read(foo), 0, 3)), 32))After narrowing,
foomight becomebits[3]:// Narrowed state element state_element(foo, bits[3], init=0) // Read of foo now re-extends it for compatibility extended_foo: bits[32] = concat(literal(0, bits=29), state_read(foo)) // Next value is computed on the narrowed bits next_value(foo, bit_slice(new_value_for_foo, 0, 3)) -
Signed Bit Width Narrowing: For state elements representing signed values, the pass utilizes interval analysis (
interval_ops::MinimumSignedBitCount) to determine the smallest number of bits strictly necessary to represent the full range of signed values a state element can take. If the actual bit width is larger than this minimum, it is narrowed. To preserve signed semantics, the pass usessign_extto reconstruct the original width when the state element is read.Example: If a 32-bit state element
baris known to consistently hold signed values between -8 and 7, it can be narrowed tobits[4](as 4 bits can represent signed values from -8 to 7).// Original state element state_element(bar, bits[32], init=0) // Example next_value (simplified, range stays within [-8, 7]) next_value(bar, add(state_read(bar), literal(1)))After narrowing:
// Narrowed state element state_element(bar, bits[4], init=0) // Read of bar now sign-extends it for compatibility extended_bar: bits[32] = sign_ext(state_read(bar), 32) // Next value is computed on the narrowed bits next_value(bar, bit_slice(new_value_for_bar, 0, 4))
This pass is highly integrated with ProcStateRangeQueryEngine, which
performs an inter-cycle analysis to precisely track the evolution of state
element values, enabling robust and correct narrowing decisions. The
transformations involve updating the StateElement definition, and modifying
StateRead and Next operations to slice/extend or concatenate as
appropriate, ensuring functional equivalence with the original wider
representation.
proc_state_opt - Proc State Optimization
Pass which tries to minimize the size and total number of elements of the proc state. The optimizations include removal of dead state elements and zero-width elements.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ProcStateOptimizationPass is an optimization pass in XLS that focuses
on minimizing the size and total number of state elements within a Proc.
This is a critical optimization for hardware generation, as fewer and smaller
state elements directly translate to reduced register count and memory usage,
leading to more compact, faster, and more efficient hardware designs.
The pass achieves its goals through several key transformations, often operating in an iterative fashion until no further improvements are possible:
-
Removal of Zero-Width State Elements: State elements that have a bit width of zero (and are not token-typed) are inherently useless as they cannot store any information. This pass identifies and removes such elements. Their corresponding
StateReadoperations are replaced with theirinitial_value(which will be a zero-width literal tuple), andNextoperations are replaced with empty tuples. -
Removal of Constant State Elements: If a state element's value is provably constant throughout the entire execution of the Proc (i.e., its
initial_valueis a constant, and all itsnext_valueoperations consistently assign that same constant value or are no-ops), then the state element is redundant. The pass replaces all reads of such a state element with its constant value and removes the element and its associatedNextoperations. This process is run to a fixed point, meaning it will iterate until no more constant state elements can be found. -
Removal of Unobservable State Elements: This is a more advanced optimization that identifies state elements whose values do not affect any observable output of the Proc (e.g., outputs to channels, assertions, traces). A state element
Xis considered observable if: * A side-effecting operation (e.g.,send,assert,cover) directly or indirectly depends onX. * The next-state value of another observable state element depends onX.To achieve this, the pass performs: * State Dependency Analysis: It computes a dependency map showing which state elements each node in the Proc depends upon. This is a forward dataflow analysis. * Transitive Closure: It constructs an adjacency matrix for state element dependencies and then computes its transitive closure. This determines which state elements ultimately influence which others, and importantly, which ones influence observable outputs. * Removal: Any state element that does not transitively affect an observable output is deemed unobservable and removed. Reads of such elements are replaced with a zero-valued literal, and their
Nextoperations are removed. -
Conversion of Constant Chains to State Machines: This optimization identifies sequences of state elements that effectively form a counter or a simple state machine driven by a constant input. For example, if
next[C[i+1]]is semantically equivalent tostate_read[C[i]]andnext[C[0]]is a constant, this sequence can be converted into a single, compact state machine. Instead of using a one-hot encoding (whichProcStateArrayFlatteningPassmight produce), this pass converts it into a binary-encoded state machine usinglog2(chain_length)bits, which is more area-efficient. The reads of the original state elements are then replaced by aselectoperation that decodes the state machine's value to produce the corresponding original state element value.
Benefits:
- Area Reduction: Directly reduces the number of registers and memory required for state, leading to smaller hardware footprints.
- Power Reduction: Fewer registers and less active logic contribute to lower power consumption in the synthesized design.
- IR Simplification: Removes redundant state and associated logic, making the IR cleaner and easier for subsequent compilation stages to process.
Example (Removing a constant state element):
// Original IR snippet
package my_package
chan out(bits[32], kind=streaming, ops=send_only, flow_control=ready_valid)
top proc my_proc() {
tkn: token = literal(value=token)
zero_val: bits[32] = literal(value=0)
one_val: bits[32] = literal(value=1)
// This state element `constant_state` is always `1`
constant_state: bits[32] = state_element(init=1)
next (constant_state, one_val) // Always updates to 1
// `not_constant_state` changes based on its LSB
not_constant_state: bits[32] = state_element(init=0)
not_constant_state_read: bits[32] = state_read(not_constant_state)
not_constant_lsb: bits[1] =
bit_slice(not_constant_state_read, start=0, width=1)
not_constant_next_lsb: bits[1] = not(not_constant_lsb)
next_not_constant_state: bits[32] =
concat(bit_slice(not_constant_state_read, start=1, width=31),
not_constant_next_lsb)
next (not_constant_state, next_not_constant_state)
// Usage: sum of constant_state and the LSB of not_constant_state
state_usage: bits[32] = add(state_read(constant_state),
bit_slice(state_read(not_constant_state),
start=0, width=1))
send_tok: token = send(tkn, state_usage, channel=out)
next (send_tok)
}
The ProcStateOptimizationPass would identify constant_state as provably
constant (always 1). It would then replace all uses of
state_read(constant_state) with literal(1) and remove the
constant_state element and its next_value:
// Optimized IR (simplified after ProcStateOptimizationPass and a subsequent
// DCE pass)
package my_package
chan out(bits[32], id=0, kind=streaming, ops=send_only,
flow_control=ready_valid)
top proc my_proc() {
tkn: token = literal(value=token)
zero_val: bits[32] = literal(value=0)
one_val: bits[32] = literal(value=1)
// `constant_state` has been removed by the pass
not_constant_state: bits[32] = state_element(init=0)
not_constant_state_read: bits[32] = state_read(not_constant_state)
not_constant_lsb: bits[1] = bit_slice(not_constant_state_read, start=0,
width=1)
not_constant_next_lsb: bits[1] = not(not_constant_lsb)
next_not_constant_state: bits[32] =
concat(bit_slice(not_constant_state_read, start=1, width=31),
not_constant_next_lsb)
next (not_constant_state, next_not_constant_state)
// `state_read(constant_state)` is replaced by `literal(1)`
state_usage: bits[32] = add(literal(1, bits=32),
bit_slice(state_read(not_constant_state),
start=0, width=1))
send_tok: token = send(tkn, state_usage, channel=out)
next (send_tok)
}
This effectively reduces the number of state elements and simplifies the logic by replacing constant state accesses with their known values.
proc_state_provenance_narrow - Proc State Provenance Narrowing
Pass which tries to minimize the size and total number of elements of the proc state. This pass works by examining the provenance of the bits making up the next value to determine which (if any) bits are never actually modified.
NB This is a separate pass from ProcStateNarrowing for simplicity of implementation. That pass mostly assumes we'll have a range-analysis which this does not need.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ProcStateProvenanceNarrowingPass is an optimization pass designed to
reduce the bit-width of state elements within a Proc. It achieves this by
performing a detailed "provenance" analysis on the bits that constitute the
next_value of each state element. The goal is to identify which specific
bits of a state element are never actually modified from their initial
value, regardless of the control flow or data inputs to the Proc. These
unchanging bits can then be removed from the state element, leading to a
narrower, more efficient register in hardware.
This pass is distinct from ProcStateNarrowingPass (which primarily relies
on range analysis) because ProcStateProvenanceNarrowingPass does not
require complex range information. Instead, it directly inspects the
bit-level origins and updates.
How it Works:
-
Bit Provenance Analysis (
BitProvenanceAnalysis): The pass utilizesBitProvenanceAnalysisto determine, for every bit of every node, its origin. This analysis can trace whether a bit is derived from a specific input bit, a constant, or a combination of sources. It provides a precise understanding of how each bit of anext_valueis influenced by the current state and inputs. -
Ternary Query Engine (
LazyTernaryQueryEngine): Alongside provenance, aLazyTernaryQueryEngineis used to provide ternary information about bits (known zero, known one, or unknown). This helps in identifying bits that are effectively constant or always take a certain value. -
UnchangedBitsFunction: This is the core logic for identifying bits that can be narrowed. For each state element, it examines all itsNextoperations. For a givenNext(state_read, next_value, predicate): * Ifnext_valueis the same asstate_read, those bits are trivially unchanged. * Otherwise, it compares the bits ofnext_valuewith theinitial_bitsof the state element. Using both ternary and provenance analysis, it identifies bits innext_valuethat are provably identical to their correspondinginitial_bitsor are known to be constants that match theinitial_bits. * Theunchanged_bitsmask is iteratively refined (using bitwiseAND) across allNextoperations for a given state element. A bit is deemed "unchanged" only if allNextoperations either preserve its initial value or are predicated off (i.e., don't update it under conditions where it could change). -
Bit Segment Extraction (
ExtractBitSegments): Once the finalunchanged_bitsmask is determined, this function breaks down the state element's bit range into contiguousBitSegments. Each segment is marked as either a constant segment (if it corresponds to bits that are always unchanged and match the initial value) or a variable segment (for bits that can change). -
NarrowTransform(State Element Transformation): This customProc::StateElementTransformeris used to apply the narrowing transformation: * New State Element: The original state element is replaced by a newStateElementwith a narrower bit-width, comprising only the variable segments identified in step 4. *TransformStateRead: When the narrowedStateElementis read by aStateReadoperation, it is reconstructed into its original, wider form. This is done by concatenating the new, narrowerStateReadoutput withLiteralnodes for all the constant segments. This ensures that downstream logic continues to receive the expected full-width value. *TransformNextValue: Thenext_valuecomputation is also narrowed. Only the variable segments from the originalnext_valueare extracted (usingbit_sliceoperations) and concatenated to form the new, narrowernext_valuefor the transformedStateElement.
Benefits:
- Significant Area Reduction: Directly reduces the number of flip-flops (registers) required for state elements, leading to smaller hardware area.
- Power Efficiency: Fewer registers typically consume less dynamic and static power.
- Improved Optimizability: A narrower state element simplifies its uses and can expose further optimization opportunities for operations that consume it.
Example:
Consider a 128-bit state element foo in a Proc, where a portion of its bits
(e.g., bits 32-63) are known to always hold a specific constant value (e.g.,
0xFFFF_0000) and never change.
// Original IR snippet
package my_package
proc my_proc(foo: bits[128],
init=0x0000_0000_FFFF_0000_0000_0000_0000_0000) {
// ...
// A next_value operation where bits 32-63 (0xFFFF_0000) are always
// preserved, but other bits might change based on inputs.
next_val_update: bits[128] = some_complex_logic(state_read(foo), input_val)
next (foo, next_val_update)
}
ProcStateProvenanceNarrowingPass would identify that bits 32-63 of foo
are unchanging. It would then transform foo into a narrower state element,
effectively removing those constant bits from its storage:
// Optimized IR snippet (simplified, after narrowing and cleanup)
package my_package
proc my_proc(foo: bits[96],
init=0x0000_0000_0000_0000_0000_0000) { // Foo is now 96 bits
// ...
foo_read_narrow: bits[96] = state_read(foo)
// Reconstruct original 128-bit value by concatenating narrowed state with
// constant bits
foo_reconstructed: bits[128] = concat(
bit_slice(foo_read_narrow, 64, 32), // High variable bits
literal(0x0000_FFFF, bits=32), // Constant bits that were removed
bit_slice(foo_read_narrow, 0, 32) // Low variable bits
)
// ... downstream logic now uses `foo_reconstructed`
// Next value also narrowed
next_val_update_narrowed: bits[96] = some_complex_logic_narrowed(...)
next (foo, next_val_update_narrowed)
}
This reduces the physical register size for foo by 32 bits, leading to a
more compact and efficient hardware implementation while maintaining
functional equivalence.
proc_state_tuple_flat - Proc State Tuple Flattening
Pass which flattens tuple elements of the proc state into their constituent components. Array elements are flattened in a different pass. Flattening improves optimizability because each state element can be considered and transformed in isolation. Flattening also gives the scheduler more flexibility; without flattening, each element in the aggregate must have the same lifetime.
ram_rewrite - RAM Rewrite
Pass that rewrites RAMs of one type to a new type. Generally this will be some kind of lowering from more abstract to concrete RAMs.
Note: What follows was generated by the Gemini LLM. Not human verified.
The RamRewritePass facilitates the transformation of RAM (Random Access
Memory) interfaces within an XLS package. Its primary role is to lower more
abstract RAM models to concrete RAM types that are directly implementable by
target hardware. This is a crucial step in the hardware generation flow.
The pass is driven by RamRewrite configurations, which specify the original
RAM type (from_config), the target RAM type (to_config), and the mapping
between logical channel names (e.g., read_req) and their corresponding
physical channel names in the IR.
Key aspects of the RamRewritePass include:
-
RAM Abstraction Levels: *
kAbstract: A high-level, generic RAM model that provides a simplified interface. *k1RW(One Read/Write Port): A concrete RAM with a single port that can perform either a read or a write operation per cycle. Its request channel typically includes address, data, write enable, and read enable signals bundled together. *k1R1W(One Read, One Write Port): A concrete RAM with separate, independent read and write ports, enabling simultaneous read and write operations. -
Channel Rewiring and Repacking: The core function of the pass is to replace existing
sendandreceiveoperations associated with thefrom_configRAM channels with newsendandreceiveoperations connected to theto_configRAM channels. This often necessitates "repacking" the data payloads to match the new channel types.* Repacking for
sendoperations: TheRepackPayloadfunction transforms the structure of an oldsendoperation's operand to match the expected input format of the new RAM channel. This can involve adding literal values (e.g.,write_enableorread_enablebits fork1RWRAMs), dropping unused fields, or simply passing through data if the formats are compatible.Example (Abstract Read Request to 1RW Request): An abstract read request might be a tuple `(address, mask)`. When rewritten to a `k1RW` RAM, its request channel expects `(address, data, write_mask, read_mask, write_enable, read_enable)`. The pass will construct a new tuple, filling `data`, `write_mask` with zero literals, setting `write_enable` to `0b0` and `read_enable` to `0b1`. ``` // Original abstract read request send send_abstract_read_req: token = send(tok, (addr, mask), channel=abstract_read_req) // Transformed 1RW request send (simplified) send_1rw_req: token = send(tok, (addr, data, wmask, mask, we, re), channel=1rw_req) ```* Repacking for
receiveoperations: Similarly, the output of the newreceiveoperation (which adheres to theto_configchannel type) is repacked to conform to the expected output format of the originalfrom_configchannel. This ensures that downstream logic that consumes the received data continues to function correctly. -
Channel Creation and Removal: The pass dynamically creates new channels that correspond to the
to_configRAM type and subsequently removes the original channels associated with thefrom_configRAM type. This ensures a clean transition to the new RAM interface. -
Proc-Scoped Channels: The pass properly handles both global and proc-scoped channels. If a
proc_nameis specified in theRamRewriteconfiguration, the channels are treated as local to that particular proc. -
Error Handling: The pass includes robust checks for potential issues such as type mismatches (e.g., inconsistencies in address width between
from_configandto_config) and unsupported transformation scenarios, returning informative errors where applicable.
This pass is a critical element in the XLS compilation flow, enabling the compiler to generate optimized hardware for diverse RAM architectures while preserving the functional correctness of the original design.
reassociation - Reassociation
Reassociates associative operations to reduce delay by transforming chains of operations to a balanced tree of operations, and gathering together constants in the expression for folding.
Note: What follows was generated by the Gemini LLM. Not human verified.
The ReassociationPass is an optimization pass in XLS that reorders
associative operations such as addition (add), subtraction (sub), and
signed/unsigned multiplication (smul, umul). Its primary objectives are
to reduce the critical path delay of the synthesized hardware and to
facilitate further constant folding by grouping literal operands.
This pass is crucial for transforming deep, unbalanced chains of operations
(e.g., (((A + B) + C) + D)) into more balanced tree structures
(e.g., ((A + B) + (C + D))). It also effectively gathers together constant
operands within an associative expression, making them amenable to subsequent
constant folding optimizations.
Key Concepts:
- Associativity: The pass exploits the associative property of
operations like addition and multiplication. This property states that
for an associative operation
op,(A op B) op Cis mathematically equivalent toA op (B op C). Reassociation leverages this to restructure the computation graph. - Critical Path Delay Reduction: A deep, unbalanced chain of operations in the IR directly translates to a long critical path in the hardware, which limits the maximum clock frequency. By rebalancing these chains into a shallower tree, the pass effectively reduces the overall critical path delay.
- Constant Folding Facilitation: Grouping constant operands together
within an associative expression allows the
ConstantFoldingPassto combine them into a single literal. This simplifies the expression and reduces dynamic computation. - Overflow Handling: The pass is designed to carefully consider potential overflow conditions, particularly for signed arithmetic and operations involving different bit widths. This ensures that the semantic correctness of the original computation is rigorously preserved during reassociation.
How it Works:
-
AssociativeElementsClass: This internal helper class is central to the pass's operation. For each node identified as part of an associative chain, it recursively collects and models: *variables_: Non-constant leaf nodes (e.g., parameters or results of other complex expressions). *constants_: Constant leaf nodes (i.e.,literaloperations). *op_: The primary associative operation (Op::kAdd,Op::kSMul, etc.) governing the combination of these elements. *needs_negate: A boolean flag indicating if a variable needs to be logically negated, typically arising from subtraction operations (e.g.,A - Bis conceptually treated asA + (-B)). *overflows_: A flag indicating whether the sub-expression represented by thisAssociativeElementsobject can potentially overflow at its current bit-width. This information is crucial for safe combination. *depth_: Tracks the depth of the longest chain of associative operations leading to this particular node. -
OneShotReassociationVisitor: This visitor traverses the IR in a single pass to compute and collect theAssociativeElementsfor each relevant node. It includes logic to handle various operations: * Associative Operations (Add,Sub,SMul,UMul): For these operations, it recursively combines theAssociativeElementsof its operands, managing negation for subtraction and tracking potential overflow conditions. *Neg: Handles negation by logically flipping theneeds_negateflag of its operand'sAssociativeElementsrepresentation. * Extensions (ZeroExtend,SignExtend) andConcatoperations that effectively act asZeroExtend: These operations can widen a value without fundamentally changing its underlying numerical value. The pass attempts to "see through" these by propagating theAssociativeElementsof the inner operand if no overflow occurs and signedness compatibility is maintained. -
ReassociationCache: This caching mechanism stores the computedAssociativeElementsfor each node. This avoids redundant computations during iterative passes and helps ensure that analyses are kept up-to-date even after transformations have altered the graph structure. -
ReassociateFunction (Main Loop): * Fixed-Point Iteration: The pass operates in a fixed-point manner. It identifies nodes that are prime candidates for reassociation and repeatedly applies the transformation until no further beneficial changes can be made across the entire function. * Candidate Selection: Nodes are selected for reassociation if they meet specific criteria, including: * Not already a leaf node (i.e., they are part of a larger associative chain). * Not composed entirely of constants (these are handled by the separateConstantFoldingPass). * Having users that require materialization (i.e., the reassociation is not trivially hidden by other optimizations). * Their current depth in the IR graph is greater than the minimum possible depth for a balanced tree of its elements, or they contain multiple constants that can be meaningfully grouped. *ReassociateOneOperation: This function is responsible for actually reconstructing the selected associative operation into a balanced tree: * Grouping Constants: It first groups all constant operands together to be combined byConstantFoldingPass. * Balancing Tree: It then constructs a balanced binary tree using the variable elements, minimizing the depth of the overall computation. The tree is typically biased to the left to ensure a consistent output structure, which can further aid subsequent Common Subexpression Elimination (CSE). * Negation and Extension Handling: It correctly applies necessary negations and extensions (sign-extend/zero-extend) during the reconstruction process to strictly preserve semantic equivalence.
Benefits:
- Reduced Critical Path: By transforming deep chains of operations into balanced trees, the pass significantly reduces the critical path delay in the generated hardware, allowing for higher clock frequencies.
- Improved Constant Folding: Grouping constant operands together within
associative expressions enables
ConstantFoldingPassto reduce them to single literals, thereby simplifying the IR and reducing dynamic computation. - Enhanced IR Structure: Creates a more canonical and optimized structure for associative expressions, which can benefit subsequent optimization passes by providing a clearer and more efficient representation of the computation.
Example (Rebalancing additions): Consider a deep, left-leaning chain of addition operations:
// Original IR snippet (deep left-leaning chain)
fn deep_add(a: bits[32], b: bits[32], c: bits[32], d: bits[32]) -> bits[32] {
add.1: bits[32] = add(a, b)
add.2: bits[32] = add(add.1, c)
ret add.3: bits[32] = add(add.2, d)
}
The ReassociationPass would transform this into a balanced tree structure:
// Optimized IR (after ReassociationPass and a subsequent DCE pass)
fn deep_add(a: bits[32], b: bits[32], c: bits[32], d: bits[32]) -> bits[32] {
add.left: bits[32] = add(a, b)
add.right: bits[32] = add(c, d)
ret add.final: bits[32] = add(add.left, add.right)
}
This transformation reduces the maximum depth of the additions from 3 to 2, directly improving potential timing performance.
recv_default - Receive default value simplification
Optimization which removes useless selects between the data value of a conditional or non-blocking receive and the default value of the receive (all zeros).
Note: What follows was generated by the Gemini LLM. Not human verified.
The ReceiveDefaultValueSimplificationPass is an optimization pass tailored
for Procs in XLS. Its purpose is to identify and eliminate redundant select
or priority_select operations that are used to explicitly handle the
default zero value produced by conditional (receive_if) or non-blocking
(receive_non_blocking) receive operations when no data is available.
In XLS, when a conditional or non-blocking receive operation does not
successfully yield data (either its predicate is false, or it's non-blocking
and no data is present on the channel), the data output of the receive
operation defaults to an all-zeros value of the appropriate type. It's a
common pattern to see IR where this default zero behavior is explicitly
reinforced by a select or priority_select operation.
The pass targets two primary patterns:
-
Conditional Receive (
receive_if) Pattern: If areceive_if(token, predicate, channel_name)operation has its data output (tuple_index(receive_op, 1)) fed into aselectoperation likeselect(predicate, cases=[literal(0), receive_data_output]), thisselectis redundant. Whenpredicateis false,receive_ifalready produces zero data, and theselectchoosesliteral(0). Whenpredicateis true,receive_ifproduces the actualreceive_data_output, and theselectchoosesreceive_data_output. In both cases, theselectsimply passes through the effectivereceive_data_output. -
Non-Blocking Receive (
receive_non_blocking) Pattern: Areceive_non_blocking(token, channel_name)operation typically returns a tuple(token, data_output, valid_bit). If thedata_output(tuple_index(receive_op, 1)) is subsequently fed into aselectlikeselect(valid_bit, cases=[literal(0), receive_data_output]), thisselectis also redundant. Whenvalid_bitis false,receive_data_outputis implicitly zero, and theselectchoosesliteral(0). Whenvalid_bitis true,receive_data_outputis the actual received data, and theselectchoosesreceive_data_output. Again, theselectis effectively a passthrough.
The ReceiveDefaultValueSimplificationPass identifies these specific
redundant patterns and replaces the select or priority_select node
directly with the actual data output from the receive operation, thereby
eliminating unnecessary logic and simplifying the IR.
Example (Conditional Receive with redundant select):
// Original IR snippet
chan in(bits[32], kind=streaming, ops=receive_only, flow_control=ready_valid)
proc p(pred: bits[1], s: bits[32], init={1, 42}) {
tkn: token = literal(value=token)
// data is 0 if pred is false
receive.1: (token, bits[32]) = receive(tkn, predicate=pred, channel=in)
receive_data: bits[32] = tuple_index(receive.1, index=1)
// This select is redundant: it effectively always yields 'receive_data'
select.2: bits[32] = sel(pred, cases=[literal(0), receive_data])
next (pred, select.2)
}
After ReceiveDefaultValueSimplificationPass:
// Optimized IR snippet
chan in(bits[32], kind=streaming, ops=receive_only, flow_control=ready_valid)
proc p(pred: bits[1], s: bits[32], init={1, 42}) {
tkn: token = literal(value=token)
receive.1: (token, bits[32]) = receive(tkn, predicate=pred, channel=in)
receive_data: bits[32] = tuple_index(receive.1, index=1)
// select.2 has been removed; receive_data is used directly
next (pred, receive_data)
}
This simplification reduces the number of nodes in the IR, contributing to smaller and potentially more efficient hardware implementations.
resource_sharing - Resource Sharing
Note: What follows was generated by the Gemini LLM. Not human verified.
The ResourceSharingPass is a critical optimization pass in XLS that aims to
reduce hardware area by sharing common computational resources (e.g., adders,
multipliers) among different parts of the design. This is achieved by
identifying mutually exclusive operations—operations that are guaranteed not
to execute at the same time—and then transforming the IR to allow them to
share the same physical hardware resource through multiplexing.
The pass is configured with a ProfitabilityGuard enum, which determines the
heuristic used to select which sharing opportunities to exploit:
kInDegree: Prioritizes sharing based on the in-degree of nodes in a conceptual "folding graph." This heuristic favors nodes that are potential "destinations" for folding, especially those with more distinct "source" operations that can be multiplexed into them. This approach is often effective for asymmetric folding scenarios where one complex operation can subsume several simpler ones.kCliques: Selects sharing opportunities based on identifying cliques (groups of mutually exclusive operations) in the folding graph. This approach is particularly suitable for symmetric folding scenarios where multiple operations of similar complexity can all share a single resource.kRandom: Randomly selects a subset of potential folding actions. This is typically used for testing, fuzzing, or exploration purposes rather than for deterministic optimization.kAlways: Attempts to perform all possible resource sharing that is identified as legal. This is generally used for aggressive optimization benchmarking or during specific testing phases.
The core mechanism of ResourceSharingPass involves several stages:
-
Mutual Exclusion Analysis: *
ComputeMutualExclusionAnalysis: This is a central function that identifies pairs or groups of operations that are mutually exclusive. It builds a "mutual exclusivity relation" between nodes, proving that two operations are guaranteed not to be active simultaneously. * It leveragesNodeForwardDependencyAnalysisto trace dataflow dependencies andBddQueryEngine(Binary Decision Diagrams) for precise logical analysis. TheBddQueryEngineis critical for proving that a particular operation's value does not influence another under specific conditions (e.g., when a particular arm of aselectoperation is chosen). * The analysis primarily focuses onPrioritySelectoperations, as these explicitly define conditional execution paths, which naturally create clear mutual exclusivity between their cases. The analysis is conservative and avoids making assumptions throughInvokenodes before function inlining to ensure correctness. -
Folding Graph Construction: Based on the results of the mutual exclusivity analysis, a
FoldingGraphis constructed. This graph represents potential "folding actions," where a "source" operation can be "folded" into a "destination" operation, implying that they can share the same underlying hardware resource. This sharing is controlled by aselectoperation whose condition is derived from the mutual exclusivity proof. -
Profitability Estimation: * The pass utilizes
AreaEstimatorandDelayEstimatortools to assess the profitability of each potential folding action. This involves a cost-benefit analysis. *EstimateAreaForSelectingASingleInput: Calculates the area overhead associated with adding multiplexers (selectlogic) required to steer the inputs to a shared resource. *EstimateAreaForNegatingNode: Estimates the area cost if an operand needs to be negated (e.g., when folding asuboperation into anaddoperation that operates on the negated operand). * Thearea_savedfor a folding action is calculated as the estimated area of the "from" node minus the area overhead of the multiplexers and any additional required logic. *DelayAnalysisis used to consider the impact of sharing on the critical path delay. ThekMaxDelaySpreadconstant defines an acceptable limit for the increase in delay due to multiplexing. -
Selection of Folding Actions: *
SelectFoldingActions: This function, guided by the chosenProfitabilityGuardheuristic, selects a subset of the legal and profitable folding actions to perform. * It prioritizes actions that offer significant area savings while keeping delay impacts within acceptable bounds. * The selected actions are sorted (e.g., by descending area savings) and then "legalized" (LegalizeSequenceOfFolding) to prevent overlapping or conflicting transformations from being applied. -
IR Transformation: Once a final set of
NaryFoldingActions is selected, the IR is transformed to implement the sharing. This involves: * Creating newselectorpriority_selectoperations (or chains of them) to multiplex the inputs of the shared resource based on the conditions derived from the original mutual exclusivity. * Replacing the original "from" operations with the output of the newly created multiplexing structure, which then feeds into the "to" (shared) operation. * Removing the now redundant "from" operations.
Example:
Consider two umul operations, mul0(a, b) and mul1(c, d), which are
mutually exclusive (e.g., they are in different arms of a priority_select).
The ResourceSharingPass might identify this and transform the IR to use a
single shared umul hardware resource:
// Original IR snippet (simplified)
selector: bits[1] = ...
mul0: bits[32] = umul(a, b, width=32)
mul1: bits[32] = umul(c, d, width=32)
ret result: bits[32] = priority_sel(selector, cases=[mul0], default=mul1)
After ResourceSharingPass (conceptual representation, as actual IR is more
complex with explicit multiplexing):
// Optimized IR snippet (simplified)
selector: bits[1] = ...
// Multiplexed inputs to the shared multiplier
input_a_mux: bits[32] = priority_sel(selector, cases=[a], default=c)
input_b_mux: bits[32] = priority_sel(selector, cases=[b], default=d)
// Shared multiplier operation
umul_shared: bits[32] = umul(input_a_mux, input_b_mux, width=32)
ret result: bits[32] = umul_shared // The shared multiplier is now the result
This pass is vital for achieving high-quality hardware designs by efficiently utilizing available hardware resources, particularly in designs with significant conditional logic.
scheduling-opt - scheduling opt passes
Passes performed immediately before starting scheduling.
These optimizations are used to fix up any changes the mutual-exclusion pass & proc-state legalize performs before the initial scheduling.
TODO(allight): We might wish to move these into opt only. sched running optimization passes is an artifact of how these tools evolved which no longer makes too much sense. To do this will require making a decision on how to handle 'mutual-exclusion opt' however.
Options Set
Min opt level: 1
Invoked Passes
select_lifting - Select Lifting
Pass which replace the pattern v = sel (c, array[i], array[j], ...) where all cases of the select reference the same array, to the code z = sel (c, i, j, ...) v = array[z]
Note: What follows was generated by the Gemini LLM. Not human verified.
The SelectLiftingPass is an optimization pass in XLS that restructures
select and priority_select operations to improve hardware efficiency.
It identifies scenarios where a select's cases are all derived from a
common "shared" operation (such as an array_index or a binary arithmetic
operation) and "lifts" that shared operation out of the select. This
transformation effectively moves the multiplexing (the select) from the
output of the shared operation to its input, which often leads to
significant hardware savings and can sometimes improve performance.
This pass is particularly effective in two main scenarios:
-
Lifting
ArrayIndexOperations: When all cases of aselectarearray_indexoperations on the same array, the pass can lift thearray_indexout of theselect. It creates a newselectthat chooses between the indices of the originalarray_indexoperations. The result of this newselect(the chosen index) is then used in a single, sharedarray_indexoperation. This replaces multiple array reads followed by a wide data multiplexer with a smaller index multiplexer followed by a single array read, which is generally much more area-efficient.// Original IR snippet v = sel(c, array[i], array[j]) // Optimized IR (after lifting) z = sel(c, i, j) v = array[z] -
Lifting Binary Operations: When all cases of a
selectare binary operations of the same kind (e.g., alladds,subs, ormuls) and share one common operand, the binary operation can be lifted out of theselect. A newselectis created to choose between the "other" (non-shared) operands. The result of this newselectis then used in a single, shared binary operation with the common operand.// Original IR snippet v = sel(c, x + a, x + b) // Optimized IR (after lifting) z = sel(c, a, b) v = x + z
Profitability and Applicability Guards:
The pass includes several "guards" to ensure that the transformation is both applicable and profitable:
CanLiftSelect: This function checks if aselectis a candidate for lifting. ForArrayIndexlifting, it verifies that all cases are indeedarray_indexoperations on the same array and that their index types are compatible. For binary operation lifting, it identifies a commonshared_nodeacross all cases and ensures the operation type is consistent and supported (e.g.,add,sub,and,or,mul, shifts). It also handles "identity" cases, where one of theselect's cases is the shared operand itself.
ShouldLiftSelect: This function determines if a liftableselectshould actually be transformed, based on profitability heuristics: * Area Estimation: For binary operations, it performs a simple cost-benefit analysis based on the bit-widths of the original and transformed operations. The transformation is generally considered profitable if the total bit-width of the new, narrowerselectand the shared operation is less than or equal to the total bit-width of the originalselectand all the individual binary operations (if they were single-use). * Delay Estimation: If adelay_modelis provided, the pass usesCriticalPathDelayAnalysisto estimate the impact of the transformation on the critical path. Lifting is avoided if it is predicted to increase the latency. This is particularly important for scenarios like converting multiple constant-shifts into a single variable-shift, which is often more expensive in hardware. * Heuristics for High-Latency Operations: For high-latency operations like multiplication, lifting is more cautiously applied, especially if it involves identity cases, to avoid potentially degrading performance.
How it Works:
-
Worklist-based Iteration: The pass maintains a worklist of
selectandpriority_selectnodes that are candidates for optimization. It iterates through this worklist, applying transformations until no more profitable lifting opportunities can be found. -
Applicability and Profitability Checks: For each
selectin the worklist, it callsCanLiftSelectand thenShouldLiftSelectto determine if a valid and beneficial transformation exists. -
IR Transformation: If a transformation is approved: *
LiftSelectForArrayIndex: Rewires the IR forarray_indexlifting by creating a newselectfor the indices and a new sharedarray_index. *LiftSelectForBinaryOperation: Rewires the IR for binary operation lifting by creating a newselectfor the non-shared operands and a new shared binary operation. It correctly handles identity cases by inserting the appropriate identity literal (e.g.,0foradd,-1forand) into the newselect. -
Cleanup: The original
selectnode, now redundant, is marked for deletion, which is typically handled by a subsequentDeadCodeEliminationPass. Any newselectnodes created during the transformation are added to the worklist, allowing for potential further optimizations.
By systematically applying these transformations, SelectLiftingPass plays a
crucial role in optimizing multiplexer-heavy designs, leading to more
efficient use of hardware resources and improved overall performance.
select_merge - Select Merging
Note: What follows was generated by the Gemini LLM. Not human verified.
The SelectMergingPass is an optimization pass in XLS that focuses on
merging consecutive select-like operations (PrioritySelect and
OneHotSelect) into a single, larger select operation. This transformation
can be highly beneficial for hardware synthesis, as it often allows the
synthesis tool to infer more efficient and compact multiplexer structures
from a single, unified select rather than from a cascade of smaller ones.
The core principle of this pass is to identify patterns where the output of
one select operation is used as an input (a case or a default value) to
another select operation. When such a pattern is found, and certain
conditions are met (primarily that the inner select has only one user),
the pass merges the two selects into a single, equivalent select.
How it Works:
The pass operates by traversing the IR and applying merging logic to each
PrioritySelect and OneHotSelect node it encounters. The transformation
logic is tailored to the specific type of select operation:
-
Merging
OneHotSelectOperations: * Pattern:one_hot_sel(selector_A, cases=[..., one_hot_sel(selector_B, cases=[...]), ...])* Transformation: The pass constructs a new, largerone_hot_sel. * New Selector: The new selector is aconcatof several components: * The original selector bits ofselector_Athat did not correspond to the innerone_hot_sel. * A new set of selector bits created by performing a bitwiseANDof theselector_Bwith the bit fromselector_Athat originally selected the innerone_hot_sel. This effectively qualifies the inner selector with the outer selection condition. * New Cases: The new set of cases is a flattened list comprising the original cases of both the outer and innerone_hot_sels, in the correct order. -
Merging
PrioritySelectOperations: * Pattern:priority_sel(selector_A, cases=[..., priority_sel(selector_B, ...), ...], default=...)* Transformation: Similar toOneHotSelect, a new, largerpriority_selis constructed. * New Selector: The new selector is also aconcat, but the logic for combining the selectors is more complex to correctly preserve the priority encoding. It involvesANDing the inner selector bits with the outer selector bits and also considering the conditions under which the innerpriority_selwould fall through to its default value. * New Cases and Default: The new cases are a flattened list of the original cases. The new default value is typically the default value of the innerpriority_selif it was part of the default path of the outerpriority_sel. * Selector Swapping for Single-BitPrioritySelects: A special heuristic is applied for single-bitpriority_sels. If the default value is the one that is aselect, the pass can effectively negate the selector and swap the case and default values to expose a more direct merging opportunity.
Key Conditions for Merging:
- Single Use: The inner
selectoperation must have only one user (the outerselect). This is a critical constraint that prevents the pass from increasing the amount of logic by duplicating the innerselect's computation. If the innerselectwere used elsewhere, merging it would require either duplicating it or introducing complex logic, which would likely be counterproductive. - Matching
selectTypes: The pass is designed to mergeselects of the same kind (e.g.,OneHotSelectintoOneHotSelect,PrioritySelectintoPrioritySelect).
Benefits:
- Improved Hardware Synthesis: Synthesis tools are often better at
optimizing a single, larger multiplexer structure than a cascade of
smaller ones. Merging
selects can lead to more efficient hardware with reduced area and potentially better timing. - IR Simplification: Reduces the depth of the IR graph by collapsing
chained
selectoperations, making the IR easier for subsequent passes to analyze.
Example (OneHotSelect merging):
Consider a scenario where one one_hot_sel is used as a case in another:
// Original IR snippet
fn f(p0: bits[2], p1: bits[2], x: bits[32], y: bits[32],
z: bits[32]) -> bits[32] {
one_hot_sel.1: bits[32] = one_hot_sel(p0, cases=[x, y])
ret one_hot_sel.2: bits[32] = one_hot_sel(p1, cases=[one_hot_sel.1, z])
}
If one_hot_sel.1 has only one user (one_hot_sel.2), the
SelectMergingPass can merge them. Let's say p1 selects one_hot_sel.1
with its 0-th bit.
// Optimized IR (simplified)
fn f(p0: bits[2], p1: bits[2], x: bits[32], y: bits[32],
z: bits[32]) -> bits[32] {
// New selector combines p0 and p1
new_selector: bits[3] = concat(
bit_slice(p1, start=1, width=1), // Original bit 1 of p1
and(bit_slice(p1, start=0, width=1), bit_slice(p0, start=1, width=1)),
// p1[0] AND p0[1]
and(bit_slice(p1, start=0, width=1), bit_slice(p0, start=0, width=1))
// p1[0] AND p0[0]
)
ret one_hot_sel.merged: bits[32] = one_hot_sel(new_selector,
cases=[x, y, z])
}
This transformation creates a single one_hot_sel that can be more
efficiently synthesized, as it avoids the intermediate multiplexer
represented by one_hot_sel.1.
select_range_simp - Select Range Simplification
Pass which simplifies selects and one-hot-selects. Example optimizations include removing dead arms and eliminating selects with constant selectors. Uses range analysis to determine possible values.
select_simp - Select Simplification
Pass which simplifies selects and one-hot-selects. Example optimizations include removing dead arms and eliminating selects with constant selectors. Uses ternary analysis to determine possible values.
simp - Simplification
Standard simplification pipeline.
This is run a large number of times and avoids many time-consuming analyses.
Invoked Passes
- ident_remove
- const_fold
- dce
- canon
- dce
- basic_simp
- dce
- arith_simp
- dce
- comparison_simp
- dce
- table_switch
- dce
- recv_default
- dce
- select_simp
- dce
- dataflow
- dce
- reassociation
- dce
- const_fold
- dce
- narrow(Ternary)
- dce
- bitslice_simp
- dce
- concat_simp
- dce
- array_untuple
- dce
- dataflow
- dce
- strength_red
- dce
- array_simp
- dce
- cse
- dce
- basic_simp
- dce
- arith_simp
- dce
- narrow(Ternary)
- dce
- bool_simp
- dce
- token_simp
- dce
simp(2) - Max-2 Simplification
Standard simplification pipeline.
Opt level is capped at 2
Options Set
Cap opt level: 2
Invoked Passes
simp(3) - Max-3 Simplification
Standard simplification pipeline.
Opt level is capped at 3
Options Set
Cap opt level: 3
Invoked Passes
simp(>=1,<=2) - Min-1 Max-2 Simplification
Standard simplification pipeline.
Opt level is capped at 2 and skipped if less than 1
Options Set
Min opt level: 1
Cap opt level: 2
Invoked Passes
simplify-and-inline - Iteratively inline and simplify
Inlining segment of the pipeline
This is performed to fixedpoint and each run a single layer of the function hierarchy is inlined away.
Options Set
Run to a fixedpoint.
Invoked Passes
sparsify_select - Sparsify Select
The SparsifySelectPass is a type of range analysis-informed dead code elimination that removes cases from selects when range analysis proves that they can never occur. It does this by splitting a select into many selects, each of which covers a single interval from the selector interval set.
Note: What follows was generated by the Gemini LLM. Not human verified.
The SparsifySelectPass is an optimization pass that utilizes range analysis
to perform a specialized form of dead code elimination on select
operations. Its core function is to identify and remove or simplify cases
within a select instruction that are provably unreachable based on the
possible range of values its selector can take. This leads to reduced
hardware complexity and improved efficiency.
Here's a detailed breakdown of its operation:
-
Range Analysis: The pass begins by employing a
PartialInfoQueryEngineto perform a precise range analysis on theselectoroperand of eachselectoperation. This analysis determines theIntervalSet– the complete set of possible values – that theselectorcan attain during execution. -
Identification of Dead Cases: If the
IntervalSetof theselectorreveals that it can only take on a subset of all possible values (i.e., the size of theIntervalSetis less than the total number of cases in theselect), then certain cases of theselectare deemed unreachable or "dead code." -
Transformation into a Chain of Selects: The primary transformation involves replacing the original
selectoperation with a new, potentially nested, structure composed of multipleselectoperations. Each newselectin this chain is meticulously designed to cover a specificIntervalfrom theselector_intervals.* The helper function
IntervalsSortedBySizesorts the reachable intervals by their size (number of points covered), typically processing smaller intervals first. This heuristic aims to produce a more compact or efficientselectdecision tree. * For eachInterval, a newselectis constructed. Thisselectfirst determines if the value of the originalselectorfalls within the bounds of the currentIntervalusing comparison operations. * If theselectoris found to be within theInterval, a nestedselect(or a direct reference to the corresponding case value if the interval is a single precise value) is used to choose among the relevant cases of the originalselectthat correspond to values within thisInterval. * If theselectoris not within the currentInterval, control flows to the "next"selectin the chain. If it's the last interval being considered, it defaults to a zero value (or the original default value of the select, if one exists).The net result is a decision tree that efficiently evaluates only the cases that are genuinely reachable given the
selector's known range.
Example:
Consider a select operation where the selector add.3 is a bits[4]
type. Although bits[4] can represent 16 values (0-15), range analysis
might prove
that add.3 can only ever take on values within the interval [8, 11].
The original select has 16 cases, one for each possible value from 0 to 15.
// Original IR snippet
fn func(x: bits[2]) -> bits[4] {
zero_ext.1: bits[4] = zero_ext(x, new_bit_count=4)
literal.2: bits[4] = literal(value=8)
// This is the selector for sel.20
add.3: bits[4] = add(zero_ext.1, literal.2)
// ... many literals for the 16 cases (literal.4 through literal.19)
ret sel.20: bits[4] = sel(add.3, cases=[
literal.4, literal.5, literal.6, literal.7, literal.8, literal.9,
literal.10, literal.11, literal.12, literal.13, literal.14,
literal.15, literal.16, literal.17, literal.18, literal.19
])
}
The SparsifySelectPass will recognize that only selector values 8, 9, 10,
and 11 are possible. It will then rewrite sel.20 into a nested structure
that effectively covers only these reachable cases:
// Optimized IR snippet (simplified conceptual representation)
fn func(x: bits[2]) -> bits[4] {
// ... zero_ext.1, literal.2, add.3 remain as before
selector_val: bits[4] = add.3 // The original selector value
ret main_select: bits[4] = sel(
// Check if selector_val is in [8, 11]
and(uge(selector_val, literal(8)), ule(selector_val, literal(11))),
cases=[
// If in range [8, 11], dispatch to a select that handles these
// specific values
sel(sub(selector_val, literal(8)), cases=[
literal(8), // for selector_val == 8
literal(9), // for selector_val == 9
literal(10), // for selector_val == 10
literal(11) // for selector_val == 11
]),
// If not in range, default to 0 (or the original select's default
// if it had one)
literal(0)
]
)
}
This transformation significantly reduces the logic required in the hardware
for the select operation by eliminating the evaluation of impossible
conditions, thereby contributing to smaller and faster hardware designs.
strength_red - Strength Reduction
Replaces operations with equivalent cheaper operations. For example, multiply by a power-of-two constant may be replaced with a shift left.
Note: What follows was generated by the Gemini LLM. Not human verified.
The StrengthReductionPass is a crucial optimization pass in XLS that aims
to improve hardware efficiency by replacing computationally expensive
operations with equivalent but cheaper alternatives. This transformation is
fundamental for reducing the area, delay, and power consumption of the
generated hardware. The pass intelligently identifies patterns where a
complex operation can be implemented using simpler primitives, often
leveraging constant operands or known bit patterns.
The pass operates by traversing the IR and applying a variety of strength
reduction techniques. It utilizes a UnionQueryEngine (combining
StatelessQueryEngine and LazyTernaryQueryEngine) to analyze the
properties of nodes, such as whether a value is fully known (constant), its
leading/trailing zeros, or known sign bits.
Key strength reductions performed by this pass include:
-
Replacement of Known Values with Literals: If the
QueryEnginecan prove that a node (especially a non-literal operation) always evaluates to a constant value, it's replaced with aliteralnode. This is a general simplification that can occur across various operation types, reducing dynamic computation to a static value. -
Sinking Operations into
Selects: If an operation has one operand that is aselect(orpriority_select) with all its cases being known constants, and other operands of the operation are also known constants, the operation can be "sunk" into theselect. This means the operation is duplicated for each case of theselect, performed on the constant values, and then a newselectis formed with the results. This often allows for further constant folding or narrowing of the operations.// Original IR v1 := Select(selector, [const1, const2]) v2 := Add(v1, const3) // Optimized IR (conceptually) v1_c1 := Add(const1, const3) v1_c2 := Add(const2, const3) v2 := Select(selector, [v1_c1, v1_c2]) -
Additions (
add): * Addition with Zero:add(x, 0)is replaced directly byx. * Addition to OR (No Carry): Ifadd(A, B)is such that there are no possible carries between any bit positions (i.e., for any bit positioni, it's not possible for bothA[i]andB[i]to be1simultaneously, as determined by ternary analysis), theaddcan be replaced by a bitwiseoroperation, which is typically cheaper. * Splitting Adders (No Carry Propagation): At higher optimization levels (SplitsEnabled), ifQueryEnginecan determine a bit position inadd(A, B)where a carry can provably not propagate (e.g., if the sum bitA[i] | B[i]is always 0), the adder can be split into two smaller adders. The lower-order bits are added independently, and the higher-order bits are added independently, with the results concatenated. This can reduce the critical path delay. -
Subtractions (
sub): * Subtraction by Zero:sub(x, 0)is replaced directly byx. * Subtracting from All-Ones:sub(all_ones, x)is replaced bynot(x). * Splitting Subtractors (No Borrow Propagation): Similar to adders, at higher optimization levels (SplitsEnabled), if a bit position insub(A, B)can be identified where a borrow can provably not propagate (e.g., ifA[i]is always 1 andB[i]is always 0), the subtractor can be split into two smaller subtractors, with results concatenated. -
Bitwise AND with Mask (
and):and(x, mask)wheremaskis a constant with a single contiguous run of set bits (e.g.,0b011100) can be strength-reduced into aconcatof zeros, abit_sliceofx(extracting the bits corresponding to the set bits in the mask), and more zeros. This avoids a full bitwise AND for simple masking operations. -
SelecttoSignExtTransformation: Aselectoperation that effectively acts as a sign extension (e.g., if the selector is 0 it picks 0, if 1 it picks all ones, and then extended to a wider width) can be replaced by asign_extoperation. This is common when a 1-bit predicate controls the sign extension of a value. -
SignExttoZeroExtTransformation: Ifsign_ext(x, ...)is such that the MSB ofxis known to be0(as determined byQueryEngine), thesign_extcan be replaced with azero_ext, which is generally simpler to implement in hardware. -
GateOperations: * If theconditionof agate(cond, data)is a known0, thegateis replaced with aliteral0. * If theconditionis a known1, thegateis replaced with itsdataoperand. * If thedataoperand is a known0, thegateis replaced with aliteral0(regardless of the condition). -
Single-bit
AddandNetoXor: For 1-bit operands,add(x, y)andne(x, y)are equivalent toxor(x, y), and are transformed accordingly, using the typically cheaperxoroperation. -
Comparisons against Powers of Two (
uge,ult): Comparisons likex >= 2^Korx < 2^Kwhere2^Kis a power of two constant, can be simplified to a comparison of abit_sliceofx(extracting the relevant leading bits) against0. -
Eq(x, 0)forbits[2]toNot(Or(bit_slice(x,0,1), bit_slice(x,1,1))): This specifically optimizes equality checks against zero for 2-bit values into a more fundamental Boolean logic expression. -
Arithmetic Operations with a Single Unknown Bit: For expensive arithmetic operations (
smul,umul,sdiv,udiv,smod,umod) with two operands where one operand is a constant and the other has only one unknown bit (all other bits are known constants), the operation is transformed into aselectoperation. Theselect's selector is the single unknown bit, and its cases are the two possible results of the arithmetic operation when that unknown bit is 0 or 1. This replaces a complex arithmetic unit with a simple multiplexer for two pre-computed literal results.
The pass is run repeatedly until a fixed point is reached, allowing these local transformations to propagate and enable further simplifications across the entire IR.
table_switch - Table switch conversion
TableSwitchPass converts chains of Select nodes into ArrayIndex ops. These chains have the form: sel.(N)(eq.X, literal.A, literal.B) sel.(N+1)(eq.Y, sel.(N), literal.C) sel.(N+2)(eq.Z, sel.(N+1), literal.D) And so on. In these chains, eq.X, eq.Y, and eq.Z must all be comparisons of the same value against different literals.
Current limitations: - Either the start or end index in the chain must be 0. - The increment between indices must be positive or negative 1. - There can be no "gaps" between indices. - The Select ops have to be binary (i.e., selecting between only two cases).
Note: What follows was generated by the Gemini LLM. Not human verified.
The TableSwitchPass is an optimization pass in XLS that transforms specific
chains of select (or priority_select) nodes into more efficient
array_index operations. This is a crucial optimization for hardware
synthesis, as it converts a series of conditional branches into a direct
table lookup mechanism, which can be implemented more compactly and with
lower delay in hardware, especially when the conditions are based on
comparing a single value against a set of constants.
The pass specifically targets select chains that conform to the following
pattern:
- Chained
SelectorPrioritySelectOperations: The pass identifies sequences ofselectorpriority_selectnodes where: * The selector of eachselectis a comparison (e.g.,eqorne) of a common "index" value against a distinct literal "key." * The result of oneselectoperation feeds into a case (or the default value) of the nextselectoperation in the chain, creating a cascaded lookup structure. * The "value" chosen by each matching comparison in the chain is a literal.
The pass then conceptually constructs a lookup table (represented as an array
of literal values) where the common "index" value from the select chain is
used to directly access an element in this array, and the elements of the
array are the "value" literals from the original select chain.
How it Works:
-
MatchLinks: This is the core pattern-matching function. It recursively traverses theselectchain, extracting essential information about each "link" (eachselectnode) in the chain. This information includes the commonindexnode, the literalkeyit's compared against, thevalueliteral chosen if the comparison is true, and thenextnode in the chain if the comparison is false. It also handlesPrioritySelectoperations where the selector is aconcatof single-bit selectors, identifying individual bits of the selector as corresponding to specific cases. -
LinksToTable: Once a complete chain ofLinks is identified, this function attempts to convert it into aValuethat represents a literal array. It maps thekeys from theLinks to corresponding indices in this array, with the associatedvalueliterals becoming the array elements. The function intelligently handles situations where the index space is not fully covered by explicitLinks, filling in the gaps with the "else" value (the default or fall-through value from the original chain). -
Replacement: If
LinksToTablesuccessfully generates a valid literal array, the entireselectchain (or the part of apriority_selectchain that can be converted) is replaced by anarray_indexoperation. The index for thisarray_indexis the commonindexnode from the originalselectchain, and the array being indexed is the newly created literal array.
Current Limitations (as described in the code):
- Either the start or end index in the chain must be 0.
- The increment between indices must be positive or negative 1.
- There can be no "gaps" between indices.
- The
Selectops must be binary (i.e., selecting between only two cases).
Example (simplification of a binary tree of sel operations performing a
lookup):
// Original IR snippet
fn main(index: bits[32]) -> bits[32] {
literal.0: bits[32] = literal(value=0)
literal.1: bits[32] = literal(value=1)
literal.2: bits[32] = literal(value=2)
literal.3: bits[32] = literal(value=3)
eq.10: bits[1] = eq(index, literal.0)
eq.11: bits[1] = eq(index, literal.1)
eq.12: bits[1] = eq(index, literal.2)
eq.13: bits[1] = eq(index, literal.3)
sel.20: bits[32] = sel(eq.10, cases=[literal.0, literal.1])
sel.21: bits[32] = sel(eq.11, cases=[sel.20, literal.2])
sel.22: bits[32] = sel(eq.12, cases=[sel.21, literal.3])
ret sel.23: bits[32] = sel(eq.13, cases=[sel.22, literal.4])
}
The TableSwitchPass would recognize this pattern and convert it into
(conceptually):
// Optimized IR (simplified)
fn main(index: bits[32]) -> bits[32] {
// ...
// Constructed array of values from the selects
array_literal: bits[32][5] = literal(value={1, 2, 3, 4, 0}) // Assuming 0
is the default/else value
ret array_index.new: bits[32] = array_index(array_literal, indices=[index])
}
This transformation significantly simplifies the hardware required for such lookup-like structures by replacing complex multiplexing logic with a direct memory access or a simpler decoder for the array index, leading to improved hardware QoR.
token_dependency - Convert data dependencies between effectful operations into token dependencies
Pass which turns data dependencies between certain effectful operations into token dependencies. In particular, transitive data dependencies between receives and other effectful ops are turned into token dependencies whenever no such token dependency already exists.
Note: What follows was generated by the Gemini LLM. Not human verified.
For example, consider a scenario where a send operation uses data produced
by a receive operation, but due to the other logic between the send and
receive the data produced by the receive is not observable by the send. Other
passes would be free to remove the data dependency on the send (since the
data from the receive is irrelevant). This would leave the send and receive
disconnected so, without an explicit token dependency, the scheduler might
reorder these, leading to incorrect behavior.
chan test_channel(
bits[32], id=0, kind=streaming, ops=send_receive,
flow_control=ready_valid)
top proc main(__state: (), init={()}) {
__token: token = literal(value=token, id=1000)
receive.1: (token, bits[32]) = receive(__token, channel=test_channel)
tuple_index.2: token = tuple_index(receive.1, index=0)
tuple_index.3: bits[32] = tuple_index(receive.1, index=1)
send.4: token = send(__token, tuple_index.3, channel=test_channel)
after_all.5: token = after_all(send.4, tuple_index.2)
tuple.6: () = tuple()
next (tuple.6)
}
In this example, send.4 uses tuple_index.3 which comes from receive.1.
The pass will identify this data dependency and add tuple_index.2 (the
token from receive.1) as an operand to an after_all operation that also
consumes the token from send.4, thus enforcing the order:
after_all.5: token = after_all(send.4, after_all_new.X)
// where after_all_new.X is effectively the token from receive.1
This ensures that receive.1 completes before send.4 can proceed.
token_simp - Simplify token networks
Pass that simplifies token networks. For example, if an AfterAll node has operands where one operand is an ancestor of another in the token graph, then the ancestor can be omitted. Similarly, duplicate operands can be removed and AfterAlls with one operand can be replaced with their operand.
Note: What follows was generated by the Gemini LLM. Not human verified.
The TokenSimplificationPass is a crucial optimization pass in XLS that
focuses on simplifying the token network within a function or proc. Token
networks are essential for correctly enforcing ordering constraints between
side-effecting operations (such as send and receive) in a dataflow graph.
This pass aims to reduce the complexity and redundancy in these networks,
leading to more compact and efficient control logic in the generated
hardware.
The pass performs several types of simplifications on token-typed
operations, primarily after_all and min_delay:
-
Simplifying
after_allOperations: * Single Operandafter_all: Anafter_alloperation with only one operand (e.g.,after_all(tok)) is redundant and is replaced directly by its single operandtok.``` // Original IR after_all.1: token = after_all(tok) next (after_all.1, ...) // Optimized IR next (tok, ...) ```* Duplicate Operands in
after_all: If anafter_alloperation has multiple identical operands (e.g.,after_all(tok, tok, tok)), the duplicates are removed, leaving only one instance of each unique token.``` // Original IR after_all.1: token = after_all(tok, tok, tok) next (after_all.1, ...) // Optimized IR next (tok, ...) ```* Nested
after_allOperations: If anafter_alloperation has anotherafter_allas an operand, and all users of the innerafter_allare alsoafter_alloperations, they can be collapsed. This effectively flattens the token dependency chain.``` // Original IR after_all.1: token = after_all(tok_a, tok_b) after_all.2: token = after_all(after_all.1, tok_c) next (after_all.2, ...) // Optimized IR after_all.new: token = after_all(tok_a, tok_b, tok_c) next (after_all.new, ...) ```* Replacing Overlapping
after_allOperands: If anafter_alloperation has multiple operands where one operand is an ancestor (directly or indirectly) of another in the token graph, the ancestor token can be removed from theafter_all's operand list. This is because the dependency is already implied by the descendant token. The pass is conservative and avoids analyzing throughinvokenodes to ensure correctness before function inlining.``` // Original IR send.2: token = send(tok, literal(10), channel=test_channel) send.3: token = send(send.2, literal(10), channel=test_channel) // send.3 depends on send.2 after_all.4: token = after_all(tok, send.2, send.3) next (after_all.4, ...) // Optimized IR (simplified to the latest token in the chain) // send.2 and tok are removed from after_all.4 because send.3 implies // them next (send.3, ...) ``` -
Simplifying
min_delayOperations: * Zero Delaymin_delay: Amin_delayoperation with a delay of 0 (e.g.,min_delay(tok, delay=0)) is a no-operation and is replaced by its operandtok. Similarly,min_delay(after_all(), delay=X)is replaced byafter_all(), as there are no actual events to delay.``` // Original IR min_delay.1: token = min_delay(tok, delay=0) next (min_delay.1, ...) // Optimized IR next (tok, ...) ```* Nested
min_delayOperations: Consecutivemin_delayoperations can be collapsed into a singlemin_delaywith the sum of their delays (e.g.,min_delay(min_delay(tok, delay=1), delay=2)becomesmin_delay(tok, delay=3)).``` // Original IR min_delay.1: token = min_delay(tok, delay=1) min_delay.2: token = min_delay(min_delay.1, delay=2) next (min_delay.2, ...) // Optimized IR min_delay.new: token = min_delay(tok, delay=3) next (min_delay.new, ...) ```* Pushing Down
min_delaythroughafter_all: If all operands of anafter_allaremin_delayoperations, the smallest common delay can be factored out and applied to the entireafter_alloperation. This can reduce the total number ofmin_delayoperations and simplify scheduling.``` // Original IR min_delay.1: token = min_delay(tok_a, delay=1) min_delay.2: token = min_delay(tok_b, delay=2) after_all.3: token = after_all(min_delay.1, min_delay.2) next (after_all.3, ...) // Optimized IR (factoring out common delay of 1) min_delay.new: token = min_delay(after_all(tok_a, min_delay(tok_b, delay=1)), delay=1) next (min_delay.new, ...) ```
This pass iteratively applies these simplifications to the token network until a fixed point is reached. By reducing redundant token dependencies and optimizing delay annotations, the pass contributes to more efficient scheduling and resource utilization in the generated hardware.
useless_assert_remove - Remove useless (always true) asserts
Pass which removes asserts that have a literal 1 as the condition, meaning they are never triggered. Rewires tokens to ensure nothing breaks.
useless_io_remove - Remove useless send/receive
Pass which removes sends/receives that have literal false as their condition. Also removes the condition from sends/receives that have literal true as their condition.
Note: What follows was generated by the Gemini LLM. Not human verified.
This pass optimizes the IR by eliminating or simplifying I/O operations (send and receive) based on their predicate (condition) values. This helps reduce dead code and simplify the control flow of the hardware.
Specifically, it handles two main cases:
-
Removal of Useless Conditional I/O: If a
sendorreceiveoperation has a predicate that is a literalfalse, and there are other active I/O operations on the same channel, the useless operation is removed. If it's the only remaining I/O operation on that channel, its data input is replaced with a literal zero to avoid leaving the channel unused.Example (SendIf with literal false predicate):
send.1: token = send(tkn, data, predicate=literal(value=0)) // This send will never execute // ... other operations, potentially another send on the same channelThis
send.1would be replaced bytkn(its token operand), effectively removing the send operation. -
Simplification of Unconditional I/O: If a
sendorreceiveoperation has a predicate that is a literaltrue, the predicate itself is removed. The operation then becomes an unconditionalsendorreceive.Example (ReceiveIf with literal true predicate):
receive.1: (token, bits[32]) = receive(tkn, predicate=literal(value=1), channel=my_channel)This
receive.1would be transformed into:receive.1: (token, bits[32]) = receive(tkn, channel=my_channel)