Codegen Options
This document outlines some useful knobs for running codegen on an XLS design. Codegen is the process of generating RTL from IR and is where operations are scheduled and mapped into RTL constructs. The output of codegen is suitable for simulation or implementation via standard tools that understand Verilog or SystemVerilog.
Input specification
<input.ir>is a positional argument giving the path to the ir file.--topspecifies the top function or proc to codegen.--codegen_options_proto=...specifies the filename of a protobuf containing the arguments to supply codegen other than the scheduling arguments. Details can be found in codegen_flags.cc--scheduling_options_proto=...specifies the filename of a protobuf containing the scheduling arguments. Details can be found in scheduling_options_flags.cc
Output locations
The following flags control where output files are put. In addition to Verilog, codegen can generate files useful for understanding or integrating the RTL.
--output_verilog_pathis the path to the output Verilog file.--output_schedule_pathis the path to a textproto that shows into which pipeline stage the scheduler put IR ops.--output_schedule_ir_pathis the path to the "IR" representation of the design, a post-scheduling IR that includes any optimizations or transforms during the scheduling pipeline.--output_block_ir_pathis the path to the "block IR" representation of the design, a post-scheduling IR that is timed and includes registers, ports, etc.--output_signature_pathis the path to the signature textproto. The signature describes the ports, channels, external memories, etc.--output_verilog_line_map_pathis the path to the verilog line map associating lines of verilog to lines of IR.--codegen_options_used_textproto_fileis the path to write a textproto containing the actual configuration used for codegen.--block_metrics_fileis the path to write a textproto containing the block metrics of the generated Verilog.
Pipelining and Scheduling Options
The following flags control how XLS maps IR operations to RTL, and if applicable control the scheduler.
--generator=...controls which generator to use. The options arepipelineandcombinational. Thepipelinegenerator runs a scheduler that partitions the IR ops into pipeline stages.
--opt_level=...controls the optimization level to apply when using thepipelinegenerator, to take advantage of any discovered optimization opportunities.
--delay_model=...selects the delay model to use when scheduling. See the page here for more detail.
--clock_period_ps=...sets the target clock period. See scheduling for more details on how scheduling works. Note that this option is optional, without specifying clock period XLS will estimate what the clock period should be.
--pipeline_stages=...sets the number of pipeline stages to use when--generator=pipeline.
--clock_margin_percent=...sets the percentage to reduce the target clock period before scheduling. See scheduling for more details.
--period_relaxation_percent=...sets the percentage that the computed minimum clock period is increased. May not be specified with--clock_period_ps.
--minimize_clock_on_failureis enabled by default. If enabled, when--clock_period_psis given with an infeasible clock (in the sense that XLS cannot pipeline this input for this clock, even with other constraints relaxed), XLS will find and report the minimum feasible clock period if one exists. If disabled, XLS will report only that the clock period was infeasible, potentially saving time.
--recover_after_minimizing_clockis disabled by default. If both this and--minimize_clock_on_failureare enabled, when--clock_period_psis given with an infeasible clock, XLS will print a warning, find and report the minimum feasible clock period (if one exists), and then continue generating Verilog as if this had been the specified clock period.
--minimize_worst_case_throughputis disabled by default. If enabled, when--worst_case_throughputis not specified (or disabled by setting it to 0 or a negative value), XLS will find & report the best possible worst-case throughput of the circuit (subject to all other constraints), and then proceed with codegen using that worst-case throughput.NOTE: If
--clock_period_psis not set, XLS will first optimize for clock speed, and then find the best possible worst-case throughput within that constraint.
--worst_case_throughput=...sets the worst-case throughput bound to use when--generator=pipeline. If set, allows scheduling a pipeline with worst-case throughput no slower than once per N cycles (assuming no stallingrecvs). If not set, defaults to 1.NOTE: If set to 0 or a negative value, no throughput minimum will be enforced.
--dynamic_throughput_objective_weight=...is disabled by default. If set, the scheduler will attempt to optimize for dynamic throughput as well as for area; the value controls how strongly this is prioritized. e.g., if set to 1024.0 (the default value), the scheduler will consider improving the dynamic throughput of one state element by 1 cycle (assuming that all data-dependent feedback paths are equally likely) to be worth adding up to 1024 flops. Only relevant if using the SDC scheduler with --worst_case_throughput set to a value != 1.
--additional_input_delay_ps=...adds additional input delay to the inputs. This can be helpful to meet timing when integrating XLS designs with other RTL. Note that flow-controlled channel operations all have inputs and outputs, so this adds delay to both sends and receives.
--additional_output_delay_ps=...adds additional output delay to the outputs. This can be helpful to meet timing when integrating XLS designs with other RTL. Note that flow-controlled channel operations all have inputs and outputs, so this adds delay to both sends and receives.
--additional_channel_delay_ps=...adds additional delay to operations on specific external channels. The flag takes a comma-separated list ofchannel[:direction]=delaypairs, which means that the operations onchannel(and if specified, in the givendirection[either recv or send]) will be modelled as having an additionaldelaypicoseconds of delay. This can be helpful to meet timing when integrating XLS designs with other RTL. Note that since all flow-controlled channel operations have both inputs & outputs, the actual delay applied will be the sum of this value and the greater ofadditional_input_delay_psoradditional_output_delay_ps, if provided.
--ffi_fallback_delay_ps=...Delay of foreign function calls if not otherwise specified. If there is no measurement or configuration for the delay of an invoked modules, this is the value used in the scheduler.
--io_constraints=...adds constraints to the scheduler. The flag takes a comma-separated list of constraints of the formfoo:send:bar:recv:3:5which means that sends on channelfoomust occur between 3 and 5 cycles (inclusive) before receives on channelbar. Note that for a constraint likefoo:send:foo:send:3:5, no constraint will be applied between a node and itself; i.e.: this means all different pairs of nodes sending onfoomust be in cycles that differ by between 3 and 5. If the special minimum/maximum valuenoneis used, then the minimum latency will be the lowest representableint64_t, and likewise for maximum latency. For an example of the use of this, see this example and the associated BUILD rule.
explain_infeasibilityconfigures what to do if scheduling fails. If set, the scheduling problem is reformulated with extra slack variables in an attempt to explain why scheduling failed.
infeasible_per_state_backedge_slack_poolIf specified, the specified value must be > 0. Setting this configures how the scheduling problem is reformulated in the case that scheduling fails. If specified, this value will cause the reformulated problem to include per-state backedge slack variables, which increases the complexity. This value scales the objective such that adding slack to the per-state backedge is preferred up until total slack reaches the pool size, after which adding slack to the shared backedge slack variable is preferred. Increasing this value should give more specific information about how much slack each failing backedge needs at the cost of less actionable and harder to understand output.
--scheduling_options_used_textproto_fileis the path to write a textproto containing the actual configuration used for scheduling.
--codegen_versionis the version of codegen pipeline to use. Either 2 (refactored codegen), 1 (original codegen path), or 0 for default. Currently default means the 1 (original).
--merge_on_mutual_exclusionruns a mutual-exclusion analysis and attempts to merge any I/O operations on the same channel that can be proven to be mutually exclusive. If disabled, XLS will instead schedule the operations independently (subject to correctness constraints); this can sometimes let the scheduler take advantage of the worst-case throughput bound to improve the scheduled results, or make scheduling possible. Enabled by default.
--output_scheduling_pass_metrics_pathdumps metrics about the scheduling pass pipeline to file as aPassPipelineMetricsProtoproto.dev_tools/pass_metrics_maincan be used to visualize the data.
--output_codegen_pass_metrics_pathdumps metrics about the scheduling pass pipeline to file as aPassPipelineMetricsProtoproto.dev_tools/pass_metrics_maincan be used to visualize the data.
--output_residual_data_pathdumps a CodegenResidualData textproto providing a reference node order and other metadata which can be used to minimize the differences between generated Verilog from different invocations of codegen.
--reference_residual_data_pathaccepts a CodegenResidualData textproto (usually dumped by an earlier invocation of codegen) providing a reference node order and other metadata which can be used to minimize the differences between generated Verilog from different invocations of codegen.
Feedback-driven Optimization (FDO) Options
The following flags control the feedback-driven optimizations in XLS. For now,
an iterative SDC scheduling method is implemented, which can take low-level
feedbacks (typically from downstream tools, e.g., OpenROAD) to guide the delay
estimation refinements in XLS. For now, FDO is disabled by default
(--use_fdo=false).
--use_fdo=true/falseEnable FDO. If false, then the--fdo_*options are ignored.--fdo_iteration_number=...The number of FDO iterations during the pipeline scheduling. Must be an integer >= 2.--fdo_delay_driven_path_number=...The number of delay-driven subgraphs in each FDO iteration. Must be a non-negative integer.--fdo_fanout_driven_path_number=...The number of fanout-driven subgraphs in each FDO iteration. Must be a non-negative integer.--fdo_refinement_stochastic_ratio=...*path_number over refinement_stochastic_ratio paths are extracted and *path_number paths are randomly selected from them for synthesis in each FDO iteration. Must be a positive float \<= 1.0.--fdo_path_evaluate_strategy=...Path evaluation strategy for FDO. Supports path, cone, and window.--fdo_synthesizer_name=...Name of synthesis backend for FDO. Only supports yosys.--fdo_yosys_path=...Absolute path of yosys.--fdo_sta_path=...Absolute path of OpenSTA.--fdo_synthesis_libraries=...Synthesis and STA libraries.--fdo_default_driver_cell=...Cell to assume is driving primary inputs.--fdo_default_load=...Cell to assume is being driven by primary outputs.
Naming
Some names can be set at codegen via the following flags:
--module_name=...sets the name of the generated verilog module.--output_port_name=....sets the name of the output port for functions.- For functions,
--input_valid_signal=...and--output_valid_signal=...adds and sets the name of valid signals when--generatoris set topipeline. The flag--output_valid_signalrequires that--input_valid_signalis set. For pipelined blocks,--output_valid_signalalso requires that the block has a reset signal to avoid garbage from being driving on the output valid port. --manual_load_enable_signal=...adds and sets the name of an input that sets the load-enable signals of each pipeline stage.- For procs,
--streaming_channel_data_suffix=...,--streaming_channel_valid_suffix=..., and--streaming_channel_ready_suffix=...set suffixes to be used on their respective signals in ready/valid channels. For example,--streaming_channel_valid_suffix=_vldfor a channel namedABCwould result in a valid port calledABC_vld. --[no]emit_sv_typessets whether the sv type names set for DSLX structures by the#[sv_type(NAME)]attribute are honored or not.
Reset Signal Configuration
--reset=...sets the name of the reset signal. If not specified, no reset signal is used.--reset_active_lowsets if the reset is active low or high. Active high by default.--reset_asynchronoussets if the reset is synchronous or asynchronous (synchronous by default).--reset_data_pathsets if the datapath should also be reset. True by default.
Codegen Mapping
--use_system_verilogsets if the output should use SystemVerilog constructs such as SystemVerilog array assignments,@always_comb,@always_ff, asserts, covers, etc. True by default.--separate_linescauses every subexpression to be emitted on a separate line. False by default.--max_inline_depth=Nputs a bound on how deeply subexpressions can be nested in a single line. 5 by default; overridden ifseparate_linesis set, which functionally forces this flag to 1.--multi_proccauses every proc to be codegen'd, not just the "top" proc. True by default.--max_trace_verbosity=Nis the maximum verbosity allowed for traces. Traces with higher verbosity are stripped from codegen output. 0 by default.--simulation_macro_name=...sets the name of the Verilog macro used to guard simulation-only constructs.assertion_macro_names=...sets the names of the Verilog macros used to guard assertions. If prefixed with!the polarity of the guard is inverted.
Format Strings
For some XLS ops, flags can override their default codegen behavior via format string. These format strings use placeholders to fill in relevant information.
--gate_format=...sets the format string forgate!ops. Supported placeholders are:-
{condition}: Identifier (or expression) of the gate. -{input}: Identifier (or expression) for the data input of the gate. -{output}: Identifier for the output of the gate. -{width}: The bit width of the gate operation.For example, consider a format string which instantiates a particular custom AND gate for gating:
my_and gated_{output} [{width}-1:0] (.Z({output}), .A({condition}), .B({input}))And the IR gate operation is:
the_result: bits[32] = gate(the_cond, the_data)This results in the following emitted Verilog:
my_and gated_the_result [32-1:0] (.Z(the_result), .A(the cond), .B(the_data));To ensure valid Verilog, the instantiated template must declare a value named
{output}(e.g.the_resultin the example).
--assert_format=...sets the format string for assert statements. Supported placeholders are:-
{message}: Message of the assert operation. -{condition}: Condition of the assert. -{label}: Label of the assert operation. It is an error not to use thelabelplaceholder. -{clk}: Name of the clock signal. It is an error not to use theclkplaceholder. -{rst}: Name of the reset signal. It is an error not to use therstplaceholder.For example, the format string:
{label}: `MY_ASSERT({condition}, "{message}")could result in the following emitted Verilog:
my_label: `MY_ASSERT(foo < 8'h42, "Oh noes!");
--smulp_format=...and--umulp_format=...set the format strings forsmulpandumulpops respectively. These ops perform partial (or split) multiplies. Supported placeholders are:-
{input0}and{input1}: The two inputs. -{input0_width}and{input1_width}: The width of the two inputs -{output}: Name of the output. Partial multiply IP generally produces two outputs with the property that the sum of the two outputs is the product of the inputs.{output}should be the concatenation of these two outputs. -{output_width}: Width of the output.For example, the format string:
multp #( .x_width({input0_width}), .y_width({input1_width}), .z_width({output_width}>>1) ) {output}_inst ( .x({input0}), .y({input1}), .z0({output}[({output_width}>>1)-1:0]), .z1({output}[({output_width}>>1)*2-1:({output_width}>>1)})]) );could result in the following emitted Verilog:
multp #( .x_width(16), .y_width(16), .z_width(32>>1) ) multp_out_inst ( .x(lhs), .y(rhs), .z0(multp_out[(32>>1)-1:0]), .z1(multp_out[(32>>1)*2-1:(32>>1)]) );Note the arithmetic performed on
output_widthto make the two-outputmultpblock fill the concatenated output expected by XLS.
I/O Behavior
--flop_inputsand--flop_outputscontrol if inputs and outputs should be flopped respectively. These flags are only used by the pipeline generator.For procs, inputs and outputs are channels with ready/valid signaling and have additional options controlling how inputs and outputs are registered.
--flop_inputs_kind=...and--flop_outputs_kind=...flags control what the logic around the outputs and inputs look like respectively. The list below enumerates the possible kinds of output flopping and shows what logic is generated in each case:-
flop: Adds a pipeline stage at the beginning or end of the block to hold inputs or outputs. This is essentially a single-element FIFO.-
skid: Adds a skid buffer at the inputs or outputs of the block. The skid buffer can hold 2 entries.-
zerolatency: Adds a zero-latency buffer at the beginning or end of the block. This is essentially a single-element FIFO with bypass.
--flop_single_value_channelscontrols if single-value channels should be flopped.
--add_idle_outputadds an additional output port namedidle.idleis the NOR of:1. Pipeline registers storing the valid bit for each pipeline stage.
1. All valid registers stored for the input/output buffers.
1. All valid signals for the input channels.
- For functions, when
--generatoris set topipeline, optional 'valid' logic can be added by using--input_valid_signal=...and--output_valid_signal=..., which also set the names for the valid I/O signals. This logic has no 'ready' signal and thus provides no backpressure. See also Naming.
- See also Reset Signal Configuration.
--fifo_moduleprovides the name of a Verilog module that will be used to implement FIFOs with data signals between procs in the same network; this defaults to "xls_fifo_wrapper". If passed an empty string, these FIFOs will be materialized using XLS-provided logic (which is not necessarily optimized for PPA). This module must take the following parameters:- Width: a positive integer; the width of the datapath in bits. - Depth: a non-negative integer; the number of entries in the FIFO. - RegisterPushOutputs: a bit; if set, the
push_*output signals must be driven directly from registers. Will never be set ifDepthis 0. - RegisterPopOutputs: a bit; if set, thepop_*output signals must be driven directly from registers. Will never be set ifDepthis less than 2. - EnableBypass: a bit; if set, the storage must be bypassed if the FIFO is empty and a push & pop occur in the same clock cycle, reducing the minimum push-to-pop (cut-through) latency. Necessitates a combinational timing path between the pop & push controllers. Will always be set ifDepthis 0.and provide the following interface:
-
input logic clk, -input logic rst, -output logic push_ready, -input logic push_valid, -input logic [Width-1:0] push_data, -input logic pop_ready, -output logic pop_valid, -output logic [Width-1:0] pop_data.
--nodata_fifo_moduleprovides the name of a Verilog module that will be used to implement FIFOs with no data signal between procs in the same network. If not present, these FIFOs will be materialized using XLS-provided logic (which is not necessarily optimized for PPA). This module must take the following parameters:- Depth: a non-negative integer; the number of entries in the FIFO. - RegisterPushOutputs: a bit; if set, the
push_*output signals must be driven directly from registers. Will never be set ifDepthis 0. - RegisterPopOutputs: a bit; if set, thepop_*output signals must be driven directly from registers. Will never be set ifDepthis less than 2. - EnableBypass: a bit; if set, the storage must be bypassed if the FIFO is empty and a push & pop occur in the same clock cycle, reducing the minimum push-to-pop (cut-through) latency. Necessitates a combinational timing path between the pop & push controllers. Will always be set ifDepthis 0.and provide the following interface:
-
input logic clk, -input logic rst, -output logic push_ready, -input logic push_valid, -input logic pop_ready, -output logic pop_valid.
RAMs (experimental)
XLS has experimental support for using proc channels to drive an external RAM. For an example usage, see this delay implemented with a single-port RAM (modeled here). Note that receives on the response channel must be conditioned on performing a read, otherwise there will be deadlock.
The codegen option --ram_configurations takes a comma-separated list of
configurations in the format ram_name:ram_kind[:kind-specific-configuration].
For a 1RW RAM, the format is
ram_name:1RW:req_channel_name:resp_channel_name:write_comp_name[:latency],
where latency is 1 if unspecified. For a 1RW RAM, there are several
requirements these channels must satisfy:
- The request channel must be a tuple type with 4 entries corresponding to
(addr, wr_data, we, re). All entries must have typebits, andweandremust be a single bit. - The response channel must be a tuple type with a single entry corresponding
to
(rd_data).rd_datamust have the same width aswr_data.
Instead of the normal channel ports, the codegen option will produce the following ports:
{ram_name}_addr{ram_name}_wr_data{ram_name}_we{ram_name}_re{ram_name}_rd_data
Note that there are no ready/valid signals as RAMs have fixed latency. There is an internal buffer to catch the response and apply backpressure on requests if needed.
When using --ram_configurations, you should generally add a scheduling
constraint via --io_constraints to ensure the request-send and
response-receive are scheduled to match the RAM's latency.
Optimization
--gate_recvsemits logic to gate the data value of a receive operation in Verilog. In the XLS IR, the receive operation has the semantics that the data value is zero when the predicate isfalse. Moreover, for a non-blocking receive, the data value is zero when the data is invalid. When set to true, the data is gated and has the previously described semantics. However, the latter does utilize more resource/area. Setting this value to false may reduce the resource/area utilization, but may also result in mismatches between IR-level evaluation and Verilog simulation.
--add_invariant_assertions(default: true) controls whether runtime assertions are emitted in the generated RTL to check IR-level invariants (such as one-hot selectors for one-hot selects). Disabling this flag omits these assertions from the output, which may be desirable for production builds or when such checks are not needed; e.g. auto-generated assertions can sometimes interfere with coverage closure metrics.
--array_index_bounds_checking: With this option set, an out of bounds array access returns the maximal index element in the array. If this option is not set, the result relies on the semantics of out-of-bounds array access in Verilog which is not well-defined. Setting this option totruemay result in more resource/area. Setting this value tofalsemay reduce the resource/area utilization, but may also result in mismatches between IR-level evaluation and Verilog simulation.
--mutual_exclusion_z3_rlimitcontrols how hard the mutual exclusion pass will work to attempt to prove that sends and receives are mutually exclusive. Concretely, this roughly limits the number ofmalloccalls done by the Z3 solver, so the output should be deterministic across machines for a given rlimit.
--default_next_value_z3_rlimitcontrols how hard our scheduling passes will work to prove that state params are fully covered by theirnext_valuenodes, so that we can skip special handling for the case where nonext_valuenode triggers. This is purely an optimization; everything will work correctly even if this is disabled (omitted, or set to -1). Concretely, this roughly limits the number ofmalloccalls done by the Z3 solver, so the output should be deterministic across machines for a given rlimit.
--register_merge_strategycontrols how we merge registers between stages which may be shared. The options areidentitywhich merges registers which can be shared and contain exactly the same value andnonewhich disables register merging. Registers are eligible for merging if the stages they are read in are not simultaneously activatable and the registers are the same type.
Miscellaneous
--randomize_order_seed, if provided, controls the seed used to randomize the order of lines in the output. This is useful for creating multiple equivalent Verilog outputs to exercise the rest of the pipeline.