Legalize Multiple Channel Ops Per Channel

Written on 2023-05-02.

Context

Objective

Whenever an XLS channel has multiple sends/receives, the XLS compiler currently relies on the mutual exclusion pass to merge these operations. If the operations cannot be proven to be mutually exclusive or if they are not in the same token level, it is an error and many legal programs fail to compile. The goal is to compile more programs that have multiple sends/receives on the same channel.

Background

One common way for the mutual exclusion pass to fail is with a pipelined loop:

if (p) {
  send(chan, value0);
} else {
  #pragma hls_pipeline_init_interval 1
  for (int i = 0; i < 4; i++) {
    send(chan, value1);
  }
}

The XLSCC frontend¹ will make the for loop a separate proc with channels that pass necessary state in and out. The "outside" proc will pause forward progress until the loop proc completes, so the two sends are mutually exclusive. However, after proc inlining, the predicates to the sends will be a function of the proc state that requires induction to prove mutual exclusion. To see why this is, it's helpful to understand how proc inlining works. The proc inlining pass creates a new container proc that allows all the inlined procs to tick simultaneously. This is implemented via an "activation network"- a series of boolean values encoded in the proc state that follow the token network of each proc. Each proc has a root activation that corresponds to the proc token parameter- this root activation is initialized as true, and activations flow down the network until something like a waiting blocking receive on an internal channel occurs. When a side-effecting operation does not occur within a container tick, all dependent operations will not be activated and will wait until the next container tick. This allows inlined procs to independently make progress on their execution. It also means that inlined proc ticks are decoupled from container proc ticks.

Channel operations in different procs will have activations be a portion of their predicates, but proving properties about these activations in general requires reasoning across multiple proc ticks. The mutual exclusion pass is not able to prove properties over time and will fail to merge the sends in the example above.

Currently, XLS treats multiple channel operations on the same channel as an error as a design choice to enforce that full throughput at II=1 is achievable. As a result, compilation will fail. It is important to note that although this problem exists for any channel, it is especially common when working with RAMs, especially single-port RAMs. The example above could come from code like

if (p) {
  mem[x] = value0;
} else {
  #pragma hls_pipeline_init_interval 1
  for (int i = 0; i < 4; i++) {
    value1 += mem[(x + 1)%MEM_SIZE];
  }
}

Each memory access gets lowered to a pair of send/receives on a request and response channel. For single-port RAMs, read and write channels will also be merged. It is therefore critical for programs using RAMs that multiple sends/receives work.

Design

Overview

Ultimately, we are asking the mutual exclusion pass to do too much. To merge operations like the sends in the example above, analysis must prove arbitrarily complex properties of the program, including those that change over time. Instead of relying on mutual exclusion analysis, it seems better to allow multiple sends and treat the mutually exclusive operations as a special case that allows for optimization. Concretely, the proposal is:

Do not merge mutually exclusive channel operations before scheduling.
When multiple channel operations occur on a channel, IR lowering should replace the channel for those operations with new internal channels. An adapter proc should be added that accepts multiple channel operations on the new "generated by the compiler pass" side and forwards them to the original IO channel.

1. One adapter variant will be an arbiter. If multiple channel operations occur in a cycle, the pipeline stalls until they all complete. Priority for the adapter will be determined by token level. If operations that are not ordered with respect to each other activate at the same time, it will be an assertion failure. Optionally, the compiler can impose an arbitrary order to unordered operations (following the token ordering where one exists) via a topo sort.

1. Another adapter variant will be a "mutex assumed" variant used only when the channel operations are mutually exclusive. The XLS→outside signals will be OR'd and the outside→XLS signals will be fanned out. An assertion will be added checking that at most one channel operation is active in a cycle².
These adapters should be separate procs and codegen'd as their own Verilog module. This hierarchy is useful for tracking the cost of the arbiter variant and facilitating later replacement of arbiter variants with mutex assumed variants. The MVP can initially inline the adapters until multi-proc codegen is supported.

Minimum Viable Product

The goal is to support the example in the Background section (and similar examples) ASAP. The steps roughly follow:

Write a proc generator for the arbiter variant of adapter.
When there are multiple channel operations on the same channel, a scheduling pass³ will add new channels and an adapter. The pass will move the channel operations' channel ids over to the new channels and connect the adapter's send/recv to the original channel. New channels to send the predicate before the operation and receive a completion after the operation will be added and connected.

1. The adapter will arbitrate between multiple operations in priority order.

2. Priority order will be determined by token dependencies. The adapter will have runtime checks that all active receives are ordered with respect to each other. Optionally, impose an arbitrary order to unordered operations via a topo sort.

3. Channel operations from different procs will be required to be mutually exclusive, enforced by a runtime check.

4. Initially, use proc inlining on the adapter.
After the adapter is added and connected by the pass, scheduling constraints will only apply to the operation in the adapter. The original send/receives will be freely scheduled, although after inlining the predicate sends (before the operation) and completion receives (after the operation) will implicitly order the operations with respect to the scheduling constraint.

Details

This proposal suggests extending the current XLS behavior which are, roughly, "multiple channel operations must be formally proven to be mutually exclusive and on the same token level, and then must be merged into a single operation". There are trade-offs in choosing how far to extend.

Single Channel Operation per Tick with True Predicate

One case is to allow multiple channel operations per proc tick, but require that only one of them have a true predicate. The current XLS behavior requires the operations to be on the same token level and to be formally proven to have mutually exclusive predicates. This could be extended to not require formal proof and instead provide a runtime check (assertion). This corresponds to the mutex assumed adapter variant described earlier and is a very incremental extension of the existing behavior. One problem with assuming all channel operations are mutually exclusive (even if you can't prove it) is that they aren't mutually exclusive. We could make it an error in the interpreter and JIT to have multiple channel operations per tick, but it currently isn't an error and seems potentially useful to allow eventually. There it seems like the mutually exclusive case should be an optimization of the more general case where multiple operations fire, and only applied when desired.

Multiple Channel Operations per Tick with True Predicate

When there are multiple channel operations per tick firing, there is a new problem to consider: how to order the operations. There is no ambiguity when there is a total ordering on operations. It would be safe for the MVP to require a total ordering. However, the XLSCC frontend specifies token dependencies as loosely as possible⁴, so it is a normal occurrence for there not to be a total ordering on channel operations.

If multiple channel operations aren't totally ordered, you can choose a static or dynamic order. On a subset of operations with no ordering with respect to each other, you could imagine choosing randomly or going round robin. However, choosing a static order seems simplest for MVP, and dynamic options seem of little utility. One way to choose this static order is to use the order given by XLS's TopoSort()⁵.

An alternative to worrying about the order is to legalize channel operations by adding a new channel per extra operation. This is equivalent to the current approach without the adapter- instead, the adapter (if needed) becomes the responsibility of the external integrator. This option might be useful in cases where the adapter orders things based on the data (for memory traffic, you could imagine performing all writes before all reads, or vice versa, and forwarding data written in the previous cycle if read in the current cycle).

Channel Operations in Different Procs

Channel operations in different procs have no direct token relationship. There is an ordering imposed by channel operations over control channels, but this relationship is indirect. TopoSort() only operates on one proc or function at a time.

We could require that multiple sends/receives occur within the same proc, although XLSCC produces channel operations in different procs for the example given at the start and many other interesting examples.

We could build a global token graph, where each proc's token parameter has an edge to some root token and each proc's next token feeds into a global next token (analogous to proc inlining). Edges will be added where procs send/receive to each other on internal channels. Note that the analysis may be tricky in cases where multiple parallel send/receives occur on these internal channels. It may be desirable to disallow multiple channel operations on internal channels, or have a variant of channel that disallows multiple channel operations to keep analysis sane.

Scheduling

For MVP, it seems simplest to require all channel operations on a given channel be scheduled in the same cycle. This will sometimes cause scheduling to fail and may interact poorly with II>1 passes, but is a useful starting point. Putting channel ops on the same channel in different cycles could allow for better QoR (especially with II>1) than could be realized if channel ops were required to be merged.

If channel operations on the same channel are scheduled in different cycles, the adapters need to delay earlier channel operations until the last cycle containing one of the multiple channel operations⁶⁷.

Long-Term Enhancements

Multi-Proc Codegen

The adapter procs should ideally be codegen'd as separate Verilog modules, which will require multi-proc codegen.

Optimization for Mutually Exclusive Operations

Post-codegen analysis may be needed to determine if operations are mutually exclusive. For example, mutual exclusion properties may be data dependent and depend on the context a block is instantiated in. Third party formal tools may be useful in proving these properties.

Multi-proc codegen makes it straightforward to switch an adapter from arbitrating to mutex assumed- generate both variants and have a Verilog config set which variant should be used for each instance after the fact.

Arbitrary Scheduling of Multiple Channel Operations

Wavefront matching channel operations across cycles seems to mean that scheduling channel operations in different cycles will not save anything. This is because the extra state you need to delay a channel operation. If analysis can determine that the wavefronts don't need to be matched (in general, this could be data dependent!), then you could remove this extra state.

If wavefronts do need to be matched, care needs to be taken to drive channel operations in the correct order in the presence of stalls.

IR-level testing should run on scheduled IR

The pass described here is a scheduling pass, and hence performed during codegen. The priority chosen by the pass when there is not a total order on the channel operations will impact program behavior, so ideally it would be consistent when running IR JIT.

One way to resolve this is to combine IR opt + codegen into a single driver. That can produce opt IR, scheduled IR, or RTL. Alternatively, you could make scheduling its own distinct step, or move scheduling passes into opt and only do block conversion and RTL conversion in codegen. Regardless, the IR jit should be running the scheduled IR to ensure equivalent program behavior.

Alternatives considered

Improve Proc Inlining

In practice, many of the failures we've seen in identifying mutually exclusive channel operations are a result of proc inlining. Proc inlining moves channel operation predicates into the proc state, but this doesn't seem necessary for many of the examples that fail to compile.

The XLSCC frontend generally produces procs that transfer execution to one proc at a time. They generally look something like:

arguments = receive(input_channel, pred=is_init_state);
state = WORK_STATE;
// do stuff
send(output_channel, pred=is_done_state);
if (is_done_state) {
  state = INIT_STATE;
}

In this case, the proc activation state can be condensed into something like a program counter. Each value of the program counter would correspond to a different inlined proc's activation and channel operations would have their predicates be strengthened with p' = p & (pc == MY_PROC_ID). As a result, channel operations in different inlined procs would be trivially mutually exclusive.

The analysis to determine if this optimization can be performed safely could be as simple as trying it and performing a deadlock analysis. However, even if the optimization is safe, it may not be an optimization- multiple procs ticking in parallel may be desired for performance reasons.

Analysis to check that execution is transferred could look like:

Find places where procs give control away. These are found by unrolling by 2 and looking for all blocking (send, recv) pairs where the receive is downstream of the send in the token graph (with nothing in-between). If the send's predicate implies the receive's predicate, the proc gives control to another proc on the send's channel until it is returned via the receive's channel. If this property holds for every proc→proc channel within a proc, we can say the proc "always extinguishes".
Find procs that come out of initialization "extinguished". There must be at least one proc (generally, the top proc) where this is not true. Other procs that begin with a blocking receive with predicate true on initialization are initialized as "extinguished". Every other side-effecting operation should be downstream of this receive.

The subset of procs that are initialized as extinguished and always extinguish can be inlined as described above.

It's unclear that this approach will allow mutual exclusion to succeed for more complicated scenarios, e.g. multiple sends/recvs in nested loops in different orders. It also seems generally unscalable to rely on passes being aware of mutual exclusion.

It is worth noting that this may be a generically useful improvement to proc inlining. Having predicates that are easier to reason about could assist optimization, and the state representation of proc activation would be more compact.

It is also worth noting that non-blocking channel operations will not be defined as extinguishing. The notion of extinguishing could be generalized to include procs whose state isn't updated observably unless a receive completes. Similarly, in our discussions Remy described some potentially simpler ways of identifying procs that can be run one at a time with a PC. Regardless of what analysis is used to identify situations where proc inlining can execute procs mutually exclusively, the shortcomings of the approach seem to be that 1) proc inlining and other passes need to be aware of how powerful the mutual exclusion pass is, which doesn't scale well; and 2) it’s not clear that this resolves all the cases where mutual exclusion can fail, even in proc inlining (proc inlining adds state to track multiple proc↔proc channel operations, so mutual exclusion could fail within an inlined proc if this obfuscates the predicate).

Use a proof assistant that supports induction

A more powerful proof assistant has the danger of increasing runtime or timing out unpredictably (Z3 is already pushing our boundaries in this regard). Even though the unrolling required to prove mutual exclusion seems bounded in most cases, it isn’t clear what sort of runtime we should expect.

It seems better to treat powerful proof assistants as an infrequent, high-effort final optimization step. For example, in the proposed design the arbiter variants could be replaced with mutex assumed variants if an expensive proof of mutual exclusion can be found.

The example presented here is particular to the XLSCC frontend. DSLX scopes channels to procs, so a similar example in DSLX would not have multiple procs sending/receiving the same channel. Instead, an internal channel would communicate proc→proc and the top proc would forward it to the IO channel. If the top proc manages the send predicate, it is likely that mutual exclusion will be able to merge the channel operations. However, if the child proc manages the predicate, DSLX could have the same problems with proc inlining that XLSCC does.

Of course, non-mutually-exclusive multiple sends/receives are not allowed in either frontend currently and this proposal would enable that for both frontends. ↩
In the general case where channel operations are scheduled in different cycles, this will need to be a temporal assertion. ↩
A "scheduling pass" is a pass that operates on the IR after scheduling but before block conversion. Importantly, we want this pass to run after the mutual exclusion pass has a chance to run and after scheduling for II>1 has occurred. ↩
As of writing, XLSCC does not specify any token relationships, it emits IR with every side effecting operation taking the proc’s root token and gathers all resulting tokens into an after-all for the next token. The TokenDependencyPass uses data dependencies to add minimal token relationships. ↩
It is often desirable for DSLX interpretation, IR interpretation/JIT, and RTL simulation to all have identical output. One reason is to support fast verification at higher levels of abstraction. Then a LEC chain from IR to RTL can prove that the high-level coverage applies to the RTL. TopoSort() (and many other ways of choosing an order) will be unstable across the different levels of abstraction, which is a potential problem. The DSLX interpreter should have a "safe" mode or throw a warning when multiple channel operations happen in a single tick. IR interpretation/JIT should be performed on IR with the token graph reflecting the order chosen by the channel legalization pass. ↩
If you have send(c0) -> recv(c1) -> send(c10) each scheduled in different cycles and recv(c1) blocks until the first send(c0) goes through, you’ll get deadlock. ↩
Like mutual exclusion analysis, a proof that a send in (tick, cycle) (t[n], c[n]) implies the predicate in (t[n-i], c[n+j]) is false could allow for early sends and simpler logic. ↩