Software pipelining is an optimization for improving ILP by overlapping loop iterations. Swing Modulo Scheduling (SMS) is an implementation of software pipelining that attempts to reduce register pressure and generating efficient pipelines with a low compile-time cost.

This implementation of SMS is a target-independent back-end pass. When enabled, the pass runs just prior to the register allocation pass, while the machine IR is in SSA form. If software pipelining is successful, then the original loop is replaced by the optimized loop. The optimized loop contains one or more prolog blocks, the pipelined kernel, and one or more epilog blocks.

The SMS implementation is an extension of the ScheduleDAGInstrs class. We represent loop carried dependences in the DAG as order edges to the Phi nodes. We also perform several passes over the DAG to eliminate unnecessary edges that inhibit the ability to pipeline. The implementation uses the DFAPacketizer class to compute the minimum initiation interval and the check where an instruction may be inserted in the pipelined schedule.

In order for the SMS pass to work, several target specific hooks need to be implemented to get information about the loop structure and to rewrite instructions. This patch implements the target hooks for Hexagon.

The pipeliner should be easily extendable to work with compare-and-branch loops instead of just hardware loops. I assume that may require some additional support from a target. Also, the implementation assumes the target uses the DFA code for scheduling, but I think it could also be changed to work with the scoreboard structure (or some other mechanism).

The SMS algorithm consists of three main steps after computing the minimal initiation interval (MII).

- Analyze the dependence graph and compute information about each instruction in the graph.
- Order the nodes (instructions) by priority based upon the heuristics described in the algorithm.
- Attempt to schedule the nodes in the specified order using the MII.

If all the instructions can be scheduled in the specified order, then the algorithm is successful. Otherwise, we increase the MII by one and try again. When the algorithm is successful, we need to replace the original loop with one or more prolog blocks, the optimized kernel, and one or more epilog blocks. The number of prolog and epilog blocks depends on the number of stages in the scheduled loop. When creating the new blocks, we need to generate new SSA names and generate Phis for the new kernel and each epilog block. The process of creating the new, pipelined loop is quite complex and hard to understand. This part of the code can certainly use some additional work to simplify and to improve the readability.