This paper explores one such an optimization scheme for the Garp Gate Array RISC Processor [HAWA97]. Garp, a Berkeley research effort, is a reconfigurable device which includes a conventional MIPS processor core and coarse-grained FPGA on the same die. Garp's reconfigurable array is designed to support wide-bit arithmetic and logic operations, typically using rows (or parts thereof) of logic blocks as bit slices. Hence Garp's horizontal interconnect is plentiful to support bit shifting, while its vertical interconnect is restricted to a hierarchy of lengths. All vertical wires span some power of two in inter-row distance, the shortest wires spanning 4 rows, and the longest spanning the entire height of the array.
In Garp's sequential timing model, "long" vertical wires which span 16
or more rows require an entire cycle for transmission and buffering.
"Short" vertical wires which span 8 or 4 rows are free of such
restrictions and may comprise data paths which pass through several
logic blocks in one cycle. The start- and end- points of wires are
staggered throughout the array as shown in Figure 1 (the configuration
shown is available identically in each column of logic blocks).
Note that Garp's interconnect is not hierarchic in the sense of
HFPGAs [AGLE94], as it does not geometrically
partition the array into distinct localized parts.
Span-4 wires are the most plentiful and originate from every row. Longer wires are fewer in quantity for any given span and originate only from every 4th, 8th, 16th row, etc. Hence it should be the objective of any logic designer or automated placement/routing tool to use the fast and plentiful span-4 wires whenever possible, or the fast span-8 wires when available. Alternatively, one may avoid vertical wires altogether and communicate via horizontal wires from any row to the next row below it, provided those wires are not already needed for a shifting operation. The use of span-16 or longer vertical wires incurs the penalty of requiring a full cycle and a buffer register at the destination.
Two interrelated problems arise in logic synthesis for an interconnect hierarchy such as Garp's. During logic retiming, it is important to know whether a communications line will be mapped to a long vertical wire, necessitating a cycle delay and a buffer. Once timing has been fixed, there remains the problem of allocating short and long wires from a physically available configuration (as in Fig.1) such that Garp's interconnect timing restrictions are not violated. A circular dependency arises, since in order to minimize data-path delay, the retiming mechanism needs access to interconnect information that is traditionally handled by a placement and routing mechanism.
At present, the suite of automated synthesis tools for Garp is such the logic mapping, retiming, and placement facility (Gama, the Garp Mapper) is entirely separate from the final routing tool (gatoconfig, the Garp Configurator). The logic and placement phase, either a manual design or using Gama, specifies vertical wiring requirements by source row, destination row, and a "short" or "long" attribute. It remains the job of the Configurator program to allocate wires from the physically available pool, given a specified row placement. Because Gama has no detailed model of vertical wire positioning, and because there do not exist wires of arbitrary length between arbitrary rows, Gama defaults to specifying a long wire for any communication that spans more than 5 rows. Although such restrained use of short wires makes the Configurator's job rather easy, the Configurator can do no more than report an error if it fails to allocate enough wires for the given row placement.
The disparity between the mapping, placement, and routing phases can be alleviated by a row-placement engine which places communicating rows as close as possible, so as to use the shortest wires possible. For optimal mapping, such a placement engine should have complete knowledge of a device's physical wire availability, and could give feedback to a logic retiming facility in some kind of iterative mapping optimization. In practice, however, such an iterative process may be too slow for comfort, necessitating some simplifications. One approach is to simplify the placement engine's knowledge of wiring resources so that it can give faster feedback to the retiming facility. Alternatively, one could implement the placement engine as a post-pass optimizer, to affect only placement for a design with fixed timing. The latter method is explored in this paper, with the optimizer being coded into the Configurator.
The objective of the post-pass placement/routing optimizer presented herein is to assist the Configurator's wire allocation phase, namely by moving the rows of a configuration until a suitable set of short wires can be assigned to it. Ideally, such an optimizer would free the mapping and retiming mechanism from worrying about the endpoints of wires, and enable it to confidently specify a short wire for communications spanning 8 (or possibly more) logic rows. In addition, analysis of the optimizer's actions and performance can help determine the potential value of implementing a similar placement engine to interact directly with Gama's retiming facility. If enough of Gama's needlessly long wires can be converted to short wires, then it would be valuable to add a similar optimizer into Gama.
FORMULATION
A precise formulation of the placement optimization problem must
capture the intent that communication between rows is to be
"localized" to use the shortest wires possible. In the case of Garp,
the problem is thankfully reduced to a one-dimensional case, as only
vertical placement of rows is in question. There has been much
success in the literature for placement using min-cut partitioning
techniques [FIMA82],
[HABO96]. Because Garp's interconnect structure
does not naturally partition the array into parts, it is not clear
that partitioning -- including the common recursive formulation
of bipartitioning -- would effect appropriate uniform localization
throughout the array.
Instead, we formulate the placement optimization problem as finding a series of row motions to minimize total vertical wire length. This formulation captures in a very intuitive manner the intent of localization for Garp - to use the shortest wire possible for any communication. The formulation also supports the conversion of length-1 vertical wires into horizontal wires between rows, an optimization which essentially frees up a span-4 vertical wire. Comparing this formulation of the placement optimization problem to min-cut techniques would be an interesting endeavor, though it is beyond the scope of this paper.
The optimization problem uses several simplifying assumptions. First, the array configuration is collapsed into a single representative row. This simplification is appropriate for a one-dimensional optimization problem, where most of the logic blocks on a row are identically- routed as bit slices of a wide-bit operation. This simplification, however, is inaccurate for the case where different columns of a row drive unrelated vertical wires in different directions. Because in Garp a logic block can drive only one vertical wire at a time, the unrelated wires appear in the collapsed model as a single wire emanating in both directions from its source. Although this representative wire will effect valid length minimization for all its constituent wires, its incorrect length could make it seem as an illegally- long short wire. Hence at present, the optimizer ignores such cases and does not guarantee that wires designated "short" will in fact remain span-8 or shorter through row motions.
A second simplification to the optimizer is that it essentially ignores the physical start- and end- points of Garp's wires. The total wire length to be minimized is taken to be the sum of spans of all vertical communication lines, irrespective of the endpoints of physically available wires. This idealization, as well as the simplification described in the previous paragraph, gives the optimizer a much-needed freedom in moving rows of an array configuration through physically illegal stages. However, these simplifications have the undesirable effect of allowing the optimizer to reach a minimum-length configuration that still fails wire allocation. A possible work-around is to attempt wire allocation after every row motion, though this can severely slow the minimization process. In the analyses of this paper, we always allow the optimizer to complete length minimization before attempting to allocate wires. The analyses are pessimistic in that the optimizer may reach an illegal final configuration after missing a legal intermediate configuration, but the analyses capture the minimum performance to be expected from the optimizer.
ALGORITHM
Given a cost function such as total vertical wire length, there are a
variety of approaches to minimizing cost. The analogous graph problem
defines vertices for rows, directed edges for vertical communication
lines, and edges weights by inter-row distance. Row motions correspond
to transformations of edge weights which do not affect connectivity.
However, the function to be minimized is non-trivial (not simply the
sum of edge weights), and the author is not aware of any polynomial
time solutions. Polynomial time approximations for cost minimization
include gradient descent methods, greedy heuristic methods, and
simulated annealing [KIGE83].
The algorithm devised for this paper is a greedy one of order approximately O(N^4) in the number of rows. The cost function to be minimized is the straightforward sum of (idealized) vertical wire lengths. It may be a good idea to weight the length of designated "short" wires more than "long" wires, since the availability of short wires is crucial for timing. However, no weighting is done in this particular implementation, which (as noted above) largely ignores the difference between short and long wires.
The basis of the greedy algorithm is to move contiguous blocks of rows so as to minimize total wire length. The notion of moving blocks of rows is an attempt to capture the natural spatial localization of logic computation. An array configuration tends to have subcomputation regions with dense local communication but only sparse communication to other blocks (in the extreme, a memory access in the Garp array requires no vertical i/o, so a subcomputation may have zero connections to the rest of the array). Hence there can arise cases where moving a block of rows will reduce total wire length when moving a single row will only increase it.
The greedy algorithm attempts to move increasingly large blocks of
rows. At each step, the algorithm performs the best possible
row-block motion with the current block size, i.e. the motion which
reduces total wire length most drastically. The algorithm continues
using the same block size until no length-reducing motions are
possible, at which time it tries the next largest block size. Note
that it is not necessary to check block sizes larger than half the
array's rows, since a motion of any block larger than half corresponds
precisely to a motion in the opposite direction of a block smaller
than half. Figure 2 depicts the downward motion of a block in blue
and the corresponding upward motion of a block in pink. Note that the
motion effectively switches the relative positions of the two blocks.
Pseudo-code for the greedy algorithm follows:
for blockSize = 1 .. (number of rows) / 2
{
repeat until exhaust length-reducing motions at this block size
{
for sourceRow = 1 .. (number of rows)
{
for destRow = 1 .. (number of rows)
{
evaluate the change in total wire length
for moving blockSize rows from sourceRow to destRow
}
}
perform best length-reducing motion discovered above
}
}
Evaluating the change in total wire length for a given row motion (in
the innermost loop) requires examining the length change of every
vertical wire which passes the blocks being switched. Since each
logic cell in Garp can drive only one vertical wire at a time, the
evaluation of a motion can be performed in linear time in the number
of rows, by maintaining and perturbing a list of vertical wire
endpoints. Specifically, the evaluation is linear in the number of
vertical wires used, which may be significantly smaller than the
number of rows. The cost of the repeat-until loop is not obvious,
though experience suggests that it is smaller than linear and is quite
small. Thus the algorithm run-time is between O(N^4) and O(N^5) in
the number of rows.
The number of iterations in the loop structure is actually lower than the pseudo-code suggests, because not all motions are legal. For instance, the destination row cannot be inside the source block of rows being moved. Also, a row that receives input via horizontal wires from the row above it (a fused row) may not be separated from that row by any motion.
Note that this greedy algorithm is not guaranteed to find a global wire-length minimum. The local minimum to which it settles is sensitive to the choice of "best row motion" among motions with equal reductions in wire length.
RESULTS - Optimization of Gama Output
We test the performance of the placement and routing optimizer on array
configurations produced by Gama from three well-known programs:
espresso 2.3, eqntott 9, and gzip 1.2.4. The first two are
SPEC integer benchmarks, while the latter is popular tool for
compression/decompression. These programs represent a good mix of
general purpose applications, featuring boolean logic, addition, loops
structures, and memory accesses. The Suif C compiler and Gama are
used to extract 139 computational kernels from the C source code of
these 3 programs, and to convert them into ".ga" format Garp array
descriptions. The processing done herein includes logic mapping from
C source to Garp logic blocks, as well as placement, but not wire
allocation. The array descriptions are subsequently passed through
the optimizing Garp Configurator for verification and wire allocation.
Of the 139 computational kernels, we find that only 4 array configurations fail wire allocation without the optimizer, and only 1 fails after optimization. Thus we see that the optimizer finds little use in improving the mappability of Gama output. We do nevertheless wish to study the performance of the optimizer in reducing total wire length, as well as in reducing long wires to potentially short ones. Of the 139 kernels, we analyze the 30 largest array configurations, from 14 to 35 rows long, plus one 94 row design.
Reductions in total wire length of up to 82% are seen in the 30
largest configurations, and up to 100% in the smaller configurations.
Note that 100% reduction means that all vertical wire requirements
were converted to use horizontal wires between adjacent rows. The
optimizer typically needs less than 10 row-block motions to settle on
a minimum total wire length, moving mostly single rows at a time, but
occasionally moves 10 or more rows in a single block. The scatter
plot of Figure 3 reveals an approximately linear relationship between
array configuration size and the number of motions performed during
optimization. This finding indicates that the repeat-until loop of
the pseudo-code requires far less than linear time, and that the
optimizer run-time is nearly O(N^4) in the number of rows.
A sample run of the optimizer is shown below, with an ASCII representation of vertical wiring requirements between rows. Rows are listed in physical order, labeled by their original numerical position. Note that all single fanout wires are reduced to horizontal wires, and that the long, high-fanout wires are all reduced to short-wire lengths. A symbol legend for the ASCII graph follows:
+ | signal source |
| | span of wire |
L | signal destination, reached by long wire |
S | signal destination, reached by short wire |
^ | signal destination, reached by horizontal wire |
design is initially mappable
5 of 33 rows fused (15.15%)
initial totalWireLength: 50
v-wire graph:
0: + L
1: ^ |
2: |
3: |
4: L
5: |
6: |
7: |
8: + |
9: ^ + |
10: | |
11: L + |
12: | ^ |
13: | |
14: | |
15: | |
16: | |
17: | | L
18: | L |
19: | | L |
20: | + | |
21: | L | |
22: | + |
23: | +
24: | +
25: | ^ +
26: | ^
27: | +
28: | ^
29: L
30:
31: +
32: ^
performing motion: { srcRow=17 numRowsToMove=1 destRow=24 } saving 7
performing motion: { srcRow=18 numRowsToMove=1 destRow=22 } saving 4
performing motion: { srcRow=13 numRowsToMove=1 destRow=0 } saving 2
performing motion: { srcRow=14 numRowsToMove=1 destRow=0 } saving 2
performing motion: { srcRow=15 numRowsToMove=1 destRow=0 } saving 2
performing motion: { srcRow=16 numRowsToMove=1 destRow=0 } saving 2
performing motion: { srcRow=6 numRowsToMove=1 destRow=0 } saving 1
performing motion: { srcRow=11 numRowsToMove=1 destRow=0 } saving 1
performing motion: { srcRow=17 numRowsToMove=1 destRow=6 } saving 1
performing motion: { srcRow=18 numRowsToMove=1 destRow=6 } saving 1
performing motion: { srcRow=19 numRowsToMove=1 destRow=6 } saving 8
performing motion: { srcRow=11 numRowsToMove=2 destRow=6 } saving 2
performing motion: { srcRow=20 numRowsToMove=2 destRow=0 } saving 2
performing motion: { srcRow=22 numRowsToMove=2 destRow=0 } saving 2
performing motion: { srcRow=24 numRowsToMove=3 destRow=0 } saving 3
performing motion: { srcRow=27 numRowsToMove=4 destRow=22 } saving 1
v-wire graph:
24: +
25: ^ +
26: ^
23: +
17: ^
22: +
19: ^
7:
2:
16:
15:
14:
13:
3:
4: L
21: L
20: +
18: L
0: + L
1: ^
5:
6:
27: +
28: ^
29: L
30: |
8: + |
9: ^ +
10: |
11: L +
12: ^
31: +
32: ^
final totalWireLength: 9 (saved 41, 82.00%)
optimized design is mappable
In order to study how the optimizer behaves for different kinds of
array configurations, we identify several characteristic parameters:
| - Configuration size: | Number of rows |
| - Wire density: | Percentage of rows which output to a vertical wire.
Array configurations with high wire density are more prone to running out of wires during wire allocation. |
| - Fused row density: | Percentage of rows receiving input via horizontal wires from
the row above (hence "fused" to that row).
A high fused row density may hinder optimizer performance, since there are fewer possible row motions. |
Figures 4(a) - 4(c) scatter-plot the percent reduction in total wire length versus each of the above three parameters. We notice a slight positive correlation between configuration size and percent length reduction. This point suggests that larger configurations have more needlessly long wires, perhaps due to long lines between distant computational blocks. Automated row shuffling that would not be intuitive to a manual designer or simple placement tool can have a big effect in such cases. Also, there appears to be a weak negative correlation between fused row density and percent length reduction. As expected, when more rows are fused by horizontal wires and cannot be separated, fewer row motions are possible, and the optimizer cannot do as good a job. Interestingly, there does not appear to be any meaningful correlation between wire density and percent length reduction.
Figures 5(a) - 5(c) scatter-plot the percentage of vertical wires shrunk to a shorter type versus the above three parameters. These plots depict wires shrunk to fit in span-8 and span-4 wires, as well as wires shrunk to length 1 (span 2). Although there are no length-1 vertical wires, most length-1 connections are in fact converted into horizontal wires, saving the use of one vertical wire. Wires shrunk down to span-8 are particularly interesting because they represent long wires that might have been retimed to be short. We see that up to 50% of a configuration's wires can be shrunk from long to span-8, and some may be shrunk further. If these long wires are in the critical path of a logic computation, then they could be retimed as short wires to save cycles.
The correlations in figures 5(a) - 5(c) are not so clear. There appears to be a weak negative correlation in all three cases for wires shrunk to length-1. That is to say, more wires can be shrunk to length-1 in smaller as well as sparser configurations. A similar very weak correlation seems to exist for wires reduced to span-8. These trends suggest that the optimizer's effectiveness is limited for designs with dense communication (high wire density, high fused row density).
EXTRAPOLATION - Optimization of Random Configurations
In the above analysis, we found that nearly all array configurations
produced by Gama are mappable, i.e. pass wire allocation. Thus the
question remains unanswered: how well does wire length minimization
fare in making unwirable placements legally wirable? In the above
analysis we saw a weak trend to suggested that this optimization is
less effective as inter-row communication grows denser. Yet Gama does
not provide enough data points to verify these notions. In this
section, we generate pseudo-random array configurations with denser
communications in order to evaluate the optimizer.
The pseudo-random array configurations are characterized by four parameters pertaining to inter-row communication:
| - Wire density: | Percentage of rows which output to a vertical wire |
| - Exponential fanout distribution: | Most vertical wires in Gama-emitted array configurations fan out to one or two destination rows. A few long wires fan out to 3, 4, or more rows. We model fanout by an exponential random variable with a settable exponential parameter "fe". Thus the probability density for fanout is given by: p(x)==Cexp{-fe*x} for fanout x>0 (disallow x=0; C is a normalizing constant). The mean fanout for such a distribution is (1 + 1/fe). A fanout exponent fe=2 (mean fanout 1.5) models the fanout of Gama configurations fairly well. |
| - Exponential wire length distribution: | Most vertical wires in Gama-emitted array configurations travel only a few rows to their destination, while a few long wires may span the entire array. We model inter-row communications distance by an exponential random variable with a settable exponential parameter "we". Thus the probability density for inter-row communications distance is given by: p(x)==Cexp{-we*x} for distance x>0 (disallow x=0; C is a normalizing constant). An expression for mean wire length is difficult to derive from such a distribution of inter-row communications distances, but we may estimate it at 2(1 + 1/fe). Note that a wire length exponent of 0 signifies a uniform distribution of wire lengths over all possible lengths. |
| - Fused row density: | The percentage of rows receiving input from the row above via horizontal wire, so that they cannot be separated from that row by row motions. The Gama-emitted designs of the previous section have a mean fused row density of approximately 30%. |
Figures 6(a) and 6(b) scatter-plot the results of varying wire density and wire length exponent with other parameters fixed to match Gama output: a fanout exponent of 2, and fused row density of 30%. Figure 6(a) plots the percentage of pseudo-random designs that pass wire-allocation before optimization is applied. Notice that designs with low enough wire density and short enough wires (i.e. high wire length exponent) have 100% success in initial mapping. Figure 6(b) plots the percentage of initially unmappable designs which were converted by the optimizer into valid, wirable configurations. We see that the optimizer succeeds in correcting designs with reasonably small wire densities (mostly below 50%) but fails for denser designs. This is precisely the trend suggested at the end of the previous section.
To make further use of the parametric pseudo-random array configuration model, we examine the optimizer's performance in the presence of higher fanout. By reducing the fanout exponent at 1, we increase the mean fanout to 2 and allow a wider variance of fanouts. Figures 7(a) and 7(b) plot the mappability results for this case in a manner analogous to figure 6. We see in Figure 7(a) that the percentage of initially mappable (wirable) designs drops off earlier than before in the presence of high wire densities and longer wires (smaller wire length exponent). The added fanout clearly makes it more difficult to allocate wires for these designs. In Figure 7(b), we see that the optimizer does not fare as well as before in recovering unwirable designs -- it succeeds only for sparse designs with wire density under 30%, and even there, no more than 60% of unwirable designs can be made wirable. In addition, we see some negative values in this graph, indicating that the optimization process made an initially mappable design unwirable in the end.
Another point to explore in the parametric model is the effect of higher fused row densities. We are likely to find higher usage of horizontal wires for shifting in arithmetic applications, especially in managing the partial products of a multiplier. Gama currently does not support general purpose multipliers, so such data points are missing from the analysis of the previous section. Figures 8(a) and 8(b) scatter-plot wirability results with a fused row density of 60% and fanout exponent of 2. Here we find that the optimizer can recover unwirable designs at all wire densities, though with a limited success rate of 40%-80%. Thus in the presence of many horizontally fused rows, it is something of a gamble whether or not the optimizer can emit a wirable design.
CONCLUSION
A one-dimensional row placement and routing optimizer based on
minimizing total vertical wire length has been presented. We find
that the optimizer is effective in assisting a wire-allocation routing
phase only for logic designs with reasonably low communication
requirements. The optimizer fails at making an unwirable design
wirable as the density and fanout of logic outputs grows. Optimizer
performance also degrades in the presence of horizontally-fused rows
which restrict its selection of row motions. These results are somewhat
pessimistic since the optimizer was allowed to skip over intermediate
wirable array configurations in its search for a minimum-wire-length
configuration. It is likely that the limited benefits of the
optimizer to improving wirability stem directly from its ignorance
of Garp's vertical wire positioning. The minimization of total wire
length may be better suited to work in combination with other post-pass
mechanisms which have a more specific model of Garp wire availability.
The analysis of Gama-emitted configurations suggests that there may be some benefit for retiming in the use of a length-minimizing placement engine early in the logic mapping phase. Such an engine could reduce critical path delays and buffers associated with needlessly long wires. The reliability of this conclusion is not so clear, since the optimizer used in this paper is coded into a late-phase Configurator and has no knowledge of the retiming decisions made by Gama.
ACKNOWLEDGEMENTS
Thanks to John Hauser
for continued guidance on
Garp
and for providing source code to the Garp Configurator.
Thanks to Timothy Callahan
for enhancements to
Gama
and for providing benchmark test cases.
Thanks to Andre' DeHon
for guidance in the formulation of this project.
Comments to Eylon Caspi.
May 6, 1997