Resource Allocation Methodology for the
Segmented Bus Platform
Tiberiu Seceleanu, Ville Leppänen, Jari Suomi,Olli Nevalainen
University of Turku, Finland
Email: {tiberiu.seceleanu,ville.leppanen,jsuomi,olli.nevalainen}@utu.fi
Abstract— Consider a system-on-chip platform realized around
the concept of segmented bus structure. The bus is segmented in
such a way that modules connected to a particular segment of the
bus can communicate in parallel with the data transfer operations
going on in the other segments. Given the frequency of data
transfer operations between the modules, our task is to determine
an efficient segmentation and segment-to-module assignment of
this kind of system organization. We consider several different
optimization methods for the problem and demonstrate their use
for sample cases, both theoretically and practically.
I. I NTRODUCTION
The segmented bus system approach is one of the solutions
to current problems (performance, power consumption, IP
utilization, etc.) facing the design of modern system-on-a-chip
(SOC), as reflected in recent studies [2], [3], [6], [10]. The
segmented bus platform that we analyze [7] fits into the larger
concept of globally asynchronous locally synchronous (GALS)
[1] system architectures. In this approach, each distinct module
of the system is synchronized to a local clock running at an
optimal frequency, whereas interactions between these modules are arranged asynchronously. Every segment is identified
with a (possibly) different clock domain.
In the present study, we consider the problem of allocating
hardware units to segments of the bus in such a way that the
traffic across segment borders is minimized and the potential
for parallel transfers is maximized. The goal is to obtain an
optimal distribution of the components on the segments, so that
the performance is maximally increased. Here, performance is
viewed as the overall time required by an application to finish
a set of tasks, on the platform. The objective here is to keep the
external data transfers of each segment as low as possible. In
order to avoid trivial solutions, we consider a fixed number of
segments. The inter-component traffic is defined by a unit-tounit matrix of transfer frequencies. Our approach differs from
the one taken by Jones et al. [4] in three aspects. First, our
objective is to maximize the parallelization and, at the same
time, minimize the frequency of inter-segment transactions.
Second, we do not fix (by a relaxation of the problem) the
device topology, but allow a free search for the orientation of
the devices. Third, we are interested in a linear organization
of bus segments only and, therefore do not allow a (more
complex) tree-like segment organization.
Due to the higher level of the approach, we also do not
take into account technical aspects of the actual implementation, such different clock domains, synchronization issues,
0-7803-9264-7/05/$20.00©2005 IEEE
or segment-to-segment throughput. Therefore, the results are
somehow ideal, but the implementation experiments show that
they strongly impact the performance of the whole system.
II. P ROBLEM STATEMENT
We consider an on-chip system organization where the bus
is divided into several segments so that data transfers between
devices (masters and slaves) reserve the segments between the
source and target device, only. As an example, consider the
case of a 3 segment bus and 8 devices (Fig. 1).
D5
D2
D4
D1
D3
Fig. 1.
D6
D7
D8
Implementation example.
A data transfer between D4 and D6 reserves the segment 2
only. A transfer between D2 and D8 reserves all the three segments. The system has a control mechanism which identifies
the source and target devices, the set of needed segments and
allocates them for the transfer [7]. It also allows parallel use of
bus, if the sets of segments allocated by two or more transfers
are disjoint. The traffic between devices is defined by a deviceto-device matrix C (ci,j ; 1 ≤ i, j ≤ n) giving the number of
data transfer requests per time unit between each device pair
(i, j), see Fig. 2. In this example we consider that D1 , D3 , D6
and D8 are masters and the otherdevices are slaves. We denote
the total traffic with Csum = i,j ci,j . For each segment k
i\j
1
2
3
4
5
6
7
8
1 2 3 4
0 50 5 2
60 0 8 5
4 6 0 40
6 3 50 0
50 80 6 6
2 4 40 50
0 1 5 4
1 0 8 5
Fig. 2.
5 6 7 8
70 3 0 0
90 7 1 2
5 60 5 7
2 60 4 3
0 4 0 1
4 0 5 3
0 6 0 90
1 4 80 0
A communication matrix.
(k = 1, 2, . . . , ns ) we can calculate the total amount of data
transfers over that segment as the sum of transfers which have
129
1. source and target device in the segment k (tk,1 ),
2. source in segment k, target elsewhere (tk,2 ),
3. target in segment k, source elsewhere (tk,3 ), or
4. source in segment i and target in j, where i < k < j or
i > k > j (tk,4 ).
Here, tk,j , with j = 1, . . . , 4 , denotes the number of data
transfers per time unit in each case. Fig. 3 shows the different
cases of data transfers for the segment k = 2.
calculate, from the device-to-device transfer matrix C and the
device-to-segment allocation vector Ā, a segment traffic load
matrix Q consisting of elements qij :
S2
S3
This gives the segment load of the segment k as
ns
k
=
ns
k
qij − qkk
i=k j=1
ns
k
(
qij + qji ) − qkk
=
i=1 j=k
2a
3a
qij +
i=1 j=k
S4
2b
1
ck,l
Ak =i,Al =j,1≤k,l≤nd
Tk
S1
qij =
Let us make the simplifying assumption qij = v (a constant
value) for all i, j. We then have:
3b
4a
Tk
4b
ns
k
=
2v − v
i=1 j=k
Fig. 3. Data transfers reserving the segment k = 2. The numbers 1 to 4
refer to the indices j of tk,j .
Let Tk (k = 1, 2, . . . , ns ) denote a weighted (wj , j =
1, . . . , 4) sum of transfers for segment k as defined above
Tk =
4
wj tk,j
=
2v(k(ns − k + 1)) − v
=
v(2kns − 2k 2 + 2k − 1)
We observe that the segmentation induces higher traffic load
to the middle-most segments in case the traffic is evenly
distributed among the different segments, see Fig. 4.
j=1
Suppose further that there are nd devices, D1 , . . . , Dnd and
let Ai be the segment number (1 ≤ Ai ≤ ns ) to which
device i is allocated. Thus, in Fig. 1 we have the device segment allocation Ā = (A1 , . . . , A8 ) = (1, 1, 2, 2, 1, 2, 3, 3),
which actually is an optimal solution (with cost 489) for the
example in Fig. 2. Here, Tk (Ā) denotes the sum of data
transfers for segment k with the device - segment allocation
Ā = (A1 , A2 , . . . , Ans ).
We want to find, for a fixed number of segments ns , a
segment allocation Ā∗ for which the maximum of the weighted
sum of data transfers of the segment is minimal:
T ∗ (Ā∗ ) = min max Tk (Ā)
Problem 2.1: The multisegmented bus device allocation
∗
problem is to solve A in (1).
What is still needed, is an expression for Tk (A) in terms of
known access frequencies ci,j (1 ≤ i, j ≤ nd ). We then have
=
ci,j
Ai =Aj =k
tk,2 (Ā)
=
ci,j
Ai =k,Aj =k
tk,3 (Ā)
=
ci,j
Ai =k,Aj =k
tk,4 (Ā)
=
Ai <k<Aj
or
ci,j
Ai >k>Aj
III. S EGMENT T RAFFIC L OAD
A simple form of the traffic load of each segment follows
if we suppose that the device - segment allocation is given by
the vector Ā, and let wi = 1 for all transfer types. We can then
10
segments
4 segments
Fig. 4.
Traffic distribution.
It is interesting to note that the traffic load of the middle
segment is
Tk
(1)
Ā 1≤k≤ns
tk,1 (Ā)
3 segments
∼
=
=
qv(ns /2(ns /2)) − v
(n2s /2 − 1)v
This indicates that, for a fixed v, the load of the middle
segment increases with the square value of ns . One should
however notice that v = O(n−2
s ), due to the overall traffic, is
constant and there are n2s different segment-to-segment routes
in the bus. At limit, Tns /2 approaches a constant different
than 0. In the same way we observe that limns →∞ T1 =
limns →∞ Tns = 0.
Now consider three extreme cases for ns : (a) ns = 1, (b)
ns = 2, and (c) ns = n. Assume that all segments have an
equal number of nodes and that in all cases there is a fixed
traffic ci,j = v. Case (a) means that the whole traffic of load
n2 v happens in one segment. In case (b), the traffic load within
both segments is (n/2)(n/2)v and the traffic load crossing
the segment border is n(n/2)v. Thus, in case (b) the overall
traffic load is 75% compared to (a). In case (c) all nodes are
in their own segment, and the traffic load of a segment in
the middle is 2(n/2)(n/2)v = n2 v/2. Then, in case of even
traffic, segmenting the bus can decrease the traffic load by at
130
most 50%, and in case k = 2 by 25%. Notice that, when the
traffic pattern is not even, the benefit can be much greater.
Therefore, we try to achieve an “uneven” situation regarding
the inter-segment communication profile.
IV. A LGORITHMS FOR SOLVING SEGMENTATION
It has been shown, in [8], that the bus segmentation Problem
2.1 is NP-complete. Next, we propose algorithms for solving
this device allocation problem. The algorithms described in
the following paragraphs create the basis for the development
of SBTool, a command line application, designed to solve
problems related to allocation and segmentation problems
for the SB platform. We present experimental results of the
heuristics in Section V, based on the output from SBTool.
Related problems. In [4] finding device allocation for a related
segmented bus system is considered by Jone et al. The two
major differences are that (1) they allow the bus system to form
an arbitrary tree, and (2) the overall usage of power consumed
by the bus segments is minimized instead of minimizing usage
of a single bus segment. Perhaps surprisingly, such a relaxed
problem turns out to be optimally solvable in polynomial time.
Although the motivation in [4] and in our work is similar,
the problems differ so much that advancing their algorithmic
solution in Problem 2.1 does not seem possible.
A similar problem is studied also in [9] by Srinivasan et
al. Minimizing power consumption is an essential part of
their optimization problem (partitioning and device allocation),
based on a genetic algorithm heuristic.
We approached the problem by using three different methods [8]. First, an exhaustive search algorithm was applied. This
is a feasible solution especially considering that, in practice, ns
can be rather modest. However, this approach may also involve
unnecessary work, since within a segment the order of devices
makes no difference, and also, it makes no difference whether
segments are ordered from left to right or vice versa. Secondly,
we looked for heuristic solutions, since solving Problem 2.1
optimally is NP-complete. The basic step in the proposed
heuristic methods is to have some initial device-to-segments
allocation, which in our case is a completely random solution
(random order of devices and randomly set segment borders).
Greedy local search methods. Next we propose two heuristics
for solving the Problem 2.1. Algorithm 4.1 is a basic greedy
local search algorithm. Besides the load matrix C and segment
number ns , it receives as its parameters the iteration bound b,
and a method M ODIFY F UNC to generate a new allocation.
New allocations are generated as long as those improve the
current setting or b non-improving allocations have been generated in sequence. Algorithm 4.1 returns device-to-segments
mapping.
Algorithm 4.1: (Greedy local search with iteration
bound)
SB-G REEDY-L OCAL -S EARCH(C[1..n][1..n],ns , b, M ODIFY F UNC)
A := G ENERATE R ANDOM A LLOCATION(C, ns );
g := G OODNESS (A, C, ns );
i := 0;
while (i < b)
A′ := ModifyFunc(A, ns );
g ′ := G OODNESS (A′, C, ns );
if (g ′ < g) A, g, i := A′ , g ′ , 0;
else i := i + 1;
return A;
Algorithm 4.2: (Random swaps and/or moves)
S WAPS -M OVES -R ANDOMLYx(A[1..n], ns ): [1..n]
Number A′ [1..n] := A;
for (i = 1 to x) do
if (R ANDOM(0 . . . 1) == 1)
then A′ := S WAP -R ANDOMLY(A′, ns );
else A′ := M OVE -R ANDOMLY(A′, ns );
return A′ ;
Algorithm 4.2 performs a sequence of x random swap/move
operations for a given device-to-segment allocation. The type
of operation (swap or move) is chosen randomly with equal
probability in each iteration round. In our experiments, we
use S WAPS -M OVES -R ANDOMLY1 , which performs a single
random swap or move.
V. E XPERIMENTAL RESULTS
We performed experiments with 2 heuristic methods:
• LocalGreedyM : Algorithm 4.1 is applied with I NIT R AN DOMLY and M OVE -R ANDOMLY . These experiments use the
parameters #a , number of attempts, which tell how many
randomly chosen starting points are used and b which denotes
the iteration bound.
• LocalGreedyM,S : The same as above, with the difference
that S WAPS -M OVES -R ANDOMLY1 is used instead of M OVE R ANDOMLY. Again #a is applied.
For experimenting with the algorithms, we choose a set
of fixed number of segments test cases, briefly described in
Fig. 5. In [8], a larger set of situations is analyzed. The
cases fixed-1 and fixed-2 are so small, that they can
be solved optimally with an exhaustive search method, with
no difference compared to the heuristic method. Both heuristic
methods, LocalGreedyM and LocalGreedyM,S , found global
optimum in all the cases, with very modest values for b (10
to 40) and #a (5 to 200). This also means that the methods
performed very quickly.
Name
fixed-1
fixed-2
fixed-3
Fig. 5.
Csum
100
100
1000
Description
a fixed test case for n = 6
a fixed test case for n = 8
a fixed test case for n = 16
Test cases with traffic distribution.
Implementation. Consider a situation where there are 16 devices (D0 , . . . , D15 ) - a fixed-3 case, and the communication matrix C is as shown in Fig. 6
We solved the segmentation problem of fixed-3 by the
exhaustive search and heuristic greedy local search algorithm
LocalGreedyM,S , see Fig. 7 for results with segments 2, . . . ,
8. In cases ns = 2, .., 4 (exhaustive search), the result is the
global optimum. In cases ns = 5, . . . , 8, heuristic methods
were applied. The parameters (iteration bound and number
of attempts) were set so that computations took around one
minute. For cases ns = 2, . . . , 4, they also found as good
solution as the global optimum.
131
D0
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
D15
D0
180
0
100
200
140
0
85
0
150
0
0
100
100
0
50
50
D1
0
300
0
0
0
0
100
100
50
150
0
150
0
100
50
50
D2 D3 D4 D5 D6 D7 D8
100 0
0
10 60 200 200
100 0
50
0
0 150 0
200 50 50 180 100 0
0
100 150 200 50
0 100 0
0
0 150 50 150 0
70
100 100 0 300 0
0
0
25
0
90 400 0 100
0
50
0
50
0 150 0
0
0
0 150 150 50 200
150 50 100 100 0
0 100
0
0
0 200 200 100 0
0
0 100 0
50 50 150
0
0 100 200 0
0
0
150 0 200 0
0 200 0
0
0
0
50 100 100 150
0
0
0
0 200 0 150
Fig. 6.
ns
1
2
3
4
5
6
7
8
D9 D10
0
0
100 100
70 140
0
0
70 150
200
0
0
0
250 100
0
0
200
0
50 300
0
0
0
150
0
0
100
0
150
0
D11
100
0
60
0
140
100
0
120
0
0
0
200
0
0
0
0
D12
0
100
0
100
80
0
50
80
150
80
150
0
250
50
0
0
D13
0
0
0
0
0
150
50
0
0
70
0
0
200
200
0
0
D14
0
100
0
0
0
0
0
0
50
0
0
100
0
0
300
100
D15
150
0
50
100
0
50
100
100
50
0
0
100
0
100
100
300
this segment-to-segment delay. However, these corresponding
values are dictated by the application level. In addition, also
the application determines the exact sequence of transfers, in
order to fulfill the required overall functionality; this aspect,
however, will affect the performance figures as obtained in our
example, only if idling periods are necessary.
Thus, another important parameter in this architecture is the
size of the data package. Its dimension dictates the size of the
inter-segment buffers and it may largely affect the system behavior in terms of performance, area and power consumption.
The upper bound of the delivery latency corresponding to one
inter-segment package is computed based on the package size.
For the above discussed exemplification, we have decided on
a package size of 25 + 2 (data + address locations).
Relative communication loads.
cost
solution (indexes)
160000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
104000
0 3 4 6 8 11 14 15 | 1 2 5 7 9 10 12 13
82800
0 6 8 11 14 15 | 1 3 7 9 | 2 4 5 10 12 13
71600
0 6 8 14 15 | 1 7 11 | 3 4 9 | 2 5 10 12 13
69150
3 4 10 12 13 | 2 5 | 7 9 | 1 6 | 0 8 11 14 15
65450
2 5 12 13 | 4 10 | 3 7 | 1 9 | 0 14 | 6 8 11 15
62300 6 8 15 | 1 14 | 0 11 | 7 | 3 4 | 2 9 | 5 10 12 13
59450 5 10 12 | 1 13 | 2 | 9 | 7 | 3 4 | 6 11 | 0 8 14 15
Fig. 7.
VI. C ONCLUSIONS
Solutions for fixed-3.
Discussion and practical aspects. The methodological development considered, at this moment, only the relative communication loads between system devices. On-going research
addresses power consumption issues in the framework of the
same platform, while guiding results and illustrative benefits
of a segmented bus platform may be derived from [10].
The 3 segment, 16 device solution was implemented on
an ALTERA Stratix II device. The comparison is performed
against a similar single bus version of the sample system (same
number of masters, same number of slaves, same relative
communication loads, same device, etc.). The different clock
frequencies were assigned as follows: for the SB implementation: segment 0 runs at 91 MHz, segment 1 at 98 MHz,
segment 2 at 89 MHz, while the central arbitration unit
operates at a 90 MHz clock frequency; we assigned for the
single bus clock the fastest of the above frequencies, 98 MHz.
In these conditions, the results show a 26% increase of
performance of the segmented bus solution, over to the execution on the single bus implementation (2.23 ms compared
to 2.82 ms, the time required for all the masters to send
their data packages). The whole system was simulated at postsynthesis levels, in the Modelsim environment [11]. The slight
differences between these results and the data specified in
Figure 6 originates from the fact that the introduced algorithms
analyze an ideal situation, where there is no inter-segment
delivery latency. This is motivated, on one hand, by the fact
that we cannot ensure a fixed value for this latency, but only
a bounded one [7]; on the other hand, this latency bounds
are fixed for a given platform instance, hence they will only
affect the performance comparison with the single bus option.
Moreover, both the communication loads and the size of
the data packages affect the performance results more than
We considered the resource allocation on a segmented
bus platform. The optimal solution was formalized as an
organizational problem, where the objective was to minimize
the maximal weighted traffic between the system devices.
A local search algorithm was proposed for minimizing the
expressions derived from the estimated traffic frequencies
between different devices. The algorithm trusted on local
search operations, performed in a greedy manner.
Future work. Application level issues, while not within the
scope of the present study, are the main factors in deciding the
communication matrix, and, hence, affect both segmentation
and allocation results. Decisions taken at this level would
greatly influence the decision on the size of the packages,
too. This is the focus of present and forthcoming research;
combined with the results presented in this study, a fully
motivated decision can be taken, regarding the selection of the
architecture, as a single bus system, or as a multi-segmented
one.
R EFERENCES
[1] D. M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems.
PhD thesis, Standford University, 1984.
[2] J. Y. Chen et al. Segmented Bus Design for Low-Power Systems. IEEE
Transactions On VLSI Systems, Vol. 7, No. 1, 1999, pp. 25-29
[3] J. Jeon, K. Choi. An Effective Synthesis Algorithm for Partitioned Bus
Architecture. Institute of Electrical Engineers, Electronics Letters, Vol.
35, No. 6, 1999, pp. 440-441.
[4] W.-B. Jone et al. Design Theory and Implementation for Low-Power
Segmented Bus Systems. ACM Transactions on Design Automation
of Electronic Systems, Vol. 8, No. 1, 2003, pp. 38–54.
[5] D.L. Kreher, D.R. Stinson. Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1999.
[6] S. Lee and K. Choi. Partitioned-bus architecture synthesis based on data
transfer model. The Sixth Asia Pacific Conference on Chip Design
Language, Fukuoka, Japan, 1999.
[7] T. Seceleanu. Communication on a Segmented Bus Platform. Proc. of the
IEEE International SOC Conference, 2004, pp. 205-208.
[8] T. Seceleanu et al. On the Organization of Multisegmented Bus. TUCS
technical report no. 647, 2004.
[9] S. Srinivasan et al. Simultaneous Partitioning and Frequenct Assignment
for On-chip Bus Architectures. Proc. of DATE 2005.
[10] H. Wang et al. A global bus power optimization methodology for physical
design of memory dominated systems by coupling bus segmentation and
activity driven block placement. Proc. of ASP-DAC 2004, pp. 759-761
[11] ModelSim Simulator. http://www.model.com.
132