Academia.eduAcademia.edu

Resource allocation methodology for the segmented bus platform

2005

Consider a system-on-chip platform realized around the concept of segmented bus structure. The bus is segmented in such a way that modules connected to a particular segment of the bus can communicate in parallel with the data transfer operations going on in the other segments. Given the frequency of data transfer operations between the modules, our task is to determine an efficient segmentation and segment-to-module assignment of this kind of system organization. We consider several different optimization methods for the problem and demonstrate their use for sample cases, both theoretically and practically.

Resource Allocation Methodology for the Segmented Bus Platform Tiberiu Seceleanu, Ville Leppänen, Jari Suomi,Olli Nevalainen University of Turku, Finland Email: {tiberiu.seceleanu,ville.leppanen,jsuomi,olli.nevalainen}@utu.fi Abstract— Consider a system-on-chip platform realized around the concept of segmented bus structure. The bus is segmented in such a way that modules connected to a particular segment of the bus can communicate in parallel with the data transfer operations going on in the other segments. Given the frequency of data transfer operations between the modules, our task is to determine an efficient segmentation and segment-to-module assignment of this kind of system organization. We consider several different optimization methods for the problem and demonstrate their use for sample cases, both theoretically and practically. I. I NTRODUCTION The segmented bus system approach is one of the solutions to current problems (performance, power consumption, IP utilization, etc.) facing the design of modern system-on-a-chip (SOC), as reflected in recent studies [2], [3], [6], [10]. The segmented bus platform that we analyze [7] fits into the larger concept of globally asynchronous locally synchronous (GALS) [1] system architectures. In this approach, each distinct module of the system is synchronized to a local clock running at an optimal frequency, whereas interactions between these modules are arranged asynchronously. Every segment is identified with a (possibly) different clock domain. In the present study, we consider the problem of allocating hardware units to segments of the bus in such a way that the traffic across segment borders is minimized and the potential for parallel transfers is maximized. The goal is to obtain an optimal distribution of the components on the segments, so that the performance is maximally increased. Here, performance is viewed as the overall time required by an application to finish a set of tasks, on the platform. The objective here is to keep the external data transfers of each segment as low as possible. In order to avoid trivial solutions, we consider a fixed number of segments. The inter-component traffic is defined by a unit-tounit matrix of transfer frequencies. Our approach differs from the one taken by Jones et al. [4] in three aspects. First, our objective is to maximize the parallelization and, at the same time, minimize the frequency of inter-segment transactions. Second, we do not fix (by a relaxation of the problem) the device topology, but allow a free search for the orientation of the devices. Third, we are interested in a linear organization of bus segments only and, therefore do not allow a (more complex) tree-like segment organization. Due to the higher level of the approach, we also do not take into account technical aspects of the actual implementation, such different clock domains, synchronization issues, 0-7803-9264-7/05/$20.00©2005 IEEE or segment-to-segment throughput. Therefore, the results are somehow ideal, but the implementation experiments show that they strongly impact the performance of the whole system. II. P ROBLEM STATEMENT We consider an on-chip system organization where the bus is divided into several segments so that data transfers between devices (masters and slaves) reserve the segments between the source and target device, only. As an example, consider the case of a 3 segment bus and 8 devices (Fig. 1). D5 D2 D4 D1 D3 Fig. 1. D6 D7 D8 Implementation example. A data transfer between D4 and D6 reserves the segment 2 only. A transfer between D2 and D8 reserves all the three segments. The system has a control mechanism which identifies the source and target devices, the set of needed segments and allocates them for the transfer [7]. It also allows parallel use of bus, if the sets of segments allocated by two or more transfers are disjoint. The traffic between devices is defined by a deviceto-device matrix C (ci,j ; 1 ≤ i, j ≤ n) giving the number of data transfer requests per time unit between each device pair (i, j), see Fig. 2. In this example we consider that D1 , D3 , D6 and D8 are masters and the otherdevices are slaves. We denote the total traffic with Csum = i,j ci,j . For each segment k i\j 1 2 3 4 5 6 7 8 1 2 3 4 0 50 5 2 60 0 8 5 4 6 0 40 6 3 50 0 50 80 6 6 2 4 40 50 0 1 5 4 1 0 8 5 Fig. 2. 5 6 7 8 70 3 0 0 90 7 1 2 5 60 5 7 2 60 4 3 0 4 0 1 4 0 5 3 0 6 0 90 1 4 80 0 A communication matrix. (k = 1, 2, . . . , ns ) we can calculate the total amount of data transfers over that segment as the sum of transfers which have 129 1. source and target device in the segment k (tk,1 ), 2. source in segment k, target elsewhere (tk,2 ), 3. target in segment k, source elsewhere (tk,3 ), or 4. source in segment i and target in j, where i < k < j or i > k > j (tk,4 ). Here, tk,j , with j = 1, . . . , 4 , denotes the number of data transfers per time unit in each case. Fig. 3 shows the different cases of data transfers for the segment k = 2. calculate, from the device-to-device transfer matrix C and the device-to-segment allocation vector Ā, a segment traffic load matrix Q consisting of elements qij : S2 S3 This gives the segment load of the segment k as ns k   = ns  k  qij − qkk i=k j=1 ns k   ( qij + qji ) − qkk = i=1 j=k 2a 3a qij + i=1 j=k S4 2b 1 ck,l Ak =i,Al =j,1≤k,l≤nd Tk S1  qij = Let us make the simplifying assumption qij = v (a constant value) for all i, j. We then have: 3b 4a Tk 4b ns k   = 2v − v i=1 j=k Fig. 3. Data transfers reserving the segment k = 2. The numbers 1 to 4 refer to the indices j of tk,j . Let Tk (k = 1, 2, . . . , ns ) denote a weighted (wj , j = 1, . . . , 4) sum of transfers for segment k as defined above Tk = 4  wj tk,j = 2v(k(ns − k + 1)) − v = v(2kns − 2k 2 + 2k − 1) We observe that the segmentation induces higher traffic load to the middle-most segments in case the traffic is evenly distributed among the different segments, see Fig. 4. j=1 Suppose further that there are nd devices, D1 , . . . , Dnd and let Ai be the segment number (1 ≤ Ai ≤ ns ) to which device i is allocated. Thus, in Fig. 1 we have the device segment allocation Ā = (A1 , . . . , A8 ) = (1, 1, 2, 2, 1, 2, 3, 3), which actually is an optimal solution (with cost 489) for the example in Fig. 2. Here, Tk (Ā) denotes the sum of data transfers for segment k with the device - segment allocation Ā = (A1 , A2 , . . . , Ans ). We want to find, for a fixed number of segments ns , a segment allocation Ā∗ for which the maximum of the weighted sum of data transfers of the segment is minimal: T ∗ (Ā∗ ) = min max Tk (Ā) Problem 2.1: The multisegmented bus device allocation ∗ problem is to solve A in (1). What is still needed, is an expression for Tk (A) in terms of known access frequencies ci,j (1 ≤ i, j ≤ nd ). We then have =  ci,j Ai =Aj =k tk,2 (Ā) =  ci,j Ai =k,Aj =k tk,3 (Ā) =  ci,j Ai =k,Aj =k tk,4 (Ā)  = Ai <k<Aj or ci,j Ai >k>Aj III. S EGMENT T RAFFIC L OAD A simple form of the traffic load of each segment follows if we suppose that the device - segment allocation is given by the vector Ā, and let wi = 1 for all transfer types. We can then 10 segments 4 segments Fig. 4. Traffic distribution. It is interesting to note that the traffic load of the middle segment is Tk (1) Ā 1≤k≤ns tk,1 (Ā) 3 segments ∼ = = qv(ns /2(ns /2)) − v (n2s /2 − 1)v This indicates that, for a fixed v, the load of the middle segment increases with the square value of ns . One should however notice that v = O(n−2 s ), due to the overall traffic, is constant and there are n2s different segment-to-segment routes in the bus. At limit, Tns /2 approaches a constant different than 0. In the same way we observe that limns →∞ T1 = limns →∞ Tns = 0. Now consider three extreme cases for ns : (a) ns = 1, (b) ns = 2, and (c) ns = n. Assume that all segments have an equal number of nodes and that in all cases there is a fixed traffic ci,j = v. Case (a) means that the whole traffic of load n2 v happens in one segment. In case (b), the traffic load within both segments is (n/2)(n/2)v and the traffic load crossing the segment border is n(n/2)v. Thus, in case (b) the overall traffic load is 75% compared to (a). In case (c) all nodes are in their own segment, and the traffic load of a segment in the middle is 2(n/2)(n/2)v = n2 v/2. Then, in case of even traffic, segmenting the bus can decrease the traffic load by at 130 most 50%, and in case k = 2 by 25%. Notice that, when the traffic pattern is not even, the benefit can be much greater. Therefore, we try to achieve an “uneven” situation regarding the inter-segment communication profile. IV. A LGORITHMS FOR SOLVING SEGMENTATION It has been shown, in [8], that the bus segmentation Problem 2.1 is NP-complete. Next, we propose algorithms for solving this device allocation problem. The algorithms described in the following paragraphs create the basis for the development of SBTool, a command line application, designed to solve problems related to allocation and segmentation problems for the SB platform. We present experimental results of the heuristics in Section V, based on the output from SBTool. Related problems. In [4] finding device allocation for a related segmented bus system is considered by Jone et al. The two major differences are that (1) they allow the bus system to form an arbitrary tree, and (2) the overall usage of power consumed by the bus segments is minimized instead of minimizing usage of a single bus segment. Perhaps surprisingly, such a relaxed problem turns out to be optimally solvable in polynomial time. Although the motivation in [4] and in our work is similar, the problems differ so much that advancing their algorithmic solution in Problem 2.1 does not seem possible. A similar problem is studied also in [9] by Srinivasan et al. Minimizing power consumption is an essential part of their optimization problem (partitioning and device allocation), based on a genetic algorithm heuristic. We approached the problem by using three different methods [8]. First, an exhaustive search algorithm was applied. This is a feasible solution especially considering that, in practice, ns can be rather modest. However, this approach may also involve unnecessary work, since within a segment the order of devices makes no difference, and also, it makes no difference whether segments are ordered from left to right or vice versa. Secondly, we looked for heuristic solutions, since solving Problem 2.1 optimally is NP-complete. The basic step in the proposed heuristic methods is to have some initial device-to-segments allocation, which in our case is a completely random solution (random order of devices and randomly set segment borders). Greedy local search methods. Next we propose two heuristics for solving the Problem 2.1. Algorithm 4.1 is a basic greedy local search algorithm. Besides the load matrix C and segment number ns , it receives as its parameters the iteration bound b, and a method M ODIFY F UNC to generate a new allocation. New allocations are generated as long as those improve the current setting or b non-improving allocations have been generated in sequence. Algorithm 4.1 returns device-to-segments mapping. Algorithm 4.1: (Greedy local search with iteration bound) SB-G REEDY-L OCAL -S EARCH(C[1..n][1..n],ns , b, M ODIFY F UNC) A := G ENERATE R ANDOM A LLOCATION(C, ns ); g := G OODNESS (A, C, ns ); i := 0; while (i < b) A′ := ModifyFunc(A, ns ); g ′ := G OODNESS (A′, C, ns ); if (g ′ < g) A, g, i := A′ , g ′ , 0; else i := i + 1; return A; Algorithm 4.2: (Random swaps and/or moves) S WAPS -M OVES -R ANDOMLYx(A[1..n], ns ): [1..n] Number A′ [1..n] := A; for (i = 1 to x) do if (R ANDOM(0 . . . 1) == 1) then A′ := S WAP -R ANDOMLY(A′, ns ); else A′ := M OVE -R ANDOMLY(A′, ns ); return A′ ; Algorithm 4.2 performs a sequence of x random swap/move operations for a given device-to-segment allocation. The type of operation (swap or move) is chosen randomly with equal probability in each iteration round. In our experiments, we use S WAPS -M OVES -R ANDOMLY1 , which performs a single random swap or move. V. E XPERIMENTAL RESULTS We performed experiments with 2 heuristic methods: • LocalGreedyM : Algorithm 4.1 is applied with I NIT R AN DOMLY and M OVE -R ANDOMLY . These experiments use the parameters #a , number of attempts, which tell how many randomly chosen starting points are used and b which denotes the iteration bound. • LocalGreedyM,S : The same as above, with the difference that S WAPS -M OVES -R ANDOMLY1 is used instead of M OVE R ANDOMLY. Again #a is applied. For experimenting with the algorithms, we choose a set of fixed number of segments test cases, briefly described in Fig. 5. In [8], a larger set of situations is analyzed. The cases fixed-1 and fixed-2 are so small, that they can be solved optimally with an exhaustive search method, with no difference compared to the heuristic method. Both heuristic methods, LocalGreedyM and LocalGreedyM,S , found global optimum in all the cases, with very modest values for b (10 to 40) and #a (5 to 200). This also means that the methods performed very quickly. Name fixed-1 fixed-2 fixed-3 Fig. 5. Csum 100 100 1000 Description a fixed test case for n = 6 a fixed test case for n = 8 a fixed test case for n = 16 Test cases with traffic distribution. Implementation. Consider a situation where there are 16 devices (D0 , . . . , D15 ) - a fixed-3 case, and the communication matrix C is as shown in Fig. 6 We solved the segmentation problem of fixed-3 by the exhaustive search and heuristic greedy local search algorithm LocalGreedyM,S , see Fig. 7 for results with segments 2, . . . , 8. In cases ns = 2, .., 4 (exhaustive search), the result is the global optimum. In cases ns = 5, . . . , 8, heuristic methods were applied. The parameters (iteration bound and number of attempts) were set so that computations took around one minute. For cases ns = 2, . . . , 4, they also found as good solution as the global optimum. 131 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 180 0 100 200 140 0 85 0 150 0 0 100 100 0 50 50 D1 0 300 0 0 0 0 100 100 50 150 0 150 0 100 50 50 D2 D3 D4 D5 D6 D7 D8 100 0 0 10 60 200 200 100 0 50 0 0 150 0 200 50 50 180 100 0 0 100 150 200 50 0 100 0 0 0 150 50 150 0 70 100 100 0 300 0 0 0 25 0 90 400 0 100 0 50 0 50 0 150 0 0 0 0 150 150 50 200 150 50 100 100 0 0 100 0 0 0 200 200 100 0 0 0 100 0 50 50 150 0 0 100 200 0 0 0 150 0 200 0 0 200 0 0 0 0 50 100 100 150 0 0 0 0 200 0 150 Fig. 6. ns 1 2 3 4 5 6 7 8 D9 D10 0 0 100 100 70 140 0 0 70 150 200 0 0 0 250 100 0 0 200 0 50 300 0 0 0 150 0 0 100 0 150 0 D11 100 0 60 0 140 100 0 120 0 0 0 200 0 0 0 0 D12 0 100 0 100 80 0 50 80 150 80 150 0 250 50 0 0 D13 0 0 0 0 0 150 50 0 0 70 0 0 200 200 0 0 D14 0 100 0 0 0 0 0 0 50 0 0 100 0 0 300 100 D15 150 0 50 100 0 50 100 100 50 0 0 100 0 100 100 300 this segment-to-segment delay. However, these corresponding values are dictated by the application level. In addition, also the application determines the exact sequence of transfers, in order to fulfill the required overall functionality; this aspect, however, will affect the performance figures as obtained in our example, only if idling periods are necessary. Thus, another important parameter in this architecture is the size of the data package. Its dimension dictates the size of the inter-segment buffers and it may largely affect the system behavior in terms of performance, area and power consumption. The upper bound of the delivery latency corresponding to one inter-segment package is computed based on the package size. For the above discussed exemplification, we have decided on a package size of 25 + 2 (data + address locations). Relative communication loads. cost solution (indexes) 160000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 104000 0 3 4 6 8 11 14 15 | 1 2 5 7 9 10 12 13 82800 0 6 8 11 14 15 | 1 3 7 9 | 2 4 5 10 12 13 71600 0 6 8 14 15 | 1 7 11 | 3 4 9 | 2 5 10 12 13 69150 3 4 10 12 13 | 2 5 | 7 9 | 1 6 | 0 8 11 14 15 65450 2 5 12 13 | 4 10 | 3 7 | 1 9 | 0 14 | 6 8 11 15 62300 6 8 15 | 1 14 | 0 11 | 7 | 3 4 | 2 9 | 5 10 12 13 59450 5 10 12 | 1 13 | 2 | 9 | 7 | 3 4 | 6 11 | 0 8 14 15 Fig. 7. VI. C ONCLUSIONS Solutions for fixed-3. Discussion and practical aspects. The methodological development considered, at this moment, only the relative communication loads between system devices. On-going research addresses power consumption issues in the framework of the same platform, while guiding results and illustrative benefits of a segmented bus platform may be derived from [10]. The 3 segment, 16 device solution was implemented on an ALTERA Stratix II device. The comparison is performed against a similar single bus version of the sample system (same number of masters, same number of slaves, same relative communication loads, same device, etc.). The different clock frequencies were assigned as follows: for the SB implementation: segment 0 runs at 91 MHz, segment 1 at 98 MHz, segment 2 at 89 MHz, while the central arbitration unit operates at a 90 MHz clock frequency; we assigned for the single bus clock the fastest of the above frequencies, 98 MHz. In these conditions, the results show a 26% increase of performance of the segmented bus solution, over to the execution on the single bus implementation (2.23 ms compared to 2.82 ms, the time required for all the masters to send their data packages). The whole system was simulated at postsynthesis levels, in the Modelsim environment [11]. The slight differences between these results and the data specified in Figure 6 originates from the fact that the introduced algorithms analyze an ideal situation, where there is no inter-segment delivery latency. This is motivated, on one hand, by the fact that we cannot ensure a fixed value for this latency, but only a bounded one [7]; on the other hand, this latency bounds are fixed for a given platform instance, hence they will only affect the performance comparison with the single bus option. Moreover, both the communication loads and the size of the data packages affect the performance results more than We considered the resource allocation on a segmented bus platform. The optimal solution was formalized as an organizational problem, where the objective was to minimize the maximal weighted traffic between the system devices. A local search algorithm was proposed for minimizing the expressions derived from the estimated traffic frequencies between different devices. The algorithm trusted on local search operations, performed in a greedy manner. Future work. Application level issues, while not within the scope of the present study, are the main factors in deciding the communication matrix, and, hence, affect both segmentation and allocation results. Decisions taken at this level would greatly influence the decision on the size of the packages, too. This is the focus of present and forthcoming research; combined with the results presented in this study, a fully motivated decision can be taken, regarding the selection of the architecture, as a single bus system, or as a multi-segmented one. R EFERENCES [1] D. M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis, Standford University, 1984. [2] J. Y. Chen et al. Segmented Bus Design for Low-Power Systems. IEEE Transactions On VLSI Systems, Vol. 7, No. 1, 1999, pp. 25-29 [3] J. Jeon, K. Choi. An Effective Synthesis Algorithm for Partitioned Bus Architecture. Institute of Electrical Engineers, Electronics Letters, Vol. 35, No. 6, 1999, pp. 440-441. [4] W.-B. Jone et al. Design Theory and Implementation for Low-Power Segmented Bus Systems. ACM Transactions on Design Automation of Electronic Systems, Vol. 8, No. 1, 2003, pp. 38–54. [5] D.L. Kreher, D.R. Stinson. Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1999. [6] S. Lee and K. Choi. Partitioned-bus architecture synthesis based on data transfer model. The Sixth Asia Pacific Conference on Chip Design Language, Fukuoka, Japan, 1999. [7] T. Seceleanu. Communication on a Segmented Bus Platform. Proc. of the IEEE International SOC Conference, 2004, pp. 205-208. [8] T. Seceleanu et al. On the Organization of Multisegmented Bus. TUCS technical report no. 647, 2004. [9] S. Srinivasan et al. Simultaneous Partitioning and Frequenct Assignment for On-chip Bus Architectures. Proc. of DATE 2005. [10] H. Wang et al. A global bus power optimization methodology for physical design of memory dominated systems by coupling bus segmentation and activity driven block placement. Proc. of ASP-DAC 2004, pp. 759-761 [11] ModelSim Simulator. http://www.model.com. 132