Academia.eduAcademia.edu

Global instruction scheduling for superscalar machines

1991, ACM SIGPLAN Notices

To improve the utilization of machine resources in superscalar processors, the instructions have to be carefully scheduled by the compiler. As internal parallelism and pipelining increases, it becomes evident that scheduling should be done beyond the basic block level. A scheme for global (intra-loop) scheduling is proposed, which uses the control and data dependence information summarized in a Program Dependence Graph, to move instructions well beyond basic block boundaries. This novel scheduling framework is based on the parametric description of the machine architecture, which spans a range of superscakis and VLIW machines, and exploits speculative execution of instructions to further enhance the performance of the general code. We have implemented our algorithms in the IBM XL family of compilers and have evaluated them on the IBM RISC System/6000 machines.

Global Instruction Scheduling David Machines Bernslein Michael IBM for SuperScalar Rodeh Israel ScientKlc Technion Haifa Center City 32000 ISRAEL 1. Introduction Abstract To improve the utilization superscalar carefully processors, scheduled parallelism evident level. scheduling information Dependence well beyond Graph, basic block uses the control to move boundaries. scheduling framework description of the machine exploits further code. speculative execution We have implemented XL family them on the IBM of compilers RISC which and to scheduling so as to improve of such transformations, has been placed on algorithms at the instruction with functional machines units BRG89, [BJR89], While pipelined HG83, GM86, Word for machines W90] (VLIW) each cycle, for pipelined issue a new instruction eliminating fee ell or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the otherwise, 0-89791-428-7/91/0005/0241 or to feature required from allowing the generation [EIEJ units the idea the goal is to every cycle, effectively NOPS (No Operations). types of machines, the code instructions the machine . ..$1.50 I for both and Very as n instructions machines the so-called However, Permission to copy without several machines n functional with is to be able to execute as many @ 1991 ACM or assembly level were suggested for processors Large Instruction machines. Association for Computing Machinery. To copy republish, reauirea a fee end/or aDecific oermiasion. out that in order language scheduling, Previously, architecture have to be rearranged, The burden compilers. [BG89, in the and have evaluated System/6000 It turned of pipelining optimizing instructions; spans of the general our algorithms [P85J for emerged which in computer instructions called instruction a new approach of program this direction at the intermediate code level. instructions of instructions enhance the performance IBM usually in a machines, streamlining performance, and is based on the parametric a range of superscakis and VLIW high speed processors to take advantage This novel architecture, building was called RISC the (intra-loop) summarized in the late seventies, subsequently be done beyond which Starting emphasizes it becomes A scheme for global is proposed, data dependence As internal increases, should in have to be by the compiler. and pipelining basic block resources the instructions that scheduling Program of machine the common the compiler is to discover that are data independent, of code that better in utilizes resources. i Proceedings of the Programming Toronto, ACM Language Ontario, SIGPLAN Design Canada, June ’91 Conference on It was a common view that such data independent and Implementation. 26-28, instructions 1991. can be found within basic blocks, there is no need to move instructions 241 beyond and basic block boundaries. Virtually, work on the implementation scheduling for pipelined scheduling within W90]. many unpredictable scientific Dependence concentrated [HG83, -type programs small basic blocks programs terminated since there, basic blocks compilers to expose parallelism with by extends RISC by the ability challenges and generation to compilers, the parametric for global instruction is not so severe, optimizing compilers. is evolving that This type of high or serious since instruction generation machine resources computations, symbolic or Udx-t not depend scheduling scheduling branch cases not beyond superscalar machines code, processor small number approaches machines scheduling was reported in [GR90], Also, instruction machine There of instructions within time is likely in on such assumption. However, global of taking advantage whenever available scheduling of the (e.g. As for the enhanced scheduling, our opinion towards a machine of computational units, is that it is with a large like VLIW between instructions have to be duplicated scheduled. Since we are currently machines the scheduling. for global with the scope of the enclosed speculative the movement loop. boundaries The method 242 and speculative we identify a small number System/6000 a conservative interested of functional approach First we try to exploit instructions, execution in order machines), useful instructions, we the cases where to be in units we to instruction the machine next we consider whose effect on performance depends on the probability to be taken, and scheduling might Bell Labs useful Also, (like the RISC established basic blocks in a PDG, a [EN89]. permits available distinguish of instructions. with in the literature: of AT&T (which of a but may not be true in of code for the VLIW which assumes the existence by proffig). E851 and the enhanced well beyond I Unix is a trademark by does Using the information are two main we present a technique scheduling scheduling scheduling) resources with In this paper, a powerful machines. one can view a as a VLIW that were reported [F81, number were investigated, of running of resources. scheduling providing of for scheduling techniques for compiling scheduling percolation of a family percolation probabilities, percolation [JW89]. instruction in fair improvements the compiled trace to pursue the scope of basic blocks resulting the PDG global more targeted code ,replication of ype programs), is capable computed of code that utilizes to a desired extent scientific (as well as enhanced poses more to allow trace scheduling main trace in the program to issue more than one sufficient where description framework level is in many for superscalar for the purposes We suggest combining hand, for at the basic block One recent effort to be used in of code for thereby called superscalar architecture, superpipelined et. al [FOW87] tend to be larger. per cycle [G089]. speed processors, called the Program that was recently by Ferrante multiprocessors. that (PDG), machines, a new type of architecture instruction Graph proposed While Recently, data structure, superscalar On the other the problem a novel vectorization such may result in code with Unixl branches. on GM86, architectures type of scheduling many of instruction basic blocks NOPS for certain include employs machines Even for basic RISC restricted all of the previous with ~f branches duplication, increase the code size incurring which additional costs in terms of instruction do not overlap belong cache misses. the execution to different of instructions iterations is often called sofware for future that of the loop. more aggressive type of instruction which Also, we of functional the machine This executed units of m types, where has nl, nz, ....n~ units of each type. Each instruction scheduling, pipelining a collection in the code can be potentially by any of the units of a speci.tied type. is left [J-X8], work. For the instruction scheduling that there is an unbounded For speculative instructions, previously-it suggested that they have to be supported machine architecture architectural [ESS, SLH90]. support carries a si~lcant evaluating with run-time compile-time retaining overhead, XL family System/6000 preliminary machine we are (coloring) of the code, still effect promised of compilers (RS/6K for the IBM for short) computers. of The rest of the paper is organized and show how it is applicable Section Then, in Section that will in Section 3 we bring serve as a running this paper we and register allocation A program instruction of machine cycles to be executed at all. between For the instruction see [BEH89]. units of its type. in Section execution, 6 we bring results and conclude in Section imposed which are modelled the execution machine Our model of a superscalar In of the PDG, description of a typical RISC that reference memory while We view a superscalar all the computations Let I (t > 1) be that if Zz is scheduled as 243 if 11 is constraints be (by the compiler) above, this would of the program, to guarantee info~ation BRG89]. are machine purposes, to start no earlier than k + t+ d. Notice, pipelined whose only are load and store instructions, edge. scheduled More is based on the done in registers. such that the edge to start at time k, then L should assume that the machine description edges of the time of 11 and d (d z O) be the delay affect the correctness 7. of by the integer graph. start earlier than mentioned some performance processor on the execution scheduled however, are presented. machine there are assigned to (11,14. For performance a small number by one of the Also, constraints (11,12) is a data dependence model interlocks 2. Parametric an integral Let 11 and L be two instructions In example. requires delays assigned to the data dependence 5 several levels of scheduling, speculative instructions Throughout register allocation scheduling instructions to the RS/6K 4 we discuss the usefulness including Finally, as follows. 2 we describe our generic machine program the onto the real on the relationships pipelined The results for our scheduling were based on a set of SPEC benchmarks machines. the using one of the standard algorithms. computational Section during phase of the compiler, discussion functional RISC [ss9]. while registers, of symbolic Subsequently, registers are mapped will not deal with our scheme in the context performance prototype symbolic number we assume execution. We have implemented the IBM register allocation such support most of the performance by speculative by the execution for replacing analysis registers in the machine. Since for speculative techniques was purposes, implements to not since we hardware the delays at run time. about the notion can be found of delays due to in [BG8!J, 2.1 The RS/6K model the second types of the above mentioned Here we show how our generic model superscalar machine machine. is cotilgured The RS/6K be considered. of a to fit the RS/6K processor is modelled 3. A program as Next, follows: we present that computes ● “ m = 3, there are three types fixed point, floating ni= l,n3= 1, nz= point unit, Most l,there instructions, point are four main in one Next, etc. a load instruction Figure a floating point compare instruction instruction comprises a floating Section that uses the result of that delays in the In this paper we concentrate computations 2 of notation, only. problem of future Therefore, in the with of before discussion. as was mentioned the global the register allocation to activate allocation 2 the registers mentioned 2, we prefer to invoke the register on fixed point them of the program However, in the code), even though whose effect is secondary. 2. the code of Figure for the purposes algorithm XL-C the instructions this stage there is an unbounded are a few additional machine in Figure statements in the code are real. and the branch compare. There that corresponds we mark the ten basic blocks of which For simplicity a delay of five cycles between for the loop, the The 2 (I 1-120) and annotate 1. Also, (BL1-BL1O) that updating if needed. we number the corresponding ; uses its result; — they are and the minimum, code of Figure and the instruction one to maximum For convenience, and the branch that uses the result of that instruction of a are compared , is presented is of of the loop. to the max and mi n variables, compiler3 a fixed point which every iteration (zfiu > v)) , and subsequently pseudo-code 1 and that two elements to the real code created by the IBM Zoad); a delay of one cycle between on the loop compared RS/6K that uses its in Figure of example. 1, we notice these elements in C) and the maximum concentrating in Figure (written is shown serve us as a running another division, a delay of three cycles between point This program marked types of delays: result register (delayed compare2 a small program the array a are fetched and the instruction instruction example the minimum In this program, there are also multi-cycle instruction – will unit and a are executed like multiplication, compare an array. types. isa single fixed a delay of one cycle between – units: unit. cycle, however, – and branch of the instructions s There of functional a single floating single branch ● point delays will number conceptually the instruction in scheduling is done (at of registers there is no scheduling after is completed. only the first and More precisely, usually the three cycle delay between a fixed point compare and the respective branch instruction encountered only when the branch is taken. However, here for simplicity is we assume that such delay exists whether the branch is taken or not. 3 The only feature of the machine in a special counter register. zero in a single instruction, that was disabled in this example is that of keeping the iteration variable of the loop Keeping the iteration variable in this register allows it to be decremented and tested for effectively reducing the overhead for loop control instructions. 244 ~ find the largest and the smal lest number in a given array minmax(a,n) { int i,u,v,min,max,n,a[SIZE]; min=a[O]; max=min; i=l; /****************** LOOP STARTS while max is kept in r30 min is kept in r28 i is kept in r29 n is kept in r27 address of a[i] is kept in r31 . . . more instructions here . . . *************** LfjOfJ STARTS ******************* ‘/ ************* / (i <n) { u=a[i]; v=a[i+l]; if (u>v) { if (u>max) max=u; if (v<min) min=v; } else { if if CL.0: (11) (12) } j= i+p. Loop printf(’’min=%d ENDS *************** / max=%d\n’’,min,max); } 1. A program Figure computing the minimum and the (Ill) maximum Every instruction branches, of an array 2, except for one cycle inthetixed point while the branches unit. There is a one cycle delay between 12and13, dueto RS/6K. Notice update instruction of a load with in 12: in addition (r31) + 8, it also increments r31 by 8 each compare corresponding consideration branch Also, unit branch address there is a three cycle delay instruction instruction. that the fixed point run in parallel, and the Taking unit we estimate into and the instructions) END BL1 u > max END BL2 max = u END BL3 v < min END BL4 min = v CL.9 END BL5 CL.4: cr6=r0,r30 (112) C (113) BF CL.ll,cr6,0x2/gt --------------------------------------(114) LR r30=r0 ~-------------------------------------CL. 11: (115) C cr7=r12,r28 (116) BF CL.9,cr7,0xl/lt --------------------------------------- v > max END BL6 max = v END BL7 u < min ... more instructions here .. . that the code executes in 20, 21 or 22 cycles, depending O, 1 or 2 updates index END BL8 min = u (117) LR r28=r12 --------------------------------------END BL9 CL.9: (118) AI r29=r29,2 i =i+2 i<n (119) C cr4=r29,r27 (120) BT CL.0,cr4,0xl/lt --------------------------------------END BL1O *************** LOfjp ENDS ********************** to assigning to locational between instruction the delayed load feature of the rO the value of the memory (post-increment). unit, take one cycle in the branch the special form B 1oad u load v and increment U>v --------------------------------------- in the code ofFigure requires r12=a(r31,4) rO, r31=a(r31,8) (13) C cr7=r12, r0 (14) BF CL.4, cr7,0x2/gt --------------------------------------(15) C cr6=r12, r30 (16) BF CL.6,cr6,0x2/gt --------------------------------------(17) LR r30=r12 --------------------------------------CL.6: (18) c cr7=r0,r28 (19) BF CL.9,cr7,0xl/lt --------------------------------------(110) LR r28=r0 (v>max) max=v; (u<min) min=u; }’ p****** ********* L LU of max and mi n variables on if Figure (LR 2. The RS/6Kpseudo-code Figure 1 are done, respectively. 245 forthe program of 4. The Program Dependence The program dependence to summarize both data While graph is a convenient the control among dependence the concept of data dependence, instruction of control dependence was introduced of control 4.1. Control a data using this value, was a long time notions that carries computing in compilers [FOW87]. ago, the notion In what follows quite we discuss the and data dependence separately. dependence We describe the idea of control dependence the program example 1. In Figure control flow described, of Figure graph of the loop of Figure in the loop. circles denote 3 the to a single The numbers inside the the indices of the ten basic blocks BL1-BL1O. We augment with ENTRY unique using 2 is where each node corresponds basic block ENTRY the code instructions, employed recently way and dependence the basic idea of one instruction value and another Graph convenience. the graph of Figure and EXIT Throughout 3 nodes for this discussion we assume a single entry node in the control flow graph, i.e., there is a single node (in our case BL1) which is connected to ENTRY. However exit nodes that have the edges leading exist. In our case BL1O is a (single) the strongIy connected in this context), graph having regions to EXIT of a control flow 3. The control block of the program. flow graph of the loop of Figure Here, a solid edge from node A to a node B has the following a meaning: For (that represent the assumption that the control may exit node. a single entry corresponds assumption several Figure loops 1. there is a condition is evaluated flow 2. if COND to the definitely graph is reducible. COND to either in the end of A that TRUE is evaluated or FALSE, to TRUE, be executed, othenvise and B will B will not be executed. The meaning of an edge from in a control program flow B. the conditions from The control loop that control 3 however, will the basic block (UsuaUy, one basic block of Figure block graph is that the control may flow from basic block be executed subgraph of Figure a node A to a node B of the A to the edges are annotated the flow to another.) it is not apparent under which of the PDG graph. with edges emanating basic (CSPDG) 2 is shown in Figure 3, each node of the graph corresponds BLl of the 4. As in Figure to a basic 246 will be evaluated BL8 will be executed FALSE. flow control dashed edges will be For example, from with as for the control 4 solid edges designate BL4 will be executed condition. edges are annotated conditions edges, while discussed below. the graph which dependence In Figure dependence of the program From The control the corresponding in Figure BL 1 indicate that BL2 and if the condition to TRUE, while 4 the while at the end of BL6 and the same condition is 2 Definition _— 1 — 2. B postdominates appears on every path from --@ A if and only if B A to EXIT. F T TF --+4 2 -- 6 T T T 5 3 *O 8 Definition 3. A and B are equivalent dominates B and B postdorninates A. Definition 4. We say that moving an instruction if and only if A T 9 7 B to A is useful if and only from if.4 and B are equivalent. Figure 4. The forward the loop control of Figure subgraph of the PDG of Definition 2 B to A is speculative from As was mentioned in the introduction, schedule instructions loop. scheduling, the for-war-d control that result from Figure the following control graphs only. dependence The usefulness stems from control of the control 4) can be executed B. of graph. useful In scheduling. are helpful To fmd equivalent search a CSPDG control of PDG that have the up to the existing dependent, i.e. they depend they do not depend on any node. BL4 are equivalent, since both mark the equivalent the instructions diiection relation of BLI For example, between of them nodes with the nodes. For example, dominates “the degree of speculativeness” graph from graph instructions scheduling is a path A to B. “gamble” from one block B if and only on every path fi-om ENTRY if A appears correctly, to B. CSPDG 247 4 we for a speculative on the outcome the moved provides that scheduling. for moving to another. instruction, instruction When we always of one or more only when we guess the direction 1. A dominates on BL1O. provides flow depend nodes BL 1 and BLI O, we conclude framework. there and the dominance our scheduling A, i.e., BL2 dashed edges, the is useful also for speculative from since In Figure CSPDG Let A and B be two nodes of a control Definition of “the same set Also, condition. of these edges provides equivalent together. nodes, we 4, BL 1 and B L 10 are equivalent, BL1 under the TRUE data doing for nodes that are identically of nodes under the same conditions. in Figure while that are such that B is reachable flow if A does not several deffitions to understand in the control an instruction that forward can be scheduled let us introduce duplication It turns out that CSPDGS (like BL 1 and For our purposes, such basic blocks dominate or BL6 and BL8 in Figure in parallel 6. We say that moving control subgraph dependence BL1O, or BL2 and BL4, A. B to A requires from graphs are acyclic. same set of control required The CSPDG the fact that basic blocks dependence. and the back dependence Notice if B does not dependence through we discuss forward dependence Definition [CHH89] the control flow graph. 4 is a forward postdorninate an instruction of a dependence graph only, or propagate edges in the control we of this type of we follow i.e. we do not compute Now a single iteration So, for the purposes instruction build within currently 5. We say that moving branches; of these branches becomes profitable. for every pair of nodes the It number of branches speculative moving scheduling). instructions on the outcome moving from For example, from is not obvious 3.) BL5 to BL 1 gambles branches, delays. when constrain since when Similarly, on the outcome B to A is n-branch 4. To compute in CSPDG from instructions from if there exists a path specula~ive A to B of length that useful scheduling n. Data dependence However, compilation we take advantage block level, data dependencies instruction dependence by instruction intrablock and interlock data dependence are computed at a basic are computed basis. time, observation. Then, data dependencies. if we discover edge from from a to c. To use this observation, are traversed pairs A block. may be caused by the usage of and (b,c), (Actually, the basic the dependency considered for every possible we compute for the data dependence the edge in an order such a and c, we have already (a,b) dependence edge from in one of the following ● the transitive relation closure in a basic block.) locations. in the code. a to b is inserted into A data that B is reachable PDG computed. in a is used in b (’jlow A register used in a is defined in b A register defined in a is defined Both (loads, it is not proven locations that are considered computation of the intrablock dependence that touch stores, calls to subroutines) instructions and (memory the data dependence as edges leading from a of a register to its use carry a (potentially non-zero) delay, which is a characteristic as was mentioned for BL1; of data we will reference by their numbers from from the Figure 2. There (I 1) to (12), since (I 1) uses r31 and (12) defines a new value for r31. di.rarnbiguation). deftition machine, of pairs of during the computation is an anti-dependence that they address different the data dependence underlying in the previous instructions is a flow Ordy are helps to reduce the number Let us demonstrate a and b we instructions memory flow in b (output dependence); ✎ data dependence such well. (anti-dependence); ✎ A in the control The observation paragraph dependence); ✎ from graph, the intrablock cases: A register defined the b in a basic Next for each pair A and B of basic blocks Let a and b be two instructions that a to b and b to c, there is no need to compute between both of the Let a, b and c be three in the code. instructions there has to reduce the that when we come to determine on an We compute registers or by accessing memory form in a basic from block control is si.rnih every pair of instructions there is a data dependence is O-branch While compiler which to be considered. following speculative. 4.2. of registers, all the data dependence essentially instructions Notice renaming process, the XL of two block, % We say that moving may unnecessarily [CFRWZ]. from since we cross two edges of Figure Definition which of anti and output to the effect of the static single assignment the control moving edges carry zero the number the scheduling does certain 4, we cross a from To minimize data dependence, BL8 to BL 1, we gamble BL8 to BL1 in Figure graph of Figure The rest of the data dependence on (in case of of a single branch, single edge. (This flow we gamble from both (I 1) and (12) to (13), since (13) uses r12 and rO defined (12), respectively. of the in Section data dependence The edge ((12),(13)) 248 in (I 1) and carries a one cycle delay, since (12) is a load instruction 2. There (delayed load), ((I 1),(13)) is not computed while transitive. There is a flow since it is data dependence from (13) to (14), since (13) sets cr7 which (14). This edge has a three cycle delay, a compare branch instruction instruction. ((12),(14)) Finally, to notice PDG convenient both that, is used in since (13) is duection, top-level since both Also, the control are acyclic, which the which is framework framework of the in the program will scheduling. This includes: “ NO duplication for global ● Only of code is allowed 6 in Section l-branch (see 4.1). speculative (see Deftition s No new basic blocks their relative ordering at hand. should These limitations be scheduled While instructions are 7 in Section 4.1). are created in the control the scheduling process. be tuned We schedule instructions dkcussed its basic blocks and in future in a region one at a time. visited in the topological for a specflc We present the top-level S,1, while the heuristics will be removed in the control process flow processed before are discussed in by processing The basic blocks are order, i.e., if there is a path graph from A to B, A is B. 5.2. Let A be the basic block to be scheduled The top-level process We schedule instructions region work. the top-level here, it is suggested that the set of heuristics 5.1. connected to a loop (which loops (which a region component without to A and are dominated Deftition 3). difkent by A (see a set C(A) contribute instructions of candidate which to A. Currently can there are two levels of scheduling: the has no back edges at all). we do not overlap iterations We maintain next, and that are for A, i.e., a set of basic blocks 1. Useful Since currently be the set of blocks equivalent blocks has at least one back edge) or a body of a subroutine enclosed on a In our terminology either a strongly that corresponds let EQUZV(A) in the program by region basis. represents of a loop, the execution of 2. l-branch in the process of scheduling instructions only: speculative: C(A)= C(A) EQUIV(A); includes the there is no difference following the body of a loop and blocks: a. the blocks the body of a subroutine. of EQUIV(A); b. AU the immediate successors of A in CSPDG; Innermost is that characterize status of our implementation flow graph during process is suitable for a range of machines Section order of branches cycle by cycle, and of a set of heuristics next, in case there is a choice. in Section edges. the current supported consists tries to schedule decide what instruction machine against the there are several limitations Deftition process, which flow in the upward This facilitates of instructions scheduling instructions of the control The original ✎ are moved preserved. is acyclic as well. 5. The scheduling out or into a i.e, they are moved direction discussed next. The global All the instructions ✎ of ((I 1),(14)) and we compute scheduling are never moved region. edges. and data dependence resultant edge and (14) is the corresponding are transitive It is important Instructions ● regions are scheduled few principles that govern f~st. There our scheduling are a c. All the immediate process: EQUW(A) 249 successors of blocks in CSPDG. in Once we initialize compute the set of candidate blocks, for A. the set of candidate instructions I is a candidate for scheduling instruction A if it belongs to one of the following we 9 The parametric machine description 2 does not cover all the secondary An of Section features of the machine; in block categories: ● The global decisions are not necessarily optimal in a local context. ● 1 belonged to A in the fust place. ~ 1 belongs to one of the blocks 1. f belongs to one of the blocks To solve this problem, and: in C(A) applied in EQUIV(A) and it may be moved basic block boundaries. instructions that are never moved basic block boundaries, (There beyond its are to every single basic block scheduling block has a more detailed scheduler which reordering like calls to allows scheduler is completed. The basic model of the more precise decisions the instructions within is of a program after the global machine beyond the basic block for the basic blocks. subroutines.) 2. 1 does not belong EQUIV(XI) to one of the blocks and it is allowed speculatively. (There are never scheduled to memory speculatively, heuristics that ready instructions, data dependence from process we maintain i.e., candidate are fulfilled. the ready list as many scheduled next as required a list of instructions functions a basic block) instructions instructions to be the parametric D(l), measure of how many instructions, we choose the %est” ones based on O for every K in B. code, and its data dependence instructions enabling are marked to become ready. in A are reordered instructions external immediate d(lJ1), and there might Initially, is set to Assume ... . Then, D(l) D(l) that .Jl,JZ, ... are the successors of Z in B, by visiting successors, D(I) = max((D(J1) I after visiting is computed as + d(Z,J1)),(D(JJ + d(l,JJ), ... ) be moved The second function provides take to complete does not depend an unbounded basic block. E(l) due to the two following reasons: called critical number path of how long it will the execution be the execution initialized 250 CP(l), a measure on 1 in B, including always create the best schedule for each individual It is mainly a on a data dependence d(l,Jz), heuristic, scheduler provides we move to to A that are physically out that the global fust The follows: A. It turns of delay slots may occur 1 to the end of B. its data dependence Once The net result is that the instructions into in the and let the delays on those edges be potentiality of A are scheduled, the next basic block. place in the to the following as fulfilled, new instructions all the instructions is picked up to to the proper B. called delay heuristic, path from Once an instruction machine in a block function If there are too many ready it is moved locally are used to set the priority Let 1 be an instruction by the machine description. criteria. are two that are computed in the program. by consulting priority There of an Every cycle we pick architecture, be scheduled, next. priority for every instruction code, these functions whose scheme is a set of the relative to be scheduled integer-valued instructions.) the scheduling that provide instruction like store (within During heuristics The heart of the scheduling to schedule it are instructions Scheduling 5.2. in of instructions 1 itself, and assuming of computational time of 1. First, to E(l) for every 1 in B. that units. CP(Z) Then, Let is again by visiting 1 after visiting successors, CP(l) 5.3, Speculative its data dependence is computed as follows: In the global scheduling non-speculative CT(l) = max((CP(J1) i- d(l,J1)), correctness scheduling scheduling, Section ... ) + l?(f) 4.2. scheduling During the decision instructions before speculative class of instructions an instruction D). (useful with ones. or speculative) (CP). Finally, ( order let A be a block scheduled, be executed of a C program: x); flow graph of this piece of code looks by a unit of the same type and) are ready at process, and one of them Also, has to be scheduled U(A) = A lJ EQUIV(A), the basic blocks the decision 1. If B(l) next. and let B(l) to which let and B(J) be 1 and J belong. is made in the following Then, Instruction orden e U(A) and B(J)# U(A), then pick ~ 2. If B(J) e U(A) and 13(1)# U(A), 3. If D(Z)> D(J), then pick ~ 4, If D(J)> D(f), 5. If CP(l) B3. Each of them into B 1, but it is apparent be printed x=3 belongs can be (speculatively) that both in B4. Data dependence of these instructions to moved of them are to move there, since a wrong the movement To solve this problem, then pick % that occurred about in the code from frost, the (symbolic) a basic block. considered that the current ordering is tuned towards of resources. preferring to B2, while value may do not prevent into B 1. > CP(J), then pick L 7. Pick an instruction functions x=5 belongs not allowed then pick J then pick< 6. If CP(J) > Cl’(l), number the that is the same time in the scheduling Notice excerpt Examine and let 1 and J be two that (should functional in as follows: To make it formally, instructions has to be maintained. The control of instructions. currently as they were defined ... path heuristic we try to preserve the original to respect this is not true, and a new type of i f (cond) x=5; else x=3; print. f(’’x=%d”, of the same class and delay we pick one that has a biggest critical it is sufficient ... we pick has the. biggest delay heuristic For the instructions following For the same doing It turns out that for speculative information process, we schedule useful while to preserve the of the program the data dependence (CP(J2) + d(l,JJ), framework, to schedule of the heuristic a machine with a small exit from B, such speculative a useful instmction to be updated a speculative instruction may cause longer delay. and tuning before a speculative In any case, updated. are needed for better Then, results. 251 speculatively that is being to a block B a new value for a register that is live on dka.llowed. one, even though the information registers that are Ibe on exit If an instruction to be moved This is the reason for always speculative experimentation computes we maintain Notice Thus, is that this type of information dynamically, motion movement i.e., after each this information has to be let us say, x=5 is fwst moved x (or actually has a symbolic register that to B 1. . . . more instructions here . . . ********** LOOp STARTS ************ . . . more instructions here . . . *********** Loop STARTS ************* CL.0: (11) (12) (118) (13) (119) (14) (15) (18) (16) (17) CL.6: (19) (110) (Ill) CL.4: (112) (115) (113) (114) CL.0: (11) (12) (118) (13) L LU AI C C ;F BF LR r12=a(r31,4) r0,r31=a(r31,8) r29=r29,2 cr7=r12,r0 cr4=r29,r27 CL.4,cr7,0x2/gt cr6=r12,r30 cr7=r0,r28 CL.6,cr6,0x2/gt r30=r12 BF LR B CL.9,cr7,0xl/lt r28=r0 CL.9 C C BF LR cr6=r0,r30 cr7=r12,r28 CL.ll,cr6,0x2/gt r30=r0 c more instructions 5. The results here of applying to the program correspondsto x) CL.9,cr7,(3xl/lt r28=r0 CL.9 detailed scheduling ofx=3to the useful scheduling and its relationship the effect of useful and on the example The result ofscheduling is presented ofBLl, considered tobe in Figure two instructions ofBL10(118 6 shows the result ofapplying (l-branch) and 119) were moved 252 12- 13 program both speculative In addition above, from BL8toBL6, the original Figure ffl Theresultisthat Figure2 of in 20-22 cycles per iteration. the schedulingto to the motions two additional (15 and 112) were moved there were those ofBLIO, since only BLIO~EQUIV(BLl). while 2 was executing were described and specula- moved inFigure5takes Figure the same program. that were 18was 15wasmovedfrom useful a.ndthe the the useful the programof Similarly, cycles per iteration, 2. ... delay slots of the there. Theresultantprogram onlyto 5. During the ordyinstmctions moved of applying ftiginthe BL4toBL2,andI of Figure useful instructions BL1, instructions to the PDG-based Let us demonstrate scheduling into speculative examples this program B 1, is out of the scope of this paper. scheduling here tive schedulingto 5.4. Scheduling speculative 6. The results 2 live onexitfrom ofthe more instructions Figure B1 will be prevented. description scheduling ... ... of Figure becomes and the movement global BF LR B CL.4: (115) C cr7=r12,r28 (113) BF CL.ll,cr5,(3x2/gt (114) LR r30=r0 CL.11: (116) BF CL.9,cr7,0xl/lt (117) LR r28=r12 CL.9: (120) BT CL.13,cr4,0xl/lt ********** Loop ENDS *************** BF CL.9,cr7,0xl/lt (117) LR r28=r12 CL.9: (120) BT CL.0,cr4,(3xl/lt ********** Loop ENDS ************** More C C C BF c BF LR (Ill) CL.11: Figure (119) (15) (112) (14) (18) (16) (17) CL.6: (19) r12=a(r31,4) r0,r31=a(r31,8) r29=r29,2 cr7=r12,r13 cr4=r29,r27 cr6=r12,r30 cr5=r0,r3Cl CL.4,cr7,0x2/gt cr7=r0,r28 CL.6,cr6,EJx2/gt r30=r12 (110) (116) ... L LU AI C instructions speculatively in the three cycle delay between since 15and that to BL1, to 13 and 14. Interestingly enough, 112 belong basic blocks that are never executed together to in any single execution of the program, two instructions will the program iteration, only one of these carry a useful result. in Figure in Figure was cor@ured All in all, 6 takes 11-12 cycles per a one cycle improvement program Next we describe how the global over the compile-time overhead improvement to a maximum design decisions 5. the global 6. Performance evaluation of the global scheme was done on the IBM whose abstract model For experimentation scheduling scheduling RS/6K is presented purposes, has been embedded of compilers. several high-level etc.; however, in Section into the IBM support like C, Fortran, we concentrate Only “small” “Small” Pascal, suite [SS9]. in unrolled EQNTOTT programs and ESPRESSO C Compiler, manipulation of Boolean functions (denoted by BASE After with and the global scheduling XL was disabled. that the base compiler includes possible scheduling machine optimization) the body two types of that of [W90], ● loop-closing scheduling techniques overlap to the that represent are rotated, loops by after the end of the global inner scheduling loops, the we effect of the software of the loop of the previous are executed of the within iteration. The general flow of the global scheduling is as inner loops are unrolled; to scheduling is applied time to the inner regions the fust only; 3. certain inner loops are rotated; techniques delay problems So, in some sense certain improvements global similar and a set of code replication certain scheduler are of one). i.e., some of the instructions 1. certain basic block that follows: and peephole as follows: a sophisticated 64 they include is applied such regions By applying 2. the global ● scheduling their fust basic block next iteration Please notice regions instead up to 4 basic blocks pipelining, in which on its own (aside of all the independent and only are scheduled. the inner of a loop achieve the partial . instruction regions second time to the rotated comparisons C compiler regions) that include once (i.e., after unrolling the global copying and equations. in the sequel) is the performance results of the same IBM other (i.e., step, before the global is applied, the loop. The basis for all the following inner regions loops with up to 4 basic blocks inner regions, are two that are related to minimization are scheduled. and 256 instructions. two iterations LI denotes the Lisp Interpreter . while status of are those that have at most In a preparation represent In the following stands for the GNU reducible basic blocks only on the C was done on the four C programs the SPEC benchmark between (i.e. regions regions scheduling GCC the current inner regions). XL programs. benchmark, The following Only two inner levels of regions outer regions ● discussion extent. regions that do not include 2.1. the global These compilers languages, The evaluation of the prototype: So, we distinguish machine ● family the trade-off and the run-time characterize scheduling scheme results 9 A preliminary so as to exploit scheduling that solve 4. the global [GR90]. scheduling time to the rotated is applied inner loops the second and the outer regions. due to the those of the scheduling The compile-time that were already part of the base overhead scheme is shown in Figure compiler. 253 of the above described 7. The column marked BASE gives the compilation in seconds as measured machine, model column marked provides on the IBM (Compile-Time percents. above mentioned rotation, only, time comes from and GCC, (Actually, performance BASE benchmarks, LI EQNTOTT ESPRESSO GCC improvement CTO 206 13% 78 465 2457 17% 12% 13% towards the existing at the moment, useful and speculative improvement (RTI) in Figure namely scheduling. for both overhead, which especially is shown of the measurements is about 0.5 of instructions by an opt imizing utilization is of machine superscalar to the processors, the base structure proposed The accuracy compilers (PDG), and a flexible 10/0. RTI USEFUL SPECULATIVE work scheduling, 312 EQNTOTT ESPRESSO GCC 45 1(36 76 2.0% 6.9% 7.1% 7.3% 0% (3% -0.5% -1.5% many helpful Figure 8. Run-time improvements Vladimir for the global sched- 254 description that employs a RS/6K machine We are going to extend our more aggressive speculative with would Krawczyk discussions, Rainish implementation. uling machine The results of evaluating and scheduling Hugo for better for a range of framework Acknowledgements. We Ebcioglu, compiler scheme on the IBM by supporting scheduling It is based on a data a parametric are quite encouraging. LI over the size for parallel/parallelizing scheduling the scheduling the global resources set of useful heuristics. BASE steps were that are being scheduled. The run-time 0/0 - it as since no major scheme allows in seconds. a larger As for the we consider The proposed with with units. usefid only and relative time of the code compiled in machines We may expect 7. Summary that we types of scheduling 8 in percents has already been optimized of computational of the regions the global due to the fact taken to reduce it except of the control overheads for the global sched- are two levels of scheduling with is modest architecture. even bigger payoffs number uling PROGRAM only.) that the achieved in run-time compile-time 7. Compile-time compiler in when the global our short experience we notice reasonable, running is to useful scheduling that the base compiler presented is etc.). PROGRAM distinguish scheduling no improvement is restricted scheduling, There the useful scheduling there is a slight degradation for both scheduling To summarize Figure most of On the other hand, for both observed. unrolling, 8 that for EQNTOTT for LI, the speculative ESPRESSO all of the loop while dominant. times in to perform steps (including The Overhead) This increase in the compilation the time required in Figure the improvement 530 whose cycle time is 40ns. CTO We notice RS/6K the increase in the compilation includes loop times of the programs duplication of code. like to thank Kemal and Ron Y. Pinter and Irit Boldo for their help in the and for References Transactions [BG89] Systems, Vol. 319-349. [BRG89] [BJR89j Bernstein, D., and Gertner, I., “Scheduling expressions on a pipelined processor with a maximal delay of one cycle”, ACM Transactions on Prog. Lang. and Systems, Vol. 11, Num. 1 (Jan. 1989), 57-66, Bernstein, D., Rodeh, M., and Gertner, I., “Approximation algorithms for scheduling arithmetic expressions on pipelined machines”, Journa[ of AZgorit/vns, 10 (Mar. 1989), 120-139. Bernstein, D., Jaffe, J. M., and Rodeh, M,, “Scheduling arithmetic and load operations in parallel with no spilling”, SIAM Journa[ of Computing, (Dec. 1989), 1098-1127. on Prog. Lang. and 9, Nurn. 3 (July 1987), [F81] Fisher, J., “Trace scheduling: A technique for global microcode compaction”, IEEE Trans. on Computers, C-30, No. 7 (July 1981), 478-490. [GM$6] Gibbons, P.B. and Muchnick, S. S., “Efficient instruction scheduling for a pipelined architecture”, Proc. of the SIGPLAN Annual Symposium, (June 1986), 11-16. [GR90] Golumblc, lM.C. and Rainish, V., “Instruction scheduling beyond basic blocks”, IBM J, Res. Dev.,(Jan. 1990), 93-98. [BEH89] Bradlee, D. G., Eggers, S.J., and Henry, R. R., “Integrating register allocation and instruction scheduling for RISCS”, to appear in Proc. of the Fourth ASPLOS Conference, (April 199 1). [G089] Groves, R. D., and Oehler, R., “An second generation RISC processor architecture”, Proc. of the IEEE Conference on Computer Design, (October 1989), 134-137. [CHH89] Cytron, R., Hind, M., and Wilson, H., “Automatic generation of DAG parallelism”, Proc. of the SIGPLAN Annual Symposium, (June 1989), 54-68, [HG83] [CFRWZ] Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K., “An efficient method for computing static single assignment form”, Proc, of the Annual ACM Symposium on Principles of Programming Languages, (Jan. 1989), 25-35. Hennessy, J,L. and Gross, T., “Postpass code optimization of pipeline constraints”, ACM Trans. on Programming Languages and Systems 5 (July 1983), 422-448. [JW89] Jouppi, N. P., and Wall, D.W., “Available instruction-level parallelism for superscalar and superpipelined machines”, Proc. of the Third A SPLOS Conference, (April 1989), 272-282. [L881 Lam M, “Software Pipelining: An effective scheduling technique for VLIW machines”, Proc. of the SIGPLAN Annual Symposium, (June 1988), 318-328. [P851 Patterson, D. A., “Reduced instruction set computers”, Comm. of A CM, (Jan. 1985), 8-21. [SLH90] Smith, M.D, Lam M. S., and Horowitz M.A., “Boosting beyond static scheduling in a superscalar processor”, Proc. of the Computer Architecture Conference, (May 1990), 344-354. [s89] “SPEC Newsletter”, Systems Performance Evaluation Cooperative, Vol. 1, Issue 1, (Sep. 1989). p-v!xy Warren, H., “Instruction scheduling for the IBM RISC System/6K processor”, IBit4 J. Res. Z)W., (J~. 1990), 85-92. [E88] Ebcioglu, K., “Some design ideas for a VLIW architecture for sequential-natured software”, Proc. of the IFIP Conference on Paral!el Processing, (April 1988), Italy. [EN89] Ebcioglu, K., and Nakatani, T., “A new compilation technique for paralleliziig regions with unpredictable branches on a VLIW architecture”, Proc. of the Workshop on Languages and Compilers fm-bm-aalle[ Computing, (August 1989), [E851 Ellis, J. R., “Bulldog: A compiler for VLIW architectures”, Ph.D. thesis, Yale U/DCS/RR-364, Yale University, Feb. 1985. [FOW87] Ferrante, J., Ottenstein, K.J., and Warren, J. D., “The program dependence graph and its use in optimization”, ACM 255 IBM