Global
Instruction
Scheduling
David
Machines
Bernslein
Michael
IBM
for SuperScalar
Rodeh
Israel ScientKlc
Technion
Haifa
Center
City
32000
ISRAEL
1. Introduction
Abstract
To improve
the utilization
superscalar
carefully
processors,
scheduled
parallelism
evident
level.
scheduling
information
Dependence
well beyond
Graph,
basic block
uses the control
to move
boundaries.
scheduling
framework
description
of the machine
exploits
further
code.
speculative
execution
We have implemented
XL
family
them
on the IBM
of compilers
RISC
which
and
to
scheduling
so as to improve
of such transformations,
has been placed on
algorithms
at the instruction
with
functional
machines
units
BRG89,
[BJR89],
While
pipelined
HG83,
GM86,
Word
for machines
W90]
(VLIW)
each cycle, for pipelined
issue a new instruction
eliminating
fee ell or part of this material is granted
provided that the copies are not made or distributed for direct commercial
advantage, the ACM copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
otherwise,
0-89791-428-7/91/0005/0241
or to
feature required
from
allowing
the generation
[EIEJ
units the idea
the goal is to
every cycle, effectively
NOPS (No
Operations).
types of machines,
the code instructions
the machine
. ..$1.50
I
for both
and Very
as n instructions
machines
the so-called
However,
Permission to copy without
several
machines
n functional
with
is to be able to execute as many
@ 1991 ACM
or assembly
level were suggested for processors
Large Instruction
machines.
Association
for Computing
Machinery.
To copy
republish, reauirea a fee end/or aDecific oermiasion.
out that in order
language
scheduling,
Previously,
architecture
have to be rearranged,
The burden
compilers.
[BG89,
in the
and have evaluated
System/6000
It turned
of pipelining
optimizing
instructions;
spans
of the general
our algorithms
[P85J
for
emerged which
in computer
instructions
called instruction
a new approach
of program
this direction
at the intermediate
code level.
instructions
of instructions
enhance the performance
IBM
usually
in a
machines,
streamlining
performance,
and
is based on the parametric
a range of superscakis and VLIW
high speed processors
to take advantage
This novel
architecture,
building
was called RISC
the
(intra-loop)
summarized
in the late seventies,
subsequently
be done beyond
which
Starting
emphasizes
it becomes
A scheme for global
is proposed,
data dependence
As internal
increases,
should
in
have to be
by the compiler.
and pipelining
basic block
resources
the instructions
that scheduling
Program
of machine
the common
the compiler
is to discover
that are data
independent,
of code that better
in
utilizes
resources.
i
Proceedings
of the
Programming
Toronto,
ACM
Language
Ontario,
SIGPLAN
Design
Canada,
June
’91
Conference
on
It was a common
view that such data independent
and Implementation.
26-28,
instructions
1991.
can be found
within
basic blocks,
there is no need to move instructions
241
beyond
and
basic
block
boundaries.
Virtually,
work
on the implementation
scheduling
for pipelined
scheduling
within
W90].
many
unpredictable
scientific
Dependence
concentrated
[HG83,
-type programs
small basic blocks
programs
terminated
since there, basic blocks
compilers
to expose parallelism
with
by
extends
RISC
by the ability
challenges
and generation
to compilers,
the parametric
for global
instruction
is not so severe,
optimizing
compilers.
is evolving
that
This type of high
or
serious
since instruction
generation
machine
resources
computations,
symbolic
or Udx-t
not depend
scheduling
scheduling
branch
cases not
beyond
superscalar
machines
code,
processor
small number
approaches
machines
scheduling
was reported
in [GR90],
Also,
instruction
machine
There
of instructions
within
time
is likely
in
on such assumption.
However,
global
of taking
advantage
whenever
available
scheduling
of the
(e.g.
As for the enhanced
scheduling,
our opinion
towards
a machine
of computational
units,
is that it is
with
a large
like VLIW
between
instructions
have to be duplicated
scheduled.
Since we are currently
machines
the
scheduling.
for global
with
the scope of the enclosed
speculative
the movement
loop.
boundaries
The method
242
and speculative
we identify
a small number
System/6000
a conservative
interested
of functional
approach
First we try to exploit
instructions,
execution
in order
machines),
useful instructions,
we
the cases where
to be
in
units
we
to instruction
the machine
next we consider
whose effect on
performance
depends on the probability
to be taken,
and scheduling
might
Bell Labs
useful
Also,
(like the RISC
established
basic blocks
in a PDG,
a
[EN89].
permits
available
distinguish
of instructions.
with
in the literature:
of AT&T
(which
of a
but may not be true in
of
code for the VLIW
which
assumes the existence
by proffig).
E851 and the enhanced
well beyond
I Unix is a trademark
by
does
Using the information
are two main
we present a technique
scheduling
scheduling
scheduling)
resources with
In this paper,
a powerful
machines.
one can view a
as a VLIW
that were reported
[F81,
number
were investigated,
of running
of resources.
scheduling
providing
of
for scheduling
techniques
for compiling
scheduling
percolation
of a family
percolation
probabilities,
percolation
[JW89].
instruction
in fair improvements
the compiled
trace
to pursue
the scope of basic blocks
resulting
the PDG
global
more targeted
code ,replication
of
ype programs),
is capable
computed
of code that utilizes
to a desired extent
scientific
(as well as enhanced
poses more
to allow
trace scheduling
main trace in the program
to issue more than one
sufficient
where
description
framework
level is in many
for superscalar
for the purposes
We suggest combining
hand, for
at the basic block
One recent effort
to be used in
of code for
thereby
called superscalar
architecture,
superpipelined
et. al [FOW87]
tend to be larger.
per cycle [G089].
speed processors,
called the Program
that was recently
by Ferrante
multiprocessors.
that
(PDG),
machines,
a new type of architecture
instruction
Graph
proposed
While
Recently,
data structure,
superscalar
On the other
the problem
a novel
vectorization
such
may result in code with
Unixl
branches.
on
GM86,
architectures
type of scheduling
many
of instruction
basic blocks
NOPS for certain
include
employs
machines
Even for basic RISC
restricted
all of the previous
with
~f branches
duplication,
increase the code size incurring
which
additional
costs in terms of instruction
do not overlap
belong
cache misses.
the execution
to different
of instructions
iterations
is often called sofware
for future
that
of the loop.
more aggressive type of instruction
which
Also, we
of functional
the machine
This
executed
units of m types, where
has nl, nz, ....n~ units of each type.
Each instruction
scheduling,
pipelining
a collection
in the code can be potentially
by any of the units of a speci.tied type.
is left
[J-X8],
work.
For the instruction
scheduling
that there is an unbounded
For speculative
instructions,
previously-it
suggested that they have to be supported
machine
architecture
architectural
[ESS, SLH90].
support
carries a si~lcant
evaluating
with
run-time
compile-time
retaining
overhead,
XL family
System/6000
preliminary
machine
we are
(coloring)
of the code, still
effect promised
of compilers
(RS/6K
for the IBM
for short)
computers.
of
The rest of the paper is organized
and show how it is applicable
Section
Then,
in Section
that will
in Section
3 we bring
serve as a running
this paper we
and register allocation
A program
instruction
of machine
cycles to be executed
at all.
between
For the
instruction
see [BEH89].
units of its type.
in Section
execution,
6 we bring
results and conclude
in Section
imposed
which
are modelled
the execution
machine
Our model
of a superscalar
In
of the PDG,
description
of a typical
RISC
that reference memory
while
We view a superscalar
all the computations
Let I (t > 1) be
that if Zz is scheduled
as
243
if 11 is
constraints
be
(by the compiler)
above, this would
of the program,
to guarantee
info~ation
BRG89].
are
machine
purposes,
to start no earlier than k + t+ d. Notice,
pipelined
whose only
are load and
store instructions,
edge.
scheduled
More
is based on the
done in registers.
such that the edge
to start at time k, then L should
assume that the machine
description
edges of the
time of 11 and d (d z O) be the delay
affect the correctness
7.
of
by the integer
graph.
start earlier than mentioned
some performance
processor
on the execution
scheduled
however,
are presented.
machine
there are
assigned to (11,14. For performance
a small
number
by one of the
Also,
constraints
(11,12) is a data dependence
model
interlocks
2. Parametric
an integral
Let 11 and L be two instructions
In
example.
requires
delays assigned to the data dependence
5 several levels of scheduling,
speculative
instructions
Throughout
register allocation
scheduling
instructions
to the RS/6K
4 we discuss the usefulness
including
Finally,
as follows.
2 we describe our generic machine
program
the
onto the real
on the relationships
pipelined
The
results for our scheduling
were based on a set of SPEC benchmarks
machines.
the
using one of the standard
algorithms.
computational
Section
during
phase of the compiler,
discussion
functional
RISC
[ss9].
while
registers,
of symbolic
Subsequently,
registers are mapped
will not deal with
our scheme in the context
performance
prototype
symbolic
number
we assume
execution.
We have implemented
the IBM
register allocation
such support
most of the performance
by speculative
by the
execution
for replacing
analysis
registers in the machine.
Since
for speculative
techniques
was
purposes,
implements
to
not
since we
hardware
the delays at run time.
about
the notion
can be found
of delays due to
in [BG8!J,
2.1 The RS/6K
model
the second types of the above mentioned
Here we show how our generic model
superscalar
machine
machine.
is cotilgured
The RS/6K
be considered.
of a
to fit the RS/6K
processor
is modelled
3. A program
as
Next,
follows:
we present
that computes
●
“
m = 3, there are three types
fixed point,
floating
ni=
l,n3=
1, nz=
point
unit,
Most
l,there
instructions,
point
are four main
in one
Next,
etc.
a load
instruction
Figure
a floating
point
compare
instruction
instruction
comprises
a floating
Section
that uses the result of that
delays in the
In this paper we concentrate
computations
2
of notation,
only.
problem
of future
Therefore,
in the
with
of
before
discussion.
as was mentioned
the global
the register allocation
to activate
allocation
2
the registers mentioned
2, we prefer to invoke
the register
on fixed point
them
of the program
However,
in the code), even though
whose effect is secondary.
2.
the code of Figure
for the purposes
algorithm
XL-C
the instructions
this stage there is an unbounded
are a few additional
machine
in Figure
statements
in the code are real.
and the branch
compare.
There
that corresponds
we mark the ten basic blocks
of which
For simplicity
a delay of five cycles between
for the loop,
the
The
2 (I 1-120) and annotate
1. Also,
(BL1-BL1O)
that
updating
if needed.
we number
the corresponding
;
uses its result;
—
they are
and the minimum,
code of Figure
and the instruction
one to
maximum
For convenience,
and the branch
that uses the result of that
instruction
of a are compared
, is presented
is
of
of the loop.
to the max and mi n variables,
compiler3
a fixed point
which
every iteration
(zfiu > v)) , and subsequently
pseudo-code
1 and
that two elements
to the real code created by the IBM
Zoad);
a delay of one cycle between
on the loop
compared
RS/6K
that uses its
in Figure
of
example.
1, we notice
these elements
in C)
and the maximum
concentrating
in Figure
(written
is shown
serve us as a running
another
division,
a delay of three cycles between
point
This program
marked
types of delays:
result register (delayed
compare2
a small program
the array a are fetched
and the instruction
instruction
example
the minimum
In this program,
there are also multi-cycle
instruction
–
will
unit and a
are executed
like multiplication,
compare
an array.
types.
isa single fixed
a delay of one cycle between
–
units:
unit.
cycle, however,
–
and branch
of the instructions
s There
of functional
a single floating
single branch
●
point
delays will
number
conceptually
the instruction
in
scheduling
is done (at
of registers
there is no
scheduling
after
is completed.
only the first and
More precisely, usually the three cycle delay between a fixed point compare and the respective branch instruction
encountered only when the branch is taken.
However, here for simplicity
is
we assume that such delay exists whether
the branch is taken or not.
3 The
only
feature
of the machine
in a special counter
register.
zero in a single instruction,
that was disabled
in this example
is that of keeping
the iteration
variable
of the loop
Keeping the iteration variable in this register allows it to be decremented and tested for
effectively reducing the overhead for loop control instructions.
244
~
find
the
largest
and the
smal lest
number
in a given
array
minmax(a,n)
{
int i,u,v,min,max,n,a[SIZE];
min=a[O];
max=min; i=l;
/******************
LOOP STARTS
while
max is kept in r30
min is kept in r28
i is kept in r29
n is kept in r27
address of a[i]
is kept in r31
. . . more instructions
here . . .
***************
LfjOfJ STARTS *******************
‘/
*************
/
(i <n)
{
u=a[i];
v=a[i+l];
if (u>v) {
if (u>max) max=u;
if (v<min) min=v;
}
else
{
if
if
CL.0:
(11)
(12)
}
j= i+p.
Loop
printf(’’min=%d
ENDS
***************
/
max=%d\n’’,min,max);
}
1. A program
Figure
computing
the minimum
and the
(Ill)
maximum
Every instruction
branches,
of an array
2, except for
one cycle inthetixed
point
while
the branches
unit.
There is a one cycle delay between
12and13,
dueto
RS/6K.
Notice
update
instruction
of a load with
in 12: in addition
(r31) + 8, it also increments
r31 by 8
each compare
corresponding
consideration
branch
Also,
unit
branch
address
there is a three cycle delay
instruction
instruction.
that the fixed point
run in parallel,
and the
Taking
unit
we estimate
into
and the
instructions)
END BL1
u > max
END BL2
max = u
END BL3
v < min
END BL4
min = v
CL.9
END BL5
CL.4:
cr6=r0,r30
(112)
C
(113)
BF
CL.ll,cr6,0x2/gt
--------------------------------------(114)
LR
r30=r0
~-------------------------------------CL. 11:
(115)
C
cr7=r12,r28
(116)
BF
CL.9,cr7,0xl/lt
---------------------------------------
v > max
END BL6
max = v
END BL7
u < min
...
more
instructions
here
.. .
that the
code executes in 20, 21 or 22 cycles, depending
O, 1 or 2 updates
index
END BL8
min = u
(117)
LR
r28=r12
--------------------------------------END BL9
CL.9:
(118)
AI
r29=r29,2
i =i+2
i<n
(119)
C
cr4=r29,r27
(120)
BT
CL.0,cr4,0xl/lt
--------------------------------------END BL1O
***************
LOfjp ENDS **********************
to assigning to
locational
between
instruction
the delayed load feature of the
rO the value of the memory
(post-increment).
unit,
take one cycle in the branch
the special form
B
1oad u
load v and
increment
U>v
---------------------------------------
in the code ofFigure
requires
r12=a(r31,4)
rO, r31=a(r31,8)
(13)
C
cr7=r12, r0
(14)
BF
CL.4, cr7,0x2/gt
--------------------------------------(15)
C
cr6=r12, r30
(16)
BF
CL.6,cr6,0x2/gt
--------------------------------------(17)
LR
r30=r12
--------------------------------------CL.6:
(18)
c
cr7=r0,r28
(19)
BF
CL.9,cr7,0xl/lt
--------------------------------------(110)
LR
r28=r0
(v>max) max=v;
(u<min) min=u;
}’
p******
*********
L
LU
of max and mi n variables
on if
Figure
(LR
2. The RS/6Kpseudo-code
Figure 1
are done, respectively.
245
forthe
program
of
4. The Program
Dependence
The program
dependence
to summarize
both
data
While
graph is a convenient
the control
among
dependence
the concept
of data dependence,
instruction
of control
dependence
was introduced
of control
4.1. Control
a data
using this value, was
a long time
notions
that carries
computing
in compilers
[FOW87].
ago, the notion
In what follows
quite
we discuss the
and data dependence
separately.
dependence
We describe
the idea of control
dependence
the program
example
1. In Figure
control
flow
described,
of Figure
graph of the loop
of Figure
in the loop.
circles denote
3 the
to a single
The numbers
inside the
the indices of the ten basic blocks
BL1-BL1O.
We augment
with
ENTRY
unique
using
2 is
where each node corresponds
basic block
ENTRY
the code instructions,
employed
recently
way
and
dependence
the basic idea of one instruction
value and another
Graph
convenience.
the graph of Figure
and EXIT
Throughout
3
nodes for
this discussion
we
assume a single entry node in the control
flow
graph,
i.e., there is a single node (in our case BL1)
which
is connected
to ENTRY.
However
exit nodes that have the edges leading
exist. In our case BL1O is a (single)
the strongIy
connected
in this context),
graph having
regions
to EXIT
of a control
flow
3. The control
block
of the program.
flow
graph
of the loop
of Figure
Here, a solid edge from
node A to a node B has the following
a
meaning:
For
(that represent
the assumption
that the control
may
exit node.
a single entry corresponds
assumption
several
Figure
loops
1. there is a condition
is evaluated
flow
2. if COND
to the
definitely
graph is reducible.
COND
to either
in the end of A that
TRUE
is evaluated
or FALSE,
to TRUE,
be executed,
othenvise
and
B will
B will
not
be
executed.
The meaning
of an edge from
in a control
program
flow
B.
the conditions
from
The control
loop
that control
3 however,
will
the basic block
(UsuaUy,
one basic block
of Figure
block
graph is that the control
may flow from
basic block
be executed
subgraph
of Figure
a node A to a node B
of the
A to the
edges are annotated
the flow
to another.)
it is not apparent
under which
of the PDG
graph.
with
edges emanating
basic
(CSPDG)
2 is shown in Figure
3, each node of the graph corresponds
BLl
of the
4. As in Figure
to a basic
246
will be evaluated
BL8 will be executed
FALSE.
flow
control
dashed edges will be
For example,
from
with
as for the control
4 solid edges designate
BL4 will be executed
condition.
edges are annotated
conditions
edges, while
discussed below.
the graph
which
dependence
In Figure
dependence
of the program
From
The control
the corresponding
in Figure
BL 1 indicate
that BL2 and
if the condition
to TRUE,
while
4 the
while
at the end of
BL6 and
the same condition
is
2
Definition
_—
1
—
2. B postdominates
appears on every path from
--@
A if and only
if B
A to EXIT.
F
T
TF
--+4
2
--
6
T
T
T
5
3
*O
8
Definition
3. A and B are equivalent
dominates
B and B postdorninates
A.
Definition
4. We say that moving
an instruction
if and only if A
T
9
7
B to A is useful if and only
from
if.4
and B are
equivalent.
Figure
4. The forward
the loop
control
of Figure
subgraph
of the PDG
of
Definition
2
B to A is speculative
from
As was mentioned
in the introduction,
schedule instructions
loop.
scheduling,
the for-war-d control
that result from
Figure
the following
control
graphs only.
dependence
The usefulness
stems from
control
of the control
4) can be executed
B.
of
graph.
useful
In
scheduling.
are helpful
To fmd equivalent
search a CSPDG
control
of PDG
that have the
up to the existing
dependent,
i.e. they depend
they do not depend
on any node.
BL4 are equivalent,
since both
mark the equivalent
the instructions
diiection
relation
of
BLI
For example,
between
of them
nodes with
the nodes.
For example,
dominates
“the degree of speculativeness”
graph from
graph
instructions
scheduling
is a path
A to B.
“gamble”
from
one block
B if and only
on every path fi-om ENTRY
if A appears
correctly,
to B.
CSPDG
247
4 we
for
a speculative
on the outcome
the moved
provides
that
scheduling.
for moving
to another.
instruction,
instruction
When
we always
of one or more
only when we guess the direction
1. A dominates
on
BL1O.
provides
flow
depend
nodes BL 1 and BLI O, we conclude
framework.
there
and
the dominance
our scheduling
A, i.e.,
BL2
dashed edges, the
is useful also for speculative
from
since
In Figure
CSPDG
Let A and B be two nodes of a control
Definition
of “the same set
Also,
condition.
of these edges provides
equivalent
together.
nodes, we
4, BL 1 and B L 10 are equivalent,
BL1 under the TRUE
data
doing
for nodes that are identically
of nodes under the same conditions.
in Figure
while
that are
such that B is reachable
flow
if A does not
several deffitions
to understand
in the control
an instruction
that forward
can be scheduled
let us introduce
duplication
It turns out that CSPDGS
(like BL 1 and
For our purposes,
such basic blocks
dominate
or BL6 and BL8 in Figure
in parallel
6. We say that moving
control
subgraph
dependence
BL1O, or BL2 and BL4,
A.
B to A requires
from
graphs are acyclic.
same set of control
required
The CSPDG
the fact that basic blocks
dependence.
and
the back
dependence
Notice
if B does not
dependence
through
we discuss forward
dependence
Definition
[CHH89]
the control
flow graph.
4 is a forward
postdorninate
an instruction
of a
dependence graph only,
or propagate
edges in the control
we
of this type of
we follow
i.e. we do not compute
Now
a single iteration
So, for the purposes
instruction
build
within
currently
5. We say that moving
branches;
of these branches
becomes
profitable.
for every pair of nodes the
It
number
of branches
speculative
moving
scheduling).
instructions
on the outcome
moving
from
For example,
from
is not obvious
3.)
BL5 to BL 1 gambles
branches,
delays.
when
constrain
since when
Similarly,
on the outcome
B to A is n-branch
4.
To compute
in CSPDG
from
instructions
from
if there exists a path
specula~ive
A to B of length
that useful
scheduling
n.
Data dependence
However,
compilation
we take advantage
block
level, data dependencies
instruction
dependence
by instruction
intrablock
and interlock
data dependence
are computed
at a basic
are computed
basis.
time,
observation.
Then,
data dependencies.
if we discover
edge from
from
a to c. To use this observation,
are traversed
pairs
A
block.
may be caused by the usage of
and (b,c),
(Actually,
the basic
the dependency
considered
for every possible
we compute
for the data dependence
the edge
in an order such
a and c, we have already
(a,b)
dependence
edge from
in one of the following
●
the transitive
relation
closure
in a basic block.)
locations.
in the code.
a to b is inserted
into
A data
that B is reachable
PDG
computed.
in a is used in b (’jlow
A register used in a is defined
in b
A register
defined
in a is defined
Both
(loads,
it is not proven
locations
that are considered
computation
of the intrablock
dependence
that touch
stores, calls to subroutines)
instructions
and
(memory
the
data dependence
as
edges leading
from
a
of a register to its use carry a (potentially
non-zero)
delay, which
is a characteristic
as was mentioned
for BL1;
of data
we will reference
by their numbers
from
from
the
Figure
2. There
(I 1) to (12), since (I 1)
uses r31 and (12) defines a new value for r31.
di.rarnbiguation).
deftition
machine,
of pairs of
during
the computation
is an anti-dependence
that they address different
the data dependence
underlying
in the previous
instructions
is a flow
Ordy
are
helps to reduce the number
Let us demonstrate
a and b we instructions
memory
flow
in b (output
dependence);
✎
data dependence
such
well.
(anti-dependence);
✎
A in the control
The observation
paragraph
dependence);
✎
from
graph, the intrablock
cases:
A register defined
the
b in a basic
Next for each pair A and B of basic blocks
Let a and b be two instructions
that
a to b and
b to c, there is no need to compute
between
both
of the
Let a, b and c be three
in the code.
instructions
there has
to reduce the
that when we come to determine
on an
We compute
registers or by accessing memory
form
in a basic
from
block
control
is si.rnih
every pair of instructions
there is a data dependence
is O-branch
While
compiler
which
to be considered.
following
speculative.
4.2.
of registers,
all the data dependence
essentially
instructions
Notice
renaming
process, the XL
of two
block,
% We say that moving
may unnecessarily
[CFRWZ].
from
since we cross two edges of Figure
Definition
which
of anti and output
to the effect of the static single assignment
the control
moving
edges carry zero
the number
the scheduling
does certain
4, we cross a
from
To minimize
data dependence,
BL8 to BL 1, we gamble
BL8 to BL1 in Figure
graph of Figure
The rest of the data dependence
on (in case of
of a single branch,
single edge. (This
flow
we gamble
from
both
(I 1) and (12)
to (13), since (13) uses r12 and rO defined
(12), respectively.
of the
in Section
data dependence
The edge ((12),(13))
248
in (I 1) and
carries a one
cycle delay, since (12) is a load instruction
2.
There
(delayed
load),
((I 1),(13)) is not computed
while
transitive.
There is a flow
since it is
data dependence
from
(13) to (14), since (13) sets cr7 which
(14).
This edge has a three cycle delay,
a compare
branch
instruction
instruction.
((12),(14))
Finally,
to notice
PDG
convenient
both
that,
is used in
since (13) is
duection,
top-level
since both
Also,
the control
are acyclic,
which
the
which
is
framework
framework
of the
in the program
will
scheduling.
This includes:
“
NO duplication
for global
●
Only
of code is allowed
6 in Section
l-branch
(see
4.1).
speculative
(see Deftition
s No new basic blocks
their relative
ordering
at hand.
should
These limitations
be scheduled
While
instructions
are
7 in Section
4.1).
are created in the control
the scheduling
process.
be tuned
We schedule instructions
dkcussed
its basic blocks
and
in future
in a region
one at a time.
visited in the topological
for a specflc
We present the top-level
S,1, while the heuristics
will be removed
in the control
process
flow
processed before
are discussed in
by processing
The basic blocks
are
order, i.e., if there is a path
graph from
A to B, A is
B.
5.2.
Let A be the basic block to be scheduled
The top-level
process
We schedule instructions
region
work.
the top-level
here, it is suggested that the set of heuristics
5.1.
connected
to a loop
(which
loops (which
a region
component
without
to A and are dominated
Deftition
3).
difkent
by A (see
a set C(A)
contribute
instructions
of candidate
which
to A. Currently
can
there are
two levels of scheduling:
the
has no back edges at all).
we do not overlap
iterations
We maintain
next, and
that are
for A, i.e., a set of basic blocks
1. Useful
Since currently
be the set of blocks
equivalent
blocks
has at least one
back edge) or a body of a subroutine
enclosed
on a
In our terminology
either a strongly
that corresponds
let EQUZV(A)
in the program
by region basis.
represents
of a loop,
the execution
of
2. l-branch
in the process of scheduling
instructions
only:
speculative:
C(A)=
C(A)
EQUIV(A);
includes
the
there is no difference
following
the body
of a loop
and
blocks:
a. the blocks
the body of a subroutine.
of EQUIV(A);
b. AU the immediate
successors of A in
CSPDG;
Innermost
is
that characterize
status of our implementation
flow graph during
process is suitable for a range of machines
Section
order of branches
cycle by cycle, and of a set of heuristics
next, in case there is a choice.
in Section
edges.
the current
supported
consists
tries to schedule
decide what instruction
machine
against the
there are several limitations
Deftition
process, which
flow
in the upward
This facilitates
of instructions
scheduling
instructions
of the control
The original
✎
are moved
preserved.
is acyclic as well.
5. The scheduling
out or into a
i.e, they are moved
direction
discussed next.
The global
All the instructions
✎
of ((I 1),(14)) and
we compute
scheduling
are never moved
region.
edges.
and data dependence
resultant
edge
and (14) is the corresponding
are transitive
It is important
Instructions
●
regions are scheduled
few principles
that govern
f~st.
There
our scheduling
are a
c. All the immediate
process:
EQUW(A)
249
successors of blocks
in CSPDG.
in
Once we initialize
compute
the set of candidate
blocks,
for A.
the set of candidate
instructions
I is a candidate
for scheduling
instruction
A if it belongs
to one of the following
we
9
The parametric
machine
description
2 does not cover all the secondary
An
of Section
features
of
the machine;
in block
categories:
●
The global
decisions
are not necessarily
optimal
in a local context.
●
1 belonged
to A in the fust place.
~ 1 belongs
to one of the blocks
1. f belongs
to one of the blocks
To solve this problem,
and:
in C(A)
applied
in
EQUIV(A)
and it may be moved
basic block
boundaries.
instructions
that are never moved
basic block
boundaries,
(There
beyond
its
are
to every single basic block
scheduling
block
has a more detailed
scheduler
which
reordering
like calls to
allows
scheduler
is completed.
The basic
model
of the
more precise decisions
the instructions
within
is
of a program
after the global
machine
beyond
the basic block
for
the basic blocks.
subroutines.)
2. 1 does not belong
EQUIV(XI)
to one of the blocks
and it is allowed
speculatively.
(There
are never scheduled
to memory
speculatively,
heuristics
that
ready instructions,
data dependence
from
process we maintain
i.e., candidate
are fulfilled.
the ready list as many
scheduled
next as required
a list of
instructions
functions
a basic block)
instructions
instructions
to be
the parametric
D(l),
measure
of how many
instructions,
we choose the %est” ones based on
O for every K in B.
code, and its data dependence
instructions
enabling
are marked
to become
ready.
in A are reordered
instructions
external
immediate
d(lJ1),
and there might
Initially,
is set to
Assume
... . Then,
D(l)
D(l)
that .Jl,JZ, ... are the
successors of Z in B,
by visiting
successors, D(I)
= max((D(J1)
I after visiting
is computed
as
+ d(Z,J1)),(D(JJ
+ d(l,JJ),
... )
be
moved
The second function
provides
take to complete
does not
depend
an unbounded
basic block.
E(l)
due to the two following
reasons:
called critical
number
path
of how long it will
the execution
be the execution
initialized
250
CP(l),
a measure
on 1 in B, including
always create the best schedule for each individual
It is mainly
a
on a
data dependence
d(l,Jz),
heuristic,
scheduler
provides
we move to
to A that are physically
out that the global
fust
The
follows:
A.
It turns
of
delay slots may occur
1 to the end of B.
its data dependence
Once
The net result is that the
instructions
into
in the
and let the delays on those edges be
potentiality
of A are scheduled,
the next basic block.
place in the
to the following
as fulfilled,
new instructions
all the instructions
is picked up to
to the proper
B.
called delay heuristic,
path from
Once an instruction
machine
in a block
function
If there are too many ready
it is moved
locally
are used to set the priority
Let 1 be an instruction
by the machine
description.
criteria.
are two
that are computed
in the program.
by consulting
priority
There
of an
Every cycle we pick
architecture,
be scheduled,
next.
priority
for every instruction
code, these functions
whose
scheme is a set of
the relative
to be scheduled
integer-valued
instructions.)
the scheduling
that provide
instruction
like store
(within
During
heuristics
The heart of the scheduling
to schedule it
are instructions
Scheduling
5.2.
in
of instructions
1 itself,
and assuming
of computational
time of 1. First,
to E(l) for every 1 in B.
that
units.
CP(Z)
Then,
Let
is
again by
visiting
1 after visiting
successors, CP(l)
5.3, Speculative
its data dependence
is computed
as follows:
In the global
scheduling
non-speculative
CT(l) = max((CP(J1)
i- d(l,J1)),
correctness
scheduling
scheduling,
Section
... ) + l?(f)
4.2.
scheduling
During
the decision
instructions
before
speculative
class of instructions
an instruction
D).
(useful
with
ones.
or speculative)
(CP).
Finally,
(
order
let A be a block
scheduled,
be executed
of a C program:
x);
flow
graph of this piece of code looks
by a
unit of the same type and) are ready at
process, and one of
them
Also,
has to be scheduled
U(A) = A lJ EQUIV(A),
the basic blocks
the decision
1. If B(l)
next.
and let B(l)
to which
let
and B(J) be
1 and J belong.
is made in the following
Then,
Instruction
orden
e U(A) and B(J)# U(A), then pick ~
2. If B(J) e U(A) and 13(1)# U(A),
3. If D(Z)>
D(J), then pick ~
4, If D(J)>
D(f),
5. If CP(l)
B3.
Each of them
into
B 1, but it is apparent
be printed
x=3 belongs
can be (speculatively)
that both
in B4.
Data dependence
of these instructions
to
moved
of them are
to move there, since a wrong
the movement
To solve this problem,
then pick %
that occurred
about
in the code
from
frost,
the (symbolic)
a basic block.
considered
that the current
ordering
is tuned towards
of resources.
preferring
to B2, while
value may
do not prevent
into
B 1.
> CP(J), then pick L
7. Pick an instruction
functions
x=5 belongs
not allowed
then pick J
then pick<
6. If CP(J) > Cl’(l),
number
the
that is
the same time in the scheduling
Notice
excerpt
Examine
and let 1 and J be two
that (should
functional
in
as follows:
To make it formally,
instructions
has to be maintained.
The control
of instructions.
currently
as they were defined
...
path heuristic
we try to preserve the original
to respect
this is not true, and a new type of
i f (cond) x=5;
else x=3;
print. f(’’x=%d”,
of the same class and delay
we pick one that has a biggest critical
it is sufficient
...
we pick
has the. biggest delay heuristic
For the instructions
following
For the same
doing
It turns out that for speculative
information
process, we schedule useful
while
to preserve the
of the program
the data dependence
(CP(J2) + d(l,JJ),
framework,
to schedule
of the heuristic
a machine
with
a small
exit from
B, such speculative
a useful instmction
to be updated
a speculative
instruction
may cause longer
delay.
and tuning
before a
speculative
In any case,
updated.
are needed for better
Then,
results.
251
speculatively
that is being
to a block
B
a new value for a register that is live on
dka.llowed.
one, even though
the information
registers that are Ibe on exit
If an instruction
to be moved
This is the reason for always
speculative
experimentation
computes
we maintain
Notice
Thus,
is
that this type of information
dynamically,
motion
movement
i.e., after each
this information
has to be
let us say, x=5 is fwst moved
x (or actually
has
a symbolic
register that
to B 1.
. . . more instructions
here . . .
**********
LOOp STARTS ************
. . . more instructions
here . . .
***********
Loop STARTS *************
CL.0:
(11)
(12)
(118)
(13)
(119)
(14)
(15)
(18)
(16)
(17)
CL.6:
(19)
(110)
(Ill)
CL.4:
(112)
(115)
(113)
(114)
CL.0:
(11)
(12)
(118)
(13)
L
LU
AI
C
C
;F
BF
LR
r12=a(r31,4)
r0,r31=a(r31,8)
r29=r29,2
cr7=r12,r0
cr4=r29,r27
CL.4,cr7,0x2/gt
cr6=r12,r30
cr7=r0,r28
CL.6,cr6,0x2/gt
r30=r12
BF
LR
B
CL.9,cr7,0xl/lt
r28=r0
CL.9
C
C
BF
LR
cr6=r0,r30
cr7=r12,r28
CL.ll,cr6,0x2/gt
r30=r0
c
more instructions
5. The results
here
of applying
to the program
correspondsto
x)
CL.9,cr7,(3xl/lt
r28=r0
CL.9
detailed
scheduling
ofx=3to
the useful
scheduling
and its relationship
the effect of useful and
on the example
The result ofscheduling
is presented
ofBLl,
considered
tobe
in Figure
two instructions
ofBL10(118
6 shows the result ofapplying
(l-branch)
and 119) were moved
252
12- 13
program
both
speculative
In addition
above,
from
BL8toBL6,
the original
Figure
ffl
Theresultisthat
Figure2
of
in 20-22 cycles per iteration.
the
schedulingto
to the motions
two additional
(15 and 112) were moved
there were those ofBLIO,
since only BLIO~EQUIV(BLl).
while
2 was executing
were described
and specula-
moved
inFigure5takes
Figure
the same program.
that were
18was
15wasmovedfrom
useful a.ndthe
the
the useful
the programof
Similarly,
cycles per iteration,
2.
...
delay slots of the
there.
Theresultantprogram
onlyto
5. During
the ordyinstmctions
moved
of applying
ftiginthe
BL4toBL2,andI
of Figure
useful instructions
BL1,
instructions
to the PDG-based
Let us demonstrate
scheduling
into
speculative
examples
this program
B 1,
is out of the scope of this paper.
scheduling
here
tive schedulingto
5.4. Scheduling
speculative
6. The results
2
live onexitfrom
ofthe
more instructions
Figure
B1 will be prevented.
description
scheduling
...
...
of Figure
becomes
and the movement
global
BF
LR
B
CL.4:
(115)
C
cr7=r12,r28
(113)
BF
CL.ll,cr5,(3x2/gt
(114)
LR
r30=r0
CL.11:
(116)
BF
CL.9,cr7,0xl/lt
(117)
LR
r28=r12
CL.9:
(120)
BT
CL.13,cr4,0xl/lt
**********
Loop ENDS ***************
BF
CL.9,cr7,0xl/lt
(117)
LR
r28=r12
CL.9:
(120)
BT
CL.0,cr4,(3xl/lt
**********
Loop ENDS **************
More
C
C
C
BF
c
BF
LR
(Ill)
CL.11:
Figure
(119)
(15)
(112)
(14)
(18)
(16)
(17)
CL.6:
(19)
r12=a(r31,4)
r0,r31=a(r31,8)
r29=r29,2
cr7=r12,r13
cr4=r29,r27
cr6=r12,r30
cr5=r0,r3Cl
CL.4,cr7,0x2/gt
cr7=r0,r28
CL.6,cr6,EJx2/gt
r30=r12
(110)
(116)
...
L
LU
AI
C
instructions
speculatively
in the three cycle delay between
since 15and
that
to BL1,
to
13 and 14.
Interestingly
enough,
112 belong
basic blocks
that are never executed
together
to
in any
single execution
of the program,
two instructions
will
the program
iteration,
only one of these
carry a useful result.
in Figure
in Figure
was cor@ured
All in all,
6 takes 11-12 cycles per
a one cycle improvement
program
Next we describe how the global
over the
compile-time
overhead
improvement
to a maximum
design decisions
5.
the global
6. Performance
evaluation
of the global
scheme was done on the IBM
whose abstract
model
For experimentation
scheduling
scheduling
RS/6K
is presented
purposes,
has been embedded
of compilers.
several high-level
etc.; however,
in Section
into the IBM
support
like C, Fortran,
we concentrate
Only
“small”
“Small”
Pascal,
suite [SS9].
in
unrolled
EQNTOTT
programs
and ESPRESSO
C Compiler,
manipulation
of Boolean
functions
(denoted
by BASE
After
with
and
the global
scheduling
XL
was disabled.
that the base compiler
includes
possible
scheduling
machine
optimization)
the body
two types of
that of [W90],
●
loop-closing
scheduling
techniques
overlap
to the
that represent
are rotated,
loops
by
after the end of
the global
inner
scheduling
loops,
the
we
effect of the software
of the loop
of the previous
are executed
of the
within
iteration.
The general flow of the global
scheduling
is as
inner loops are unrolled;
to
scheduling
is applied
time to the inner regions
the fust
only;
3. certain inner loops are rotated;
techniques
delay problems
So, in some sense certain improvements
global
similar
and
a set of code replication
certain
scheduler
are
of one).
i.e., some of the instructions
1. certain
basic block
that
follows:
and peephole
as follows:
a sophisticated
64
they include
is applied
such regions
By applying
2. the global
●
scheduling
their fust basic block
next iteration
Please notice
regions
instead
up to 4 basic blocks
pipelining,
in which
on its own (aside of all the
independent
and
only
are scheduled.
the inner
of a loop
achieve the partial
.
instruction
regions
second time to the rotated
comparisons
C compiler
regions)
that include
once (i.e., after unrolling
the global
copying
and equations.
in the sequel) is the performance
results of the same IBM
other
(i.e.,
step, before the global
is applied,
the loop.
The basis for all the following
inner regions
loops with up to 4 basic blocks
inner regions,
are two
that are related to minimization
are scheduled.
and 256 instructions.
two iterations
LI denotes the Lisp Interpreter
.
while
status of
are those that have at most
In a preparation
represent
In the following
stands for the GNU
reducible
basic blocks
only on the C
was done on the four C programs
the SPEC benchmark
between
(i.e. regions
regions
scheduling
GCC
the current
inner regions).
XL
programs.
benchmark,
The following
Only two inner levels of regions
outer regions
●
discussion
extent.
regions that do not include
2.1.
the global
These compilers
languages,
The evaluation
of the
prototype:
So, we distinguish
machine
●
family
the trade-off
and the run-time
characterize
scheduling
scheme
results
9
A preliminary
so as to exploit
scheduling
that solve
4. the global
[GR90].
scheduling
time to the rotated
is applied
inner loops
the second
and the
outer regions.
due to the
those of the scheduling
The compile-time
that were already part of the base
overhead
scheme is shown in Figure
compiler.
253
of the above described
7. The column
marked
BASE
gives the compilation
in seconds as measured
machine,
model
column
marked
provides
on the IBM
(Compile-Time
percents.
above mentioned
rotation,
only,
time
comes from
and GCC,
(Actually,
performance
BASE
benchmarks,
LI
EQNTOTT
ESPRESSO
GCC
improvement
CTO
206
13%
78
465
2457
17%
12%
13%
towards
the existing
at the moment,
useful and speculative
improvement
(RTI)
in Figure
namely
scheduling.
for both
overhead,
which
especially
is shown
of the measurements
is about
0.5
of instructions
by an opt imizing
utilization
is
of machine
superscalar
to the
processors,
the base
structure
proposed
The accuracy
compilers
(PDG),
and a flexible
10/0.
RTI
USEFUL SPECULATIVE
work
scheduling,
312
EQNTOTT
ESPRESSO
GCC
45
1(36
76
2.0%
6.9%
7.1%
7.3%
0%
(3%
-0.5%
-1.5%
many helpful
Figure
8. Run-time improvements
Vladimir
for the global sched-
254
description
that employs
a
RS/6K
machine
We are going to extend our
more aggressive speculative
with
would
Krawczyk
discussions,
Rainish
implementation.
uling
machine
The results of evaluating
and scheduling
Hugo
for better
for a range of
framework
Acknowledgements.
We
Ebcioglu,
compiler
scheme on the IBM
by supporting
scheduling
It is based on a data
a parametric
are quite encouraging.
LI
over the size
for parallel/parallelizing
scheduling
the scheduling
the global
resources
set of useful heuristics.
BASE
steps were
that are being scheduled.
The run-time
0/0 -
it as
since no major
scheme allows
in seconds.
a larger
As for the
we consider
The proposed
with
with
units.
usefid only and
relative
time of the code compiled
in machines
We may expect
7. Summary
that we
types of scheduling
8 in percents
has already been optimized
of computational
of the regions
the global
due to the fact
taken to reduce it except of the control
overheads for the global sched-
are two levels of scheduling
with
is modest
architecture.
even bigger payoffs
number
uling
PROGRAM
only.)
that the achieved
in run-time
compile-time
7. Compile-time
compiler
in
when the global
our short experience
we notice
reasonable,
running
is
to useful scheduling
that the base compiler
presented
is
etc.).
PROGRAM
distinguish
scheduling
no improvement
is restricted
scheduling,
There
the useful scheduling
there is a slight degradation
for both
scheduling
To summarize
Figure
most of
On the other hand, for both
observed.
unrolling,
8 that for EQNTOTT
for LI, the speculative
ESPRESSO
all of the
loop
while
dominant.
times in
to perform
steps (including
The
Overhead)
This increase in the compilation
the time required
in Figure
the improvement
530 whose cycle time is 40ns.
CTO
We notice
RS/6K
the increase in the compilation
includes
loop
times of the programs
duplication
of code.
like to thank
Kemal
and Ron Y. Pinter
and Irit Boldo
for their help in the
and
for
References
Transactions
[BG89]
Systems, Vol.
319-349.
[BRG89]
[BJR89j
Bernstein, D., and Gertner, I.,
“Scheduling
expressions on a pipelined
processor with a maximal delay of one
cycle”, ACM Transactions on Prog.
Lang. and Systems, Vol. 11, Num. 1
(Jan. 1989), 57-66,
Bernstein, D., Rodeh, M., and Gertner,
I., “Approximation
algorithms
for
scheduling arithmetic
expressions on
pipelined machines”, Journa[ of
AZgorit/vns,
10 (Mar. 1989), 120-139.
Bernstein, D., Jaffe, J. M., and Rodeh,
M,, “Scheduling
arithmetic
and load
operations in parallel with no spilling”,
SIAM Journa[ of Computing, (Dec.
1989), 1098-1127.
on Prog. Lang.
and
9, Nurn. 3 (July 1987),
[F81]
Fisher, J., “Trace scheduling: A
technique for global microcode
compaction”,
IEEE Trans. on
Computers, C-30, No. 7 (July 1981),
478-490.
[GM$6]
Gibbons,
P.B. and Muchnick,
S. S.,
“Efficient
instruction
scheduling for a
pipelined architecture”,
Proc. of the
SIGPLAN
Annual Symposium, (June
1986), 11-16.
[GR90]
Golumblc,
lM.C. and Rainish, V.,
“Instruction
scheduling beyond basic
blocks”, IBM J, Res. Dev.,(Jan. 1990),
93-98.
[BEH89]
Bradlee, D. G., Eggers, S.J., and Henry,
R. R., “Integrating
register allocation
and
instruction
scheduling for RISCS”, to
appear in Proc. of the Fourth ASPLOS
Conference, (April 199 1).
[G089]
Groves, R. D., and Oehler, R., “An
second generation RISC processor
architecture”,
Proc. of the IEEE
Conference on Computer Design,
(October
1989), 134-137.
[CHH89]
Cytron, R., Hind, M., and Wilson, H.,
“Automatic
generation of DAG
parallelism”,
Proc. of the SIGPLAN
Annual Symposium, (June 1989), 54-68,
[HG83]
[CFRWZ]
Cytron, R., Ferrante, J., Rosen, B. K.,
Wegman, M. N., and Zadeck, F. K., “An
efficient method for computing
static
single assignment form”, Proc, of the
Annual ACM Symposium on Principles
of Programming
Languages, (Jan. 1989),
25-35.
Hennessy, J,L. and Gross, T., “Postpass
code optimization
of pipeline
constraints”,
ACM Trans. on
Programming
Languages and Systems 5
(July 1983), 422-448.
[JW89]
Jouppi, N. P., and Wall, D.W.,
“Available
instruction-level
parallelism
for superscalar and superpipelined
machines”, Proc. of the Third A SPLOS
Conference, (April 1989), 272-282.
[L881
Lam M, “Software Pipelining:
An
effective scheduling technique for VLIW
machines”, Proc. of the SIGPLAN
Annual Symposium, (June 1988),
318-328.
[P851
Patterson, D. A., “Reduced instruction
set computers”,
Comm. of A CM, (Jan.
1985), 8-21.
[SLH90]
Smith, M.D, Lam M. S., and Horowitz
M.A., “Boosting beyond static
scheduling in a superscalar processor”,
Proc. of the Computer Architecture
Conference, (May 1990), 344-354.
[s89]
“SPEC Newsletter”,
Systems
Performance
Evaluation
Cooperative,
Vol. 1, Issue 1, (Sep. 1989).
p-v!xy
Warren, H., “Instruction
scheduling for
the IBM RISC System/6K
processor”,
IBit4
J. Res. Z)W., (J~.
1990), 85-92.
[E88]
Ebcioglu,
K., “Some design ideas for a
VLIW
architecture
for
sequential-natured
software”, Proc. of
the IFIP Conference on Paral!el
Processing, (April 1988), Italy.
[EN89]
Ebcioglu,
K., and Nakatani, T., “A new
compilation
technique for paralleliziig
regions with unpredictable
branches on
a VLIW
architecture”,
Proc. of the
Workshop on Languages and Compilers
fm-bm-aalle[ Computing, (August 1989),
[E851
Ellis, J. R., “Bulldog:
A compiler for
VLIW
architectures”,
Ph.D.
thesis, Yale
U/DCS/RR-364,
Yale University,
Feb.
1985.
[FOW87]
Ferrante, J., Ottenstein,
K.J., and
Warren, J. D., “The program dependence
graph and its use in optimization”,
ACM
255
IBM