Increasing E ective IPC by Exploiting Distant Parallelism
Ivan Martel, Daniel Ortega, Eduard Ayguade and Mateo Valero
Departament d'Arquitectura de Computadors,
Universitat Politecnica de Catalunya { Barcelona, Spain
e-mail: fimartel,dortega,eduard,
[email protected]
Abstract
The main objective of compiler and processor designers is
to e ectively exploit the instruction{level parallelism (ILP)
available in applications. Although most of the times their
research activities have been conducted separately, we believe that a stronger co{operation between them will make
e ective the increase of potential ILP coming from future architectures. Nowadays, most computer architecture achievements proceed towards the overcoming of the hurdle imposed by dependencies in the code, by means of extracting
parallelism from large instruction windows. However, implementation constraints limit the size of this window and
therefore the visibility of the program structure at run{time.
In this paper we show the existence of distant parallelism that future compilers could detect. By distant parallelism we mean parallelism that can not be captured by the
processor instruction window and that can produce threads
suitable for parallel execution in a multithreaded processor.
Although this parallelism also exists in numerical applications (going far beyond classical loop parallelism and usually
known as task parallelism), we focus on non{numerical applications, where the data and computation structures make
dicult the detection of concurrent threads of execution.
Some preliminary but encouraging results are presented
in the paper, reporting speed{ups in the range of 1.2 to
2.65. These results seem promising and want to show a
new insight in the detection of threads for current and future multithreaded architectures. It is important to notice
at this point that the bene ts described herein are totally
orthogonal to any other architectural techniques targeting a
single thread.
1 Introduction
The parallelism exhibited by programs depends not only on
the program execution model but also on the architecture
under which they are executed. Many theoretical or limit
studies have focused on analysing the available parallelism in
programs under di erent architectural constraints. One of
the rst studies [20] reported average IPCs (Instructions per
Cycle) between 2 and 3. However, this study did not speculate control in any way, reducing the window from which to
extract parallelism to just a single basic block.
Later studies [15] showed the importance of branch prediction in order to increase IPC. This technique allows the
exploitation of large amounts of ILP (Instruction{level parallelism) by looking for parallelism across basic block boundaries. Doing so implies the speculation of control which relies
on the e ectiveness of branch prediction schemes.
More recent limit studies have focused on removing false
dependencies (either between registers or memory locations)
[7, 26], analysing their e ects and discussing the feasibility
of removing them. The importance of register renaming
is crucial in order to take advantage of the parallelism in
programs [21].
Recent papers [14] show that not only false dependencies
due to the reuse of storage locations limit the parallelism
that can be exploited. In their research, other compiler induced dependencies are also investigated. This kind of dependencies, dynamically detected as true data dependencies
by the architecture, are not introduced by the algorithm itself, but by the way the compiler expresses computation;
some of them can be avoided by using di erent code generation techniques.
Sometimes these limit studies have induced architectural
proposals that try to accomplish the theoretical IPCs observed. However, this is not always possible because of the
characteristics of the study itself, that totally relax certain
architectural constraints, (e.g. perfect branch prediction or
unbounded resources). Even without assuming limit conditions, the proposals may not be worth implementing, such
as very large instruction windows [11].
The main way of increasing IPC, and therefore speeding up applications, has always been the exploitation of the
inherent parallelism in programs, either using software techniques or hardware mechanisms. Although the majority of
previous research in ILP focused on the performance of a single thread of execution, a more e ective increase of ILP can
be achieved from the execution of multiple threads from the
same program. We strongly believe that this increment in
IPC should arrive from a combined e ort from the design of
algorithms, compiler techniques and computer architecture.
In any case, the extraction of parallelism from programs is
not an easy task and is based on the analysis and detection
of data and control dependencies.
There have been several proposals intended at the overcoming of the data and control dependencies. As said before,
register renaming may eciently overcome false data dependencies across registers. Recent proposals try to predict val-
ues to break down true data dependence chains and therefore
expose more parallelism [8]. To be able of exploiting higher
degrees of ILP it is necessary to look for parallelism across
basic block boundaries and support from e ective branch
prediction schemes [17, 28]. This control speculation allows
the simultaneous execution of instructions from di erent basic blocks and has originated some novel architectures like
the ones in the multiscalar [18] and trace [16] processors.
In order to further increase the number of instructions
from which to exploit parallelism, multithreaded architectures [27, 23] have been proposed. Threads coming from
the same application are usually found in parallel loops detected by the compiler [22]. Hardware mechanisms have
been proposed to detect dependence violations when loops,
whose data dependence patterns cannot be decided at compile time, are executed speculatively as parallel loops [19].
Other proposals try to dynamically detect these loops and
extract their semantic information at run{time [9]. Notice
that this level of detection of loops seems to be the frontier between hardware and software mechanisms. Parallel
loops are much better recognised by software while other
loop structures, more complex or with unpredictable dependence patterns, have to be detected or speculated via hardware. The work presented in [24] goes even a bit further
by speculating data dependencies between a loop and its
continuation.
Programs usually have much more parallelism than what
the hardware can dynamically detect. The hardware is restricted to `see' a tiny portion of the program being executed, because of the limitations of its instruction window and the limited semantics of the instructions. In our
work, we intend to exploit non{structured thread parallelism
that could be statically extracted from the source code by
the compiler. In addition to the loop{level parallelism detected by current parallelising compilers (like POLARIS [3]
or SUIF [6]), non{structured parallelism can also be detected when accurately combining the analysis of control
and data dependencies in a hierarchical task graph [12] (like
in the Parafrase{2 [13] or PROMIS compilers [4]). However, applications sometimes show parallelism between zones
of code very distant apart (in terms of number of instructions executed between them) that can not be automatically detected by the compiler because of the limited scope
of its analysis techniques or because it is hidden by the
data and computation structures used in the application.
Many numerical applications also show multiple levels of
parallelism, combining both task parallelism (usually at the
coarser level) and loop{level parallelism [1, 2].
Non{numerical applications also have non{homogeneous
parallelism between zones distant apart. Nonetheless, the
achievement of this parallelism may imply the use of parallelising techniques, analogous to the ones already used
for parallelising loops, but in a more complex way. Therefore, the rst objective of this paper is to demonstrate that
non{numerical applications show high degrees of thread{
level parallelism, and that remarkable bene ts can be obtained by exploiting it. The second objective of this paper
is to show some of the compiler transformations (similar
to the ones currently applied for the parallelisation of numerical applications) that would be required to exploit this
parallelism. Four benchmarks from SPEC95int are used
(compress, m88ksim, go and ijpeg). We manually generate threads for them (using standard thread creation and
synchronisation system calls). We use an execution{driven
environment to simulate the execution of these threads on
an ideal processor. The preliminary performance gures re-
ported for this ideal processor try to reveal sources of parallelism in non{numerical applications that will probably
never be discovered at run{time and worth to be detected
by future parallelising compilers.
The organisation of the paper is as follows. Section 2
presents a description of the types of parallelism that can
be found in both numerical and non{numerical applications.
Section 3 describes in detail the analysis of each of the four
benchmarks from SPEC95int. Section 4 summarizes the
compiler transformations used for their parallelisation. Section 5 describes the simulation environment and presents
results obtained from this simulation on an ideal architecture. The paper ends with the conclusions in Section 6.
2 Parallelism in programs
The existing parallelism in programs can be classi ed according to the quantity of instructions it covers: instruction
level parallelism (ILP) and thread level parallelism (TLP).
ILP is accomplished via the concurrent execution of instructions belonging to the same ow of control. TLP di ers from
ILP in that what is considered totally parallel are groups of
instructions, despite the fact that instructions belonging to a
particular group may have dependencies among them. TLP
o ers advantages in the sense that di erent types of threads
can co{exist at a time in the processor, balancing the needs
for di erent resources; however, inter{threads con icts in
the memory hierarchy can also reduce the nal performance.
Nevertheless, the bene ts derived from TLP are orthogonal
to those coming from ILP, what makes the combination of
both techniques advisable and desirable.
On the contrary to ILP, TLP is hardly obtained via hardware. Some researchers have proposed to speculate zones
of execution with a high probability of being parallel, thus
achieving TLP; nevertheless, these zones must be locally adjacent and completely parallel at an instruction level, still
leaving higher levels of parallelism to be found at compile
time. The amount of TLP found at compile time exclusively
by compilers is very small, and is mainly found in numerical
applications at the level of loops. Non{numerical applications are, in general, considered non parallelisable because
of the data and computation structures used in this kind of
applications. Normally, compilers need the help of programmers by means of directives and assertions, multithreading
libraries, or restructuration of the source code to make parallelism available to them.
In the following subsections we analyse di erent forms
of TLP that are not usually detected by current parallelising compilers, either for numerical and non{numerical programs. Threads detected encapsulate regions of code that
can be executed in parallel and that are distant in terms
of number of instructions (statically in the original source
code and/or dynamically when they are executed). Later
in the next section we focus on the parallelisation of some
non{numerical applications from SPEC. Additional results
for some numerical SPEC applications can be found in the
extended version of this paper [10].
2.1 Numerical applications
Existing compiler techniques for nding parallelism in numerical applications refer primarily to loops. Di erent techniques have been proposed to analyse and transform codes
in order to make loops totally parallel. Although other levels of parallelism can exist in numerical applications (usually
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1-6: calls to ENR, with loop-level
parallelism inside.
7-12 and 26-31: calls to ZFFT,
with two levels of
loop-level parallelism inside.
13-18 and 20-25: calls to XYFFT,
with two levels of
loop-level parallelism inside.
19: call to UXW, with loop-level
parallelism inside.
32: sequential calls to DCOPY,
LIN, LINAVG and MIXAVG,
with loop-level parallelism inside.
Figure 1: Task graph for the SPEC95fp Turb3D program.
at the level of parallel tasks), compilers usually fail in automatically detecting them. The need of accurate interprocedural analysis and optimisation techniques applied across
procedure boundaries and data equivalencies are needed to
successfully nd these sources of parallelism.
The exploitation of multiple levels of parallelism in numerical application is also an issue to consider. Loop{level
parallelism sometimes produces poor scalability because the
amount of computation is too small or because although the
theoretical computation is high, the data movement overheads tend to hide the bene ts of the parallel execution.
Exploiting multiple levels of parallelism may distribute other
sources of parallelism among groups of processors and therefore avoid the negative e ects of scalability. In addition to
that, the overheads related to thread creation and joining
could be reduced considerably if higher (coarser) levels of
parallelism are exploited.
Although certain combinations of application and architecture do not need new sources of parallelism, other combinations can bene t from the exploitation of multiple levels
of parallelism (for example, clustered architectures in which
the processors in a cluster can exploit loop{level parallelism
and outer levels of parallelism can be exploited among clusters).
Notice that, in any case, parallelism found in numerical applications is usually well structured (in the sense that
threads have assigned the same kind of computation) and
synchronisation between them occurs at global points (related to all threads or to a group of them) by means of barriers. Most current proposals for parallel constructs (like the
ones in the OpenMP [5] extensions to Fortran and C/C++)
are designed having in mind this kind of parallelism.
For example, Turb3D is a program from SPEC95fp benchmark suite that simulates isotropic homogeneous turbulence
in a three dimensional cube. The application works primarily with six di erent tridimensional matrices, called u, v, w,
ox, oy and oz. Its main function, turb3d, contains an iterative time{step loop that consists of 4 loops, a call to uxw plus
some additional calls to implement the timestepping scheme.
These four loops are parallel and make the same operation
over the six matrices, being di erent in the way each loop
accesses them. The functions called from each of the loops,
z t and xy t perform the FFT transformation over the different matrices, from time domain to frequency domain and
viceversa. Between the rst two loops and the last two ones
function uxw combines contributions from all matrices in
order to produce new versions for some of them. Although
not having parallelism at the level of di erent matrices, uxw
has parallelism at the level of loops. Similarly, the routines
invoked to implement the time stepping scheme also have
parallelism at the level of loops. Figure 1 summarises the
parallelism structure of this application.
In the extended version of this paper [10] we show the
possible speed{ups for this application when combining different levels of parallelism. The highest speed{up achieved
is 391.05 when all levels of parallelism are opened. Nevertheless, most of this speed{up (309.78) is achieved when two
levels are opened, the highest one at the level of sections and
the outer loop level. This behaviour has also been noted in
the other application reported in the extended version of
this paper.
2.2 Non{numerical applications
Numerical applications usually have regular control and data
structures that can be easily understood by the compiler.
Nevertheless, non{numerical applications use dynamically
allocated data structures (such as pointers, lists and trees)
accessed through one or several levels of indirections, thus
complicating the task of the compiler. Therefore, small
amounts of parallelism are expected from non{numerical applications, and in general they are considered to be single
threaded.
The main constraint we imposed ourselves when looking
for distant parallelism in the non{numerical applications was
to avoid any change of the algorithm originally implemented
in SPEC95 that made them more amenable for parallelisation. It is also important to remark at this point that the
purpose is to show the existence of distant parallelism in
these applications; this has usually implied to skip any possible parallelism that could be easily captured by current
processor execution windows.
The exploitation of distant parallelism has been based on
nding zones of code whose contents are semantically parallel. As we will see in Section 3, parallelism found in non{
numerical applications tends to be non{homogeneous and
non{structured. Possible false data dependencies among
zones will be eliminated by applying techniques similar to
the ones currently used in the parallelisation of numerical
codes (for instance, privatisation of simple and structured
variables). Possible true data dependencies sometimes will
require a producer{consumer(s) co{ordination scheme, reduction operations around complex data structures and very
frequently the need of synchronisation between the threads
at non regular points of execution.
For example, Figure 2 shows one of the strategies that
has been used to parallelise some of the benchmarks. The
parallelisation strategy is based on a thread that produces
producer
thread
........... queues
consumer
threads
Figure 2: Parallelisation strategy based on the One
Producer{Many Consumers paradigm.
3.1 M88ksim
Description of the program
The m88ksim program is a simulator of the Motorola 88100
processor. The program implements the simulation of the
processor datapath and a user{oriented environment for debugging, including the use of core les. It spends most of
the execution time in the simulation phase.
The function that begins the execution of the simulated
program is named go and iteratively calls the Data path
function for every simulated instruction. With a complex
structure, Data path simulates the data cache, the instruction cache, the memory management units, the functional
units, the timing and veri es the triggering of breakpoints.
But all those tasks are performed in an interlaced fashion.
First, the function ckbrkpts checks if the current program
counter stands for a code breakpoint (if it does, function
Data path returns). Then cmmu function1 simulates the behaviour of the instruction cache memory management unit,
in spite of the actual instruction that is obtained later by
calling getmemptr.
Once the simulator has a new instruction, it analyses the
availability of the operands and resources, and then, a large
case statement evaluates the operation code, performing the
corresponding action. If it is a memory reference the data
cache is simulated, and a breakpoint checking is performed.
Finally killtime updates the time in every structure according to the instruction latency and the program counter is
modi ed.
Structure of the parallelisation
Basically, there are three code sections that can be executed
simultaneously with minimum data con icts among them.
These sections belong mainly to the Data path function and
are generated after applying code motion within and across
procedure boundaries. These sections are shown in Figure 3
under the names of timing, exe and fetch, representing the
timing, the execution and the fetch of the next instruction,
respectively.
Three threads are used to simulate the timing, two for the
execution and one for the fetch mechanism. The threads of
the timing are executing primarily the section of the code
belonging to the function killtime. They assume that the
variable cmmutime equals zero, which occurs nearly always
except when a memory instruction misses in the data cache.
This will increase the timing, and is represented in the gure by the dotted arrow called cmmutime. Breakpoints are
checked during the execution phase. As breakpoints rarely
occur, this thread is not always executed.
PC Guess
Sbus2
Breakpoint?
Statistics
Breakpoint?
cmmutime
FETCH
Fetch Next
EXE
CheckIssue
Killtime
3 Particular parallelisations
All the programs whose parallelisation is described in this
section belong to the SPEC95int benchmark suite. Although
we just describe four benchmarks, we believe that the techniques described are representative and applicable to the
rest of them. For each benchmark we describe the application itself, the parallelisation structure and its potential
bene ts.
TIMING
Real Execution
data to be consumed by a set of consumer threads; in this
case, a set of queues are used to bu er the data being transmitted from producer to consumers threads. This strategy has been used for the parallelisation of compress and
ijpeg. Other strategies based on the simultaneous execution of (data and control) dependent or independent threads
have been used in m88ksim, go and ijpeg.
Figure 3: Structure of the threaded parallelisation in
m88ksim.
Potential bene ts
The execution of function Data path represents 90% of the
total execution time when this program is executed with
the test input. Every simulated Motorola instruction takes
around 1,200 machine instructions of the host processor.
The critical path that limits the performance is the timing
simulation. It takes approximately 360 instructions. Therefore, the theoretical speed{up that can be achieved is 2.70.
3.2 Compress
Description of the program
The program that comes with the SPEC95 benchmark suite
is a modi cation of the original UNIX compression algorithm. Its main loop consists on 25 iterations of successive
compressions and decompressions of nearly the same data.
The data is slightly changed each iteration by adding characters at the end of it, but the amount of characters added
is very small in comparison with the total amount, representing less than 2 per ten thousands.
During the compression it is possible to detect repetitions of the same pattern of computation. The compression
algorithm is an implementation of the LZW algorithm which
uses a translation table made up on the y both in the compression and in the decompression phase. The compression
algorithm is made up of mainly two di erent functions, compress and output. Compress takes one character from the
input and merges it with the previous one; if this conjunction is found in the table, it takes its code and scans another
character from the input. Sooner or later the conjunction
of characters will not appear in the table, meaning that it
has not appeared yet in the input data, and the compress
function will introduce it in the table and produce a unique
output code for it. The signi cant bits of these output codes
range from 9 to 16 bits, and to bene t more from the compression, they are packed by the function output. Eventually,
the output codes will run out, and at this moment, the program enters a repetitive task of looking each 10,000 outputs
the compression rate. When this rate decreases, the current
table is considered useless, and a special code meaning table
cleaning is produced. The process starts again with a new
table.
The decompression phase has mainly two di erent functions, decompress and getcode. Getcode is a function that
takes the input of the decompression phase (that is, the compressed data) and unpacks the codes, giving them unpacked
to the decompress phase, which does the inverse method
done by compress, it searches with these codes in the table,
producing the primary input.
Structure of the parallelisation
We have parallelised both the compression phase and the
decompression phase. In each of them two threads have
been created, in a producer consumer way (Figure 2). In
the compression phase function compress acts as the producer of codes and the output function acts as the consumer
of codes. The producer thread passes the codes to the consumer thread through an intermediate queue, thus allowing
the producer to work ahead of the consumer. The parallelisation also required to apply some code motion and privatisation of variables.
In the decompression phase the function getcode has been
made the producer of codes and the function decompress the
consumer. We will see later that this makes the parallelisation work better, because the smallest thread is the producer
and the larger one is the consumer.
Potential bene ts
In the sequential compression there are 81,396 calls to the
function output before the rst cleaning of the table occurs.
Knowing that this sequential compression is 24,703,595 instructions long, it can be said, grosso modo, that the cycle
of compression is 303.49 instructions in average. We have
decided to call a cycle of compression the amount of work
done by the function compress in order to get a code, pass
it to output and the work done by output to pack it.
After generating the threads, the number of instructions
executed in function compress is 19,995,496 and in function
output is 5,059,462; therefore, the average instructions/cycle
of compression is 245.65 for function compress and 62.15 for
function output. If both threads could execute in parallel
without any problem, the critical path would be reduced
from 303.49 to 245.65, thus achieving a theoretical speed{
up of 1.24.
An equivalent analysis for the decompression phase has
also been done. The cycle of decompression is analogous to
the cycle of compression, and according to the amount of
instructions executed (13,778,302) and the amount of cycles
(81,396, the same as in the compression phase) an average
time in instructions per cycle of decompression for this phase
can be extracted, which is 169.27 instruction/cycle.
After generating the threads, the number of instructions
executed in function decompress is 8,018,357 and in function
getcode is 5,366,955. This makes 98.5 instructions per cycle
of decompression for decompress and 65.93 instructions per
cycle of decompression for getcode. If both threads could be
executed in parallel without any problem, the critical path
would have been reduced from 169.27 to 98.51 instructions
per cycle, thus achieving a theoretical speed{up of 1.72.
3.3
Go
Description of the program
The program that comes with the SPEC95 benchmark suite
is a modi ed version of the go playing game. The modi ed
version of the program allows the selection of the skill level,
the size of the board, and the introduction of a set of moves
to start with.
At the highest level, the program spends the biggest portion of time executing function life, which analyses characteristics of a particular distribution of stones in the board.
Life calls di erent functions that gather heuristic information which will help decide the best move in every turn.
Many of these functions call iscaptured which does a tactical analysis of a group, spending around 80-90% of the total
execution time. A group is a set of stones which potentially
controls a portion of the board.
Iscaptured modi es a big amount of the data stored in
lists. The underlying code implements a general arti cial
intelligence algorithm that evaluates a speculative tree of
moves. As the function gets deeper into the tree it must update the structures locally in order to re ect the new game
state. In the same way, it must recover the state when returning from any node to the father. Finally, the algorithm
goes back to its initial state, returning a condition that is
used to rearrange group armies or to detect critical spots.
Structure of the parallelisation
At a low level, the program spends most of the time in functions that manage lists. Nevertheless, we have focused our
parallelisation at a coarser level, in functions that gather
heuristic information. In particular we have parallelised
functions bdead and ndcaptured which mainly contain a
loop that calls iscaptured.
Several instances of iscaptured could be executed in parallel if local structures were privatised and if it could be guaranteed the exclusive accesses to some lists. The loops that
contain the call to iscaptured also contains instructions that
prevent the parallelisation. Loop distribution is applied in
order to separate the parallel zone from the sequential one.
The parallel part is the one containing a variable amount of
calls to iscaptured, which are executed in separate threads
with private structures.
To overcome the problem of the accesses to lists, all modi cations to lists are done locally in iscaptured. In the sequential part of the loop, these local modi cations to lists
are made global in a reduction scheme.
Potential bene ts
The loops parallelised are the main ones in ndcaptured and
bdead. They take, respectively, 13% and 37% of the program time, mainly in iscaptured. As an upper bound we
could expect a local speedup equal to the mean number of
calls to iscaptured in each function, 9.7 and 5.7 respectively,
although this number varies among di erent games. This
ideal case should bring a global speedup of 1.7. There is
little penalty in the parallelisation because the threads are
large enough. Furthermore, the second phase of the loops
(the sequential part) is very little in comparison with the
time consumed by iscaptured. However, the local speedups
achieved are smaller than the ideal case. The reason is the
load unbalance due to the variation in execution time of
iscaptured that will be discussed later.
3.4
Ijpeg
Description of the program
The program that comes with the SPEC95 benchmark suite
is a version of IJG JPEG application that compresses and
decompresses at multiple settings a previously loaded-intomemory bitmap image and produces statistics about the
whole process. Conceptually it could be seen as a search
for the optimal compression parameters program, although
no attempt is made to determine any quality/size trade{o .
The image is represented by three colour matrices de ning RGB colours. The image is converted from this space
colour to a luminance{chrominance space colour and then
transformed into a frequency space via discrete cosine transforms. This new image is compressed with a Hu man encoding. This process conforms the jpeg compression. An
inverse transformation conforms the decompression.
Structure of the parallelisation
Four di erent parts of the program have been analysed and
parallelised. They cover over 60% of a standard execution
of the benchmark. Two of these parts have been parallelised
using a xed number of threads while the other two work
in a producer{consumers way, thus having a parametrisable
number of threads.
One of the zones parallelised is the conversion from RGB
to YCC in the function rgb ycc convert. Three threads have
been extracted from this code, each of them working with a
particular colour. Similarly, function h2v2 merged upsample
also uses the same type of parallelisation, although the conversion of colours is done the other way round. In both parallelisations only two threads are created, leaving the rest of
the work to be done by the main thread.
A di erent parallelisation strategy has been used in function forward DCT. This function iterates through all the
blocks in the image performing the DCT and afterwards
descales the coecients and stores them in the appropriate
structures. We have used the producer{consumers paradigm
(Figure 2) in this parallelisation, where the main thread is
in charge of distributing the blocks among the consumer
threads. Similarly, function jpeg idct islow also consists of
a main thread and a parametrisable number of consumer
threads that are in charge of making the DCT computation.
Potential bene ts
We have theoretically analysed the potential bene ts of the
parallelisation in this program according to pro le information and the knowledge we have of the di erent zones. If we
consider each of the zones parallelised to have decreased by a
factor equal to the number of threads, then we can estimate
the total speed{up obtainable from the pro le information.
The potential bene ts of the parallelisation of the rst
two zones explained, those dealing with colour transformation, can be found by dividing its critical path by three. The
other two zones described have more parallel threads, thus
potentially achieving bigger reduction of their critical path.
We have analysed pro le information from two di erent
runs of this benchmark. The rst one uses the standard input le (penguin.ppm), while the second one uses the test
input le (specmun.ppm). The pro le information di ers
from one to the other sensitively. The bigger input o ers
much more potential than the small one. The amount of
code parallelised covers 63% and 57%, respectively with the
change of input. The time spent in each of the zones parallelised also decreases with the input le, therefore the potential bene ts change when speaking of the bigger input or
the smaller one.
With the bigger input we have calculated a potential
speed{up of up to 2.04 with 16 threads, while the smaller
input only shows a potential speed{up of 1.70 with the same
number of threads. Nevertheless, both analysis show that
most of the improvement is achieved when using up to 8
threads. The speed{up values for 4, 8, 12 and 16 threads
are 1.796, 1.953, 2.012 and 2.043, respectively.
4 Compiler techniques
In this section we summarize the compiler transformations
applied in the parallelisation of the programs described in
Section 3. All of them assume an accurate interprocedural
analysis able to disambiguate memory references and eciently derive alias information. In addition to that, the
compiler framework should also be able to apply some transformations which have been thoroughly studied in the eld
of parallelising compilers for numerical codes.
Most of the programs required some kind of code movement both within the scope of a procedure and across procedure boundaries. Sometimes this has been applied to simply
balance the amount of work done within threads; this would
require the estimation of execution costs, either through program pro ling or static estimation. In other situations code
movement has been applied to isolate sequential from parallel parts. For instance, loop distribution has been applied
to partially parallelise some of the loops that appear in go.
Variable privatisation has been extensively used in all
the programs. This privatization involved scalar as well as
structured variables (e.g. lists). The detection of reduction
operations has also been applied in some of the programs
to break recurrences. This implied the generation of private
copies for the variables involved in the reduction and the
sequential update to make their e ect global. These reductions are usually applied to lists and other more complex
data structures, which adds more diculty to the process.
The semantic parallelism manually detected would require the construction of a task graph combining control
and data dependences in the form of task precedences. A
hierarchical de nition for this graph and the above mentioned code motion together with cost estimation would enable the detection of ecient distant parallelism. Satisfying
the dependences among the threads executing these tasks
has been accomplished by means of point to point synchronisation (like in m88ksim), guaranteeing exclusive access to
some data structures (like in go) or using the producer{
consumers paradigm (like in compress or ijpeg). For the
later, this would require realising that the task graph has
'narrow' zones (in terms of precedences) which represent the
place in which the producer thread would pass information
to the consumer thread. The compiler should introduce here
the structures to communicate both threads and allow them
to run asynchronously.
5 Experimental results
5.1 Simulation environment
In this section we describe the environment used for the
simulation of the parallel execution of the non{numerical
applications parallelised in Section 3. These parallelisations have been done using standard UNIX thread creation
and synchronisation system calls. We have used the MINT
execution{driven simulator [25] running on top of an SGI
Origin2000 system.
Our simulation environment assumes that all instructions
execute in one cycle and perfect memory (i.e. all loads and
stores hit in cache). We do not take into account possible
interferences that may adversely a ect cache performance
when running multiple threads.
MINT has the possibility of de ning the costs in terms
of instructions of all system calls, such as the ones used
for thread creation, synchronisation and communication. In
this paper we assume that they are executed in one cycle, because we believe that multithreaded architectures that will
exploit this kind of parallelism will have instructions and
architectural support to execute them.
Our current work targets the simulation of these parallel codes on a detailed processor simulator where all these
parameters are taken into account.
5.2 Analysis of results
The results presented in this section will always refer to
speed{ups, considering the speed{up of a program the total
5.2.4 Ijpeg
The speed{up reported for the parallelisation of program
ijpeg depends on the number of threads (4, 8 and 12 threads)
devoted to the execution of the consumer parts in two of the
0
0
1
2
3
Figure 4: Probability of executing a particular number of
instructions (x105 , horizontal axis) when calling procedure
iscaptured.
four functions parallelised. The other two parts are always
executed with the same number of threads (three). The different versions evaluated are labeled according to the number of consumer threads.
We have run simulations with both the test input le,
specmun.ppm, and the standard input le, penguin.ppm. Figure 5 shows the speed{up achieved for these two input les
and for the three con gurations mentioned above. Notice
that the speed{up ranges from 1.37 to 1.44 with the test
input and from 1.48 to 1.57 with the standard one.
Notice that the speed{ups reported are smaller than expected. This is due to an excessive redundancy in the computations done in some threads. This replication implies
some threads read the same values and do the same computation to produce the same intermediate result. This redundancy has been introduced to avoid the overhead that would
introduce the direct synchronisation of dependent threads.
Speed-Up
1.5
1.0
t)
in
d
ar
nd
sta
ta
12
th
re
ad
s(
(s
ea
ds
th
r
8
pu
t)
in
ar
nd
ar
nd
ta
(s
ea
ds
th
r
4
d
in
d
in
st
te
s(
ad
re
th
12
pu
t)
pu
t)
pu
t)
pu
in
st
te
s(
8
th
re
ad
s(
te
st
in
pu
t)
0.5
ad
5.2.3 Go
As a result of the parallelisation described in Section 3, the
speed{up obtained for functions ndcaptured and bdead is
2.2 and 2.4, respectively. The parallelisation of these two
functions results in a global speedup of 1.4 for the whole go
application.
Notice that the speed{up obtained is smaller than the
one predicted in Section 3. The reason for these unexpected
results is the large variance in terms of execution time of
function iscaptured. For example, Figure 4 shows the probability of executing a particular number of instructions in
this function when called from routine bdead (which performs 150,377 calls to iscaptured when executes with play
level 40 and a board size of 19). The shape of the plot
is similar for other invokations from other routines. As a
consequence, some threads take longer than other threads,
reducing the potential bene ts that a balanced execution
could return.
0.02
re
5.2.2 Compress
The speed{ups reported for the parallelisation of the compression and the decompression parts of the compress program are 1.22 and 1.52, respectively.
The di erence relative to the theoretical values reported
in Section 3 are because of the large variance in the length
of the threads, especially in the producer thread in the compression. This variance correlates with the particular moment in the construction of the table of translation of codes.
At the start of the compression phase, few codes have been
introduced, therefore the threads are small. At later times,
when more codes have been introduced in the table, threads
become larger. The overhead introduced by the parallelisation itself (queue management) also reduces the potential
speed{up achievable.
0.04
th
5.2.1 M88ksim
The speed{up reported for the parallelisation of m88ksim
is of 2.65. The simulation was done using the test input,
which consists of 500K Motorola instructions, mainly logic
and memory instructions. Similar results in the execution
pro les are obtained with other input les corresponding to
integer Motorola applications.
The di erence with the theoretical speed{up (reported
in Section 3) is due to the variance of the execution phase.
Although its mean size is smaller than the mean size of the
timing phase, it becomes dominant for certain kind of instructions. For example, when oating point instructions
are simulated (10% of time in the test input set used) the execution thread grows up to 15,000 instructions, making impossible to nd more tasks to perform in parallel. Memory
instructions also tend to make the execution phase longer
than the timing, reducing the potential speed{up. Sometimes, certain conditions prevent the execution of all the
threads in parallel (i.e. when a trap occurs).
0.06
4
number of cycles the sequential version takes to complete
divided by the total number of cycles of the parallelised version. The results always refer to complete executions.
Figure 5: Speed{up for ijpeg.
6 Conclusions
The main way of increasing IPC, and therefore speeding up
applications, has always been the exploitation of the inherent parallelism of programs, either using software techniques
or hardware mechanisms. The majority of previous research
in ILP focused on the performance of a single thread of execution; however, a more e ective increase of ILP can be
achieved from the execution of multiple threads belonging
to the same application. Although several previous proposals have focused on the dynamic detection of these threads
(around loops), we push for a combined e ort, both from
compiler and architecture, towards getting higher e ective
increments in IPC. The compiler should be able to detect
distant parallelism (not captured by the hardware mecha-
nisms included in the processor) and the processor should
be able to eciently exploit intra{thread parallelism and
manage the multiple threads eciently.
Parallel loops, which are currently at the frontier between hardware and software detection mechanisms, are the
main sources of threads in numerical applications. The limited visibility of hardware mechanisms do not allow the exploitation of more distant parallelism that exists both in
numerical and non{numerical applications. In this paper
we have demonstrated that non{numerical applications can
bene t from thread{level parallelism. We have used four
SPEC95int applications (compress, m88ksim, go and ijpeg)
to present di erent parallelisation strategies that requires
minimum changes in the application. We have also shown
that the compiler transformations applied are similar to
the ones available in current parallelising compilers for numerical applications. The additional diculty comes from
the use of dynamically allocated data structures accessed
through one or several levels of indirections; ecient and
accurate interprocedural analysis techniques are required to
overcome it.
The speed{ups reported by our simulations on a ideal
processor (perfect memory and one cycle execution for all
instructions) show promising increases in the range of 1.20
to 2.65. The results obtained are not to be seen as unreachable limits but as a new approach into the extraction of
parallelism. The bene ts described are orthogonal to any
other architectural techniques focused on a single thread.
We consider this paper to be a rst stage in a long term
research in combining software and hardware techniques for
e ectively increasing IPC. We expect architectural proposals
to derive from our current investigation.
7 Acknowlegments
This work was supported by the Ministry of Education of
Spain under contracts CICYT TIC98{0511 and TIC97{1445{
CE and grant AP98{42879678, the Direccio General de Recerca under grant 1998FI{00292{APTIND, and the CEPBA.
The authors wish to thank Jesus Labarta, Jesus Corbal and
Xavier Martorell for the time devoted to fruitful discussions
and their help in understanding some of the benchmarks.
References
[1] E. Ayguade, X. Martorell, J. Labarta, M. Gonzalez, and
N. Navarro. Exploiting parallelism through directives on the
nano-threads programming model. 10th International Workshop on Languages and Compilers for Parallel Computing, August 1997.
[2] H.E. Bal and M. Haines. Approaches for integrating task and
data parallelism. IEEE Concurrency, July-September 1998.
[3] W. Blume, R. Eigenmann, J. Hoe inger, D. Padua, P. Petersen,
L. Rauchwerger, and P. Tu. Automatic detection of parallelism:
A grand challenge for high performance computing. IEEE Parallel and Distributed Technology, Fall 1994.
[4] C.J. Brownhill, A. Nicolau, S. Novack, and C.D. Polychronopoulos. The promis compiler prototype. 1997 Conference on Parallel Architectures and Compilation Techniques, June 1997.
[5] v. 1.0. Fortran Language Speci cation. Openmp organization. www.openmp.org/openmp/mp-documents/fspec.ps, October 1997.
[6] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy,
S.W. Liao, E. Bugnion, and M.S. Lam. Maximizing multiprocessor performance with the suif compiler. IEEE Computer,
December 1996.
[7] N.P. Jouppi and D.W. Wall. Available instruction-level parallelism for superscalar and superpipelined machines. 3th International Conference on Architectural Support for Programming
Languages and Operating Systems, May 1989.
[8] M.H. Lipasti and J.P. Shen. Exceeding the data ow limit via
value prediction. 29th Annual International Symposium on Microarchitecture, December 1996.
[9] P. Marcuello and A. Gonzalez. Speculative multithreaded processors. ACM International Conference on Supercomputing,
1998.
[10] I. Martel, D. Ortega, E. Ayguade, and M. Valero. Increasing
e ective ipc by exploiting distant parallelism. Technical Report
UPC-DAC-1998-59, Departmento de Arquitectura de Computadores, Universidad Politecnica de Catalu~na{Barcelona, December 1998.
[11] S. Palacharla, N. Jouppi, and J.E. Smith. Complexity-e ective
superscalar processors. 24th Annual International Symposium
on Computer Architecture, June 1996.
[12] C.D. Polychronopoulos. Nano-threads: Compiler driven multithreading. 4th International Workshop on Compilers for Parallel Computing, November 1993.
[13] C.D. Polychronopoulos, M. Girkar, M.R. Haghighat, C.L. Lee,
B. Leung, and D. Schouten. Parafrase{2: An environment for
parallelizing, partitioning, and scheduling programs on multiprocessors. International Journal of High Speed Computing, 1989.
[14] M.A. Posti , D. Greene, G. Tyson, and T. Mudge. The limits of
instruction level parallelism in spec95 applications. 3rd Workshop on Interaction between Compilers and computer Architectures, October 1998.
[15] E.M. Riseman and C.C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers,
1984.
[16] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith. Trace
processors. 30th International Symposium on Microarchitecture, December 1997.
[17] J.E. Smith. A study of branch prediction strategies. 8th Annual
International Symposium on Computer Architecture, 1981.
[18] G.S. Sohi, S.E. Breach, and T.N. Vijaykumar. Multiscalar processors. 22nd Annual International Symposium on Computer
Architecture, June 1995.
[19] J.G. Ste an and T.C. Mowry. The potential for using threadlevel data speculation to facilitate automatic parallelization.
Fourth International Symposium on High-Performance Computer Architecture, February 1998.
[20] G.S. Tjaden and M.J. Flynn. Detection and parallel execution of
independent instructions. Journal of the ACM, October 1970.
[21] R.M. Tomasulo. An ecient algorithm for exploiting multiple
arithmetic units. IBM Journal of Research and Development,
January 1967.
[22] J.Y. Tsai and P.C. Yew. The superthreaded architecture: Thread
pipelining with run-time data dependence checking and control
speculation. International Conference on Parallel Architectures and Compilation Techniques, October 96.
[23] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and
R.L. Stamm. Exploiting choice: Instruction fetch and issue on
an implementable simultaneous multithreading processor. 22nd
Annual International Symposium on Computer Architecture,
June 1995.
[24] S. Vajapeyam, T. Mitra, P.J. Joseph, and A. Mukherjee. Dynamic vectorization: The potential of exploiting repetitive control ow. Technical Report IISc-CSA-98-08, Dept. of Computer
Science and Automation, Indian Institute of Science, August
1998.
[25] J.E. Veenstra and R.J. Fowler. Mint tutorial and user manual.
Technical Report 452, Computer Science Department,The University of Rochester, June 1993.
[26] D.W. Wall. Limits of instruction-level parallelism. 4th International Conference on Architectural Support for Programming
Languages and Operating Systems, April 1991.
[27] W. Yamamoto and M. Nemirovsky. Increasing superscalar performance thorugh multistreaming. International Conference on
Parallel Architectures and Compilation Techniques, October
95.
[28] T-Y. Yeh and Y.N. Patt. Alternative implementations of two{
level adaptive branch predictors. 19th Annual International
Symposium on Computer Architecture, May 1992.