Jacob With Berry: Charanjit Kandola, Daniel Fagerlie Due: June 2, 2017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Jacob with Berry

Charanjit Kandola, Daniel Fagerlie


Due: June 2, 2017

1 Introduction
Jacobi is a nifty solution to Laplaces equation. It is easy to implement in the
concurrency paradigm, so greater performance in a multi-processor/core system
can be achieved. We will describe the structure of our implementation under
the viewpoint of an object oriented language and the automated experiment
trials which were processed in R to get the results. We ended up seeing a good
climb in speedup from one thread to two threads, however with each additional
thread, there was less of an impact. A peak was reached when using all the
cores of the system.

2 Implementation
We used Java to implement the algorithm. We have a Main class that accepts
the matrix to be processed, a Jacobi class that divides up the matrix and creates
an array of threads, and Work objects that are created by the Jacobi class that
actually implement the Jacobi algorithm and rendezvous.
The Main class assigns the size of the matrix and the number of threads to
use to implement the algorithm. It is passed the name of the text file holding the
matrix, loads the matrix, and creates a Jacobi object that receives the number
of threads to use. The Main class then keeps track of how long the Jacobi takes
to run using Javas nanoTime() function and determines how long each thread
took by calculating an average.
The Jacobi class receives the matrix, its size, and the number of threads
to use. It divides up the matrix based on its size and the number of threads,
producing starting and ending i and j values for use in a for loop. The matrix
is broken up horizontally so each thread can take advantage of a row major
order language. Once the i and j values are calculated, the Jacobi class creates
a thread to add to its array of threads that is made from a Work object.
Work is a nested class within the Jacobi class that implements the Runnable
interface. It receives the starting and ending i and j values, which are different
for every Work object. It uses these to implement the Jacobi algorithm, but
experiences a rendezvous at three different points. The threads each calculate
a difference in values from the original matrix and store it in a new Matrix

1
(called newMatrix) before entering the first rendezvous. Each difference then
gets compared to the differences from the other threads to calculate a maximum
difference in the second rendezvous. When the last thread has arrived in the
second rendezvous, the maximum difference is compared to epsilon. If the dif-
ference is not less than epsilon after this, each thread rewrites a portion of the
original matrix with the corresponding portion from the new Matrix and enters
the third rendezvous. The entire process repeats until the maximum difference
is found to be less than epsilon.

3 Experiment
Each trial was timed in the code itself by capturing System.nanoTime() before
starting the threads and after joining the threads, then differencing the two.
Because of this setup, the thread creation, matrix loading, and other program
setup was excluded from the timings, resulting in only the Jacobi algorithm
being timed.
In total, there were 1020 trials conducted (102 for each number of threads
used). We felt this would minimize errors introduced by other processes running
on the system during execution. During the trials, only Jacobi and its tester
program were run so as to limit background processes.
The java compiler was: javac 1.8.0 131, and the system was linux (Ubuntu
16.04.2 LTS), running the i3 windows manager.
The processor was Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, with the
turbo feature turned off, in order to limit processor frequency variation. This
was to help prevent the frequency getting up to 4.20GHz with 1 and 2 threads,
and somewhere close to 3.60GHz as more threads were add. This was possibly
due to Intels scheme for managing heat and power consumption. During the
trials, the processor was observed and found to be running at 3.60GHz steady
for all thread counts.

2
4 Results

Each point making up the Mean line were created by averaging the 102 Trials
in each thread test set. The ideal speedup (i.e Theoretical limit) would be where
100% of the code is parallelizable (e.g. Two threads, twice the speedup).
During the 5 to 10 thread tests, processor usage would hover about 90%.

n Speedupn p
1 - -
2 1.475 64.4
3 1.715 62.5
4 1.748 57.1

The above table was created by using the formula:


1
Speedupn = p p
(1 100 ) + 100n

where n is the number of threads and p is the percent of the running code done
in parallel. In this case, p was calculated by using the collected Speedupn and
n from the trials.

3
5 Discussion and Conclusion
Preportion Discrepency
Since only the Jacobi algorithm was timed and the code appeared to be
mostly parallelizable, the p results from the table were surprising. They showed
around 60% of the time was spent in concurrent computation.
This didnt seem correct. The three main parts of Jacobi are the interior
calculations, max finding, and matrix copying. Each part was fully parallelized,
followed only by some sort of rendezvous. The rendezvous are where the threads
are forced to be serialized, each having to trickle through a synchronized block
and preform simple check statements, with the exception of the epsilon check,
which requires only two more simple statements.
In each of the three parts, there is a statement inside an embedded loop that
2
runs about mn times, where m is the matrix size and n the number of threads.
Assuming about 5 statements in three rendezvous, 3 (5 4) = 60, we get a
rough estimate of the number of serial code statements for 4 threads. Plugging
2
in numbers for the parallel part, we get about 3 2048 4 = 3.14x106 statements
per thread. Since the serial statements are simple in comparison to the parallel
3.14x106
statements, we should be safe in assuming that 3.14x10 6 +60 = 99.9%, which is a

high ninety percent possible concurrency.


The most probably cause is the compiler and hardware are doing some sort
of optimizing on the matrix loops, causing the concurrent portions to take a lot
less time. Since the rendezvous would most likely receive little optimizing, they
would take the same amount. Thus, the ratio of the two is brought much closer
together.
One piece of evidence seems to support this conclusion. The amount of time
for a single thread to run through the matrix on our system was about 8.77
seconds. In previously reported experiments, the time was around 6 minutes, a
significant difference. More information on the other system would be required.

Anomaly
Weird stuff happened after the number of threads was increased beyond
4. The constant decline in speedup between 4 and 7 was probably due to
hyperthreading with two contexts sharing some of the same resources on the
processor. We have no idea why the trend of decreasing speedup changed at
8 by showing a larger speedup compared to 7, and again at 10 compared with
9. Since this is an average of 102 trials, this shouldnt be the result of some
anomaly. Further understanding of the OS and hardware is required.

Overall
If this was run on a system that served other users/applications and concur-
rency is desired, then the most efficient use of resources looks to be using 2 or 3
threads of execution. Using 4 might not offer enough advantage to justify tying
up processor time.

4
Concurrent programming seems to offer great performance advantage to
those who wield it, if they know how to use it, especially if the algorithm can
be made concurrent easily like Jacobi.

You might also like