Parallel
Parallel
Parallel
FP7 Funded
Project MONFISPOL Grant no.: 225149
Marco Ratto
European Commission, Joint Research Centre, Ispra, ITALY
March 22, 2023
Contents
1 The ideas implemented in Parallel DYNARE 3
5 Conclusions 33
1
Abstract
In this document, we describe the basic ideas and the methodology identified
to realize the parallel package within the DYNARE project (called the “Parallel
has been developed taking into account two different perspectives: the “User per-
“User perspective” is to allow DYNARE users to use the parallel routines easily,
quickly and appropriately. Under the “Developers perspective”, on the other hand,
we need to build a core of parallelizing routines that are sufficiently abstract and
‘parallel paradigm’, for application to any DYNARE routine or portion of code con-
2
1 The ideas implemented in Parallel DYNARE
The basic idea behind “Parallel Dynare” is to build a framework to parallelize portions
of code that require a minimal (i.e. start-end communication) or no communications
between different processes, denoted in the literature as “embarrassingly parallel” (Goffe
and Creel, 2008; Barney, 2009). In more complicated cases there are different and more
sophisticated solutions to write (or re-write) parallel codes using, for example, OpenMP
or MPI. Within DYNARE, we can find many portions of code with the above features:
loops of computational sequences with no interdependency that are coded sequentially.
Clearly, this does not make optimal use of computers having 2-4-8, or more cores or CPUs.
The basic idea is to assign the different and independent computational sequences to
different cores, CPU’s or computers and coordinating this new distributed computational
environment with the following criteria:
provide the necessary input data to any sequence, possibly including results obtained
sources;
ensure the coherence of the results with the original sequential execution.
3
...
n=2;
m=10^6;
Matrix= zeros(n,m);
for i=1:n,
Matrix(i,:)=rand(1,m);
end,
Mse= Matrix;
...
Example 1
With one CPU this cycle is executed in sequence: first for i=1, and then for i=2.
Nevertheless, these 2 iterations are completely independent. Then, from a theoretical
point of view, if we have two CPUs (cores) we can rewrite the above code as:
...
n=2;
m=10^6;
<provide to CPU1 and CPU2 input data m>
Example 2
The for cycle has disappeared and it has been split into two separated sequences that
can be executed in parallel on two CPUs. We have the same result (Mpa=Mse) but the
computational time can be reduced up to 50%.
4
cores/CPUs/Computer Network easily;
2. a number of procedures performed after the completion of Metropolis, that use the
posterior MC sample:
(a) the diagnostic tests for the convergence of the Markov Chain
(McMCDiagnostics.m);
(c) the function that computes posterior statistics for filtered and smoothed vari-
ables, forecasts, smoothed shocks, etc..
(prior_posterior_statistics.m).
(d) the utility function that loads matrices of results and produces plots for pos-
terior statistics (pm3.m).
Unfortunately, MATLAB does not provide commands to simply write parallel code
as in Example 2 (i.e. the pseudo-commands : <provide inputs>, <execute on CPU>
and <retrieve>). In other words, MATLAB does not allow concurrent programming: it
does not support multi-threads, without the use (and purchase) of MATLAB Distributed
Computing Toolbox. Then, to obtain the behavior described in Example 2, we had to
find an alternative solution.
The solution that we have found can be synthesized as follows:
When the execution of the code should start in parallel (as in Example 2),
instead of running it inside the active MATLAB session, the following steps
are performed:
3. when the parallel computations are concluded the control is given back
to the original MATLAB session that collects the result from all parallel
5
‘agents’ involved and coherently continue along the sequential computa-
tion.
Three core functions have been developed implementing this behavior, namely MasterParallel.m,
slaveParallel.m and fParallel.m. The first function (MasterParallel.m) operates at
the level of the ‘master’ (original) thread and acts as a wrapper of the portion of code to
be distributed in parallel, distributes the tasks and collects the results from the parallel
computation. The other functions (slaveParallel.m and fParallel.m) operate at the
level of each individual ‘slave’ thread and collect the jobs distributed by the ‘master’,
execute them and make the final results available to the master. The two different im-
plementations of slave operation comes from the fact that, in a single DYNARE session,
there may be a number parallelized sessions that are launched by the master thread.
Therefore, those two routines reflect two different versions of the parallel package:
1. the ‘slave’ MATLAB sessions are closed after completion of each single job, and new
instances are called for any subsequent parallelized task (fParallel.m);
2. once opened, the ‘slave’ MATLAB sessions are kept open during the DYNARE
session, waiting for the jobs to be executed, and are only closed upon completion of
the DYNARE session on the ‘master’ (slaveParallel.m).
We will see that none of the two options is superior to the other, depending on the
model size.
Here we describe how to run parallel sessions in DYNARE and, for the developers com-
munity, how to apply the package to parallelize any suitable piece of code that may be
deemed necessary.
6
3.1 Requirements
2. PsTools (Russinovich, 2009) must be installed in the path of the master Windows
machine;
3. the Windows user on the master machine has to be user of any other slave machine
in the cluster, and that user will be used for the remote computations.
2. the UNIX user on the master machine has to be user of any other slave machine in
the cluster, and that user will be used for the remote computations;
3. SSH keys must be installed so that the SSH connection from the master to the slaves
can be done without passwords, or using an SSH agent.
We assume here that the reader has some familiarity with DYNARE and its use. For the
DYNARE users, the parallel routines are fully integrated and hidden inside the DYNARE
environment.
The general idea is to put all the configuration of the cluster in a config file different
from the MOD file, and to trigger the parallel computation with option(s) on the dynare
command line. The configuration file is designed as follows:
be in a standard location
7
– c:\Documents and Setting\<username>\Application Data\dynare.ini on Windows;
computation
For each cluster, specify a list of slaves with a list of options for each slave [if not
explicitly specified by the configuration file, the preprocessor sets the options to
default];
UserName : required for remote login; in order to assure proper communications be-
tween the master and the slave threads, it must be the same user name actu-
ally logged on the ‘master’ machine. On a Windows network, this is in the form
DOMAIN\username, like DEPT\JohnSmith, i.e. user JohnSmith in windows group
DEPT;
Password : required for remote login (only under Windows): it is the user password on
DOMAIN and ComputerName;
RemoteDrive : Drive to be used on remote computer (only for Windows, for example
the drive C or drive D);
1
In Windows XP it is possible find this name in ’My Computer’ − > mouse right click − > ’Property’
− > ’Computer Name’.
8
RemoteDirectory : Directory to be used on remote computer, the parallel toolbox will
create a new empty temporary subfolder which will act as remote working directory;
The syntax of the configuration file will take the following form (the order in which
the clusters and nodes are listed is not significant):
9
[cluster]
Name = c1
Members = n1 n2 n3
[cluster]
Name = c2
Members = n2 n3
[node]
Name = n1
ComputerName = localhost
CPUnbr = 1
[node]
Name = n2
ComputerName = karaba.cepremap.org
CPUnbr = 5
UserName = houtanb
RemoteDirectory = /home/houtanb/Remote
DynarePath = /home/houtanb/dynare/matlab
MatlabOctavePath = matlab
[node]
Name = n3
ComputerName = hal.cepremap.ens.fr
CPUnbr = 3
UserName = houtanb
RemoteDirectory = /home/houtanb/Remote
DynarePath = /home/houtanb/dynare/matlab
MatlabOctavePath = matlab
parallel: trigger the parallel computation using the first cluster specified in config
file
parallel_test: just test the cluster, don’t actually run the MOD file
10
3.2.2 Preprocessing cluster settings
options_.parallel=
struct(’Local’, Value,
’ComputerName’, Value,
’CPUnbr’, Value,
’UserName’, Value,
’Password’, Value,
’RemoteDrive’, Value,
’RemoteFolder’, Value,
’MatlabOctavePath’, Value,
’DynarePath’, Value);
All these fields correspond to the slave options except Local, which is set by the
pre-processor according to the value of ComputerName:
Local: the variable Local is binary, so it can have only two values 0 and 1. If ComputerName
is set to localhost, the preprocessor sets Local = 1 and the parallel computation
is executed on the local machine, i.e. on the same computer (and working directory)
where the DYNARE project is placed. For any other value for ComputerName, we
will have Local = 0;
In addition to the parallel structure, which can be in a vector form, to allow spe-
cific entries for each slave machine in the cluster, there is another options_ field, called
parallel_info, which stores all options that are common to all cluster. Namely, accord-
ing to the parallel_slave_open_mode in the command line, the leaveSlaveOpen field
takes values:
11
3.2.3 Example syntax for Windows and Unix, for local parallel runs (assum-
ing quad-core)
In this case, the only slave options are ComputerName and CPUnbr.
[cluster]
Name = local
Members = n1
[node]
Name = n1
ComputerName = localhost
CPUnbr = 4
for UserName, ALSO the group has to be specified, like DEPT\JohnSmith, i.e. user
ComputerName is the name of the computer in the windows network, i.e. the output
Example 1 Parallel codes that are run on a remote computer named vonNeumann with
eight cores, using only the cores 4,5,6, working on the drive ’C’ and folder ’dynare_calcs\Remote’.
The computer vonNeumann is in a net domain of the CompuTown university, with
user John logged with the password *****:
12
[cluster]
Name = vonNeumann
Members = n2
[node]
Name = n2
ComputerName = vonNeumann
CPUnbr = [4:6]
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
Example 2 We can build a cluster, combining local and remote runs. For example the
following configuration file includes the two previous configurations but also gives
the possibility (with cluster name c2) to build a grid with a total number of 7 CPU’s
:
[cluster]
Name = local
Members = n1
[cluster]
Name = vonNeumann
Members = n2
[cluster]
Name = c2
Members = n1 n2
[node]
Name = n1
ComputerName = localhost
CPUnbr = 4
[node]
Name = n2
ComputerName = vonNeumann
CPUnbr = [4:6]
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
13
Example 3 We can build a cluster, combining many remote machines. For example the
following commands build a grid of four machines with a total number of 14 CPU’s:
[cluster]
Name = c4
Members = n1 n2 n3 n4
[node]
Name = n1
ComputerName = vonNeumann1
CPUnbr = 4
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
[node]
Name = n2
ComputerName = vonNeumann2
CPUnbr = 4
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
[node]
Name = n3
ComputerName = vonNeumann3
CPUnbr = 2
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = D
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
[node]
Name = n4
ComputerName = vonNeumann4
CPUnbr = 4
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = John\dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
14
3.2.5 Example Unix syntax for remote runs
One remote slave: the following command defines remote runs on the machine name.domain.org.
[cluster]
Name = unix1
Members = n2
[node]
Name = n2
ComputerName = name.domain.org
CPUnbr = 4
UserName = JohnSmith
RemoteDirectory = /home/john/Remote
DynarePath = /home/john/dynare/matlab
MatlabOctavePath = matlab
Combining local and remote runs: the following commands define a cluster of local
an remote CPU’s.
[cluster]
Name = unix2
Members = n1 n2
[node]
Name = n1
ComputerName = localhost
CPUnbr = 4
[node]
Name = n2
ComputerName = name.domain.org
CPUnbr = 4
UserName = JohnSmith
RemoteDirectory = /home/john/Remote
DynarePath = /home/john/dynare/matlab
MatlabOctavePath = matlab
In this section we describe what happens when the user omits a mandatory entry or
provides bad values for them and how DYNARE reacts in these cases. In the parallel
15
package there is a utility (AnalyseComputationalEnvironment.m) devoted to this task
(this is triggered by the command line option parallel_test). When necessary during
the discussion, we use the parallel entries used in the previous examples.
CPUnbr: a value for this variable must be in the form [s:d] with d>=s. If the user
types values s>d, their order is flipped and a warning message is sent. When the
user provides a correct value for this field, DYNARE checks if d CPUs (or cores)
are available on the computer. Suppose that this check returns an integer nC. We
can have three possibilities:
1. nC= d; all the CPU’s available are used, no warning message are generated by
DYNARE;
3. nC< d; DYNARE alerts the user that there are less CPU’s than those declared.
The parallel tasks would run in any case, but some CPU’s will have multiple
instances assigned, with no gain in computational time.
UserName & Password: if Local = 1, no information about user name and password
is necessary: “I am working on this computer”. When remote computations on a
Windows network are required, DYNARE checks if the user name and password are
correct, otherwise execution is stopped with an error; for a Unix network, the user
and the proper operation of SSH is checked;
RemoteDrive & RemoteDirectory: if Local = 1, these fields are not required since
the working directory of the ‘slaves’ will be the same of the ‘master’. If Local = 0,
DYNARE tries to copy a file (Tracing.txt) in this remote location. If this operation
fails, the DYNARE execution is stopped with an error;
16
MatlabOctavePath & DynarePath: MATLAB instances are tried on slaves and the
DYNARE path is checked.
In this section we describe with some accuracy the DYNARE parallel routines.
Windows: With Windows operating system, the parallel package requires the installa-
tion of a free software package called PsTools (Russinovich, 2009). PsTools suite
is a resource kit with a number of command line tools that mimics administrative
features available under the Unix environment. PsTools can be downloaded from
Russinovich (2009) and extracted in a Windows directory on your computer: to
make PsTools working properly, it is mandatory to add this directory to the Win-
dows path. After this step it is possible to invoke and use the PsTools commands
from any location in the Windows file system. PsTools, MATLAB and DYNARE
have to be installed and work properly on all the machines in the grid for parallel
computation.
Unix: With Unix operating system, SSH must be installed on the master and on the
slave machines. Moreover, SSH keys must be installed so that the SSH connections
from the master to the slaves can be done without passwords.
It is called from the master computer, at the point where the parallelization
system should be activated. Its main arguments are the name of the function
containing the task to be run on every slave computer, inputs to that function
17
stored in two structures (one for local and the other for global variables), and
the configuration of the cluster; this function exits when the task has finished
on all computers of the cluster, and returns the output in a structure vector
(one entry per slave);
routine: so it prepares and send the input information for slaves, it retrieves
from slaves the info about the status of remote computations stored on remote
slaves by the remote processes; finally it retrieves outputs stored on remote
machines by slave processes;
fMessageStatus.m: provides the core for simple message passing during slave execution:
using this routine, slave processes can store locally on remote machine basic info on
the progress of computations; such information is retrieved by the master process
(i.e. masterParallel.m) allowing to echo progress of remote computations on the
master; the routine fMessageStatus.m is also the entry-point where a signal of
18
interruption sent by the master can be checked and executed; this routine typically
replaces calls to waitbar.m;
closeSlave.m is the utility that sends a signal to remote slaves to close themselves. In the
standard operation, this is only needed with the ‘Always-Open’ mode and it is called
when DYNARE computations are completed. At that point, slaveParallel.m will
get a signal to terminate and no longer wait for new jobs. However, this utility is
also useful in any parallel mode if, for any reason, the master needs to interrupt the
remote computations which are running;
cluster works properly and echoes error messages when problems are detected;
remote directories;
available CPU’s;
a number of generalized routines that properly perform delete, copy, mkdir, rmdir
commands through the network file-system (i.e. used from the master to operate
on slave machines); the routines are adaptive to the actual environment (Windows
or Unix);
19
In Table 1 we have synthesized the main steps for parallelizing MATLAB codes.
So far, we have parallelized the following functions, by selecting the most computa-
tionally intensive loops:
4. the Monte Carlo cycle looping over posterior parameter subdraws performing the
IRF simulations (<*>_core1) and the cycle looping over exogenous shocks plotting
IRF’s charts (<*>_core2):
posteriorIRF.m,
posteriorIRF_core1.m, posteriorIRF_core2.m;
5. the Monte Carlo cycle looping over posterior parameter subdraws, that computes
filtered, smoothed, forecasted variables and shocks:
prior_posterior_statistics.m,
prior_posterior_statistics_core.m;
6. the cycle looping over endogenous variables making posterior plots of filter, smoother,
forecasts: pm3.m, pm3_core.m.
Using a MATLAB pseudo (but very realistic) code, we now describe in detail how to
use the above step by step procedure to parallelize the random walk Metropolis Hastings
20
1. locate within DYNARE the portion of code suitable to be parallelized, i.e. an
expensive cycle for;
2. suppose that the function tuna.m contains a cycle for that is suitable for paral-
lelization: this cycle has to be extracted from tuna.m and put it in a new MATLAB
function named tuna_core.m;
3. at the point where the expensive cycle should start, the function tuna.m invokes
the utility masterParallel.m, passing to it the options_.parallel structure, the
name of the of the function to be run in parallel (tuna_core.m), the local and global
variables needed and all the information about the files (MATLAB functions *.m;
data files *.mat) that will be handled by tuna_core.m;
4. the function masterParallel.m reads the input arguments provided by tuna.m and:
decides how to distribute the task evenly across the available CPU’s (using the
utility routine distributeJobs.m); prepares and initializes the computational
environment (i.e. copy files/data) for each slave machine;
uses the PsTools and the Operating System commands to launch new MAT-
LAB instances, synchronize the computations, monitor the progress of slave
tasks through a simple message passing system (see later) and collect results
upon completion of the slave threads;
6. the utility fMessageStatus.m can be used within the core routine tuna_core.m to
send information to the master regarding the progress of the slave thread;
7. when all DYNARE computations are completed, closeSlave.m closes all open re-
mote MATLAB/OCTAVE instances waiting for new jobs to be run.
21
algorithm. Any other function can be parallelized in the same way.
It is obvious that most of the computational time spent by the
random_walk_metropolis_hastings.m function is given by the cycle looping over the
parallel chains performing the Metropolis:
function random_walk_metropolis_hastings
(TargetFun, ProposalFun, ..., varargin)
[...]
for b = fblck:nblck,
...
end
[...]
Since those chains are totally independent, the obvious way to reduce the computa-
tional time is to parallelize this loop, executing the (nblck-fblck) chains on different
computers/CPUs/cores.
To do so, we remove the for cycle and put it in a new function named <*>_core.m:
function myoutput =
random_walk_metropolis_hastings_core(myinputs,fblck,nblck, ...)
[...]
just list global variables needed (they are set-up properly by fParallel or slaveParallel)
TargetFun=myinputs.TargetFun;
ProposalFun=myinputs.ProposalFun;
xparam1=myinputs.xparam1;
[...]
for b = fblck:nblck,
...
end
[...]
myoutput.record = record;
[...]
The split of the for cycle has to be performed in such a way that the new <*>_core func-
22
tion can work in both serial and parallel mode. In the latter case, such a function will
be invoked by the slave threads and executed for the number of iterations assigned by
masterParallel.m.
The modified random_walk_metropolis_hastings.m is therefore:
function random_walk_metropolis_hastings(TargetFun,ProposalFun,\ldots,varargin)
[...]
% here we wrap all local variables needed by the <*>_core function
localVars = struct(’TargetFun’, TargetFun, ...
[...]
’d’, d);
[...]
% here we put the switch between serial and parallel computation:
if isnumeric(options_.parallel) || (nblck-fblck)==0,
% serial computation
fout = random_walk_metropolis_hastings_core(localVars, fblck,nblck, 0);
record = fout.record;
else
% parallel computation
Finally, in order to allow the master thread to monitor the progress of the slave threads,
some message passing elements have to be introduced in the <*>_core.m file. The utility
function fMessageStatus.m has been designed as an interface for this task, and can be
23
seen as a generalized form of the MATLAB utility waitbar.m.
In the following example, we show a typical use of this utility, again from the random
walk Metropolis routine:
for j = 1:nruns
[...]
% define the progress of the loop:
prtfrc = j/nruns;
end
[...]
end
In the previous example, a number of arguments are used to identify which CPU and which
computer in the claster is sending the message, namely:
% whoiam [int] index number of this CPU among all CPUs in the
% cluster
% ThisMatlab [int] index number of this slave machine in the cluster
% (entry in options_.parallel)
The message is stored as a MATLAB data file *.mat saved on the working directory of
remote slave computer. The master will will check periodically for those messages and
retrieve the files from remote computers and produce an advanced monitoring plot.
So, assuming to run two Metropolis chains, under the standard serial implementation
there will be a first waitbar popping up on matlab, corresponding to the first chain:
24
followed by a second waitbar, when the first chain is completed.
On the other hand, under the parallel implementation, a parallel monitoring plot will
be produced by masterParallel.m:
We checked the new parallel platform for DYNARE performing a number of tests, us-
ing different models and computer architectures. We present here all tests performed
with Windows XP/MATLAB. However, similar tests were performed successfully under
Linux/Ubuntu environment. In the Bayesian estimation of DSGE models with DYNARE,
most of the computing time is devoted to the posterior parameter estimation with the
Metropolis algorithm. The first and second tests are therefore focused on the paral-
lelization of the Random Walking Metropolis Hastings algorithm (Sections 4.1-4.2). In
addition, further tests (Sections 4.3-4.4) are devoted to test all the parallelized functions
in DYNARE.
4.1 Test 1.
The main goal here was to evaluate the parallel package on a fixed hardware platform
and using chains of variable length. The model used for testing is a modification of
Hradisky et al. (2006). This is a small scale open economy DSGE model with 6 observed
variables, 6 endogenous variables and 19 parameters to be estimated. We estimated the
25
model on a bi-processor machine (Fujitsu Siemens, Celsius R630) powered with an Intel®
XeonCPU 2.80GHz Hyper Treading Technology; first with the original serial Metropolis
and subsequently using the parallel solution, to take advantage of the two processors
technology. We ran chains of increasing length: 2500, 5000, 10,000, 50,000, 100,000,
250,000, 1,000,000.
Figure 1. X=Runs / Y=Computational time in minutes
Figure 1. X=Runs / Y=Computational time in minutes
800
800
700
700
600
600
500
one processor
400
500 two processors
200
0
0 200000 400000 600000 800000 1000000 1200000
100
0.5
0.3
0.2
0.4
0.1
0.3
0
0 200000 400000 600000 800000 1000000 1200000
0.2
Figure 2 plots
0.1 the computation time gain against the chain length. The time gain
0
0 200000 400000 600000 800000 1000000 31200000
Figure 2: Reduction of computational time (i.e. the ‘time gain’) using the parallel coding
Figure 2 plots the computation time gain against the chain length. The time gain
versus chain length. The time gain is computed as (Ts − Tp )/Tp , where Ts and Tp denote
the computing time of the serial and parallel implementations respectively.
3
Overall results are given in Figure 1, showing the computational time versus chain
length, and Figure 2, showing the reduction of computational time (or the time gain)
26
Machine Single-processor Bi-processor Dual core
Parallel 8:01:21 7:02:19 5:39:38
Serial 10:12:22 13:38:30 11:02:14
Speed-Up rate 1.2722 1.9381 1.9498
Ideal Speed-UP rate ∼1.5 2 2
Table 2: Trail results with normal PC operation. Computing time expressed in h:m:s.
Speed-up rate is computed as Ts /Tp , where Ts and Tp are the computing times for the
serial and parallel implementations.
with respect to the serial implementation provided by the parallel coding. The gain in
computing time of the exercise is of about 45% on this test case, so reducing from 11.40
hours to about 6 hours the cost of running 1,000,000 Metropolis iterations (the ideal gain
would be of 50% in this case).
4.2 Test 2.
The scope of the second test was to verify if results were robust over different hardware
platforms. We estimated the model with chain lengths of 1,000,000 runs on the following
hardware platforms:
Single processor machine: Intel ® Pentium4® CPU 3.40GHz with Hyper Treading
Technology (Fujitsu-Siemens Scenic Esprimo);
Dual core machine: Intel Centrino T2500 2.00GHz Dual Core (Fujitsu-Siemens,
LifeBook S Series).
We first run the tests with normal configuration. However, since (i) dissimilar software
environment on the machine can influence the computation; (ii) Windows service (Net-
work, Hard Disk writing, Demon, Software Updating, Antivirus, etc.) can start during
the simulation; we also run the tests not allowing for any other process to start during
the estimation. Table 2 gives results for the ordinary software environment and process
priority is set as low/normal.
27
Environment Computing time Speed-up rate
w.r.t. Table 2
Parallel Waitbar Not Visi- 5:06:00 1.06
ble
Parallel waitbar Not Visi- 4:40:49 1.22
ble, Real-time Process pri-
ority, Unplugged network
cable.
Table 3: Trail results with different software configurations (optimized operating environ-
ment for computational requirements).
Results showed that Dual-core technology provides a similar gain if compared with
bi-processor results, again about 45%. The striking results was that the Dual-core pro-
cessor clocked at 2.0GHz was about 30% faster than the Bi-processor clocked at 2.8GHz.
Interesting gains were also obtained via multi-threading on the Single-processor machine,
with speed-up being about 1.27 (i.e. time gain of about 21%). However, beware that we
burned a number of processors performing tests on single processors with hyper-threading
and using very long chains (1,000,000 runs)! We re-run the tests on the Dual-core ma-
chine, by cleaning the PC operation from any interference by other programs and show
results in Table 3. A speed-up rate of 1.06 (i.e. 5.6% time gain) can be obtained simply
hiding the MATLAB waitbar. The speed-up rate can be pushed to 1.22 (i.e. 18% time
gain) by disconnecting the network and setting the priority of the process to real time.
It can be noted that from the original configuration, taking 11:02 hours to run the two
parallel chains, the computational time can be reduced to 4:40 hours (i.e. for a total time
gain of over 60% with respect to the serial computation) by parallelizing and optimally
configuring the operating environment. These results are somehow surprising and show
how it is possible to reduce dramatically the computational time with slight modification
in the software configuration.
Given the excellent results reported above, we have parallelized many other DYNARE
functions. This implies that parallel instances can be invoked many times during a single
DYNARE session. Under the basic parallel toolbox implementation, that we call the
‘Open/Close’ strategy, this implies that MATLAB instances are opened and closed many
times by system calls, possibly slowing down the computation, specially for ‘entry-level’
28
computer resources. As mentioned before, this suggested to implement an alternative
strategy for the parallel toolbox, that we call the ‘Always-Open’ strategy, where the slave
MATLAB threads, once opened, stay alive and wait for new tasks assigned by the master
until the full DYNARE procedure is completed. We show next the tests of these latest
implementations.
4.3 Test 3
In this Section we use the Lubik (2003) model as test function2 and a very simple computer
class, quite diffuse nowadays: Netbook personal Computer. In particular we used the
Dell Mini 10 with Processor Intel® AtomZ520 (1,33 GHz, 533 MHz), 1 GB di RAM
(with Hyper-trading). First, we tested the computational gain of running a full Bayesian
estimation: Metropolis (two parallel chains), MCMC diagnostics, posterior IRF’s and
filtered, smoothed, forecasts, etc. In other words, we designed DYNARE sessions that
invoke all parallelized functions. Results are shown in Figures 3-4. In Figure 3 we
show the computational time versus the length of the Metropolis chains in the serial and
parallel setting (‘Open/Close’ strategy). With very short chain length, parallel setting
obviously slows down performances of the computations (due to delays in open/close
MATLAB sessions and in synchronization), while increasing the chain length, we can get
speed-up rates up to 1.41 on this ‘entry-level’ portable computer (single processor and
Hyper-threading). In order to appreciate the gain of parallelizing all functions invoked
after Metropolis, in Figure 4 we show the results of the experiment, but without running
Metropolis, i.e. we use the options load_mh_files = 1 and mh_replic = 0 DYNARE
options (i.e. Metropolis and MCMC diagnostics are not invoked). The parallelization of
the functions invoked after Metropolis allows to attain speed-up rates of 1.14 (i.e. time
gain of about 12%). Note that the computational cost of these functions is proportional to
the chain length only when the latter is relatively small. In fact, the number of sub-draws
taken by posteriorIRF.m or prior_posterior_statistics.m is proportional to the
total number of MH draws up to a maximum threshold of 500 sub-draws (for IRF’s) and
2
The Lubik (2003) model is also selected as the ‘official’ test model for the parallel toolbox in
DYNARE.
29
10005 1246 948
15005 1647 1250
20005 2068 1502
25005 2366 1675
Table3. Computational Time using all the parallel functions in DYNARE and the
Open/Close strategy.
We can also plot the results in table 3. We call this situation Complete Parallel …
Complete Parallel
2500
2000
Computational Time (sec)
1500
Serial
Parallel
1000
Now we test the computational time for the model without the Metropolis
Hasting:
500
Partial Parallel
450
400
350
Computational Time (sec)
300
250 Serial
200 Parallel
150
100
50
0
105 1005 5005 10005 15005 20005 25005
MH Runs
30
We can also plot and compare the results in table 5 with results in table 3 and 4.:
Open/Close vs AlwaysOpen
Complete Parallel
1800
1600
1400
1000 Open/Close
800 Always Open
600
400
200
0
105 1005 5005 10005 15005 20005 25005
MH Runs
400
350
Computational Time (sec)
300
250
Open/Close
200
Always Open
150
100
50
0
105 1005 5005 10005 15005 20005 25005
MH Runs
1,200 sub-draws (for smoother). This is reflected in the shape of the plots, which attain a
But it is impossible to do it: in fact for example with only 1005 MH runs the
plateau when these thresholds
computational are about
time is serial reached. In Figures
54 min, parallel 40 min. 5-6
If thewe
runsplot results
are 5005 of the same type
the serial
time is about 4 h and 4 min … sob!
of tests just described, but comparing the ‘Open/Close’ and the ‘Always-open’ strategies.
We can see in Test
both 4 graphs that the more sophisticated approach ’Always-open’ provides
We proceed as in Test 3 but using the very big models QUEST III and a Notebook
some reductionSamsunq
in computational time.
Q 45 with an Dual core When
Processor the entire Bayesian analysis is performed
Intel Centrino.
(including Metropolis and MCMC diagnostics, Figure 5) the gain is on average of 5%,
but it can be more than 10% for short chains. When the Metropolis is not performed, the
11
gain rises on average at about 10%. As expectable, the gain of the ‘Always-open’ strategy
is specially visible when the computational time spent in a single parallel session is not
31
too long if compared to the cost of opening and closing new MATLAB sessions under the
‘Open/Close’ approach.
4.4 Test 4
Here we increase the dimension of the test model, using the QUEST III model (Ratto
et al., 2009), using a more powerful Notebook Samsung Q 45 with an Dual core Processor
Intel Centrino. In Figures 7-8 we show the computational gain of the parallel coding with
the ‘Open/Close’ strategy. When the Metropolis is included in the analysis (Figure 7),
the computational gain increases with the chain length. For 50,000 MH iterations, the
Time
Chains
speed-up rate is about 1.42 Time
(i.e. a 30% time gain), but pushing the computation up to
Parallel
Length Serial
105 98 95
1,000,000 runs provides an almost
1005 398
ideal speed-up
255
of 1.9 (i.e. a gain of about 50% similar
5005 1463 890
to Figure 1). It is also interesting
10005 2985 to note that
1655for this medium/large size model, even at
20005 4810 2815
30005 6630 4022
very short chain length, the parallel coding is always winning over the serial. Excluding
40005 7466 5246
50000 9263 6565
the MetropolisTable6.
from Computational Time using (Figure
DYNARE execution all the parallel
8), we function involved
can see that and
the the
speed-up rate of
Open/Close strategy.
running the posterior analysis in parallel on two cores reaches 1.6 (i.e. 38% of time gain).
10000
9000
8000
Computational Time (sec)
7000
6000
Serial
5000
Parallel
4000
3000
2000
1000
0
105 1005 5005 10005 20005 30005 40005 50000
MH Runs
Figure 7: Computational Time (s) versus Metropolis length, running all the parallelized
functions in DYNARE and the basic parallel implementation (the ‘Open/Close’ strategy).
(Ratto et al., 2009). Comp Comp
Chains Time Time
Length Serial Parallel
We also checked the efficacy of the ‘Always-open’ approach with respect to the ‘Open/Close’
12
(Figures 9 and 10). We can see in Figure 9 that, running the entire Bayesian analysis, no
32
advantage can be appreciated from the more sophisticated ‘Always-open’ approach.
On the other hand, in Figure 10, we can see that the ‘Always-open’ approach still
provides a small speed-up rate of about 1.03. These results confirm the previous comment
that the gain of the ‘Always-open’ strategy is specially visible when the computational
time spent in a single parallel session is not too long, and therefore, the bigger the model
size, the less the advantage of this strategy.
5 Conclusions
The methodology identified for parallelizing MATLAB codes within DYNARE proved to
be effective in reducing the computational time of the most extensive loops. This method-
ology is suitable for ‘embarrassingly parallel’ codes, requiring only a minimal communi-
105 62 63
cation flow between
1005 slave and 285
master threads.
198 The parallel DYNARE is built around
5005 498 318
10005 798
a few ‘core’ routines, that act as a sort of ‘parallel paradigm’. Based on those rou- 488
20005 799 490
30005 781 518
tines, parallelization of expensive loops is made quite simple for DYNARE developers.
40005 768 503
50005 823 511
100000 801 530
A basic message passing system is also provided, that allows the master thread to mon-
Table7. Computational Time without MH and the Open/Close strategy.
itor the progress of slave threads. The test model ls2003.mod is available in the folder
\tests\parallel of the DYNARE distribution, that allows running parallel examples.
Partial Parallel-Open/Close
900
800
700
Computational Time (sec)
600
500 Serial
400 Parallel
300
200
100
0
105 1005 5005 10005 20005 30005 40005 50005 100000
MH Runs
Figure 8: Computational Time (s) versus Metropolis length, loading previously performed
MH runs and running only the parallelized functions after Metropolis (Ratto et al., 2009).
Basic parallel implementation (the ‘Open/Close’ strategy).
Computational
Times: Computational
Chain Complete Times: Partial
Lenght Parallel Parallel 33
13
50000 6624 498
Table8. Computational Time with Always Open strategy.
Open/Close vs AlwaysOpen
Complete Parallel
7000
6000
4000
Open/Close
Always Open
3000
2000
1000
0
105 1005 5005 10005 20005 30005 40005 50000
MH Runs
Figure 9: Comparison of the ‘Open/Close’ strategy and the ‘Always-open’ strategy. Com-
putational Time (s) versus Metropolis length, running all the parallelized functions in
Open/Close vs AlwaysOpen
DYNARE (Ratto et al., 2009). Partial Parallel
600
500
Computational Time (sec)
400
Open/Close
300
Always Open
200
14
100
0
105 1005 5005 10005 20005 30005 40005 50000
MH Runs
Figure 10: Comparison of the ‘Open/Close’ strategy and the ‘Always-open’ strategy.
ComputationalTestTime
5. (s) versus Metropolis length, running only the parallelized functions
after Metropolis
The (QUEST IIIinmodel
strong reduction Ratto
computational etallow
time al., us2009).
to compare the use within DSGE
molling of two distict implementation of Metropolis Hasting Alghoritms: Independent
and Random Wallking.
References Specifically, we execute the QUEST III models with:
34
W.L. Goffe and M. Creel. Multi-core CPUs, clusters, and grid computing: A tutorial.
Computational economics, 32(4):353–382, 2008.
M. Hradisky, R. Liska, M. Ratto, and R. Girardi. Exchange Rate Versus Inflation Target-
ing in a Small Open Economy SDGE Model, for European Union New Members States.
In DYNARE CONFERENCE, Paris September 4-5, 2006.
Marco Ratto, Werner Roeger, and Jan in ’t Veld. QUEST III: An estimated open-
economy DSGE model of the euro area with fiscal and monetary policy. Eco-
nomic Modelling, 26(1):222 – 233, 2009. doi: DOI:10.1016/j.econmod.2008.06.
014. URL http://www.sciencedirect.com/science/article/B6VB1-4TC8J5F-1/2/
7f22da17478841ac5d7a77d06f13d13e.
35
i
SE(²C
t ) SE(²ηt ) SE(²P
t
M
)
30 10 150
20 100
5
10 50
0 0 0
0 0.05 0.1 0.15 0 0.2 0.4 0.02 0.04 0.06 0.08
SE(²P
t
X
) SE(²EX
t ) SE(²CG
t )
1000
1000
10
500
500
5
0 0 0
0 0.1 0.2 0.3 0.1 0.2 0.3 0.4 0.05 0.1 0.15
SE(²It G) SE(²Leis
t ) SE(²LOL
t )
60
1000
300
40
200
500
20
100
0 0 0
0.05 0.1 0.15 0 0.05 0.1 0.15 0 5 10 15
−3
x 10
Figure 11: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
F
SE(²M
t ) SE(²B
t ) SE(²rp
t )
2000 200
4000
0 0 0
2 4 6 8 5 10 15 0 0.01 0.02
−3 −3
x 10 x 10
SE(²Tt R ) SE(²L
t ) SE(²UP
t )
300
2000 20
200
1000 10 100
0 0 0
0.05 0.1 0.15 0 0.05 0.1 0.15 0.05 0.1 0.15
γucap,2 tCG
0 γK
4
20 0.02
2
10 0.01
0 0 0
0 0.05 0.1 −1 0 1 0 100 200
Figure 12: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
36
γI γL γP
0.8
0.03
0.6 0.02
0.02
0.4
0.01
0.2 0.01
0 0 0
0 20 40 60 0 50 100 150 0 50 100 150
γP M γP X γW
0.8 0.03
0.6
0.6
0.02
0.4
0.4
0.01 0.2
0.2
0 0 0
0 50 100 0 50 100 0 50 100
CG CG
τLag τAdj hC
4
8 8
3
6 6
2
4 4
1 2 2
0 0 0
−1 −0.5 0 0.5 −0.8 −0.6 −0.4 −0.2 0 0.4 0.6 0.8
Figure 13: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
IG
hL τ0I G τLag
4 4
4
3
2 2 2
1
0 0 0
0.4 0.6 0.8 1 −1 0 1 0 0.2 0.4 0.6 0.8
IG I NOM
τAdj τLag κ
8 0.8
6 20 0.6
4 0.4
10
2 0.2
0 0 0
−0.8 −0.6 −0.4 −0.2 0 0.6 0.7 0.8 0.9 0 2 4 6
Ci η
ρ ρ ρP M
15 30
6
10 20
4
5 2 10
0 0 0
0.6 0.8 1 0 0.2 0.4 0.6 0.8 0.6 0.8 1
Figure 14: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
37
ρP X ρG ρGI
6
3 6
4
2 4
2 1 2
0 0 0
0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 0.6 0.8 1
ρLeis ρL0 ρP CP M
40 3
20
30
15 2
20 10
1
10 5
0 0 0
0.6 0.8 1 0.8 0.9 1 0 0.5 1
PWPX BF
ρ ρ ρrp
6
40 10
4
2 20 5
0 0 0
0 0.2 0.4 0.6 0.8 0.6 0.8 1 0.6 0.8 1
Figure 15: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
ρucap risk rp
20 40 100
10 20 50
0 0 0
0.85 0.9 0.95 1 0 0.02 0.04 0.01 0.02 0.03 0.04
ωX sfp sfpm
15 6 3
10 4 2
5 2 1
0 0 0
0.7 0.8 0.9 0.2 0.4 0.6 0.8 1 0 0.5 1
sfpx sfw σc
10
3
0.4
2 0.3
5
0.2
1
0.1
0 0 0
0.2 0.4 0.6 0.8 1 0 0.5 1 0 5 10
Figure 16: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
38
σX σ slc
6
1 1.5
4
1
0.5
0.5 2
0 0 0
1 2 3 4 1 2 3 0 0.2 0.4 0.6 0.8
tIπNOM bU ρT R
2
4 8
1.5
6
1
2 4
0.5 2
0 0 0
1 2 3 −1 0 1 0.6 0.8 1
tIY,1
NOM
tIY,2
NOM
wrlag
4
15 3
3
10 2
2
1 5 1
0 0 0
0 0.5 1 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8
Figure 17: Prior (grey lines) and posterior density of estimated parameters (black =
100,000 runs; red = 1,000,000 runs) using the RWMH algorithm (QUEST III model
Ratto et al., 2009).
39
A A tale on parallel computing
This is a general introduction to Parallel Computing. Readers can skip it, provided they
have a basic knowledge of DYNARE and Computer Programming (Goffe and Creel, 2008;
Azzini et al., 2007; ParallelDYNARE, 2009). There exists an ample scientific literature
as well as an enormous quantity of information on the Web, about parallel computing.
Sometimes, this amount of information may result ambiguous and confusing in the no-
tation adopted and the description of technologies. Then main the goal here is therefore
to provide a very simple introduction to this subject, leaving the reader to Brookshear
(2009) for a more extensive and clear introduction to computer science.
Modern computer systems (hardware and software) is conceptually identical to the first
computer developed by J. Von Neumann. Nevertheless, over time, hardware, software,
but most importantly hardware & software together have acquired an ever increasing
ability to perform incredibly complex and intensive tasks. Given this complexity, we use
to explain the modern computer systems as the “avenue paradigm”, that we summarize
in the next tale.
Nowadays there is a small but lovely town called “CompuTown”. In CompuTown
there are many roads, which are all very similar to each other, and also many gardens.
The most important road in CompuTown is the Von Neumann Avenue. The first building
in Von Neumann Avenue has three floors (this is a computer system: PC, workstation,
etc.; see Figure 18 and Brookshear (2009)). Floors communicate between them only with
a single stair. In each floor there are people coming from the same country, with the same
language, culture and uses. People living, moving and interacting with each other in the
first and second floor are the programs or software agents or, more generally speaking,
algorithms (see chapters 3, 5, 6 and 7 in Brookshear (2009)). Examples of the latter are
the softwares MATLAB, Octave, and a particular program called the operating system
(Windows, Linux, Mac OS, etc.).
People at the ground floor are the transistors, the RAM, the CPU, the hard disk,
etc. (i.e. the Computer Architecture, see chapters 1 and 2 in Brookshear). People at the
second floor communicate with people at the first floor using the only existing scale (the
40
define a set of word, fixed and understood by all: the Programming Languages.
More specifically we call these high-level programming languages (java, c,
matlab …), because are relating with people who are on the upper floors of the
building!
First floor:
the Operating System
…
Ground floor:
the Hardware …
people inThethe
1 building
process to transformuse also programming
an high-level pictureslanguages
to communicate: thecompilation
in to binary code is called icons and graphical user
process.
interface.
2
In a similar way, people at the first floor communicate with people at the ground floor.
Not surprisingly, in this case, people use low-level programming languages to communi-
cate to each other (assembler, binary code, machine language, etc.). More importantly,
however, people at the first floor must also manage and coordinate the requests from
people on the second floor to people at the ground floor, since there is no direct commu-
nication between ground and second floor. For example they need to translate high-level
programming languages into binary code3 : the Operating System performs this task.
Sometimes, people at the second floor try to talk directly with people at the ground
floor, via the system calls. In the parallelizing software presented in this document, we will
use frequently these system calls, to distribute the jobs between the available hardware
3
The process to transform an high-level programming languages into binary code is called compilation
process.
41
resources, and to coordinate the overall parallel computational process. If only a single
person without family lives on the ground floor, such as the porter, we have a CPU single
core. In this case, the porter can only do one task at a time for the people in first or
second floor (the main characteristic of the Von Neumann architecture). For example,
in the morning he first collects and sorts the mail for the people in the building, and
only after completing this task he can take care of the garden. If the porter has to
do many jobs, he needs to write in a paper the list of things to do: the memory and
the CPU load. Furthermore, to properly perform its tasks, sometimes the porter has to
move some objects trough the passageways at the ground floor (the System Bus). If the
passageways have standard width, we will have a 32 bits CPU architecture (or bus). If
the passageways are very large we will have, for example, a 64 bits CPU architecture (or
bus). In this scenario, there will be very busy days where many tasks have to be done
and many things have to be moved around: the porter will be very tired, although he
will be able to ‘survive’. The most afflicted are always the people at the first floor. Every
day they have a lot of new, complex requests from the people at the second floor. These
requests must be translated in a correct way and passed to the porter. The people at the
second floor (the highest floor) “live in cloud cuckoo land”. These people want everything
to be done easily and promptly: the artificial intelligence, robotics, etc. The activity in
the building increases over time, so the porter decides to get helped in order to reduce
the execution time for a single job. There are two ways to do this:
the municipality of CompuTown interconnects all the buildings in the city using
roads, so that the porter can share and distribute the jobs (the Computer Networks):
if the porters involved have the same nationality and language we have a Computer
Cluster, otherwise we have a Grid. Nevertheless, in both cases, it is necessary to
define a correct way in which porters can manage, share and complete a shared job:
the communication protocol (TCP/IP, internet protocol, etc.);
Computer. In other case, the porter may get married, producing a dual-core CPU.
In this case, the wife can help the porter to perform his tasks or even take entirely
42
some jobs for her (for example do the accounting, take care of the apartment, etc.).
If the couple has a children, they can have a further little help: the thread and then
the Hyper-threading technology.
Now a problem arises: who should coordinate the activities between the porters (and
their family) and between the other buildings? Or, in other words, should we refurbish
the first and second floors to take advantage of the innovations on the ground floor and
of the new roads in CompuTown? First we can lodge new persons at the first floor:
the operating systems with a set of network tools and multi-processors support, as well
as new people at the second floor with new programming paradigms (MPI, OpenMP,
Parrallel DYNARE, etc.). Second, a more complex communication scheme between first
and ground floor is necessary, building a new set of stairs. So, for example, if we have
two stairs between ground and first floor and two porters, using multi-processors and
a new parallel programming paradigm, we can assign jobs to each porter directly and
independently, and then coordinate the overall work. In parallel DYNARE we use this
kind of ‘refurbishing’ to reduce the computational time and to meet the request of people
at the second floor.
Unfortunately, this is only an idealized scenario, where all the citizens in CompuTown
live in peace and cooperate between them. In reality, some building occupants argue with
each other and this can cause stopping their job: these kinds of conflicts may be linked
to software and hardware compatibility (between ground and first floor), or to different
software versions (between second and first floor). The building administration or the
municipality of CompuTown have to take care of these problems an fix them, to make the
computer system operate properly.
This tale (that can be also called The Programs’s Society) covered in a few pages the
fundamental ideas of computer science.
43