Parallel DYNARE Toolbox FP7 Funded Project MONFISPOL Grant No.: 225149
Parallel DYNARE Toolbox FP7 Funded Project MONFISPOL Grant No.: 225149
FP7 Funded
Project MONFISPOL Grant no.: 225149
Marco Ratto
European Commission, Joint Research Centre, Ispra, ITALY
July 24, 2017
Contents
1 The ideas implemented in Parallel DYNARE 3
5 Conclusions 43
1
Abstract
2
1 The ideas implemented in Parallel DYNARE
• ensure the coherence of the results with the original sequential execu-
tion.
3
Generally, during a program execution, the largest computational time is
spent to execute nested cycles. For simplicity and without loss in generality
we can consider here only for cycles (it is possible to demonstrate that any
while cycle admits an equivalent for cycle). Then, after identifying the
most computationally expensive for cycles, we can split their execution (i.e.
the number or iterations) between different cores, CPUs, computers. For
example, consider the following simple MATLAB piece of code:
...
n=2;
m=10^6;
Matrix= zeros(n,m);
for i=1:n,
Matrix(i,:)=rand(1,m);
end,
Mse= Matrix;
...
Example 1
With one CPU this cycle is executed in sequence: first for i=1, and then
for i=2. Nevertheless, these 2 iterations are completely independent. Then,
from a theoretical point of view, if we have two CPUs (cores) we can rewrite
the above code as:
4
...
n=2;
m=10^6;
<provide to CPU1 and CPU2 input data m>
Example 2
The for cycle has disappeared and it has been split into two separated
sequences that can be executed in parallel on two CPUs. We have the same
result (Mpa=Mse) but the computational time can be reduced up to 50%.
(a) the diagnostic tests for the convergence of the Markov Chain
5
(McMCDiagnostics.m);
(c) the function that computes posterior statistics for filtered and
smoothed variables, forecasts, smoothed shocks, etc..
(prior_posterior_statistics.m).
(d) the utility function that loads matrices of results and produces
plots for posterior statistics (pm3.m).
When the execution of the code should start in parallel (as in Ex-
ample 2), instead of running it inside the active MATLAB ses-
sion, the following steps are performed:
6
3. when the parallel computations are concluded the control is
given back to the original MATLAB session that collects the
result from all parallel ‘agents’ involved and coherently con-
tinue along the sequential computation.
1. the ‘slave’ MATLAB sessions are closed after completion of each single
job, and new instances are called for any subsequent parallelized task
(fParallel.m);
2. once opened, the ‘slave’ MATLAB sessions are kept open during the
DYNARE session, waiting for the jobs to be executed, and are only
closed upon completion of the DYNARE session on the ‘master’ (slaveParallel.m).
We will see that none of the two options is superior to the other, depending
on the model size.
7
3 Installation and utilization
Here we describe how to run parallel sessions in DYNARE and, for the devel-
opers community, how to apply the package to parallelize any suitable piece
of code that may be deemed necessary.
3.1 Requirements
3. the Windows user on the master machine has to be user of any other
slave machine in the cluster, and that user will be used for the remote
computations.
2. the UNIX user on the master machine has to be user of any other
slave machine in the cluster, and that user will be used for the remote
computations;
3. SSH keys must be installed so that the SSH connection from the master
to the slaves can be done without passwords, or using an SSH agent.
8
3.2 The user perspective
We assume here that the reader has some familiarity with DYNARE and its
use. For the DYNARE users, the parallel routines are fully integrated and
hidden inside the DYNARE environment.
The general idea is to put all the configuration of the cluster in a config file
different from the MOD file, and to trigger the parallel computation with
option(s) on the dynare command line. The configuration file is designed as
follows:
• be in a standard location
• For each cluster, specify a list of slaves with a list of options for each
slave [if not explicitly specified by the configuration file, the preproces-
sor sets the options to default];
9
CPUnbr : this is the number of CPU’s to be used on that computer; if
CPUnbr is a vector of integers, the syntax is [s:d], with d>=s (d, s
are integer); the first core has number 1 so that, on a quad-core, use 4
to use all cores, but use [3:4]to specify just the last two cores (this is
particularly relevant for Windows where it is possible to assign jobs to
specific processors);
Password : required for remote login (only under Windows): it is the user
password on DOMAIN and ComputerName;
10
DynarePath : path to matlab directory within the Dynare installation
directory;
The syntax of the configuration file will take the following form (the order
in which the clusters and nodes are listed is not significant):
11
[cluster]
Name = c1
Members = n1 n2 n3
[cluster]
Name = c2
Members = n2 n3
[node]
Name = n1
ComputerName = localhost
CPUnbr = 1
[node]
Name = n2
ComputerName = karaba.cepremap.org
CPUnbr = 5
UserName = houtanb
RemoteDirectory = /home/houtanb/Remote
DynarePath = /home/houtanb/dynare/matlab
MatlabOctavePath = matlab
[node]
Name = n3
ComputerName = hal.cepremap.ens.fr
CPUnbr = 3
UserName = houtanb
RemoteDirectory = /home/houtanb/Remote
DynarePath = /home/houtanb/dynare/matlab
MatlabOctavePath = matlab
• parallel: trigger the parallel computation using the first cluster spec-
ified in config file
12
given cluster
• parallel_test: just test the cluster, dont actually run the MOD file
options_.parallel=
struct(’Local’, Value,
’ComputerName’, Value,
’CPUnbr’, Value,
’UserName’, Value,
’Password’, Value,
’RemoteDrive’, Value,
’RemoteFolder’, Value,
’MatlabOctavePath’, Value,
’DynarePath’, Value);
All these fields correspond to the slave options except Local, which is set
by the pre-processor according to the value of ComputerName:
Local: the variable Local is binary, so it can have only two values 0 and 1.
If ComputerName is set to localhost, the preprocessor sets Local = 1
and the parallel computation is executed on the local machine, i.e.
on the same computer (and working directory) where the DYNARE
project is placed. For any other value for ComputerName, we will have
Local = 0;
13
In addition to the parallel structure, which can be in a vector form, to
allow specific entries for each slave machine in the cluster, there is another
options_ field, called parallel_info, which stores all options that are com-
mon to all cluster. Namely, according to the parallel_slave_open_mode in
the command line, the leaveSlaveOpen field takes values:
3.2.3 Example syntax for Windows and Unix, for local parallel
runs (assuming quad-core)
In this case, the only slave options are ComputerName and CPUnbr.
[cluster]
Name = local
Members = n1
[node]
Name = n1
ComputerName = localhost
CPUnbr = 4
14
• for UserName, ALSO the group has to be specified, like DEPT\JohnSmith,
i.e. user JohnSmith in windows group DEPT;
Example 1 Parallel codes that are run on a remote computer named vonNeumann
with eight cores, using only the cores 4,5,6, working on the drive ’C’
and folder ’dynare_calcs\Remote’. The computer vonNeumann is in a
net domain of the CompuTown university, with user John logged with
the password *****:
[cluster]
Name = vonNeumann
Members = n2
[node]
Name = n2
ComputerName = vonNeumann
CPUnbr = [4:6]
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
Example 2 We can build a cluster, combining local and remote runs. For
example the following configuration file includes the two previous con-
figurations but also gives the possibility (with cluster name c2) to build
a grid with a total number of 7 CPU’s :
15
[cluster]
Name = local
Members = n1
[cluster]
Name = vonNeumann
Members = n2
[cluster]
Name = c2
Members = n1 n2
[node]
Name = n1
ComputerName = localhost
CPUnbr = 4
[node]
Name = n2
ComputerName = vonNeumann
CPUnbr = [4:6]
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
16
[cluster]
Name = c4
Members = n1 n2 n3 n4
[node]
Name = n1
ComputerName = vonNeumann1
CPUnbr = 4
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
[node]
Name = n2
ComputerName = vonNeumann2
CPUnbr = 4
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
[node]
Name = n3
ComputerName = vonNeumann3
CPUnbr = 2
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = D
RemoteDirectory = dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
[node]
Name = n4
ComputerName = vonNeumann4
CPUnbr = 4
UserName = COMPUTOWN\John
Password = *****
RemoteDrive = C
RemoteDirectory = John\dynare_calcs\Remote
DynarePath = c:\dynare\matlab
MatlabOctavePath = matlab
17
3.2.5 Example Unix syntax for remote runs
One remote slave: the following command defines remote runs on the ma-
chine name.domain.org.
[cluster]
Name = unix1
Members = n2
[node]
Name = n2
ComputerName = name.domain.org
CPUnbr = 4
UserName = JohnSmith
RemoteDirectory = /home/john/Remote
DynarePath = /home/john/dynare/matlab
MatlabOctavePath = matlab
Combining local and remote runs: the following commands define a clus-
ter of local an remote CPU’s.
18
[cluster]
Name = unix2
Members = n1 n2
[node]
Name = n1
ComputerName = localhost
CPUnbr = 4
[node]
Name = n2
ComputerName = name.domain.org
CPUnbr = 4
UserName = JohnSmith
RemoteDirectory = /home/john/Remote
DynarePath = /home/john/dynare/matlab
MatlabOctavePath = matlab
In this section we describe what happens when the user omits a mandatory
entry or provides bad values for them and how DYNARE reacts in these cases.
In the parallel package there is a utility (AnalyseComputationalEnvironment.m)
devoted to this task (this is triggered by the command line option parallel_test).
When necessary during the discussion, we use the parallel entries used in
the previous examples.
CPUnbr: a value for this variable must be in the form [s:d] with d>=s. If
the user types values s>d, their order is flipped and a warning message
is sent. When the user provides a correct value for this field, DYNARE
19
checks if d CPUs (or cores) are available on the computer. Suppose
that this check returns an integer nC. We can have three possibilities:
1. nC= d; all the CPU’s available are used, no warning message are
generated by DYNARE;
3. nC< d; DYNARE alerts the user that there are less CPU’s than
those declared. The parallel tasks would run in any case, but
some CPU’s will have multiple instances assigned, with no gain in
computational time.
20
3.3 The Developers perspective
In this section we describe with some accuracy the DYNARE parallel rou-
tines.
Unix: With Unix operating system, SSH must be installed on the master
and on the slave machines. Moreover, SSH keys must be installed so
that the SSH connections from the master to the slaves can be done
without passwords.
21
masterParallel is the entry point to the parallelization system:
• It is called from the master computer, at the point where the par-
allelization system should be activated. Its main arguments are
the name of the function containing the task to be run on every
slave computer, inputs to that function stored in two structures
(one for local and the other for global variables), and the con-
figuration of the cluster; this function exits when the task has
finished on all computers of the cluster, and returns the output in
a structure vector (one entry per slave);
22
[‘Always-Open’];
23
The parallel toolbox also includes a number of utilities:
24
1. locate within DYNARE the portion of code suitable to be parallelized,
i.e. an expensive cycle for;
2. suppose that the function tuna.m contains a cycle for that is suitable
for parallelization: this cycle has to be extracted from tuna.m and put
it in a new MATLAB function named tuna_core.m;
3. at the point where the expensive cycle should start, the function
tuna.m invokes the utility masterParallel.m, passing to it the
options_.parallel structure, the name of the of the function to be
run in parallel (tuna_core.m), the local and global variables needed
and all the information about the files (MATLAB functions *.m; data
files *.mat) that will be handled by tuna_core.m;
25
So far, we have parallelized the following functions, by selecting the most
computationally intensive loops:
4. the Monte Carlo cycle looping over posterior parameter subdraws per-
forming the IRF simulations (<*>_core1) and the cycle looping over
exogenous shocks plotting IRF’s charts (<*>_core2):
posteriorIRF.m,
posteriorIRF_core1.m, posteriorIRF_core2.m;
5. the Monte Carlo cycle looping over posterior parameter subdraws, that
computes filtered, smoothed, forecasted variables and shocks:
prior_posterior_statistics.m,
prior_posterior_statistics_core.m;
26
3.3.1 Write a parallel code: an example
Using a MATLAB pseudo (but very realistic) code, we now describe in detail
how to use the above step by step procedure to parallelize the random walk
Metropolis Hastings algorithm. Any other function can be parallelized in the
same way.
It is obvious that most of the computational time spent by the
random_walk_metropolis_hastings.m function is given by the cycle loop-
ing over the parallel chains performing the Metropolis:
function random_walk_metropolis_hastings
(TargetFun, ProposalFun, ..., varargin)
[...]
for b = fblck:nblck,
...
end
[...]
Since those chains are totally independent, the obvious way to reduce the
computational time is to parallelize this loop, executing the (nblck-fblck)
chains on different computers/CPUs/cores.
To do so, we remove the for cycle and put it in a new function named
<*>_core.m:
27
function myoutput =
random_walk_metropolis_hastings_core(myinputs,fblck,nblck, ...)
[...]
just list global variables needed (they are set-up properly by fParallel or slaveParallel)
TargetFun=myinputs.TargetFun;
ProposalFun=myinputs.ProposalFun;
xparam1=myinputs.xparam1;
[...]
for b = fblck:nblck,
...
end
[...]
myoutput.record = record;
[...]
The split of the for cycle has to be performed in such a way that the new
<*>_core function can work in both serial and parallel mode. In the latter
case, such a function will be invoked by the slave threads and executed for
the number of iterations assigned by masterParallel.m.
The modified random_walk_metropolis_hastings.m is therefore:
28
function random_walk_metropolis_hastings(TargetFun,ProposalFun,,varargin)
[...]
% here we wrap all local variables needed by the <*>_core function
localVars = struct(’TargetFun’, TargetFun, ...
[...]
’d’, d);
[...]
% here we put the switch between serial and parallel computation:
if isnumeric(options_.parallel) || (nblck-fblck)==0,
% serial computation
fout = random_walk_metropolis_hastings_core(localVars, fblck,nblck, 0);
record = fout.record;
else
% parallel computation
Finally, in order to allow the master thread to monitor the progress of the
slave threads, some message passing elements have to be introduced in the
29
<*>_core.m file. The utility function fMessageStatus.m has been designed
as an interface for this task, and can be seen as a generalized form of the
MATLAB utility waitbar.m.
In the following example, we show a typical use of this utility, again from
the random walk Metropolis routine:
for j = 1:nruns
[...]
% define the progress of the loop:
prtfrc = j/nruns;
end
[...]
end
30
% whoiam [int] index number of this CPU among all CPUs in the
% cluster
% ThisMatlab [int] index number of this slave machine in the cluster
% (entry in options_.parallel)
The message is stored as a MATLAB data file *.mat saved on the working
directory of remote slave computer. The master will will check periodically
for those messages and retrieve the files from remote computers and produce
an advanced monitoring plot.
So, assuming to run two Metropolis chains, under the standard serial
implementation there will be a first waitbar popping up on matlab, corre-
sponding to the first chain:
31
4 Parallel DYNARE: testing
4.1 Test 1.
The main goal here was to evaluate the parallel package on a fixed hardware
platform and using chains of variable length. The model used for testing is
a modification of Hradisky et al. (2006). This is a small scale open economy
DSGE model with 6 observed variables, 6 endogenous variables and 19 pa-
rameters to be estimated. We estimated the model on a bi-processor machine
(Fujitsu Siemens, Celsius R630) powered with an Intel(R) Xeon(TM) CPU
2.80GHz Hyper Treading Technology; first with the original serial Metropo-
lis and subsequently using the parallel solution, to take advantage of the
two processors technology. We ran chains of increasing length: 2500, 5000,
32
10,000, 50,000, 100,000, 250,000, 1,000,000.
Figure 1. X=Runs / Y=Computational time in minutes
Figure 1. X=Runs / Y=Computational time in minutes
800
800
700
600
700
500
600
400
one processor
two processors
500
300
one processor
400
200
two processors
300
100
200
0
0 200000 400000 600000 800000 1000000 1200000
100
Figure 2. X=Runs Y=Time Gain
Figure
1: Computational
0 0.6
time (in minutes) versus chain length for the serial
and parallel implementation
0 200000 (Metropolis
400000 600000 with two chains).
800000 1000000 1200000
0.5
Figure 2. X=Runs Y=Time Gain
0.4
0.6
0.3
0.5
0.2
0.4
0.1
0.3
0
0 200000 400000 600000 800000 1000000 1200000
0.2
Figure 2 plots the computation time gain against the chain length. The time gain
0.1
0 3
0 200000 400000 600000 800000 1000000 1200000
Figure 2: Reduction of computational time (i.e. the ‘time gain’) using the
Figure 2 plots the computation time gain against the chain length. The time gain
parallel coding versus chain length. The time gain is computed as (Ts −
Tp )/Tp , where Ts and Tp denote the computing time of the serial and parallel
implementations respectively.
3
33
Overall results are given in Figure 1, showing the computational time
versus chain length, and Figure 2, showing the reduction of computational
time (or the time gain) with respect to the serial implementation provided
by the parallel coding. The gain in computing time of the exercise is of about
45% on this test case, so reducing from 11.40 hours to about 6 hours the cost
of running 1,000,000 Metropolis iterations (the ideal gain would be of 50%
in this case).
4.2 Test 2.
The scope of the second test was to verify if results were robust over different
hardware platforms. We estimated the model with chain lengths of 1,000,000
runs on the following hardware platforms:
• Dual core machine: Intel Centrino T2500 2.00GHz Dual Core (Fujitsu-
Siemens, LifeBook S Series).
We first run the tests with normal configuration. However, since (i) dis-
similar software environment on the machine can influence the computation;
(ii) Windows service (Network, Hard Disk writing, Demon, Software Updat-
ing, Antivirus, etc.) can start during the simulation; we also run the tests
not allowing for any other process to start during the estimation. Table 2
34
Machine Single-processor Bi-processor Dual core
Parallel 8:01:21 7:02:19 5:39:38
Serial 10:12:22 13:38:30 11:02:14
Speed-Up rate 1.2722 1.9381 1.9498
Ideal Speed-UP rate ∼1.5 2 2
gives results for the ordinary software environment and process priority is set
as low/normal.
Results showed that Dual-core technology provides a similar gain if com-
pared with bi-processor results, again about 45%. The striking results was
that the Dual-core processor clocked at 2.0GHz was about 30% faster than
the Bi-processor clocked at 2.8GHz. Interesting gains were also obtained via
multi-threading on the Single-processor machine, with speed-up being about
1.27 (i.e. time gain of about 21%). However, beware that we burned a num-
ber of processors performing tests on single processors with hyper-threading
and using very long chains (1,000,000 runs)! We re-run the tests on the
Dual-core machine, by cleaning the PC operation from any interference by
other programs and show results in Table 3. A speed-up rate of 1.06 (i.e.
5.6% time gain) can be obtained simply hiding the MATLAB waitbar. The
speed-up rate can be pushed to 1.22 (i.e. 18% time gain) by disconnecting
the network and setting the priority of the process to real time. It can be
noted that from the original configuration, taking 11:02 hours to run the
two parallel chains, the computational time can be reduced to 4:40 hours
(i.e. for a total time gain of over 60% with respect to the serial computa-
tion) by parallelizing and optimally configuring the operating environment.
35
Environment Computing time Speed-up rate
w.r.t. Table 2
Parallel Waitbar Not Visi- 5:06:00 1.06
ble
Parallel waitbar Not Visi- 4:40:49 1.22
ble, Real-time Process pri-
ority, Unplugged network
cable.
These results are somehow surprising and show how it is possible to reduce
dramatically the computational time with slight modification in the software
configuration.
Given the excellent results reported above, we have parallelized many
other DYNARE functions. This implies that parallel instances can be in-
voked many times during a single DYNARE session. Under the basic parallel
toolbox implementation, that we call the ‘Open/Close’ strategy, this implies
that MATLAB instances are opened and closed many times by system calls,
possibly slowing down the computation, specially for ‘entry-level’ computer
resources. As mentioned before, this suggested to implement an alternative
strategy for the parallel toolbox, that we call the ‘Always-Open’ strategy,
where the slave MATLAB threads, once opened, stay alive and wait for new
tasks assigned by the master until the full DYNARE procedure is completed.
We show next the tests of these latest implementations.
36
4.3 Test 3
In this Section we use the Lubik (2003) model as test function2 and a very
simple computer class, quite diffuse nowadays: Netbook personal Computer.
In particular we used the Dell Mini 10 with Processor Intel Atom Z520 (1,33
GHz, 533 MHz), 1 GB di RAM (with Hyper-trading). First, we tested the
computational gain of running a full Bayesian estimation: Metropolis (two
parallel chains), MCMC diagnostics, posterior IRF’s and filtered, smoothed,
forecasts, etc. In other words, we designed DYNARE sessions that invoke
all parallelized functions. Results are shown in Figures 3-4. In Figure 3 we
show the computational time versus the length of the Metropolis chains in
the serial and parallel setting (‘Open/Close’ strategy). With very short chain
length, parallel setting obviously slows down performances of the computa-
tions (due to delays in open/close MATLAB sessions and in synchronization),
while increasing the chain length, we can get speed-up rates up to 1.41 on
this ‘entry-level’ portable computer (single processor and Hyper-threading).
In order to appreciate the gain of parallelizing all functions invoked after
Metropolis, in Figure 4 we show the results of the experiment, but with-
out running Metropolis, i.e. we use the options load_mh_files = 1 and
mh_replic = 0 DYNARE options (i.e. Metropolis and MCMC diagnostics
are not invoked). The parallelization of the functions invoked after Metropo-
lis allows to attain speed-up rates of 1.14 (i.e. time gain of about 12%).
Note that the computational cost of these functions is proportional to the
chain length only when the latter is relatively small. In fact, the number of
2
The Lubik (2003) model is also selected as the ‘official’ test model for the parallel
toolbox in DYNARE.
37
1005 246 287
5005 755 599
10005 1246 948
15005 1647 1250
20005 2068 1502
25005 2366 1675
Table3. Computational Time using all the parallel functions in DYNARE and the
Open/Close strategy.
We can also plot the results in table 3. We call this situation Complete Parallel …
Complete Parallel
2500
2000
Now we test the computational time for the model without the Metropolis
Computational Time (sec)
Hasting:
1500
Chains Length Comp Time Serial Serial
Comp Time Parallel
Parallel
1000
105 84 117
1005 121 165
5005 500
252 239
10005 353 330
15005 366 339
0
20005 105 1005 383
5005
10005
15005 335
20005 25005
25005 357 MH Runs 314
Figure 3. The plot of data in Table 3.
Table4. Computational Time without the computation of Metropolis Hasting
Figure 3: Computational Time (s) versus Metropolis length, running all the
algorithm and the Open/Close strategy.
parallelized
functions in DYNARE and the basic parallel implementation
(the ‘Open/Close’ strategy). (Lubik, 2003).
Figure 3. show as …
We can also plot the results in table 4:
8
Partial Parallel
450
400
350
Computational Time (sec)
300
250 Serial
200 Parallel
150
100
50
0
105 1005 5005 10005 15005 20005 25005
MH Runs
Figure 4. The plot of data in Table 4.
Figure 4: Computational Time (s) versus Metropolis length, loading previ-
ously performed MH runs and running only the parallelized functions after
9
Metropolis (Lubik, 2003). Basic parallel implementation (the ‘Open/Close’
strategy).
38
5005 504 205
10005 915 306
15005 1203 320
20005 1506 334
25005 1611 322
Table5. Computational Time with Always Open strategy.
We can also plot and compare the results in table 5 with results in table 3 and 4.:
Open/Close vs AlwaysOpen
Complete Parallel
1800
1600
1400
Computational Time (sec)
1200
1000 Open/Close
800 Always Open
600
400
200
0
105 1005 5005 10005 15005 20005 25005
MH Runs
Figure 5. The compared computational time for “complete” parallel .
Figure 5: Comparison of the ‘Open/Close’ strategy and the ‘Always-open’
strategy. Computational Time (s) versus Metropolis length, running all the
10
parallelized functions in DYNARE (Lubik, 2003).
39
Open/Close vs AlwaysOpen
Partial Parallel
400
350
Computational Time (sec)
300
250
Open/Close
200
Always Open
150
100
50
0
105 1005 5005 10005 15005 20005 25005
MH Runs
Figure 6. The compared computational time for “partial” parallel .
Figure 6:
Comparison of the ‘Open/Close’ strategy and the ‘Always-open’
strategy. Computational Time (s) versus Metropolis length, running only
the parallelized functions after Metropolis (Lubik, 2003).
3.1. In this section we try to use the QUEST III model Ratto (2009) and: Dell Mini
10 with Processor Intel® Atom™ Z520 (1,33 GHz, 533 MHz), 1 GB di RAM.
40
Time
Chains Time Parallel
Length Serial
105 98 95
1005 398 255
5005 1463 890
10005 2985 1655
20005 4810 2815
30005 6630 4022
40005 7466 5246
50000 9263 6565
DYNARE execution
Table6. (Figure
Computational 8),using
Time we can see
all the that function
parallel the speed-up
involved rate of running
and the
Open/Close strategy.
the posterior
analysis in parallel on two cores reaches 1.6 (i.e. 38% of time
gain).
Complete Parallel - Open/Close
10000
9000
8000
Computational Time (sec)
7000
6000
Serial
5000
Parallel
4000
3000
105 62 63
1005
2000 285 198
5005 498 318
1000
10005 798 488
20005 0 799 490
30005 105 1005781 5005 10005 518
20005 30005 40005 50000
40005 768 503
MH Runs
50005 823 511
100000 801 530
Figure 7: Computational Time (s) versus Metropolis length, running all the
Table7. Computational Time without MH and the Open/Close strategy.
parallelized
functions
in
DYNARE
and
the
basic
parallel implementation
(the ‘Open/Close’
Comp Comp
strategy). (Ratto et al., 2009).
Chains Time Time
Length Serial Partial Parallel-Open/Close
Parallel
900
12
800
700
Computational Time (sec)
600
500 Serial
400 Parallel
300
200
100
0
105 1005 5005 10005 20005 30005 40005 50005 100000
MH Runs
Figure 8: Computational Time (s) versus Metropolis length, loading pre-
viously performed MH runs and running only the parallelized functions
after Metropolis
(Ratto et al., 2009). Basic parallel implementation (the
‘Open/Close’
strategy).
Computational
Times: Computational 41
Chain Complete Times: Partial
Lenght Parallel Parallel
13
1005 273 117
5005 871 332
10005 1588 460
20005 2791 470
30005 3963 492
40005 5292 479
50000 6624 498
Table8. Computational Time with Always Open strategy.
Open/Close vs AlwaysOpen
Complete Parallel
7000
5000
4000
Open/Close
Always Open
3000
2000
1000
0
105 1005 5005 10005 20005 30005 40005 50000
MH Runs
14
600
500
Computational Time (sec)
400
Open/Close
300
Always Open
200
100
0
105 1005 5005 10005 20005 30005 40005 50000
MH Runs
50.000 4
100.000 2
200.000 2 and 4
400.000 3
500.000 2
1.000.000 2
On the other hand, in Figure 10, we can see that the ‘Always-open’
approach still provides a small speed-up rate of about 1.03. These results
confirm the previous comment that the gain of the ‘Always-open’ strategy
is specially visible when the computational time spent in a single parallel
session is not too long, and therefore, the bigger the model size, the less the
advantage of this strategy.
5 Conclusions
References
43
DYNARE CONFERENCE, Paris School of Economics, September 2007.
W.L. Goffe and M. Creel. Multi-core CPUs, clusters, and grid computing:
A tutorial. Computational economics, 32(4):353–382, 2008.
ParallelDYNARE. http://www.dynare.org/dynarewiki/paralleldynare,
2009.
44
M. Russinovich. PsTools v2.44, 2009. available at Microsoft TechNet,
http://technet.microsoft.com/en-us/sysinternals/bb896649.aspx.
45
i
SE(²C
t ) SE(²ηt ) SE(²P
t
M
)
30 10 150
20 100
5
10 50
0 0 0
0 0.05 0.1 0.15 0 0.2 0.4 0.02 0.04 0.06 0.08
SE(²P
t
X
) SE(²EX
t ) SE(²CG
t )
1000
1000
10
500
500
5
0 0 0
0 0.1 0.2 0.3 0.1 0.2 0.3 0.4 0.05 0.1 0.15
SE(²It G) SE(²Leis
t ) SE(²LOL
t )
60
1000
300
40
200
500
20
100
0 0 0
0.05 0.1 0.15 0 0.05 0.1 0.15 0 5 10 15
−3
x 10
Figure 11: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
F
SE(²M
t ) SE(²B
t ) SE(²rp
t )
2000 200
4000
0 0 0
2 4 6 8 5 10 15 0 0.01 0.02
−3 −3
x 10 x 10
SE(²Tt R ) SE(²L
t ) SE(²UP
t )
300
2000 20
200
1000 10 100
0 0 0
0.05 0.1 0.15 0 0.05 0.1 0.15 0.05 0.1 0.15
γucap,2 tCG
0 γK
4
20 0.02
2
10 0.01
0 0 0
0 0.05 0.1 −1 0 1 0 100 200
Figure 12: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
46
γI γL γP
0.8
0.03
0.6 0.02
0.02
0.4
0.01
0.2 0.01
0 0 0
0 20 40 60 0 50 100 150 0 50 100 150
γP M γP X γW
0.8 0.03
0.6
0.6
0.02
0.4
0.4
0.01 0.2
0.2
0 0 0
0 50 100 0 50 100 0 50 100
CG CG
τLag τAdj hC
4
8 8
3
6 6
2
4 4
1 2 2
0 0 0
−1 −0.5 0 0.5 −0.8 −0.6 −0.4 −0.2 0 0.4 0.6 0.8
Figure 13: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
IG
hL τ0I G τLag
4 4
4
3
2 2 2
1
0 0 0
0.4 0.6 0.8 1 −1 0 1 0 0.2 0.4 0.6 0.8
IG I NOM
τAdj τLag κ
8 0.8
6 20 0.6
4 0.4
10
2 0.2
0 0 0
−0.8 −0.6 −0.4 −0.2 0 0.6 0.7 0.8 0.9 0 2 4 6
Ci η
ρ ρ ρP M
15 30
6
10 20
4
5 2 10
0 0 0
0.6 0.8 1 0 0.2 0.4 0.6 0.8 0.6 0.8 1
Figure 14: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
47
ρP X ρG ρGI
6
3 6
4
2 4
2 1 2
0 0 0
0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 0.6 0.8 1
ρLeis ρL0 ρP CP M
40 3
20
30
15 2
20 10
1
10 5
0 0 0
0.6 0.8 1 0.8 0.9 1 0 0.5 1
F
ρP W P X ρB ρrp
6
40 10
4
2 20 5
0 0 0
0 0.2 0.4 0.6 0.8 0.6 0.8 1 0.6 0.8 1
Figure 15: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
ρucap risk rp
20 40 100
10 20 50
0 0 0
0.85 0.9 0.95 1 0 0.02 0.04 0.01 0.02 0.03 0.04
ωX sfp sfpm
15 6 3
10 4 2
5 2 1
0 0 0
0.7 0.8 0.9 0.2 0.4 0.6 0.8 1 0 0.5 1
sfpx sfw σc
10
3
0.4
2 0.3
5
0.2
1
0.1
0 0 0
0.2 0.4 0.6 0.8 1 0 0.5 1 0 5 10
Figure 16: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
48
σX σ slc
6
1 1.5
4
1
0.5
0.5 2
0 0 0
1 2 3 4 1 2 3 0 0.2 0.4 0.6 0.8
tIπNOM b U ρT R
2
4 8
1.5
6
1
2 4
0.5 2
0 0 0
1 2 3 −1 0 1 0.6 0.8 1
tIY,1
NOM
tIY,2
NOM
wrlag
4
15 3
3
10 2
2
1 5 1
0 0 0
0 0.5 1 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8
Figure 17: Prior (grey lines) and posterior density of estimated parameters
(black = 100,000 runs; red = 1,000,000 runs) using the RWMH algorithm
(QUEST III model Ratto et al., 2009).
49
A A tale on parallel computing
50
define a set of word, fixed and understood by all: the Programming Languages.
More specifically we call these high-level programming languages (java, c,
matlab …), because are relating with people who are on the upper floors of the
building!
First floor:
the Operating System
…
Ground floor:
the Hardware …
particularTheprogram
1 called
process to transform the operating
an high-level systemin to(Windows,
programming languages Linux,
binary code is called Mac OS,
compilation
process.
etc.).
2
People at the ground floor are the transistors, the RAM, the CPU, the
hard disk, etc. (i.e. the Computer Architecture, see chapters 1 and 2 in
Brookshear). People at the second floor communicate with people at the
first floor using the only existing scale (the pipe). In these communica-
tions, people talk two different languages, and therefore do not understand
each other. To remove this problem people define a set of words, fixed and
understood by everybody: the Programming Languages. More specifically,
51
these languages are called high-level programming languages (Java, C/C++,
FORTRAN,MATLAB, etc.), because they are related to people living on
the upper floors of the building! Sometimes people in the building use also
pictures to communicate: the icons and graphical user interface.
In a similar way, people at the first floor communicate with people at the
ground floor. Not surprisingly, in this case, people use low-level programming
languages to communicate to each other (assembler, binary code, machine
language, etc.). More importantly, however, people at the first floor must also
manage and coordinate the requests from people on the second floor to people
at the ground floor, since there is no direct communication between ground
and second floor. For example they need to translate high-level programming
languages into binary code3 : the Operating System performs this task.
Sometimes, people at the second floor try to talk directly with people at
the ground floor, via the system calls. In the parallelizing software presented
in this document, we will use frequently these system calls, to distribute the
jobs between the available hardware resources, and to coordinate the overall
parallel computational process. If only a single person without family lives
on the ground floor, such as the porter, we have a CPU single core. In this
case, the porter can only do one task at a time for the people in first or
second floor (the main characteristic of the Von Neumann architecture). For
example, in the morning he first collects and sorts the mail for the people
in the building, and only after completing this task he can take care of the
garden. If the porter has to do many jobs, he needs to write in a paper the list
3
The process to transform an high-level programming languages into binary code is
called compilation process.
52
of things to do: the memory and the CPU load. Furthermore, to properly
perform its tasks, sometimes the porter has to move some objects trough
the passageways at the ground floor (the System Bus). If the passageways
have standard width, we will have a 32 bits CPU architecture (or bus). If
the passageways are very large we will have, for example, a 64 bits CPU
architecture (or bus). In this scenario, there will be very busy days where
many tasks have to be done and many things have to be moved around: the
porter will be very tired, although he will be able to ‘survive’. The most
afflicted are always the people at the first floor. Every day they have a lot of
new, complex requests from the people at the second floor. These requests
must be translated in a correct way and passed to the porter. The people at
the second floor (the highest floor) “live in cloud cuckoo land”. These people
want everything to be done easily and promptly: the artificial intelligence,
robotics, etc. The activity in the building increases over time, so the porter
decides to get helped in order to reduce the execution time for a single job.
There are two ways to do this:
53
Bi-Processor Computer. In other case, the porter may get married,
producing a dual-core CPU. In this case, the wife can help the porter
to perform his tasks or even take entirely some jobs for her (for example
do the accounting, take care of the apartment, etc.). If the couple has
a children, they can have a further little help: the thread and then the
Hyper-threading technology.
Now a problem arises: who should coordinate the activities between the
porters (and their family) and between the other buildings? Or, in other
words, should we refurbish the first and second floors to take advantage of
the innovations on the ground floor and of the new roads in CompuTown?
First we can lodge new persons at the first floor: the operating systems with
a set of network tools and multi-processors support, as well as new people at
the second floor with new programming paradigms (MPI, OpenMP, Parrallel
DYNARE, etc.). Second, a more complex communication scheme between
first and ground floor is necessary, building a new set of stairs. So, for
example, if we have two stairs between ground and first floor and two porters,
using multi-processors and a new parallel programming paradigm, we can
assign jobs to each porter directly and independently, and then coordinate
the overall work. In parallel DYNARE we use this kind of ‘refurbishing’
to reduce the computational time and to meet the request of people at the
second floor.
Unfortunately, this is only an idealized scenario, where all the citizens
in CompuTown live in peace and cooperate between them. In reality, some
building occupants argue with each other and this can cause stopping their
job: these kinds of conflicts may be linked to software and hardware com-
54
patibility (between ground and first floor), or to different software versions
(between second and first floor). The building administration or the munic-
ipality of CompuTown have to take care of these problems an fix them, to
make the computer system operate properly.
This tale (that can be also called The Programs’s Society) covered in a
few pages the fundamental ideas of computer science.
55