TephraProb on a cluster
08 Jun 2018tephraprob
parallel
codes
There is a relatively easy way to parallelise TephraProb on a computer cluster without having to struggle with Matlab’s Parallel Computing Toolbox. Matlab is not even needed on the cluster, and it is just a matter of sending single occurrences of Tephra2 on different nodes. Here is how to do so:
- Generate your input files and eruption scenarios locally
- Send the required files on the cluster
- Run the scenario on the cluster
- Retrieve the output files from the cluster to the local computer
- Post-process the output files (e.g. probability calculations) locally
Setting up remote files
Following the procedure in the TephraProb manual, all tasks should be completed until section 5.3 - meaning you should be set to hit the Run Tephra2 function. The main file that will be used for the parallelisation is T2_stor.txt, located in
Transfer your run, grid and wind files on the cluster. Unless you decide to customize
ROOT ├── MODEL/ ├── RUNS/ │ └── runName/ │ └── runNumber/ │ ├── CONF/*.* │ ├── GS*.* │ └── OUT/*.* ├── WIND/ │ └── windName/ │ └── pathToAscii/*.* ├── GRID/ │ └── gridName/ │ └── *.utm ├── T2_stor.txt └── runTephraProb.sh
Note that
The
rsync -arvz --exclude *.mat --exclude FIG --exclude KML --exclude LOG --exclude SUM RUNS/run_name host@server:~/TephraProb/RUNS/
Compiling Tephra2
Tephra2 needs to be recompiled for the cluster’s architecture. On the cluster, from the root of TephraProb, navigate to
make
Now, navigate back to the root of TephraProb and type:
chmod 755 MODEL/tephra2-2012
That should get Tephra2 running on the cluster. Getting a
Running in parallel
The parallelization is achieved using job arrays. Conceptually:
-
T2_stor.txt is cut in smaller files named
T2_stor.txt00 ,T2_stor.txt01 …T2_stor.txtXX using thesplit Unix command - The subset of commands to Tephra2 in each of the sub-file is sent to a different node using a job array
On the cluster, split
split -l 1000 -a 2 -d T2_stor.txt T2_stor.txt
where
Let’s say that this created 10 files named
SLURM
On a SLURM cluster, the bash
module load GCC/4.9.3-2.25
module load OpenMPI/1.10.2
module load parallel
chunk=`printf "%02d" $SLURM_ARRAY_TASK_ID`
srun parallel -j 16 -a T2_stor.txt$chunk
The job can then be submitted using:
sbatch --array=0-9 runTephraProb.sh
OpenPBS
On an OpenPBS cluster, the bash
module load openmpi/1.4.5-gnu
module load parallel
cd $PBS_O_WORKDIR
chunk=`printf "%02d" $PBS_ARRAYID`
mpirun -np 12 -machinefile $PBS_NODEFILE parallel -j 12 -a T2_stor.txt.$chunk
The job can then be submitted using:
qsub -t 0-9 runTephraProb.sh
Post-processing
Once the modelling is finished, copy the remote version of
rsync -arvz --ignore-existing host@server:~/TephraProb/RUNS/run_name/ RUNS/run_name/