×
Search

 

    TephraProb on a cluster

    There is a relatively easy way to parallelise TephraProb on a computer cluster without having to struggle with Matlab’s Parallel Computing Toolbox. Matlab is not even needed on the cluster, and it is just a matter of sending single occurrences of Tephra2 on different nodes. Here is how to do so:

    1. Generate your input files and eruption scenarios locally
    2. Send the required files on the cluster
    3. Run the scenario on the cluster
    4. Retrieve the output files from the cluster to the local computer
    5. Post-process the output files (e.g. probability calculations) locally

    Setting up remote files

    Following the procedure in the TephraProb manual, all tasks should be completed until section 5.3 - meaning you should be set to hit the Run Tephra2 function. The main file that will be used for the parallelisation is T2_stor.txt, located in RUNS/runName/runNumber/T2_stor.txt, which contains the Tephra2 commands for all single model runs of the scenario.

    Transfer your run, grid and wind files on the cluster. Unless you decide to customize T2_stor.txt, the directory tree once on the cluster should look like the following. Note that not all files of the RUNS/, GRID/ and WIND folders need to be transferred.

    ROOT
    ├── MODEL/
    ├── RUNS/
    │   └── runName/
    │       └── runNumber/
    │           ├── CONF/*.*
    │           ├── GS*.*
    │           └── OUT/*.*
    ├── WIND/
    │   └── windName/
    │           └── pathToAscii/*.*
    ├── GRID/
    │   └── gridName/
    │           └── *.utm
    ├── T2_stor.txt
    └── runTephraProb.sh
    

    Note that T2_stor.txt should be at the root, and all commands specified in it should point to locations that have been uploaded on the cluster. Note that by default, all paths defined in T2_stor.txt are relative.

    The rsync becomes handy here. To copy RUNS/ folders, you can use something like the following command, which will ignore all unecessary files and folders:

    rsync -arvz --exclude *.mat --exclude FIG --exclude KML --exclude LOG --exclude SUM  RUNS/run_name host@server:~/TephraProb/RUNS/
    

    Compiling Tephra2

    Tephra2 needs to be recompiled for the cluster’s architecture. On the cluster, from the root of TephraProb, navigate to MODEL/forward_src/ and enter:

    make
    

    Now, navigate back to the root of TephraProb and type:

    chmod 755 MODEL/tephra2-2012
    

    That should get Tephra2 running on the cluster. Getting a File format not recognized error? See if that helps.

    Running in parallel

    The parallelization is achieved using job arrays. Conceptually:

    1. T2_stor.txt

      is cut in smaller files named T2_stor.txt00, T2_stor.txt01T2_stor.txtXX using the split Unix command

    2. The subset of commands to Tephra2 in each of the sub-file is sent to a different node using a job array

    On the cluster, split T2_stor.txt:

    split -l 1000 -a 2 -d T2_stor.txt T2_stor.txt
    

    where -l is the number of line of each sub-file (here T2_stor.txt will be split into subsets of 1000 lines) and -a is the number of digits appended to the name of the subfile (e.g. -a 2 produces 01, 02 etc…). The last argument is the generic name of the sub-file.

    Let’s say that this created 10 files named T2_stor.txt00 to T2_stor.txt09. We need to adapt the job array to account for the range 0-9. We then use a handy little piece of code called GNU Parallel to send single CPU jobs to the nodes.

    SLURM

    On a SLURM cluster, the bash runTephraProb.sh might look like that:

    module load GCC/4.9.3-2.25
    module load OpenMPI/1.10.2
    module load parallel
    chunk=`printf "%02d" $SLURM_ARRAY_TASK_ID`
    srun parallel -j 16 -a T2_stor.txt$chunk
    

    The job can then be submitted using:

    sbatch --array=0-9 runTephraProb.sh
    

    OpenPBS

    On an OpenPBS cluster, the bash runTephraProb.sh might look like that:

    module load openmpi/1.4.5-gnu
    module load parallel
    cd $PBS_O_WORKDIR
    chunk=`printf "%02d" $PBS_ARRAYID`
    mpirun -np 12 -machinefile $PBS_NODEFILE parallel -j 12 -a T2_stor.txt.$chunk
    

    The job can then be submitted using:

    qsub -t 0-9 runTephraProb.sh 
    

    Post-processing

    Once the modelling is finished, copy the remote version of RUNS/runName/runNumber/OUT/ on to your local RUNS/runName/runNumber/OUT/ and go on with the post-processing. Again, the rsync command can help you achieve something like that: The job can then be submitted using:

    rsync -arvz --ignore-existing host@server:~/TephraProb/RUNS/run_name/ RUNS/run_name/