Running on SLURM HPC
HPC using the slurm workload manager
🧱 Setting up your environment (only once per user)
You will need to load some modules in order to run the code. These include gcc
, anaconda
, and git
. You should be able to do this using the module
commands.
First purge the current modules:
Now find the name of the available gcc 9.x module and load that:
Now load anaconda:
Now you need to create the conda environment. You will create the environment in two shorter commands, installing the python and R stuff separately. This can be extremely long if done in one command, so doing it in two helps. This command is quite long you'll have the time to brew some nice coffee ☕️:
The next step in preparing your environment is to install the necessary R packages. First, activate your environment, launch R and then install the following packages.
You are now ready to run using SLURM!
🗂️ Files and folder organization
HPC administrators are likely to provide different partitions with different properties for your use. We recommend a partition that supports a shared environment and storage intensive needs.
For example, we use a scratch partition with 20T of space, which has a primary user, and other users share this storage. In our setup this looks like: /scratch4/primary-user/
. We will describe this setup as an example, but note that your HPC setup might be different (if so, change the relevant paths).
We recommend setting up two folders: one containing the code, and one for storing the model output. Helper scripts are setup to use this code structure.
code folder:
/scratch4/primary-user/flepimop-code
Check the Before any run page for how to set up the appropriate folders or repositories. For our purposes, a subfolder is also setup for each user. This allows users to be able to manage their own code and launch their own runs. For example, for a user, this might look like/scratch4/primary-user/flepimop-code/$USER/flepiMoP
/scratch4/primary-user/flepimop-code/$USER/flepimop-sample
Note that the repository should be cloned flat, i.e the flepiMoP
repository is at the same level as the data repository, not inside it.
output folder:
/scratch4/primary-user/flepimop-runs
After an inference run finishes, it's output and the logs files are copied from the project folder where the model is run from, toscratch4/primary-user/flepimop-runs/$JOB_NAME
whereJOB_NAME
is an environmental variable set up within the submission script (described below; this is usually of the formUSA-DATE
).
🚀 Run inference using slurm (do everytime)
In your HPC system, enter the following command:
This will prepare the environment and setup variables for the validation date, the location of the model output from which you want to resume (this can be an S3 bucket, or a local path) and the run index for this run. If you don't want to set a variable, just hit enter.
Check that the conda environment is activated: you should see(flepimop-env)
on the left of your command-line prompt.
Then prepare the pipeline directory (if you have already done that and the pipeline hasn't been updated (git pull
says it's up to date) then you can skip these steps.
Define environment variables
Create environmental variables for the paths to the flepiMoP code folder and the project folder:
Go into the code directory and do the installation the R and Python code packages:
Each installation step may take a few minutes to run.
Run the code
Everything is now ready 🎉 The next step depends on what sort of simulation you want to run: One that includes inference (fitting model to data) or only a forward simulation (non-inference). Inference is run from R, while forward-only simulations are run directly from the Python package gempyor
.
In either case, navigate to the project directory and make sure to delete any old model output files that are there. Note that in the example config provided, the output is saved to model_output
, but this might be otherwise defined in config::model_output_dirname.
Set the path to your config
You may want to test that it works before launching a full batch:
If this fails, you may want to investigate this error. In case this succeeds, then you can proceed (but remember to delete the existing model output).
Launch your inference batch job
When an inference batch job is launched, a few post processing scripts are called to run automatically postprocessing-scripts.sh.
You can manually change what you want to run by editing this script.
To launch the whole inference batch job, type the following command:
This command infers everything from you environment variables, if there is a resume or not, what is the run_id, etc. The part after the "2" makes sure this file output is redirected to a script for logging, but has no impact on your submission.
This launches a batch job to your HPC, with each slot on a separate node.
If you'd like to have more control, you can specify the arguments manually:
Commit files to Github. After the job is successfully submitted, you will now be in a new branch of the project repository. For documentation purposes, we recommend committing the ground truth data files to the branch on github:
DO NOT move to a different git branch after this step, as the run will use data in the current directory.
🛠 Helpful tools and other notes
Monitor your run
During an inference batch run, log files will show the progress of each array/slot. These log files will show up in your project directory and have the file name structure:
To view these as they are being written, type
Other commands that are helpful for monitoring the status of your runs (note that <Job ID>
here is the SLURM job ID, not the JOB_NAME
set by flepiMoP):
SLURM command | What does it do? |
---|---|
| Displays the names and statuses of all jobs submitted by the user. Job status might be: R: running, P: pending. |
| Displays information related to the efficiency of resource usage by the job |
| Displays accounting data for all jobs and job steps |
| This cancels a job. If you want to cancel/kill all jobs submitted by a user, you can type |
Running an interactive session
To check your code prior to submitting a large batch job, it's often helpful to run an interactive session to debug your code and check everything works as you want. On 🪨🐠 this can be done using interact
like the below line, which requests an interactive session with 4 cores, 24GB of memory, for 12 hours.
The options here are [-n tasks or cores]
, [-t walltime]
, [-p partition]
and [-m memory]
, though other options can also be included or modified to your requirements. More details can be found on the ARCH User Guide.
Moving files to your local computer
Often you'll need to move files back and forth between your HPC and your local computer. To do this, your HPC might suggest Filezilla or Globus file manager. You can also use commands scp
or rsync
(check what works for your HPC).
Other helpful commands
If your system is approaching a file number quota, you can find subfolders that contain a large number of files by typing:
Last updated