LogoLogo
JHU-IDDCOVID-19 Scenario Modeling hubCOVID-19 Forecast Hub
  • Home
  • 🦠gempyor: modeling infectious disease dynamics
    • Modeling infectious disease dynamics
    • Model Implementation
      • flepiMoP's configuration file
      • Specifying population structure
      • Specifying compartmental model
      • Specifying initial conditions
      • Specifying seeding
      • Specifying observational model
      • Distributions
      • Specifying time-varying parameter modifications
      • Other configuration options
      • Code structure
    • Model Output
  • 📈Model Inference
    • Inference Description
    • Inference Implementation
      • Specifying data source and fitted variables
      • (OLD) Configuration options
      • (OLD) Configuration setup
      • Code structure
    • Inference Model Output
    • Inference with EMCEE
  • 🖥️More
    • Setting up the model and post-processing
      • Config writer
      • Diagnostic plotting scripts
      • Create a post-processing script
      • Reporting
    • Advanced
      • File descriptions
      • Numerical methods
      • Additional parameter options
      • Swapping model modules
      • Using plug-ins 🧩[experimental]
  • 🛠️How To Run
    • Before any run
    • Quick Start Guide
    • Multiple Configuration Files
    • Synchronizing Files
    • Advanced run guides
      • Running with Docker locally 🛳
      • Running locally in a conda environment 🐍
      • Running on AWS 🌳
      • Running On A HPC With Slurm
    • Common errors
    • Useful commands
    • Tips, tricks, FAQ
  • 🗜️Development
    • Git and GitHub Usage
    • Guidelines for contributors
  • Deprecated pages
    • Module specification
  • JHU Internal
    • US specific How to Run
      • Running with Docker locally (outdated/US specific) 🛳
      • Running on Rockfish/MARCC - JHU 🪨🐠
      • Running with docker on AWS - OLD probably outdated
        • Provisioning AWS EC2 instance
        • AWS Submission Instructions: Influenza
        • AWS Submission Instructions: COVID-19
      • Running with RStudio Server on AWS EC2
    • Inference scratch
Powered by GitBook
On this page
  • Installing flepiMoP
  • Updating flepiMoP
  • Initialize The Created flepiMoP Environment
  • Submitting A Batch Inference Job To Slurm
  • Estimating Required Resources For A Batch Inference Job
Edit on GitHub
Export as PDF
  1. How To Run
  2. Advanced run guides

Running On A HPC With Slurm

Tutorial on how to install and run flepiMoP on a supported HPC with slurm.

PreviousRunning on AWS 🌳NextCommon errors

Last updated 2 months ago

These details cover how to install and initialize flepiMoP on an HPC environment and submit a job with slurm.

Currently only JHU's Rockfish and UNC's Longleaf HPC clusters are supported. If you need support for a new HPC cluster please file an issue in .

For getting access to one of the supported HPC environments please refer to the following documentation before continuing:

  • for UNC users, or

  • for JHU users.

External users will need to consult with their PI contact at the respective institution.

Installing flepiMoP

This task needs to be ran once to do the initial install of flepiMoP.

On JHU's Rockfish you'll need to run these steps in a slurm interactive job. This can be launched with /data/apps/helpers/interact -n 4 -m 12GB -t 4:00:00, but please consult the for up to date information.

Obtain a temporary clone of the flepiMoP repository. The install script will place a permanent clone in the correct location once ran. You may need to take necessary steps to setup git on the HPC cluster being used first before running this step.

$ git clone git@github.com:HopkinsIDD/flepiMoP.git --depth 1
Cloning into 'flepiMoP'...
remote: Enumerating objects: 487, done.
remote: Counting objects: 100% (487/487), done.
remote: Compressing objects: 100% (424/424), done.
remote: Total 487 (delta 59), reused 320 (delta 34), pack-reused 0 (from 0)
Receiving objects: 100% (487/487), 84.04 MiB | 41.45 MiB/s, done.
Resolving deltas: 100% (59/59), done.
Updating files: 100% (411/411), done.

Run the hpc_install_or_update script, substituting <cluster-name> with either rockfish or longleaf. This script will prompt the user asking for the location to place the flepiMoP clone and the name of the conda environment that it will create. If this is your first time using this script accepting the defaults is the quickest way to get started. Also, expect this script to take a while the first time that you run it.

$ ./flepiMoP/build/hpc_install_or_update <cluster-name>

Remove the temporary clone of the flepiMoP repository created before. This step is not required, but does help alleviate confusion later.

$ rm -rf flepiMoP/

Updating flepiMoP

Updating flepiMoP is designed to work just the same as installing flepiMoP. Make sure that your clone of the flepiMoP repository is set to the branch your working with (if doing development or operations work) and then run the hpc_install_or_update script, substituting <cluster-name> with either rockfish or longleaf.

$ ./flepiMoP/build/hpc_install_or_update <cluster-name>

Initialize The Created flepiMoP Environment

These steps to initialize the environment need to run on a per run or as needed basis.

Change directory to where a full clone of the flepiMoP repository was placed (it will state the location in the output of the script above). And then run the hpc_init script, substituting <cluster-name> with either rockfish or longleaf. This script will assume the same defaults as the script before for where the flepiMoP clone is and the name of the conda environment. This script will also ask about a project directory and config, if this is your first time initializing flepiMoP it might be helpful to use configs out of flepiMoP/examples/tutorials directory as a test.

$ ./batch/hpc_init <cluster-name>

Upon completing this script it will output a sample set of commands to run to quickly test if the installation/initialization has gone okay.

Submitting A Batch Inference Job To Slurm

The main entry point for submitting batch inference jobs is the flepimop batch-calibrate action. This CLI tool will let you submit a job to slurm once logged into a cluster. For details on the available options please refer to flepimop batch-calibrate --help. As a quick example let's submit an R inference and EMCEE inference job. For the R inference run execute the following once logged into either longleaf or rockfish:

$ export PROJECT_PATH="$FLEPI_PATH/examples/tutorials/"
$ cd $PROJECT_PATH
$ flepimop batch-calibrate \
    --blocks 1 \
    --chains 4 \
    --samples 20 \
    --simulations 100 \
    --time-limit 30min \
    --slurm \
    --nodes 4 \
    --cpus 1 \
    --memory 1G \
    --extra 'partition=<your partition, if relevant>' \
    --extra 'email=<your email, if relevant>' \
    --skip-checkout \
    -vvv \
    config_sample_2pop_inference.yml

This command will produce a large amount of output, due to -vvv. If you want to try the command without actually submitting the job you can pass the --dry-run option. This command will submit a job to calibrate the sample 2 population configuration which uses R inference. The R inference supports array jobs so each chain will be run on an individual node with 1 CPU and 1GB of memory a piece. Additionally the extra option allows you to provide additional info to the batch system, in this case what partition to submit the jobs to but email is also supported with slurm for notifications. After running this command you should notice the following output:

  • config_sample_2pop-YYYYMMDDTHHMMSS.yml: This file contains the compiled config that is actually submitted for inference,

  • manifest.json: This file contains a description of the submitted job with the command used, the job name, and flepiMoP and project git commit hashes,

  • slurm-*_*.out: These files contain output from slurm for each of the array jobs submitted,

  • tmp*.sbatch: Contains the generated file submitted to slurm with sbatch.

For operational runs these files should be committed to the checked out branch for archival/reproducibility reasons. Since this is just a test you can safely remove these files after inspecting them.

Now, let's submit an EMCEE inference job with the same tool. Importantly, the options we'll use won't change much because flepimop batch-calibrate is designed to provide a unified implementation independent interface.

$ export PROJECT_PATH="$FLEPI_PATH/examples/simple_usa_statelevel/"
$ cd $PROJECT_PATH
$ flepimop batch-calibrate \
    --blocks 1 \
    --chains 4 \
    --samples 20 \
    --simulations 100 \
    --time-limit 30min \
    --slurm \
    --nodes 1 \
    --cpus 4 \
    --memory 8G \
    --extra 'partition=<your partition, if relevant>' \
    --extra 'email=<your email, if relevant>' \
    --skip-checkout \
    -vvv \
    simple_usa_statelevel.yml

One notable difference is, unlike R inference, EMCEE inference only supports running on 1 node so resources for this command are adjusted accordingly:

  • Swapping 4 nodes with 1 cpu each to 1 node with 4 cpus, and

  • Doubling the memory usage from 4 nodes with 1GB each for 4GB total to 1 node with 8GB for 8GB total.

The extra increase in memory is to run a configuration that is slightly more resource intense than the previous example. This command will also produce a similar set of record keeping files like before that you can safely remove after inspecting.

Estimating Required Resources For A Batch Inference Job

When inspecting the output of flepimop batch-calibrate --help you may have noticed several options named --estimate-*. While not required for the smaller jobs above this tool has the ability to estimate the required resources to run a larger batch estimation job. The tool does this by running smaller jobs and then projecting the required resources for a large job from those smaller jobs. To use this feature provide the --estimate flag, a job size of the targeted job, resources for test jobs, and the following estimation settings:

  • --estimate-runs: The number of smaller jobs to run to estimate the required resources from,

  • --estimate-interval: The size of the prediction interval to use for estimating the resource/time limit upper bounds,

  • --estimate-vary: The job size elements to vary when generating smaller jobs,

  • --estimate-factors: The factors to use in projecting the larger scale estimation job,

  • --estimate-measurements: The resources to estimate,

  • --estimate-scale-upper: The scale factor to use to determine the largest sample job to generate, and

  • --estimate-scale-lower: The scale factor to use to determine the smallest sample job to generate.

Effectively using these options requires some knowledge of the underlying inference method. Sticking with the simple usa state level example above try submitting the following command (after cleaning up the output from the previous example):

$ flepimop batch-calibrate \
    --blocks 1 \
    --chains 4 \
    --samples 20 \
    --simulations 500 \
    --time-limit 2hr \
    --slurm \
    --nodes 1 \
    --cpus 4 \
    --memory 24GB \
    --extra 'partition=<your partition, if relevant>' \
    --extra 'email=<your email, if relevant>' \
    --skip-checkout \
    --estimate \
    --estimate-runs 6 \
    --estimate-interval 0.8 \
    --estimate-vary simulations \
    --estimate-factors simulations \
    --estimate-measurements time \
    --estimate-measurements memory \
    --estimate-scale-upper 5 \
    --estimate-scale-lower 10 \
    -vvv \
    simple_usa_statelevel.yml > simple_usa_statelevel_estimation.log 2>&1 & disown

In short, this command will submit 6 test jobs that will vary simulations and measure time and memory. The number of simulations will be used to project the required resources. The test jobs will range from 1/5 to 1/10 of the target job size. This command will take a bit to run because it needs to wait on these test jobs to finish running before it can do the analysis, so you can check on the progress by checking the output of the simple_usa_statelevel_estimation.log file.

Once this command finishes running you should notice a file called USA_influpaint_resources.json. This JSON file contains the estimated resources required to run the target job. You can submit the target job with the estimated resources by using the same command as before without the --estimate-* options and using the --from-estimate option to pull the information from the outputted file:

$ flepimop batch-calibrate \
    --blocks 1 \
    --chains 4 \
    --samples 20 \
    --simulations 500 \
    --time-limit 2hr \
    --slurm \
    --nodes 1 \
    --cpus 4 \
    --memory 24GB \
    --from-estimate USA_influpaint_resources.json \
    --extra 'partition=<your partition, if relevant>' \
    --extra 'email=<your email, if relevant>' \
    --skip-checkout \
    -vvv \
    simple_usa_statelevel.yml
🛠️
the flepiMoP GitHub repository
UNC's Longleaf Cluster
JHU's Rockfish Cluster
Rockfish user guide