Running on Rockfish/MARCC - JHU 🪨🐠
or any HPC using the slurm workload manager
🗂️ Files and folder organization
Rockfish administrators provided several partitions with different properties. For our needs (storage intensive and shared environment), we work in the /scratch4/struelo1/
partition, where we have 20T of space. Our folders are organized as:
code-folder:
/scratch4/struelo1/flepimop-code/
where each user has its own subfolder, from where the repos are cloned and the runs are launched. e.g for user chadi, we'll find:/scratch4/struelo1/flepimop-code/chadi/covidsp/Flu_USA
/scratch4/struelo1/flepimop-code/chadi/COVID19_USA
/scratch4/struelo1/flepimop-code/chadi/flepiMoP
...
(we keep separated repositories by users so that different versions of the pipeline are not mixed where we run several runs in parallel. Don't hesitate to create other subfolders in the code folder (
/scratch4/struelo1/flepimop-code/chadi-flusight
, ...) if you need them.
Note that the repository is cloned flat, i.e the flepiMoP
repository is at the same level as the data repository, not inside it!
output folder:
/scratch4/struelo1/flepimop-runs
stores the run outputs. After an inference run finishes, it's output and the logs files are copied from the$DATA_PATH/model_output
to/scratch4/struelo1/flepimop-runs/THISRUNJOBNAME
where the jobname is usually of the formUSA-DATE.
When logging on you'll see two folders data_struelo1
and scr4_struelo1
, which are shortcuts to /data/struelo1
and /scratch4/struelo1
. We don't use data/struelo1
.
Login on rockfish
Using ssh from your terminal, type in:
and enter your password when prompted. You'll be into rockfish's login node, which is a remote computer whose only purpose is to prepare and launch computations on so-called compute nodes.
🧱 Setup (to be done only once per USER )
Load the right modules for the setup:
Now, type the following line so git remembers your credential and you don't have to enter your token 6 times per day:
Now you need to create the conda environment. You will create the environment in two shorter commands, installing the python and R stuff separately. This can be extremely long if done in one command, so doing it in two helps. This command is quite long you'll have the time to brew some nice coffee ☕️:
Clone the FlepiMoP and other model repositories
Use the following commands to have git clone the FlepiMoP repository and any other model repositories you'd like to work on through https
. In the code below, $USER is a variable that contains your username.
You will be prompted to provide your GitHub username and password. Note that from 2021, GitHub has changed the use of passwords to the use of personal acces tokens, so the prompted "password" is not the password you use to login. Instead, we recommend using the more safe ssh
protocol to clone GitHub repositories. To do so, first generate an ssh private-public keypair on the Rockfish cluster and then copy the generated public key from the Rockfish cluster to your local computer by opening a terminal and prompting,
scp -r <username>@rfdtn1.rockfish.jhu.edu:/home/<username>/.ssh/<key_name.pub> .
Then add the public key to your GitHub account. Next, make a file ~/.ssh/config
by using the command
vi ~/.ssh/config`. Press 'I' to go into insert mode and paste the following chunck of code,
Press 'esc' to exit INSERT model followed by ':x' to save and exit the file. By adding this configuration file, you make sure Rockfish doesn't forget your ssh key when you log out. Now clone the github repositories as follows,
and you will not be prompted for credentials.
Setup your Amazon Web Services (AWS) credentials
This can be done in a second step -- but is necessary in order to push and pull data to Amazon Simple Storage Service (S3). Setup AWS by running,
Then run ./aws-cli/bin/aws configure
to set up your credentials,
To get the (secret) access key, ask the AWS administrator (Shaun Truelove) to generate them for you.
🚀 Run inference using slurm (do everytime)
log-in to rockfish via ssh, then type:
which will prepare the environment and setup variables for the validation date (choose as day after end_date_groundtruth
), the resume location and the run index for this run. If you don't want to set a variable, just hit enter.
Note that now the run-id of the run we resume from is automatically inferred by the batch script :)
Check that the conda environment is activated: you should see(flepimop-env)
on the left of your command-line prompt.
Then prepare the pipeline directory (if you have already done that and the pipeline hasn't been updated (git pull
says it's up to date) then you can skip these steps
Now flepiMoP is ready 🎉. If the R
command doesn't work, try r
and if that doesn't work run module load
r/4.0.2`.
Next step is to setup the data. First $DATA_PATH to your data folder, and set any data options. If you are using the Delph Epidata API, first register for a key here: https://cmu-delphi.github.io/delphi-epidata/. Once you have a key, add that below where you see [YOUR API KEY]. Alternatively, you can put that key in your config file in the inference
section as gt_api_key: "YOUR API KEY"
.
For a COVID-19 run, do:
for Flu do:
Now for any type of run:
Do some clean-up before your run. The fast way is to restore the $DATA_PATH
git repository to its blank states (⚠️ removes everything that does not come from git):
Run the preparatory script for the data and you are good,
If you want to profile how the model is using your memory resources during the run:
Now you may want to test that it works :
If this fails, you may want to investigate this error. In case this succeeds, then you can proceed by first deleting the model_output:
Launch your inference batch job
When an inference batch job is launched, a few post processing scripts are called to run automatically postprocessing-scripts.sh.
You can manually change what you want to run by editing this script.
Now you're fully set to go 🎉
To launch the whole inference batch job, type the following command:
This command infers everything from you environment variables, if there is a resume or not, what is the run_id, etc. The part after the "2" makes sure this file output is redirected to a script for logging, but has no impact on your submission.
If you'd like to have more control, you can specify the arguments manually:
If you want to send any post-processing outputs to #flepibot-test
instead of csp-production
, add the flag --slack-channel debug
Commit files to Github. After the job is successfully submitted, you will now be in a new branch of the data repo. Commit the ground truth data files to the branch on github and then return to the main branch:
but DO NOT finish up by git checking main like the aws instructions, as the run will use data in the current folder.
Monitor your run
TODO JPSEH WRITE UP TO HERE
Two types of logfiles: in `$DATA_PATH`: slurm-JOBID_SLOTID.out and and filter_MC logs:
```tail -f /scratch4/struelo1/flepimop-runs/USA-20230130T163847/log_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc_100.txt
```
Helpful commands
When approching the file number quota, type
to find which subfolders contains how many files
Common errors
Check that the python comes from conda with
which python
if some weird package missing errors arrive. Sometime conda magically disappears.Don't use
ipython
as it breaks click's flags
cleanup:
Get a notification on your phone/mail when a run is done
We use ntfy.sh for notification. Install ntfy on your Iphone or Android device. Then subscribe to the channel ntfy.sh/flepimop_alerts
where you'll receive notifications when runs are done.
End of job notifications goes as urgent priority.
How to use slurm
Check your running jobs:
where job_id has your full array job_id and each slot after the under-score. You can see their status (R: running, P: pending), how long they have been running and soo on.
To cancel a job
Running an interactive session
To check your code prior to submitting a large batch job, it's often helpful to run an interactive session to debug your code and check everything works as you want. On 🪨🐠 this can be done using interact
like the below line, which requests an interactive session with 4 cores, 24GB of memory, for 12 hours.
The options here are [-n tasks or cores]
, [-t walltime]
, [-p partition]
and [-m memory]
, though other options can also be included or modified to your requirements. More details can be found on the ARCH User Guide.
Moving files to your local computer
Often you'll need to move files back and forth between Rockfish and your local computer. To do this, you can use Open-On-Demand, or any other command line tool.
scp -r <user>@rfdtn1.rockfish.jhu.edu:"<file path of what you want>" <where you want to put it in your local>
Installation notes
These steps are already done an affects all users, but might be interesting in case you'd like to run on another cluster
Install slack integration
So our 🤖-friend can send us some notifications once a run is done.
Last updated