1 of 8

US specific How to Run

Running with Docker locally (outdated/US specific) 🛳

Short internal tutorial on running locally using a "Docker" container.

There are more comprehensive directions in the How to run -> Running with Docker locally section, but this section has some specifics required to do US-specific, COVID-19 and flu-specific runs

Setup

Run Docker Image

Current Docker image: /hopkinsidd/flepimop:latest-dev

Docker is a software platform that allows you to build, test, and deploy applications quickly. Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. This means you can run and install software without installing the dependencies in the system.

A docker container is an environment which is isolated from the rest of the operating system i.e. you can create files, programs, delete and everything but that will not affect your OS. It is a local virtual OS within your OS ;

For flepiMoP, we have a docker container that will help you get running quickly ;

docker pull hopkinsidd/flepimop:latest-dev
docker run -it \
  -v <dir1>:/home/app/flepimop \
  -v <dir2>:/home/app/drp \
hopkinsidd/flepimop:latest-dev

In this command we run the docker image hopkinsidd/flepimop. The -v command is used to allocate space from Docker and mount it at the given location ;

This mounts the data folder <dir1> to a path called drp within the docker environment, and the COVIDScenarioPipeline <dir2> in flepimop ;

🚀 Run inference

Fill the environment variables (do this every time)

First, populate the folder name variables:

export FLEPI_PATH=/home/app/csp/
export PROJECT_PATH=/home/app/drp/

Then, export variables for some flags and the census API key (you can use your own):

export FLEPI_STOCHASTIC_RUN=false
export FLEPI_RESET_CHIMERICS=TRUE
export CENSUS_API_KEY="6a98b751a5a7a6fc365d14fa8e825d5785138935"

Where do I get a census key API?

The Census Data Application Programming Interface (API) is an API that gives the public access to raw statistical data from various Census Bureau data programs. To acquire your own API Key, click here.

After you enter your details, you should receive an email using which you can activate your key and then use it.

Note: Do not enter the API Key in quotes, copy the key as it is.

Go into the Pipeline repo (making sure it is up to date on your favorite branch) and do the installation required of the repository:

cd $FLEPI_PATH   # it'll move to the flepiMoP/ directory
Rscript local_install.R               # Install the R stuff
pip install --no-deps -e gempyor_pkg/ # install gempyor
git lfs install
git lfs pull

Note: These installations take place in the docker container and not the Operating System. They must be made once while starting the container and need not be done for every time you are running tests, provided they have been installed once.

Run the code

Everything is now ready. 🎉 Let's do some clean-up in the data folder (these files might not exist, but it's good practice to make sure your simulation isn't re-using some old files) ;

cd $PROJECT_PATH       # goes to Flu_USA
git restore data/
rm -rf data/mobility_territories.csv data/geodata_territories.csv data/us_data.csv
rm -r model_output/ # delete the outputs of past run if there are

Stay in $PROJECT_PATH, select a config, and build the setup. The setup creates the population seeding file (geodata) and the population mobility file (mobility). Then, run inference:

export CONFIG_PATH=config_SMH_R1_lowVac_optImm_2022.yml
Rscript $FLEPI_PATH/datasetup/build_US_setup.R
Rscript $FLEPI_PATH/datasetup/build_flu_data.R
flepimop-inference-main -j 1 -n 1 -k 1

where:

n is the number of parallel inference slots,
j is the number of CPU cores it'll use in your machine,
k is the number of iterations per slots.

It should run successfully and create a lot of files in model_output/ ;

The last few lines visible on the command prompt should be:

[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
NULL

Other helpful tools

To understand the basics of docker refer to the following: Docker Basics

To install docker for windows refer to the following link: Installing Docker

The following is a good tutorial for introduction to docker: Docker Tutorial

To run the entire pipeline we use the command prompt. To open the command prompt type “Command Prompt" in the search bar and open the command prompt. Here is a tutorial video for navigating through the command prompt: Command Prompt Tutorial

To test, we use the test folder (test_documentation_inference_us in this case) in the CovidScenariPipeline as the data repository folder. We run the docker container and set the paths.

Running on Rockfish/MARCC - JHU 🪨🐠

or any HPC using the slurm workload manager

🗂️ Files and folder organization

Rockfish administrators provided several partitions with different properties. For our needs (storage intensive and shared environment), we work in the /scratch4/struelo1/ partition, where we have 20T of space. Our folders are organized as:

code-folder: /scratch4/struelo1/flepimop-code/ where each user has its own subfolder, from where the repos are cloned and the runs are launched. e.g for user chadi, we'll find:
- /scratch4/struelo1/flepimop-code/chadi/covidsp/Flu_USA
- /scratch4/struelo1/flepimop-code/chadi/COVID19_USA
- /scratch4/struelo1/flepimop-code/chadi/flepiMoP
- ...
- (we keep separated repositories by users so that different versions of the pipeline are not mixed where we run several runs in parallel. Don't hesitate to create other subfolders in the code folder (/scratch4/struelo1/flepimop-code/chadi-flusight, ...) if you need them.

Note that the repository is cloned flat, i.e the flepiMoP repository is at the same level as the data repository, not inside it!

output folder:/scratch4/struelo1/flepimop-runs stores the run outputs. After an inference run finishes, it's output and the logs files are copied from the $PROJECT_PATH/model_output to /scratch4/struelo1/flepimop-runs/THISRUNJOBNAME where the jobname is usually of the form USA-DATE.

When logging on you'll see two folders data_struelo1 and scr4_struelo1, which are shortcuts to /data/struelo1 and /scratch4/struelo1. We don't use data/struelo1.

Using ssh from your terminal, type in:

ssh {YOUR ROCKFISH USERNAME}@login.rockfish.jhu.edu

and enter your password when prompted. You'll be into rockfish's login node, which is a remote computer whose only purpose is to prepare and launch computations on so-called compute nodes.

🧱 Setup (to be done only once per USER )

Load the right modules for the setup:

module purge
module load gcc/9.3.0
module load anaconda3/2022.05  # very important to pin this version as other are buggy
module load git                # needed for git
module load git-lfs            # git-lfs (do we still need it?)

Now, type the following line so git remembers your credential and you don't have to enter your token 6 times per day:

git config --global credential.helper store
git config --global user.name "{NAME SURNAME}"
git config --global user.email YOUREMAIL@EMAIL.COM
git config --global pull.rebase false # so you use merge as the default reconciliation method

Now you need to create the conda environment. You will create the environment in two shorter commands, installing the python and R stuff separately. This can be extremely long if done in one command, so doing it in two helps. This command is quite long you'll have the time to brew some nice coffee ☕️:

# install all python stuff first
conda create -c conda-forge -n flepimop-env numba pandas numpy seaborn tqdm matplotlib click confuse pyarrow sympy dask pytest scipy graphviz emcee xarray boto3 slack_sdk

# activate the enviromnment and install the R stuff
conda activate flepimop-env
conda install -c conda-forge r-readr r-sf r-lubridate r-tigris r-tidyverse r-gridextra r-reticulate r-truncnorm r-xts r-ggfortify r-flextable r-doparallel r-foreach r-arrow r-optparse r-devtools r-tidycensus r-cdltools r-cowplot

Clone the FlepiMoP and other model repositories

Use the following commands to have git clone the FlepiMoP repository and any other model repositories you'd like to work on through https. In the code below, $USER is a variable that contains your username.

cd /scratch4/struelo1/flepimop-code/
mkdir $USER
cd $USER
git clone https://github.com/HopkinsIDD/flepiMoP.git
git clone https://github.com/HopkinsIDD/Flu_USA.git
# or any other model repositories

You will be prompted to provide your GitHub username and password. Note that from 2021, GitHub has changed the use of passwords to the use of personal acces tokens, so the prompted "password" is not the password you use to login. Instead, we recommend using the more safe ssh protocol to clone GitHub repositories. To do so, first generate an ssh private-public keypair on the Rockfish cluster and then copy the generated public key from the Rockfish cluster to your local computer by opening a terminal and prompting,

scp -r <username>@rfdtn1.rockfish.jhu.edu:/home/<username>/.ssh/<key_name.pub> .

Then add the public key to your GitHub account. Next, make a file ~/.ssh/config by using the command vi ~/.ssh/config`. Press 'I' to go into insert mode and paste the following chunck of code,

Host github.com
    User git
    IdentityFile ~/.ssh/<key_name>

Press 'esc' to exit INSERT model followed by ':x' to save and exit the file. By adding this configuration file, you make sure Rockfish doesn't forget your ssh key when you log out. Now clone the github repositories as follows,

cd /scratch4/struelo1/flepimop-code/
mkdir $USER
cd $USER
git clone git@github.com:HopkinsIDD/flepiMoP.git
git clone git@github.com:HopkinsIDD/Flu_USA.git
# or any other model repositories

and you will not be prompted for credentials.

Setup your Amazon Web Services (AWS) credentials

This can be done in a second step -- but is necessary in order to push and pull data to Amazon Simple Storage Service (S3). Setup AWS by running,

cd ~ # go in your home directory
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install -i ~/aws-cli -b ~/aws-cli/bin
./aws-cli/bin/aws --version

Then run ./aws-cli/bin/aws configure to set up your credentials,

# AWS Access Key ID [None]: Access key
# AWS Secret Access Key [None]: Secret Access key
# Default region name [None]: us-west-2
# Default output format [None]: json

To get the (secret) access key, ask the AWS administrator (Shaun Truelove) to generate them for you.

🚀 Run inference using slurm (do everytime)

log-in to rockfish via ssh, then type:

source /scratch4/struelo1/flepimop-code/$USER/flepiMoP/batch/slurm_init.sh

which will prepare the environment and setup variables for the validation date (choose as day after end_date_groundtruth), the resume location and the run index for this run. If you don't want to set a variable, just hit enter.

Note that now the run-id of the run we resume from is automatically inferred by the batch script :)

what does this do || it returns an error

This script runs the following commands to setup up the environment, which you can run individually as well.

module purge
module load gcc/9.3.0
module load git
module load git-lfs
module load slurm
module load anaconda3/2022.05
conda activate flepimop-env
export CENSUS_API_KEY={A CENSUS API KEY}
export FLEPI_STOCHASTIC_RUN=false
export FLEPI_RESET_CHIMERICS=TRUE
export FLEPI_PATH=/scratch4/struelo1/flepimop-code/$USER/flepiMoP

# And then it asks you some questions to setup some enviroment variables

and the it does some prompts to fix the following 3 enviroment variables. You can skip this part and do it later manually.

export VALIDATION_DATE="2023-01-29"
export RESUME_LOCATION=s3://idd-inference-runs/USA-20230122T145824
export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc

Check that the conda environment is activated: you should see(flepimop-env) on the left of your command-line prompt.

Then prepare the pipeline directory (if you have already done that and the pipeline hasn't been updated (git pull says it's up to date) then you can skip these steps

cd /scratch4/struelo1/flepimop-code/$USER
export FLEPI_PATH=$(pwd)/flepiMoP
cd $FLEPI_PATH
git checkout main
git pull

# install dependencies ggraph and tidy graph
R
> install.packages(c("ggraph","tidygraph"))
> quit()

# install the R module
Rscript build/local_install.R # warnings are ok; there should be no error.

# install gempyor
pip install --no-deps -e flepimop/gempyor_pkg/

Now flepiMoP is ready 🎉. If the R command doesn't work, try r and if that doesn't work run module load r/4.0.2`.

Next step is to setup the data. First $PROJECT_PATH to your data folder, and set any data options. If you are using the Delph Epidata API, first register for a key here: https://cmu-delphi.github.io/delphi-epidata/. Once you have a key, add that below where you see [YOUR API KEY]. Alternatively, you can put that key in your config file in the inference section as gt_api_key: "YOUR API KEY".

For a COVID-19 run, do:

cd /scratch4/struelo1/flepimop-code/$USER
export PROJECT_PATH=$(pwd)/COVID19_USA
export GT_DATA_SOURCE="csse_case, fluview_death, hhs_hosp"
export DELPHI_API_KEY="[YOUR API KEY]"

for Flu do:

cd /scratch4/struelo1/flepimop-code/$USER
export PROJECT_PATH=$(pwd)/Flu_USA

Now for any type of run:

cd $PROJECT_PATH
git pull 
git checkout main

Do some clean-up before your run. The fast way is to restore the $PROJECT_PATH git repository to its blank states (⚠️ removes everything that does not come from git):

git reset --hard && git clean -f -d  # this deletes everything that is not on github in this repo !!!

I want more control over what is deleted

if you prefer to have more control, delete the files you like, e.g

If you still want to use git to clean the repo but want finer control or to understand how dangerous is the command, read this.

rm -rf model_output data/us_data.csv data-truth &&
   rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
   rm -rf data/seeding_territories.csv && 
   rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv

# don't delete model_output if you have another run in //
rm -rf $PROJECT_PATH/model_output
# delete log files from previous runs
rm *.out

Run the preparatory script for the data and you are good,

export CONFIG_PATH=config_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc.yml
Rscript $FLEPI_PATH/datasetup/build_US_setup.R

# For covid do
Rscript $FLEPI_PATH/datasetup/build_covid_data.R

# For Flu (do not do this for the scenariohub!)
R
> install.packages(c("RSocrata"))
Rscript $FLEPI_PATH/datasetup/build_flu_data.R

# build seeding
Rscript $FLEPI_PATH/datasetup/build_initial_seeding.R

If you want to profile how the model is using your memory resources during the run:

export FLEPI_MEM_PROFILE=TRUE
export FLEPI_MEM_PROF_ITERS=50

Now you may want to test that it works :

flepimop-inference-main -c $CONFIG_PATH -j 1 -n 1 -k 1

If this fails, you may want to investigate this error. In case this succeeds, then you can proceed by first deleting the model_output:

rm -r model_output

Launch your inference batch job

When an inference batch job is launched, a few post processing scripts are called to run automatically postprocessing-scripts.sh. You can manually change what you want to run by editing this script.

Now you're fully set to go 🎉

To launch the whole inference batch job, type the following command:

python $FLEPI_PATH/batch/inference_job_launcher.py --slurm 2>&1 | tee $FLEPI_RUN_INDEX_submission.log

This command infers everything from you environment variables, if there is a resume or not, what is the run_id, etc. The part after the "2" makes sure this file output is redirected to a script for logging, but has no impact on your submission.

If you'd like to have more control, you can specify the arguments manually:

python $FLEPI_PATH/batch/inference_job_launcher.py --slurm \
                    -c $CONFIG_PATH \
                    -p $FLEPI_PATH \
                    --data-path $PROJECT_PATH \
                    --upload-to-s3 True \
                    --id $FLEPI_RUN_INDEX \
                    --fs-folder /scratch4/struelo1/flepimop-runs \
                    --restart-from-location $RESUME_LOCATION

If you want to send any post-processing outputs to #flepibot-test instead of csp-production, add the flag --slack-channel debug

Commit files to Github. After the job is successfully submitted, you will now be in a new branch of the data repo. Commit the ground truth data files to the branch on github and then return to the main branch:

git add data/ 
git commit -m"scenario run initial" 
branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
git push --set-upstream origin $branch

but DO NOT finish up by git checking main like the aws instructions, as the run will use data in the current folder.

Monitor your run

TODO JPSEH WRITE UP TO HERE

Two types of logfiles: in `$PROJECT_PATH`: slurm-JOBID_SLOTID.out and and filter_MC logs:

```tail -f /scratch4/struelo1/flepimop-runs/USA-20230130T163847/log_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc_100.txt

```

Helpful commands

When approching the file number quota, type

find . -maxdepth 1 -type d | while read -r dir
 do printf "%s:\t" "$dir"; find "$dir" -type f | wc -l; done

to find which subfolders contains how many files

Common errors

Check that the python comes from conda with which python if some weird package missing errors arrive. Sometime conda magically disappears.
Don't use ipython as it breaks click's flags

cleanup:

rm -r /scratch4/struelo1/flepimop-runs/
rm -r model_output
cd $COVID_PATH;git pull;cd $PROJECT_PATH
rm *.out

Get a notification on your phone/mail when a run is done

We use ntfy.sh for notification. Install ntfy on your Iphone or Android device. Then subscribe to the channel ntfy.sh/flepimop_alerts where you'll receive notifications when runs are done.

End of job notifications goes as urgent priority.

How to use slurm

Check your running jobs:

squeue -u $USER

where job_id has your full array job_id and each slot after the under-score. You can see their status (R: running, P: pending), how long they have been running and soo on.

To cancel a job

scancel JOB_ID

Running an interactive session

To check your code prior to submitting a large batch job, it's often helpful to run an interactive session to debug your code and check everything works as you want. On 🪨🐠 this can be done using interact like the below line, which requests an interactive session with 4 cores, 24GB of memory, for 12 hours.

interact -p defq -n 4 -m 24G -t 12:00:00

The options here are [-n tasks or cores], [-t walltime], [-p partition] and [-m memory], though other options can also be included or modified to your requirements. More details can be found on the ARCH User Guide.

Moving files to your local computer

Often you'll need to move files back and forth between Rockfish and your local computer. To do this, you can use Open-On-Demand, or any other command line tool.

scp -r <user>@rfdtn1.rockfish.jhu.edu:"<file path of what you want>" <where you want to put it in your local>

Installation notes

These steps are already done an affects all users, but might be interesting in case you'd like to run on another cluster

Install slack integration

So our 🤖-friend can send us some notifications once a run is done.

cd /scratch4/struelo1/flepimop-code/
nano slack_credentials.sh
# and fill the file:
export SLACK_WEBHOOK="{THE SLACK WEBHOOK FOR CSP_PRODUCTION}"
export SLACK_TOKEN="{THE SLACK TOKEN}"

Running with docker on AWS - OLD probably outdated

This page, along with the other AWS run guides, are not deprecated in case we need to run flepiMoP on AWS again in the future, but also are not maintained as other platforms (such as longleaf and rockfish) are preferred for running production jobs.

For large simulations, running the model on a cluster or cloud computing is required. AWS provides a good solution for this, if funding or credits are available (AWS can get very expensive).

Introduction

This document contains instructions for setting up and running the two different kinds of SEIR modeling jobs supported by the COVIDScenarioPipeline repository on AWS:

Inference jobs, using AWS Batch to coordinate hundreds/thousands of jobs across a fleet of servers, and
Planning jobs, using a single relatively large EC2 instance (usually an r5.24xlarge) to run one or more planning scenarios on a single high-powered machine.

Most of the steps required to setup and run the two different types of jobs on AWS are identical, and I will explicitly call out the places where different steps are required. Throughout the document, we assume that your client machine is a UNIX-like environment (e.g., OS X, Linux, or WSL).

Local Client Setup

I need a few things to be true about the local machine that you will be using to connect to AWS that I'll outline here:

You have created and downloaded a .pem file for connecting to an EC2 instance to your ~/.ssh directory. When we provision machines, you'll need to use the .pem file as a secret key for connecting. You may need to change the permission of the .pem file:
```
chmod 400 ~/.ssh/<your .pem file goes here>
```

You have created a ~/.ssh/config file that contains an entry that looks like this so we can use staging as an alias for your provisioned EC2 instance in the rest of the runbook:

host staging
HostName <IP address of provisioned server goes here>
IdentityFile ~/.ssh/<your .pem file goes here>
User ec2-user
IdentitiesOnly yes
StrictHostKeyChecking no

You can connect to Github via SSH. This is important because we will need to use your Github SSH key to interact with private repositories from the staging server on EC2.

Provisioning The Staging Server

If you are running an Inference job, you should use a small instance type for your staging server (e.g., an m5.xlarge will be more than enough.) If you are running a Planning job, you should provision a beefy instance type (I am especially partial to the memory-and-CPU heavy r5.24xlarge, but given how fast the planning code has become, an r5.8xlarge should be perfectly adequate.)

If you have access to the jh-covid account, you should use the IDD Staging AMI (ami-03641dd0c8554e5d0) to provision and launch new staging servers; it is already setup with all of the dependencies described in this section, however you will need to alter it's default network settings, iam role and security group(Please refer this page in details). You can find the AMI here, select it, and press the Launch button to walk you through the Launch Wizard to choose your instance type and .pem file to provision your staging server. When going through the Launch Wizard, be sure to select Next: Configure Instance details instead of Review and Launch. You will need to continue selecting the option that is not Review and Launch until you have selected a security group. In these screens, most of the default options are fine, but you will want to set the HPC VPC network, choose a public subnet (it will say public or private in the name), and set the iam role to EC2S3FullAccess on the first screen. You can also name the machine by providing a Name tag in the tags screen. Finally, you will need to set your security group to dcv_usa and/or dcv_usa2. You can then finalize the machine initialization with Review and Launch. Once your instance is provisioned, be sure to put its IP address into the HostName section of the ~/.ssh/config file on your local client so that you can connect to it from your client by simply typing ssh staging in your terminal window.

If you are having connection timeout issues when trying to ssh into the AWS machine, you should check that you have SSH TCP Port 22 permissions in the dcv_usa/ security group.

If you do not have access to the jh-covid account, you should walk through the regular EC2 Launch Wizard flow and be sure to choose the Amazon Linux 2 AMI (HVM), SSD Volume Type (ami-0e34e7b9ca0ace12d, the 64-bit x86 version) AMI. Once the machine is up and running and you can SSH to it, you will need to run the following code to install the software you will need for the rest of the run:

sudo yum -y update
sudo yum -y install awscli 
sudo yum -y install git 
sudo yum -y install docker.io 
sudo yum -y install pbzip2 

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
sudo yum -y install git-lfs
git lfs install

Connect to Github

Once your staging server is provisioned and you can connect to it, you should scp the private key file that you use for connecting to Github to the /home/ec2-user/.ssh directory on the staging server (e.g., if the local file is named ~/.ssh/id_rsa, then you should run scp ~/.ssh/id_rsa staging:/home/ec2-user/.ssh to do the copy). For convenience, you should create a /home/ec2-user/.ssh/config file on the staging server that has the following entry:

host github.com
 HostName github.com
 IdentityFile ~/.ssh/id_rsa
 User git

This way, the git clone, git pull, etc. operations that you run on the staging server will use your SSH key without constantly prompting you to login. Be sure to chmod 600 ~/.ssh/config to give the new file the correct permissions. You should now be able to clone a COVID19 data repository into your home directory on the staging server to do work against. For this example, to use the COVID19_Minimal repo, run:

git clone git@github.com:HopkinsIDD/COVID19_Minimal.git

to get it onto the staging server. By convention, we do runs with the COVIDScenarioPipeline repository nested inside of the data repository, so we then do:

cd COVID19_Minimal
git clone git@github.com:HopkinsIDD/COVIDScenarioPipeline.git

to clone the modeling code itself into a child directory of the data repository.

Getting and Launching the Docker Container

The previous section is only for getting a minimal set of dependencies setup on your staging server. To do an actual run, you will need to download the Docker container that contains the more extensive set of dependencies we need for running the code in the COVIDScenarioPipeline repository. To get the development container on your staging server, please run:

sudo docker pull hopkinsidd/covidscenariopipeline:latest-dev

There are multiple versions of the container published on DockerHub, but latest-dev contains the latest-and-greatest dependencies and can support both Inference and Planning jobs. In order to launch the container and run a job, we need to make our local COVID19_Minimal directory visible to the container's runtime. For Inference jobs, we do this by running:

sudo docker run \
  -v /home/ec2-user/COVID19_Minimal:/home/app/src \
  -v /home/ec2-user/.ssh:/home/app/.ssh \
  -it hopkinsidd/covidscenariopipeline:latest-dev

The -v option to docker run maps a file in the host filesystem (i.e., the path on the left side of the colon) to a file in the container's filesystem. Here, we are mapping the /home/ec2-user/COVID19_Minimal directory on the staging server where we checked out our data repo to the /home/app/src directory in the container (by convention, we run commands inside of the container as a user named app.) We also map our .ssh directory from the host filesystem into the container so that we can interact with Github if need be using our SSH keys. Once the container is launched, we can cd src; ls -ltr to look around and ensure that our directory mapping was successful and we see the data and code files that we are expecting to run with.

Once you are in the src directory, there are a few final steps required to install the R packages and Python modules contained within the COVIDScenarioPipeline repository. First, checkout the correct branch of COVIDScenarioPipeline. Then, assuming that you created a COVIDScenarioPipeline directory within the data repo in the previous step, you should be able to run:

Rscript COVIDScenarioPipeline/local_install.R
(cd COVIDScenarioPipeline/; python setup.py install)

to install the local R packages and then install the Python modules.

Once this step is complete, your machine is properly provisioned to run Planning jobs using the tools you normally use (e.g., make_makefile.R or running simulate.py and hospdeath.R directly, depending on the situation.) Running Inference jobs requires some extra steps that are covered in the next two sections.

Running Inference Jobs

Once the container is setup from the previous step, we are ready to test out and then launch an inference job against a configuration file (I will use the example of config.yml for the rest of this document.) First, I setup and run the build_US_setup.R script against my configuration file to ensure that the mobility data is up to date:

export CENSUS_API_KEY=<your census api key>
cd COVIDScenarioPipeline
git lfs pull
cd ..
Rscript COVIDScenarioPipeline/R/scripts/build_US_setup.R -c config.yml

Next, I kick off a small local run of the full_filter.R script. This serves two purposes: first, we can verify that the configuration file is in good shape and can support a few small iterations of the inference calculations before we kick off hundreds/thousands of jobs via AWS Batch. Second, it downloads the case data that we need for inference calculations to the staging server so that it can be cached locally and used by the batch jobs on AWS- if we do not have a local cache of this data at the start of the run, then every job will try to download the data itself, which will force the upstream server to deny service to the worker jobs, which will cause all of the jobs to fail. My small runs usually look like:

Rscript COVIDScenarioPipeline/R/scripts/full_filter.R -c config.yml -k 2 -n 1 -j 1 -p COVIDScenarioPipeline

This will run two sequential simulations (-k 2) for a single slot (-n 1) using a single CPU core (-j 1), looking for the modeling source code in the COVIDScenarioPipeline directory (-p COVIDScenarioPipeline). (We need to use the command line arguments here to explicitly override the settings of these parameters inside of config.yml since this run is only for local testing.) Assuming that this run succeeds, we are ready to kick off a batch job on the cluster.

The COVIDScenarioPipeline/batch/inference_job.py script will use the contents of the current directory and the values of the config file and any commandline arguments we pass it to launch a run on AWS Batch via the AWS API. To run this script, you need to have access to your AWS access keys so that you can enable access to the API by running aws configure at the command line, which will prompt you to enter your access key, secret, and preferred region, which should always be us-west-2 for jh-covid runs. (You can leave the Default format entry blank by simply hitting Enter.) IMPORTANT REMINDER: (Do not give anyone your access key and secret. If you lose it, deactivate it on the AWS console and create a new one. Keeep it safe.)

The simplest way to launch an inference job is to simply run

./COVIDScenarioPipeline/batch/inference_job.py -c config.yml

This will use the contents of the config file to determine how many slots to run, how many simulations to run for each slot, and how to break those simulations up into blocks of batch jobs that run sequentially. If you need to override any of those settings at the command line, you can run

./COVIDScenarioPipeline/batch/inference_job.py --help

to see the full list of command line arguments the script takes and how to set them.

One particular type of command line argument cannot be specified in the config: arguments to resume a run from a previously submitted run. This takes two arguments based on the previous run:

./COVIDScenarioPipeline/batch/inference_job.py --restart-from-s3-bucket=s3://idd-inference-runs/USA-20210131T170334/ --restart-from-run-id=2021.01.31.17:03:34.

Both the s3 bucket and run id are printed as part of the output for the previous submission. We store that information on a slack channel #csp-production, and suggest other groups find similar storage.

Inference jobs are parallelized by NPI scenarios and hospitalization rates, so if your config file defines more than one top-level scenario or more than one set of hospitalization parameters, the inference_job.py script will kick off a separate batch job for the cross product of scenarios * hospitalizations. The script will announce that it is launching each job and will print out the path on S3 where the final output for the job will be written. You can monitor the progress of the running jobs using either the AWS Batch Dashboard or by running:

./COVIDScenarioPipeline/batch/inference_job_status.py

which will show you the running status of the jobs in each of the queues.

Operating Inference Jobs

By default, the AWS Batch system will usually run around 50% of your desired simultaneously executable jobs concurrently for a given inference run. For example, if you are running 300 slots, Batch will generally run about 150 of those 300 tasks at a given time. If you need to force Batch to run more tasks concurrently, this section provides instructions for how to cajole Batch into running harder.

You can see how many tasks are running within each of the different Batch Compute Environments corresponding to the Batch Job Queues via the Elastic Container Service (ECS) Dashboard. There is a one-to-one correspondence between Job Queues, Compute Environments, and ECS Clusters (the matching ones all end with the same numeric identifier.) You can force Batch to scale up the number of CPUs available for running tasks by selecting the radio button corresponding to the compute environment that you want to scale on the Batch Compute Environment dashboard, clicking Edit, increasing the Desired CPU (and possibly the Minimum CPU, see below), and clicking the Save button. You will be able to see new containers and tasks coming online via the ECS Dashboard after a few minutes.

If you want to force new tasks to come online ASAP, you should consider increasing the Minimum CPU for the Compute Environment as well as the Desired CPU (the Desired CPU is not allowed to be lower than the Minimum CPU, so if you increase the Minimum you must increase the Desired as well to match it.) This will cause Batch to spin new containers up quickly and get them populated with running tasks. There are two downsides to doing this: first, it overrides the allocation algorithm that makes cost/performance tradeoff decisions in favor of spending more money in order to get more tasks running. Second, you must remember to update the Compute Environment towards the end of the job run to set the Minimum CPU to zero again so that the ECS cluster can spin down when the job is finished; if you do not do this, ECS will simply leave the machines up and running, wasting money without doing any actual work. (Note that you should never manually try to lower the value of the Desired CPU setting for the Compute Environment- the Batch system will throw an error if you attempt to do this.)

Provisioning AWS EC2 instance

Signing in to AWS Management Consol ;

Click on below:

Then the next view appears, check "regeon" as "Oregon" by default and "user@Accond ID" as you expeced.

If you have already accessed AWS console, these kinds of view can be seen. In the case select "EC2" to go to "EC2 Dashboard"(if not, skip it).

EC2 Dashboard

In this EC2 Dashboard, we can maintain the EC2 boxes from creation to deletion. In this section, how to create an EC2 instance from the AMI image which has already been registered is shown.

Select "Images>AMIs" in the right pane(Navigation pain) ;

Select an AMI name which name is "IDD Staging AMI" in the "Amazon Machine Images (AMIs)" by clicking the responding AMI checkbox on the left, then push the "Launch instance from AMI" button (colored in orange).

Launch an instance

To create an EC2 instance, fill out the items as below (example):

Name and tags
- input an appropriate name (e.g., "sample_box01")
Application and OS image
- check whether "AMI from catalog" is _"IDD Staging AMI" (for example; select one as you want) ;
Instance type
- as you selected by drop-down list(e.g., m5.xlarge)
Key pair(login ;
- you can generate new key pair if you want to connect to the instance securely (by clicking "Create new key pair" on the right side), but usually select "ams__ks_ED25519__keypair" by drop-down list so that you can be helped when local client setup (recommended).
  - In case that you use your own key, you will be the only person to log in, of course. you should be careful of handling key management ;
Network settings (push the button "Edit" on the right to extend configuration; see below)
- VPC - required
  - select "HPC VPC" item by drop-down menu
- Subnet
  - select "HPC Public Subnet among _"us-west-2*" ;
- Firewall (security groups)
  - select "Select existing security grous" toggle, then
  - Common security groups
    select "dvc_usa" and "dvc__usa2" by drop-down menu

Advanced details
- "EC2S3FullAccess" should be setected in IAM instance profile, but to do it an authentication (IAM role or policy) must be set on to the working IAM account

then push "Launch Instance" button which is located at the bottom right side of the scree ;

AWS Submission Instructions: Influenza

Step 1. Create the configuration file.

see Building a configuration file

Step 2. Start and access AWS submission box

Spin up an Ubuntu submission box if not already running. To do this, log onto AWS Console and start the EC2 instance.

Update IP address in .ssh/config file. To do this, open a terminal and type the command below. This will open your config file where you can change the IP to the IP4 assigned to the AWS EC2 instance (see AWS Console for this):

notepad .ssh/config

SSH into the box. In the terminal, SSH into your box. Typically we name these instances "staging", so usually the command is:

ssh staging

Step 3. Setup the environment

Now you should be logged onto the AWS submission box.

Update the github repositories. In the below example we assume you are running mainbranch in Flu_USA andmainbranch in COVIDScenarioPipeline. This assumes you have already loaded the appropriate repositories on your EC2 instance. Have your GitHub ssh key passphrase handy so you can paste it when prompted (possibly multiple times) with the git pull command. Alternatively, you can add your github key to your batch box so you do not have to log in repeated (see X).

cd Flu_USA
git config --global credential.helper cache
git pull 

cd COVIDScenarioPipeline
git pull	
git checkout main
git pull
cd ..

Initiate the docker. Start up and log into the docker container, pull the repos from GitHub, and run setup scripts to setup the environment. This setup code links the docker directories to the existing directories on your box. As this is the case, you should not run job submission simultaneously using this setup, as one job submission might modify the data for another job submission.

sudo docker pull hopkinsidd/covidscenariopipeline:latest-dev
sudo docker run -it \
  -v /home/ec2-user/Flu_USA:/home/app/drp \
  -v /home/ec2-user/Flu_USA/COVIDScenarioPipeline:/home/app/drp/COVIDScenarioPipeline \
  -v /home/ec2-user/.ssh:/home/app/.ssh \
hopkinsidd/covidscenariopipeline:latest-dev  
    
cd ~/drp 
git config credential.helper store 
git pull 
git checkout main
git config --global credential.helper 'cache --timeout 300000'

cd ~/drp/COVIDScenarioPipeline 
git pull 
git checkout main
git pull 

Rscript local_install.R && 
   python -m pip install --upgrade pip &&
   pip install -e gempyor_pkg/ && 
   pip install boto3 && 
   cd ~/drp

Step 4. Model Setup

To run the via AWS, we first run a setup run locally (in docker on the submission EC2 box) ;

Setup environment variables. Modify the code chunk below and submit in the terminal. We also clear certain files and model output that get generated in the submission process. If these files exist in the repo, they may not get cleared and could cause issues. You need to modify the variable values in the first 4 lines below. These include the SCENARIO, VALIDATION_DATE, COVID_MAX_STACK_SIZE, and COMPUTE_QUEUE. If submitting multiple jobs, it is recommended to split jobs between 2 queues: Compartment-JQ-1588569569 and Compartment-JQ-1588569574.

If not resuming off previous run:

export SCENARIO=FCH_R1_highVac_pesImm_2022_Oct30 && 
   export VALIDATION_DATE="2022-10-16" && 
   export COVID_MAX_STACK_SIZE=1000 && 
   export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
   export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
   export COVID_RESET_CHIMERICS=TRUE &&
   rm -rf model_output data/us_data.csv data-truth &&
   rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
   rm -rf data/seeding_territories.csv

If resuming from a previous run, there are an additional couple variables to set. This is the same for a regular resume or continuation resume. Specifically:

RESUME_ID - the COVID_RUN_INDEX from the run resuming from.
RESUME_S3 - the S3 bucket where this previous run is stored

export SCENARIO=FCH_R1_highVac_pesImm_2022_Nov27 && 
   export VALIDATION_DATE="2022-11-27" && 
   export COVID_MAX_STACK_SIZE=1000 && 
   export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
   export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
   export COVID_RESET_CHIMERICS=TRUE &&
   rm -rf model_output data/us_data.csv data-truth &&
   rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
   rm -rf data/seeding_territories.csv
   
export RESUME_ID=FCH_R1_highVac_pesImm_2022_Nov20 &&
  export RESUME_S3=USA-20221120T194228

Preliminary model run. We do a setup run with 1 to 2 iterations to make sure the model runs and setup input data. This takes several minutes to complete, depending on how complex the simulation will be. To do this, run the following code chunk, with no modification of the code required:

export COVID_RUN_INDEX=$SCENARIO && 
   export CONFIG_NAME=config_$SCENARIO.yml && 
   export CONFIG_PATH=/home/app/drp/$CONFIG_NAME && 
   export COVID_PATH=/home/app/drp/COVIDScenarioPipeline && 
   export PROJECT_PATH=/home/app/drp && 
   export INTERVENTION_NAME="med" && 
   export COVID_STOCHASTIC=FALSE && 
   rm -rf $PROJECT_PATH/model_output $PROJECT_PATH/us_data.csv &&
   rm -rf $PROJECT_PATH/seeding_territories.csv && 
   cd $PROJECT_PATH && Rscript $COVID_PATH/R/scripts/build_US_setup.R -c $CONFIG_NAME && 
   Rscript $COVID_PATH/R/scripts/build_flu_data.R -c $CONFIG_NAME && 
   Rscript $COVID_PATH/R/scripts/full_filter.R -c $CONFIG_NAME -j 1 -n 1 -k 1 && 
   printenv CONFIG_NAME

Step 5. Launch job on AWS batch

Configure AWS. Assuming that the simulations finish successfully, you will now enter credentials and submit your job onto AWS batch. Enter the following command into the terminal ;

aws configure

You will be prompted to enter the following items. These can be found in a file called new_user_credentials.csv ;

Access key ID when prompted
Secret access key when prompted
Default region name: us-west-2
Default output: Leave blank when this is prompted and press enter (The Access Key ID and Secret Access Key will be given to you once in a file)

Launch the job. To launch the job, use the appropriate setup based on the type of job you are doing. No modification of these code chunks should be required.

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic &&
printenv CONFIG_NAME

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic -j 1 -k 1 &&
printenv CONFIG_NAME

NOTE: Resume and Continuation Resume runs are currently submitted the same way, resuming from an S3 that was generated manually. Typically we will also submit any Continuation Resume run specifying --resume-carry-seeding as starting seeding conditions will be manually constructed and put in the S3.

Carrying seeding (do this to use seeding fits from resumed run):

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
printenv CONFIG_NAME

Discarding seeding (do this to refit seeding again):

export CONFIG_PATH=$CONFIG_NAME &&  
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic --resume-discard-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
printenv CONFIG_NAME

Single Iteration + Carry seeding (do this to produce additional scenarios where no fitting is required):

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID -j 1 -k 1 &&
printenv CONFIG_NAME

NOTE: A Resume and Continuation Resume are currently submitted the same way, but with --resume-carry-seeding specified and resuming from an S3 that was generated manually.

Step 6. Document the Submission

Commit files to Github. After the job is successfully submitted, you will now be in a new branch of the population repo. Commit the ground truth data files to the branch on github and then return to the main branch:

git add data/ 
git config --global user.email "[email]" 
git config --global user.name "[github username]" 
git commit -m"scenario run initial" 
branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
git push --set-upstream origin $branch

git checkout main
git pull

Save submission info to slack. We use a slack channel to save the submission information that gets outputted. Copy this to slack so you can identify the job later. Example output:

Setting number of output slots to 300 [via config file]
Launching USA-20220923T160106_inference_med...
Resuming from run id is SMH_R1_lowVac_optImm_2018 located in s3://idd-inference-runs/USA-20220913T000opt
Discarding seeding results
Final output will be: s3://idd-inference-runs/USA-20220923T160106/model_output/
Run id is SMH_R1_highVac_optImm_2022
Switched to a new branch 'run_USA-20220923T160106'
config_SMH_R1_highVac_optImm_2022.yml

AWS Submission Instructions: COVID-19

Step 1. Create the configuration file.

see Building a configuration file

Step 2. Start and access AWS submission box

Spin up an Ubuntu submission box if not already running. To do this, log onto AWS Console and start the EC2 instance.

notepad .ssh/config

SSH into the box. In the terminal, SSH into your box. Typically we name these instances "staging", so usually the command is:

ssh staging

Step 3. Setup the environment

Now you should be logged onto the AWS submission box.

Update the github repositories. In the below example we assume you are running mainbranch in Flu_USA andmainbranch in COVIDScenarioPipeline. This assumes you have already loaded the appropriate repositories on your EC2 instance. Have your Github ssh key passphrase handy so you can paste it when prompted (possibly multiple times) with the git pull command. Alternatively, you can add your github key to your batch box so you do not have to log in repeated (see X).

cd COVID19_USA
git config --global credential.helper cache
git pull 
git checkout main
git pull

cd flepiMoP
git pull	
git checkout main
git pull
cd ..

Initiate the docker. Start up and log into the docker container, pull the repos from Github, and run setup scripts to setup the environment. This setup code links the docker directories to the existing directories on your box. As this is the case, you should not run job submission simultaneously using this setup, as one job submission might modify the data for another job submission.

sudo docker pull hopkinsidd/flepimop:latest-dev
sudo docker run -it \
  -v /home/ec2-user/COVID19_USA:/home/app/drp/COVID19_USA \
  -v /home/ec2-user/flepiMoP:/home/app/drp/flepiMoP \
  -v /home/ec2-user/.ssh:/home/app/.ssh \
hopkinsidd/flepimop:latest-dev  
    
cd ~/drp/COVID19_USA
git config credential.helper store 
git pull 
git checkout main
git pull
git config --global credential.helper 'cache --timeout 300000'

cd ~/drp/flepiMoP 
git pull 
git checkout main
git pull 

Rscript build/local_install.R && 
   python -m pip install --upgrade pip &&
   pip install -e flepimop/gempyor_pkg/ && 
   pip install boto3 && 
   cd ..

Step 4. Model Setup

To run the via AWS, we first run a setup run locally (in docker on the submission EC2 box) ;

If not resuming off previous run:

export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_FCH_Dec11_tsvacc && 
   export VALIDATION_DATE="2022-12-11" && 
   export COVID_MAX_STACK_SIZE=1000 && 
   export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
   export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
   export FLEPI_RESET_CHIMERICS=TRUE &&
   rm -rf model_output data/us_data.csv data-truth &&
   rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
   rm -rf data/seeding_territories.csv && 
   rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv

If resuming from a previous run, there are an additional couple variables to set. This is the same for a regular resume or continuation resume. Specifically:

RESUME_ID - the COVID_RUN_INDEX from the run resuming from.
RESUME_S3 - the S3 bucket where this previous run is stored

export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_Dec18_tsvacc && 
   export VALIDATION_DATE="2022-12-18" && 
   export COVID_MAX_STACK_SIZE=1000 && 
   export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
   export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
   export FLEPI_RESET_CHIMERICS=TRUE &&
   rm -rf model_output data/us_data.csv data-truth &&
   rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
   rm -rf data/seeding_territories.csv && 
   rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv
   
export RESUME_LOCATION=s3://idd-inference-runs/USA-20230423T235232

export CONFIG_NAME=config_$SCENARIO.yml && 
   export CONFIG_PATH=/home/app/drp/COVID19_USA/$CONFIG_NAME && 
   export FLEPI_PATH=/home/app/drp/flepiMoP && 
   export PROJECT_PATH=/home/app/drp/COVID19_USA && 
   export INTERVENTION_NAME="med" && 
   export FLEPI_STOCHASTIC=FALSE && 
   rm -rf $PROJECT_PATH/model_output $PROJECT_PATH/us_data.csv && 
   cd $PROJECT_PATH && 
   Rscript $FLEPI_PATH/R/scripts/build_US_setup.R -c $CONFIG_NAME && 
   Rscript $FLEPI_PATH/R/scripts/build_covid_data.R -c $CONFIG_NAME && 
   Rscript $FLEPI_PATH/R/scripts/full_filter.R -c $CONFIG_NAME -j 1 -n 1 -k 1 && 
   printenv CONFIG_NAME

Step 5. Launch job on AWS batch

Configure AWS. Assuming that the simulations finish successfully, you will now enter credentials and submit your job onto AWS batch. Enter the following command into the terminal ;

aws configure

You will be prompted to enter the following items. These can be found in a file called new_user_credentials.csv ;

Access key ID when prompted
Secret access key when prompted
Default region name: us-west-2
Default output: Leave blank when this is prompted and press enter (The Access Key ID and Secret Access Key will be given to you once in a file)

Launch the job. To launch the job, use the appropriate setup based on the type of job you are doing. No modification of these code chunks should be required.

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$FLEPI_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic &&
printenv CONFIG_NAME

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$FLEPI_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic -j 1 -k 1 &&
printenv CONFIG_NAME

NOTE: Resume and Continuation Resume runs are currently submitted the same way, resuming from an S3 that was generated manually. Typically we will also submit any Continuation Resume run specifying --resume-carry-seeding as starting seeding conditions will be manually constructed and put in the S3.

Carrying seeding (do this to use seeding fits from resumed run):

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$FLEPI_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
printenv CONFIG_NAME

Discarding seeding (do this to refit seeding again):

export CONFIG_PATH=$CONFIG_NAME &&  
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic --resume-discard-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
printenv CONFIG_NAME

Single Iteration + Carry seeding (do this to produce additional scenarios where no fitting is required):

export CONFIG_PATH=$CONFIG_NAME &&
cd $PROJECT_PATH &&
$COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --non-stochastic --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID -j 1 -k 1 &&
printenv CONFIG_NAME

Step 6. Document the Submission

Commit files to GitHub. After the job is successfully submitted, you will now be in a new branch of the population repo. Commit the ground truth data files to the branch on GitHub and then return to the main branch:

git add data/ 
git config --global user.email "[email]" 
git config --global user.name "[github username]" 
git commit -m"scenario run initial" 
branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
git push --set-upstream origin $branch

git checkout main
git pull

Save submission info to slack. We use a slack channel to save the submission information that gets outputted. Copy this to slack so you can identify the job later. Example output:

Setting number of output slots to 300 [via config file]
Launching USA-20220923T160106_inference_med...
Resuming from run id is SMH_R1_lowVac_optImm_2018 located in s3://idd-inference-runs/USA-20220913T000opt
Discarding seeding results
Final output will be: s3://idd-inference-runs/USA-20220923T160106/model_output/
Run id is SMH_R1_highVac_optImm_2022
Switched to a new branch 'run_USA-20220923T160106'
config_SMH_R1_highVac_optImm_2022.yml

Running with RStudio Server on AWS EC2

Introduction

As a computational environment, you can use an RStudio Server integrated AWS EC2 instance for either your personal space or shared usages among multiple users, via GUI as well as CLI using ssh. The EC2 instance-type was selected to be appropriate one for running programs as of now (2023/1), in views of both computational resources and finances, so that you can use a cloud-based computing environment which you can access with GUI including from Web, without any difficulties to set it up. The details hereinafter may be able to change.

Versions

The current installed versions of software or additional information related to AWS EC2 are as follows:

R/RStudio Server
- R version: 4.2.2
- RStudio Server version: v2022.07.02+576
AWS EC2 instance configurations
- instance-type: r6i.4xlarge (16 cores, 128GB mem ;
- Storage: 2TB x 1 (gp3)
- OS: ubuntu 22.04 (Jammy)

Provisioning a server on AWS EC2

To be written/ Talk to someone who would be able to do that.

EC2 instance initialization with specfic AMI
Configured network related including ports openings
registration of the user in the EC2 instance
Configuring shared directory via SMB and accoun ;

Starting an EC2 instanc ;

The procedure is same as a normal ec2 instance starting. One way is to select the ec2 instance and start it in EC2 Management Console ;

Once the instance started, RStudio Server can be accessed without invoking manually ;

Accessing RStudio Server

By default RStudio Server runs on port 8787 and accepts connections from all remote clients. After invoking an EC2 box you should therefore be able to navigate a web browser to the following address to access the server:

http://<ip-addr>:8787/

Then the authentication dialog will be shown, try to log in by inputting your username and password which are already registered in the box and pushing the "Sign In " button:

RStudio view can be appeared as below:

Accessing the Linux server(Ubuntu) using RDP

To access the linux server with GUI, RDP software can be applicable, in addition to the usual way, via ssh with command line.

Using Windows

By using "Remote Desktop Connection" app in Windows, you can log in to the Linux server box from remote environment.

Using Mac

For Mac users, the below RDP software is recommented:

Microsoft Remote Desktop

Accessing the shared space on Linux server

As a shared space, the directory named:

/home/shared

is deployed among multiple server boxes using EFS(Elastic File System) which covers NFSv4 protocol. ;

Accessing the shared space on Linux server using Samba(obsolete)

Common

In the linux box, Samba(SMB) service has been on for file exchanging by default. The area in which can be readable and writable under the specific user privileages is:

/home/share

When accessing the area via SMB, you can input username and its password in a dialog window which will be shown. The username is:

smbshare

(ask password for above user in advance, if you want to access via SMB)

Using Windows

By inputting the form such as \\<ip-addr>\share in Windows Explorer

Using Mac

From Finder you can access the shared space using SMB.

From Finder Menu, choose "MOVE" then "Connect to Server"
When a dialog appears, fill username and password out as a registered user ;
After pushing "connect" button, the designated area will be shown in Finder if no errors happen ;

Notes

When you are inside of the university networks, e.g. in labs or in office, you will not access to the server box with SMB because the networks may be blocking the ports related to the services.

If you are using MAC as a local pc, there is a workaround to avoid the situation but for Windows it has not been clear there is a solution (now under investigation). If you want to know the related information, currently even for Mac user only though, please try to make a contact. In case of a Windows user, I recommend using "Local devices and resources" setting of Remote Desktop Connection.\

Running with docker on AWS - OLD probably outdated

For large simulations, running the model on a cluster or cloud computing is required. AWS provides a good solution for this, if funding or credits are available (AWS can get very expensive).

Introduction

This document contains instructions for setting up and running the two different kinds of SEIR modeling jobs supported by the COVIDScenarioPipeline repository on AWS:

Inference jobs, using AWS Batch to coordinate hundreds/thousands of jobs across a fleet of servers, and
Planning jobs, using a single relatively large EC2 instance (usually an r5.24xlarge) to run one or more planning scenarios on a single high-powered machine.

Local Client Setup

I need a few things to be true about the local machine that you will be using to connect to AWS that I'll outline here:

You have created and downloaded a .pem file for connecting to an EC2 instance to your ~/.ssh directory. When we provision machines, you'll need to use the .pem file as a secret key for connecting. You may need to change the permission of the .pem file:
```
chmod 400 ~/.ssh/<your .pem file goes here>
```

You have created a ~/.ssh/config file that contains an entry that looks like this so we can use staging as an alias for your provisioned EC2 instance in the rest of the runbook:

host staging
HostName <IP address of provisioned server goes here>
IdentityFile ~/.ssh/<your .pem file goes here>
User ec2-user
IdentitiesOnly yes
StrictHostKeyChecking no

You can connect to Github via SSH. This is important because we will need to use your Github SSH key to interact with private repositories from the staging server on EC2.

Provisioning The Staging Server

If you are having connection timeout issues when trying to ssh into the AWS machine, you should check that you have SSH TCP Port 22 permissions in the dcv_usa/ security group.

sudo yum -y update
sudo yum -y install awscli 
sudo yum -y install git 
sudo yum -y install docker.io 
sudo yum -y install pbzip2 

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
sudo yum -y install git-lfs
git lfs install

Connect to Github

host github.com
 HostName github.com
 IdentityFile ~/.ssh/id_rsa
 User git

git clone git@github.com:HopkinsIDD/COVID19_Minimal.git

to get it onto the staging server. By convention, we do runs with the COVIDScenarioPipeline repository nested inside of the data repository, so we then do:

cd COVID19_Minimal
git clone git@github.com:HopkinsIDD/COVIDScenarioPipeline.git

to clone the modeling code itself into a child directory of the data repository.

Getting and Launching the Docker Container

sudo docker pull hopkinsidd/covidscenariopipeline:latest-dev

sudo docker run \
  -v /home/ec2-user/COVID19_Minimal:/home/app/src \
  -v /home/ec2-user/.ssh:/home/app/.ssh \
  -it hopkinsidd/covidscenariopipeline:latest-dev

Rscript COVIDScenarioPipeline/local_install.R
(cd COVIDScenarioPipeline/; python setup.py install)

to install the local R packages and then install the Python modules.

Running Inference Jobs

export CENSUS_API_KEY=<your census api key>
cd COVIDScenarioPipeline
git lfs pull
cd ..
Rscript COVIDScenarioPipeline/R/scripts/build_US_setup.R -c config.yml

Rscript COVIDScenarioPipeline/R/scripts/full_filter.R -c config.yml -k 2 -n 1 -j 1 -p COVIDScenarioPipeline

The simplest way to launch an inference job is to simply run

./COVIDScenarioPipeline/batch/inference_job.py -c config.yml

./COVIDScenarioPipeline/batch/inference_job.py --help

to see the full list of command line arguments the script takes and how to set them.

One particular type of command line argument cannot be specified in the config: arguments to resume a run from a previously submitted run. This takes two arguments based on the previous run:

./COVIDScenarioPipeline/batch/inference_job.py --restart-from-s3-bucket=s3://idd-inference-runs/USA-20210131T170334/ --restart-from-run-id=2021.01.31.17:03:34.

./COVIDScenarioPipeline/batch/inference_job_status.py

which will show you the running status of the jobs in each of the queues.

Operating Inference Jobs

Running on Rockfish/MARCC - JHU 🪨🐠

or any HPC using the slurm workload manager

🗂️ Files and folder organization

code-folder: /scratch4/struelo1/flepimop-code/ where each user has its own subfolder, from where the repos are cloned and the runs are launched. e.g for user chadi, we'll find:
- /scratch4/struelo1/flepimop-code/chadi/covidsp/Flu_USA
- /scratch4/struelo1/flepimop-code/chadi/COVID19_USA
- /scratch4/struelo1/flepimop-code/chadi/flepiMoP
- ...
- (we keep separated repositories by users so that different versions of the pipeline are not mixed where we run several runs in parallel. Don't hesitate to create other subfolders in the code folder (/scratch4/struelo1/flepimop-code/chadi-flusight, ...) if you need them.

Note that the repository is cloned flat, i.e the flepiMoP repository is at the same level as the data repository, not inside it!

output folder:/scratch4/struelo1/flepimop-runs stores the run outputs. After an inference run finishes, it's output and the logs files are copied from the $PROJECT_PATH/model_output to /scratch4/struelo1/flepimop-runs/THISRUNJOBNAME where the jobname is usually of the form USA-DATE.

When logging on you'll see two folders data_struelo1 and scr4_struelo1, which are shortcuts to /data/struelo1 and /scratch4/struelo1. We don't use data/struelo1.

Using ssh from your terminal, type in:

ssh {YOUR ROCKFISH USERNAME}@login.rockfish.jhu.edu

and enter your password when prompted. You'll be into rockfish's login node, which is a remote computer whose only purpose is to prepare and launch computations on so-called compute nodes.

🧱 Setup (to be done only once per USER )

Load the right modules for the setup:

module purge
module load gcc/9.3.0
module load anaconda3/2022.05  # very important to pin this version as other are buggy
module load git                # needed for git
module load git-lfs            # git-lfs (do we still need it?)

Now, type the following line so git remembers your credential and you don't have to enter your token 6 times per day:

git config --global credential.helper store
git config --global user.name "{NAME SURNAME}"
git config --global user.email YOUREMAIL@EMAIL.COM
git config --global pull.rebase false # so you use merge as the default reconciliation method

# install all python stuff first
conda create -c conda-forge -n flepimop-env numba pandas numpy seaborn tqdm matplotlib click confuse pyarrow sympy dask pytest scipy graphviz emcee xarray boto3 slack_sdk

# activate the enviromnment and install the R stuff
conda activate flepimop-env
conda install -c conda-forge r-readr r-sf r-lubridate r-tigris r-tidyverse r-gridextra r-reticulate r-truncnorm r-xts r-ggfortify r-flextable r-doparallel r-foreach r-arrow r-optparse r-devtools r-tidycensus r-cdltools r-cowplot

Clone the FlepiMoP and other model repositories

cd /scratch4/struelo1/flepimop-code/
mkdir $USER
cd $USER
git clone https://github.com/HopkinsIDD/flepiMoP.git
git clone https://github.com/HopkinsIDD/Flu_USA.git
# or any other model repositories

scp -r <username>@rfdtn1.rockfish.jhu.edu:/home/<username>/.ssh/<key_name.pub> .

Then add the public key to your GitHub account. Next, make a file ~/.ssh/config by using the command vi ~/.ssh/config`. Press 'I' to go into insert mode and paste the following chunck of code,

Host github.com
    User git
    IdentityFile ~/.ssh/<key_name>

cd /scratch4/struelo1/flepimop-code/
mkdir $USER
cd $USER
git clone git@github.com:HopkinsIDD/flepiMoP.git
git clone git@github.com:HopkinsIDD/Flu_USA.git
# or any other model repositories

and you will not be prompted for credentials.

Setup your Amazon Web Services (AWS) credentials

This can be done in a second step -- but is necessary in order to push and pull data to Amazon Simple Storage Service (S3). Setup AWS by running,

cd ~ # go in your home directory
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install -i ~/aws-cli -b ~/aws-cli/bin
./aws-cli/bin/aws --version

Then run ./aws-cli/bin/aws configure to set up your credentials,

# AWS Access Key ID [None]: Access key
# AWS Secret Access Key [None]: Secret Access key
# Default region name [None]: us-west-2
# Default output format [None]: json

To get the (secret) access key, ask the AWS administrator (Shaun Truelove) to generate them for you.

🚀 Run inference using slurm (do everytime)

log-in to rockfish via ssh, then type:

source /scratch4/struelo1/flepimop-code/$USER/flepiMoP/batch/slurm_init.sh

Note that now the run-id of the run we resume from is automatically inferred by the batch script :)

what does this do || it returns an error

This script runs the following commands to setup up the environment, which you can run individually as well.

module purge
module load gcc/9.3.0
module load git
module load git-lfs
module load slurm
module load anaconda3/2022.05
conda activate flepimop-env
export CENSUS_API_KEY={A CENSUS API KEY}
export FLEPI_STOCHASTIC_RUN=false
export FLEPI_RESET_CHIMERICS=TRUE
export FLEPI_PATH=/scratch4/struelo1/flepimop-code/$USER/flepiMoP

# And then it asks you some questions to setup some enviroment variables

and the it does some prompts to fix the following 3 enviroment variables. You can skip this part and do it later manually.

export VALIDATION_DATE="2023-01-29"
export RESUME_LOCATION=s3://idd-inference-runs/USA-20230122T145824
export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc

Check that the conda environment is activated: you should see(flepimop-env) on the left of your command-line prompt.

Then prepare the pipeline directory (if you have already done that and the pipeline hasn't been updated (git pull says it's up to date) then you can skip these steps

cd /scratch4/struelo1/flepimop-code/$USER
export FLEPI_PATH=$(pwd)/flepiMoP
cd $FLEPI_PATH
git checkout main
git pull

# install dependencies ggraph and tidy graph
R
> install.packages(c("ggraph","tidygraph"))
> quit()

# install the R module
Rscript build/local_install.R # warnings are ok; there should be no error.

# install gempyor
pip install --no-deps -e flepimop/gempyor_pkg/

Now flepiMoP is ready 🎉. If the R command doesn't work, try r and if that doesn't work run module load r/4.0.2`.

For a COVID-19 run, do:

cd /scratch4/struelo1/flepimop-code/$USER
export PROJECT_PATH=$(pwd)/COVID19_USA
export GT_DATA_SOURCE="csse_case, fluview_death, hhs_hosp"
export DELPHI_API_KEY="[YOUR API KEY]"

for Flu do:

cd /scratch4/struelo1/flepimop-code/$USER
export PROJECT_PATH=$(pwd)/Flu_USA

Now for any type of run:

cd $PROJECT_PATH
git pull 
git checkout main

Do some clean-up before your run. The fast way is to restore the $PROJECT_PATH git repository to its blank states (⚠️ removes everything that does not come from git):

git reset --hard && git clean -f -d  # this deletes everything that is not on github in this repo !!!

I want more control over what is deleted

if you prefer to have more control, delete the files you like, e.g

If you still want to use git to clean the repo but want finer control or to understand how dangerous is the command, read this.

rm -rf model_output data/us_data.csv data-truth &&
   rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
   rm -rf data/seeding_territories.csv && 
   rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv

# don't delete model_output if you have another run in //
rm -rf $PROJECT_PATH/model_output
# delete log files from previous runs
rm *.out

Run the preparatory script for the data and you are good,

export CONFIG_PATH=config_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc.yml
Rscript $FLEPI_PATH/datasetup/build_US_setup.R

# For covid do
Rscript $FLEPI_PATH/datasetup/build_covid_data.R

# For Flu (do not do this for the scenariohub!)
R
> install.packages(c("RSocrata"))
Rscript $FLEPI_PATH/datasetup/build_flu_data.R

# build seeding
Rscript $FLEPI_PATH/datasetup/build_initial_seeding.R

If you want to profile how the model is using your memory resources during the run:

export FLEPI_MEM_PROFILE=TRUE
export FLEPI_MEM_PROF_ITERS=50

Now you may want to test that it works :

flepimop-inference-main -c $CONFIG_PATH -j 1 -n 1 -k 1

If this fails, you may want to investigate this error. In case this succeeds, then you can proceed by first deleting the model_output:

rm -r model_output

Launch your inference batch job

Now you're fully set to go 🎉

To launch the whole inference batch job, type the following command:

python $FLEPI_PATH/batch/inference_job_launcher.py --slurm 2>&1 | tee $FLEPI_RUN_INDEX_submission.log

If you'd like to have more control, you can specify the arguments manually:

python $FLEPI_PATH/batch/inference_job_launcher.py --slurm \
                    -c $CONFIG_PATH \
                    -p $FLEPI_PATH \
                    --data-path $PROJECT_PATH \
                    --upload-to-s3 True \
                    --id $FLEPI_RUN_INDEX \
                    --fs-folder /scratch4/struelo1/flepimop-runs \
                    --restart-from-location $RESUME_LOCATION

If you want to send any post-processing outputs to #flepibot-test instead of csp-production, add the flag --slack-channel debug

git add data/ 
git commit -m"scenario run initial" 
branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
git push --set-upstream origin $branch

but DO NOT finish up by git checking main like the aws instructions, as the run will use data in the current folder.

Monitor your run

TODO JPSEH WRITE UP TO HERE

Two types of logfiles: in `$PROJECT_PATH`: slurm-JOBID_SLOTID.out and and filter_MC logs:

```tail -f /scratch4/struelo1/flepimop-runs/USA-20230130T163847/log_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc_100.txt

```

Helpful commands

When approching the file number quota, type

find . -maxdepth 1 -type d | while read -r dir
 do printf "%s:\t" "$dir"; find "$dir" -type f | wc -l; done

to find which subfolders contains how many files

Common errors

Check that the python comes from conda with which python if some weird package missing errors arrive. Sometime conda magically disappears.
Don't use ipython as it breaks click's flags

cleanup:

rm -r /scratch4/struelo1/flepimop-runs/
rm -r model_output
cd $COVID_PATH;git pull;cd $PROJECT_PATH
rm *.out

Get a notification on your phone/mail when a run is done

We use ntfy.sh for notification. Install ntfy on your Iphone or Android device. Then subscribe to the channel ntfy.sh/flepimop_alerts where you'll receive notifications when runs are done.

End of job notifications goes as urgent priority.

How to use slurm

Check your running jobs:

squeue -u $USER

where job_id has your full array job_id and each slot after the under-score. You can see their status (R: running, P: pending), how long they have been running and soo on.

To cancel a job

scancel JOB_ID

Running an interactive session

interact -p defq -n 4 -m 24G -t 12:00:00

Moving files to your local computer

Often you'll need to move files back and forth between Rockfish and your local computer. To do this, you can use Open-On-Demand, or any other command line tool.

scp -r <user>@rfdtn1.rockfish.jhu.edu:"<file path of what you want>" <where you want to put it in your local>

Installation notes

These steps are already done an affects all users, but might be interesting in case you'd like to run on another cluster

Install slack integration

So our 🤖-friend can send us some notifications once a run is done.

cd /scratch4/struelo1/flepimop-code/
nano slack_credentials.sh
# and fill the file:
export SLACK_WEBHOOK="{THE SLACK WEBHOOK FOR CSP_PRODUCTION}"
export SLACK_TOKEN="{THE SLACK TOKEN}"

US specific How to Run

Running with Docker locally (outdated/US specific) 🛳

Setup

🚀 Run inference

Fill the environment variables (do this every time)

Run the code

Running on Rockfish/MARCC - JHU 🪨🐠

🗂️ Files and folder organization

Login on rockfish

🧱 Setup (to be done only once per USER )

Clone the FlepiMoP and other model repositories

Setup your Amazon Web Services (AWS) credentials

🚀 Run inference using slurm (do everytime)

Launch your inference batch job

Monitor your run

Helpful commands

Common errors

Get a notification on your phone/mail when a run is done

How to use slurm

Running an interactive session

Moving files to your local computer

Installation notes

Install slack integration

Running with docker on AWS - OLD probably outdated

Introduction

Local Client Setup

Provisioning The Staging Server

Connect to Github

Getting and Launching the Docker Container

Running Inference Jobs

Operating Inference Jobs

Provisioning AWS EC2 instance

Signing in to AWS Management Consol ;

EC2 Dashboard

Launch an instance

AWS Submission Instructions: Influenza

Step 1. Create the configuration file.

Step 2. Start and access AWS submission box

Step 3. Setup the environment

Step 4. Model Setup

Step 5. Launch job on AWS batch

Step 6. Document the Submission

AWS Submission Instructions: COVID-19

Step 1. Create the configuration file.

Step 2. Start and access AWS submission box

Step 3. Setup the environment

Step 4. Model Setup

Step 5. Launch job on AWS batch

Step 6. Document the Submission

Running with RStudio Server on AWS EC2

Introduction

Versions

Provisioning a server on AWS EC2

Starting an EC2 instanc ;

Accessing RStudio Server

Accessing the Linux server(Ubuntu) using RDP

Using Windows

Using Mac

Microsoft Remote Desktop

Accessing the shared space on Linux server

Accessing the shared space on Linux server using Samba(obsolete)

Common

Using Windows

Using Mac

Notes

Running with Docker locally (outdated/US specific) 🛳

Setup

🚀 Run inference

Fill the environment variables (do this every time)

Run the code

Running with docker on AWS - OLD probably outdated

Introduction

Local Client Setup

Provisioning The Staging Server

Connect to Github

Getting and Launching the Docker Container

Running Inference Jobs

Operating Inference Jobs

AWS Submission Instructions: Influenza

Step 1. Create the configuration file.