Only this pageAll pages
Powered by GitBook
1 of 66

FlepiMoP Documentation

Loading...

gempyor: modeling infectious disease dynamics

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Model Inference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

More

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

How To Run

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Development

Loading...

Loading...

Loading...

Deprecated pages

Loading...

JHU Internal

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Technical Reference

Loading...

Modeling infectious disease dynamics

Within flepiMoP, gempyor is an open-source Python package that constructs and simulates compartmental infectious disease dynamics models. gempyor is meant to be used within flepiMoP, where it integrates with parameter inference and data processing scripts, but can also be run standalone with a command-line interface, generating simulations of disease incidence under different scenario assumptions.

To simulate an infectious disease dynamics problem, the following building blocks needs to be defined:

  • The population structure over which the disease is transmitted

  • The , defining the compartments and the transitions between compartments

  • An , defining different observable outcomes (serology, hospitalization, deaths, cases) from the transmission model

  • The parameters and modifiers that apply to them

Generalized compartmental infection model

At the core of our pipeline is a dynamic mathematical model that categorizes individuals in the population into a discrete set of states ('compartments') and describes the rates at which transitions between states can occur. Our modeling approach was developed to describe classic infectious disease transmission models, like the SIR model, but is much more general. It can encode any compartmental model in which transitions between states are of the form

where , , and are time-dependent variables describing the number of individuals in each state, is a rate parameter (units of time) and is a scaling parameter (unitless). may be , , a different variable, or 1, and the rate may also be the sum of terms of this form. Rates that involve non-linear functions or more than two variables are currently not possible. For simplicity, we omitted the time dependencies on parameters (e.g is in fact and , are ,).

The model can be simulated as a continuous-time, deterministic process (i.e., a set of ordinary differential equations), which in this example would be in the form

Details on the numerical integration procedure for simulating such an equation is given in the section.

Alternatively, the model can be simulated as a discrete-time stochastic process, where the number of individuals transitioning between states and at time is a binomial random variable

where the second term is the expected fraction of individuals in the state at time who would transition to by time if there were no other changes to in this time, and time step is a chosen parameter that must be small for equivalence between continuous- and discrete-time versions of the model.

SEIR model

For example, an SEIR model – which describes an infection for which susceptible individuals () who are infected first pass through a latent or exposed () phase before becoming infectious () and that confers perfect lifelong immunity after recovery () – could be encoded as

where is the transmission rate (rate of infectious contact per infectious individual), is the rate of progression ( is the average latent/incubation period), and is the recovery rate ( is the average duration of the infectious period), and is the total population size (). In differential equation form, this model is

and as a stochastic process, it is

A common COVID-19 model is a variation of this SEIR model that incorporates:

  1. multiple identical stages of the infectious period, which allows us to model gamma-distributed durations of infectiousness, and

  2. an infection rate modified by a 'mixing coefficient', , which is a rough heuristic for the slowdown in disease spread that occurs in realistically heterogeneous populations where more well-connected individuals are infected first.

A three-stage infectious period model is given by

The flepiMoP model structure is specifically designed to make it simple to encode the type of more complex "stratified'' models that often arise in infectious disease dynamics. The following are some examples of possible stratifications.

Age groups

To describe an SEIR-type disease that spreads and progresses differently among children versus adults, one may want to repeat each compartment of the model for each of the two age groups (C – Children, A – Adults), creating an age-stratified model

where is the transmission rate between age X and Y, and we have assumed individuals do not age on the timescale relevant to the model.

Vaccination status

Vaccination status could influence disease progression and infectiousness, and could also change over time as individuals choose to get the vaccine (V – vaccinated, U – unvaccinated)

where is the vaccination rate (we assume that individuals do not receive the vaccine while they are exposed or infectious) and is the vaccine efficacy against infection. Similar structures could be used for other sources of prior immunity or other dynamic risk groups.

Pathogen strain

Another common stratification would be pathogen strain, such as COVID-19 variants. Individuals may be infected with one of several variants, strains, or serotypes. Our framework can easily create multistrain models, for example

where is the immune cross-protection conferred from infection with strain A to subsequent infection with strain B. Co-infection is ignored. All individuals are assumed to be initially equally susceptible to both infections and are just categorized as (vs ) for convenience.

All combinations of these situations can be quickly specified in flepiMoP. Details on how to encode these models is provided in the section, with examples given in the section.

Clinical outcomes and observations model

The pipeline allows for an additional type of dynamic state variable beyond those included in the mathematical model. We refer to these extra variables as "Outcomes" or "Observations". Outcome variables can be functions of model variables, but do not feed back into the model by influencing other state variables. Typically, we use outcome variables to describe the process through which some subset of individuals in a compartment are "observed'' and become part of the data to which models are compared and attempt to predict. For example, in the context of a model for an infectious disease like COVID-19, outcome variables include reported cases, hospitalizations, and deaths.

An outcome variable can be generated from a state variable of the mathematical model using the following properties:

  • The proportion of all individuals in who will be observed as ,

  • The delay between when an individual enters state and when they are observed as , which can follow a class of probability distributions where is the parameters of the distribution (e.g., the mean and standard deviation of a normal distribution)

  • (optional) the duration spent in observable , in which case the output will also contain the prevalence (number of individuals currently in in addition to the incidence into

In addition to single values (drawn from a distribution), the duration and delay can be inputted as distributions, producing a convolution of the output.

The number of individuals in at time who become part of the outcome variable is a random variable, and individuals who are observed in at time could have entered at different times in the past.

Formally, for a deterministic, continuous-time model

For a discrete-time, stochastic model

Note that outcomes constructed in this way always represent incidence values; meaning they describe the number of individuals newly entering this state at time . If the model state is also an incidence, then is a unitless probability, whereas if is a prevalence (number of individuals currently in state at time ), then is instead a probability per time unit.

Outcomes can also be constructed as functions of other outcomes. For example, a fraction of hospitalized patients may end up in the intensive care unit (ICU).

There are several benefits to separating outcome variables from the mathematical model. Firstly, these variables can be calculated after the model is run, and only at the timepoints of interest, which can dramatically reduce the memory needed during model simulation. Secondly, outcome variables can be fully stochastic even when the mathematical model is simulated deterministically. This becomes useful when an infection might be at high enough prevalence that a deterministic simulation is appropriate, but when there is a rare and therefore quite stochastic outcome reported in the data (e.g., severe cases) that the model is tasked with predicting. Thirdly, outcome variables can have arbitrary delay distributions, to take into account the complexities of health reporting practices, whereas our mathematical modeling framework is designed mainly for exponentially distributed delays and only easily permits extensions to gamma-distributed delays. Finally, this separation keeps the pipeline modular and allow for easy editing of one component of the model without disrupting the other.

Details on how to specify these outcomes in the model configuration files is provided in the section, with examples given in the section.

Population structure

The pipeline was designed specifically to simulate infection dynamics in a set of connected subpopulations. These subpopulations could represent geographic divisions, like countries, states, provinces, or neighborhoods, or demographic groups, or potentially even different host species. The equations and parameters of the transmission and outcomes models are repeated for each subpopulation, but the values of the parameters can differ by location. Within each subpopulation, infection is equally likely to spread between any pair of susceptible/infected individuals after accounting for their infection class, whereas between subpopulations there may be varying levels of mixing.

Formally, this type of population structure is often referred to as a “metapopulation”, and each subpopulation may be called a “deme”.

The following properties may be different between subpopulations:

  • the population size

  • the parameters of the transmission model (see LINK)

  • the parameters of the outcomes model (see LINK)

  • the amount of transmission that occurs within this subpopulation versus from any other subpopulation (see LINK)

Currently, the following properties must be the same across all subpopulations:

  • the compartmental model structure

  • the form of the likelihood function used to estimate parameters by fitting the model to data (LINK)

  • ...

Mixing between subpopulations

The generalized compartmental model allows for second order “interaction” terms that describe transitions between model states that depend on interactions between pairs of individuals. For example, in the context of a classical SIR model, the rate of new infections depends on interactions between susceptible and infectious individuals and the transmission rate

For a model with multiple subpopulations, each of these interactions can occur either between individuals in the same or different subpopulations, with specific rate parameters for each combination of individual locations

where is the per-contact per-time rate of disease transmission between an infected individual residing in subpopulation and a susceptible individual from subpopulation .

In general for infection models in connected subpopulations, the transmission rates can take on arbitrary values. In this pipeline, however, we impose an additional structure on these terms. We assume that interactions between subpopulations occur when individuals temporarily relocate to another subpopulation, where they interact with locals. We call this movement “mobility”, and it could be due to regular commuting, special travel, etc. There is a transmission rate () associated with each subpopulation , and individuals physically in that subpopulation – permanently or temporarily – are exposed and infected with this local rate whenever they encounter local susceptible individuals.

The transmission matrix is then

where is the onward transmission rate from infected individuals in subpopulation , is the number of individuals in subpopulation i who are interacting with individuals in subpopulation at any given time (for example, fraction who commute each day), and is a fractional scaling factor for the strength of inter-population contacts (for example, representing the fraction of hours in a day commuting individuals spend outside vs. inside their subpopulation).

The list of all pairwise mobility values and the interaction scaling factor are model input parameters. Details on how to specify them are given in the section.

If an alternative compartmental disease model is created that has other interactions (second order terms), then the same mobility values are used to determine the degree of interaction between each pair of subpopulations.

Initial conditions

Initial conditions can be specified by setting the values of the compartments in the disease transmission model at time zero, or the start of the simulation. For example, we might assume that for day zero of an outbreak the whole population is susceptible except for one single infected individual, i.e. and . Alternatively, we might assume that a certain proportion of the population has prior immunity from previous infection or vaccination.

It might also be necessary to model instantaneous changes in values of model variables at any time during a simulation. We call this 'seeding'. For example, individuals may import infection from other external populations, or instantaneous mutations may occur, leading to new variants of the pathogen. These processes can be modeled with seeding, allowing individuals to change state at specified times independently of model equations.

We also note that seeding can also be used as a convenient way to specify initial conditions, particularly ealy in an outbreak where the outbreak is triggered by a few 'seedings'.

Time-dependent interventions

Parameters in the disease transmission model or the observation model may change over time. These changes could be, for example: environmental drivers of disease seasonality; “non-pharmaceutical interventions” like social distancing, isolation policies, or wearing of personal protective equipment; “pharmaceutical interventions” like vaccination, prophylaxis, or therapeutics; changes in healthcare seeking behavior like testing and diagnosis; changes in case reporting, etc.

The model allows for any parameter of the disease transmission model or the observation model to change to a new value for a time interval specified by start and end times (or multiple start and end times, for interventions that are recurring). Each change may be subpopulation-specific or apply to the entire population. Changes may be overlapping in time.

The magnitude of these changes are themselves model parameters, and thus may along with other parameters when the model is fit to data. Currently, the start and end times of interventions must be fixed and cannot be varied or inferred.

For example, the rate of transmission in subpopulation , , may be reduced by an intervention that acts between times and , and another intervention that acts between times and

In this case, and are both considered simple SinglePeriodModifier interventions. There are four possible types of interventions that can be included in the model

  • SinglePeriodModifier - an intervention that leads to a fractional reduction in a parameter value in subpopulation (i.e., ) between two timepoints

  • MultiPeriodModifier - an intervention that leads to a fractional reduction in a parameter value in subpopulation (i.e., ) value between multiple sets of timepoints

  • ModifierModifier- an intervention that leads to a fractional reduction in the value of another intervention between two timepoints

  • StackedModifier - TBA

the timing and extent of any interventions that modify these parameters (see LINK)

  • the initial timing and number of external introductions of infections into the population (see LINK)

  • the ground truth timeseries data used to compare to model output and infer model parameters (see LINK)

  • X→bXZaY,X \xrightarrow{b X Z^{a}} Y,XbXZa​Y,
    XXX
    YYY
    ZZZ
    bbb
    −1^{-1}−1
    aaa
    ZZZ
    XXX
    YYY
    XXX
    X(t)X(t)X(t)
    aaa
    bbb
    a(t)a(t)a(t)
    b(t)b(t)b(t)
    dXdt=bXZa.\frac{dX}{dt} = b X Z^a.dtdX​=bXZa.
    XXX
    YYY
    ttt
    NX→Y(t)=Binom(X,1−e−Δt⋅bZ(t)a),N_{X\rightarrow Y}(t) = \textrm{Binom}(X,1-e^{-\Delta{t} \cdot bZ(t)^a}),NX→Y​(t)=Binom(X,1−e−Δt⋅bZ(t)a),
    XXX
    ttt
    YYY
    t+Δtt+\Delta tt+Δt
    XXX
    Δt\Delta{t}Δt
    SSS
    EEE
    III
    RRR
    S→βSI/NE→σEI→γIR,S \xrightarrow{\beta S I/N} E \xrightarrow{\sigma E} I \xrightarrow{\gamma I} R,SβSI/N​EσE​IγI​R,
    β\betaβ
    σ\sigmaσ
    1/σ1/\sigma1/σ
    γ\gammaγ
    1/γ1/\gamma1/γ
    NNN
    N=S+E+I+RN=S+E+I+RN=S+E+I+R
    dSdt=−βSIN,\frac{dS}{dt} = - \beta S \frac{I}{N} ,dtdS​=−βSNI​,
    dEdt=βSIN−σE,\frac{dE}{dt} = \beta S \frac{I}{N} - \sigma E,dtdE​=βSNI​−σE,
    dIdt=σE−γI,\frac{dI}{dt} = \sigma E - \gamma I,dtdI​=σE−γI,
    dRdt=γI,\frac{dR}{dt} = \gamma I,dtdR​=γI,
    NS→E(t)=Binom(S(t),1−e−Δt⋅βI(t)/N),N_{S\rightarrow E}(t) = \textrm{Binom}(S(t),1-e^{-\Delta{t} \cdot \beta I(t)/N}),NS→E​(t)=Binom(S(t),1−e−Δt⋅βI(t)/N),
    NE→I(t)=Binom(E(t),1−e−Δt⋅σ),N_{E\rightarrow I}(t) = \textrm{Binom}(E(t),1-e^{-\Delta{t} \cdot \sigma}),NE→I​(t)=Binom(E(t),1−e−Δt⋅σ),
    NI→R(t)=Binom(I(t),1−e−Δt⋅γ).N_{I\rightarrow R}(t) = \textrm{Binom}(I(t),1-e^{-\Delta{t} \cdot \gamma }).NI→R​(t)=Binom(I(t),1−e−Δt⋅γ).
    α∈[0,1]\alpha \in [0,1]α∈[0,1]
    S→βS(I1+I2+I3)α/NE→σEI1→3γI1I2→3γI2I3→3γI3R.S \xrightarrow{\beta S (I_1+I_2+I_3)^\alpha/N} E \xrightarrow{\sigma E} I_1 \xrightarrow{3\gamma I_1} I_2 \xrightarrow{3\gamma I_2} I_3 \xrightarrow{3\gamma I_3} R.SβS(I1​+I2​+I3​)α/N​EσE​I1​3γI1​​I2​3γI2​​I3​3γI3​​R.
    SC→SC(βCCIC/NC+βACIA/NA)EC→σCECIC→γCICRC,S_C \xrightarrow{S_C (\beta_{CC} I_C/N_C + \beta_{AC} I_A/N_A)} E_C \xrightarrow{\sigma_C E_C} I_C \xrightarrow{\gamma_C I_C} R_C,SC​SC​(βCC​IC​/NC​+βAC​IA​/NA​)​EC​σC​EC​​IC​γC​IC​​RC​,
    SA→SA(βAAIA/NA+βCAIC/NC)EA→σAEAIA→γAIARA,S_A \xrightarrow{S_A (\beta_{AA} I_A/N_A + \beta_{CA} I_C/N_C)} E_A \xrightarrow{\sigma_A E_A} I_ A \xrightarrow{\gamma_A I_A} R_A,SA​SA​(βAA​IA​/NA​+βCA​IC​/NC​)​EA​σA​EA​​IA​γA​IA​​RA​,
    βXY\beta_{XY}βXY​
    SU→βSU(IU+IV)/NEU→σUEUIU→γUIURU,S_U \xrightarrow{\beta S_U (I_U + I_V)/N} E_U \xrightarrow{\sigma_U E_U} I_U \xrightarrow{\gamma_U I_U} R_U,SU​βSU​(IU​+IV​)/N​EU​σU​EU​​IU​γU​IU​​RU​,
    SV→β(1−θ)SV(IU+IV)/NEV→σVEVIV→γVIVRV,S_V \xrightarrow{\beta (1-\theta) S_V (I_U + I_V)/N} E_V \xrightarrow{\sigma_V E_V} I_V \xrightarrow{\gamma_V I_V} R_V,SV​β(1−θ)SV​(IU​+IV​)/N​EV​σV​EV​​IV​γV​IV​​RV​,
    SU→νSUSV,S_U \xrightarrow{\nu S_U} S_V,SU​νSU​​SV​,
    RU→νRURV,R_U \xrightarrow{\nu R_U} R_V,RU​νRU​​RV​,
    uuu
    θ\thetaθ
    SA→βASAIA/NAEA→σAEAIA→γAIARA,S_A \xrightarrow{\beta_A S_A I_A/N_A} E_A \xrightarrow{\sigma_A E_A} I_ A \xrightarrow{\gamma_A I_A} R_A,SA​βA​SA​IA​/NA​​EA​σA​EA​​IA​γA​IA​​RA​,
    SA→βBSBIB/NBEB→σBEBIB→γBIBRB,S_A \xrightarrow{\beta_B S_B I_B/N_B} E_B \xrightarrow{\sigma_B E_B} I_B \xrightarrow{\gamma_B I_B} R_B,SA​βB​SB​IB​/NB​​EB​σB​EB​​IB​γB​IB​​RB​,
    RA→βB(1−ϕAB)RAIB/NBEAB→σABEABIAB→γABIABRAB,R_{A} \xrightarrow{\beta_B(1-\phi_{AB}) R_A I_B/N_B} E_{AB} \xrightarrow{\sigma_{AB} E_{AB}} I_{AB} \xrightarrow{\gamma_{AB} I_{AB}} R_{AB},RA​βB​(1−ϕAB​)RA​IB​/NB​​EAB​σAB​EAB​​IAB​γAB​IAB​​RAB​,
    RB→βA(1−ϕBA)RBIA/NBEAB→σABEABIAB→γABIABRAB,R_{B} \xrightarrow{\beta_A (1-\phi_{BA}) R_B I_A/N_B} E_{AB} \xrightarrow{\sigma_{AB} E_{AB}} I_{AB} \xrightarrow{\gamma_{AB} I_{AB}} R_{AB},RB​βA​(1−ϕBA​)RB​IA​/NB​​EAB​σAB​EAB​​IAB​γAB​IAB​​RAB​,
    ϕAB\phi_{AB}ϕAB​
    SAS_ASA​
    SBS_BSB​
    H(t)H(t)H(t)
    X(t)X(t)X(t)
    XXX
    HHH
    ppp
    XXX
    HHH
    f(Δt;θ)f(\Delta t;\theta)f(Δt;θ)
    θ\thetaθ
    HHH
    HHH
    XXX
    t1t_1t1​
    H(t2)H(t_2)H(t2​)
    HHH
    ttt
    XXX
    H(t)=∫τpX(τ)f(t−τ,θ)dτH(t) = \int_{\tau} p X(\tau) f(t-\tau, \theta) d\tauH(t)=∫τ​pX(τ)f(t−τ,θ)dτ
    H(t)=∑τi=0tMultinomial(Binomial(X(τi),p),{f(t−τi,θ)}).H(t) = \sum_{\tau_i=0}^{t}\text{Multinomial} (\text{Binomial}(X(\tau_i), p), \{f(t-\tau_i, \theta)\}).H(t)=τi​=0∑t​Multinomial(Binomial(X(τi​),p),{f(t−τi​,θ)}).
    H(t)H(t)H(t)
    ttt
    X(t)X(t)X(t)
    ppp
    X(t)X(t)X(t)
    ttt
    ppp
    dIdt=βSI−γI\frac{dI}{dt} = \beta S I - \gamma IdtdI​=βSI−γI
    dIidt=∑jβjiIjSi−γIi\frac{dI_i}{dt} = \sum_j \beta_{ji} I_j S_i - \gamma I_idtdIi​​=j∑​βji​Ij​Si​−γIi​
    βji\beta_{ji}βji​
    jjj
    iii
    βji\beta_{ji}βji​
    βj\beta_jβj​
    jjj
    βji={paMijNiβjif j≠i(1−∑j≠ipaMijNi)if j=i\beta_{ji} = \begin{cases} p_a \frac{M_{ij}}{N_i} \beta_j &\text{if } j \neq i \\ \left( 1- \sum_{j \neq i} p_a \frac{M_{ij}}{N_i} \right) &\text{if } j = i \end{cases}βji​={pa​Ni​Mij​​βj​(1−∑j=i​pa​Ni​Mij​​)​if j=iif j=i​
    βj\beta_jβj​
    jjj
    MijM_{ij}Mij​
    jjj
    pap_apa​
    S(0)=N−1S(0) = N-1S(0)=N−1
    I(0)=1I(0) = 1I(0)=1
    iii
    βi\beta_iβi​
    rkr_krk​
    tk,startt_{k,\text{start}}tk,start​
    tk,endt_{k,\text{end}}tk,end​
    rlr_lrl​
    tl,startt_{l,\text{start}}tl,start​
    tl,endt_{l,\text{end}}tl,end​
    βj′(t)=(1−rk(t;tk,start,tk,end))(1−rl(t;tl,start,tl,end)))βj0\beta_j'(t) = (1-r_k(t;t_{k,\text{start}},t_{k,\text{end}}))(1-r_l(t;t_{l,\text{start}},t_{l,\text{end}})))\beta_j^0βj′​(t)=(1−rk​(t;tk,start​,tk,end​))(1−rl​(t;tl,start​,tl,end​)))βj0​
    rk(t)r_k(t)rk​(t)
    rl(t)r_l(t)rl​(t)
    rjr_jrj​
    jjj
    βj\beta_jβj​
    βj′(t)=(1−rj(t;tj,start,tj,end))βj0\beta_j'(t) = (1-r_j(t;t_{j,\text{start}},t_{j,\text{end}}))\beta_j^0βj′​(t)=(1−rj​(t;tj,start​,tj,end​))βj0​
    rj(t;tj,start,tj,end)={rjif tj,start<t<tj,end0otherwiser_j(t;t_{j,\text{start}},t_{j,\text{end}}) = \begin{cases} r_j &\text{if } t_{j,\text{start}} < t <t_{j,\text{end}} \\ 0 &\text{otherwise} \end{cases}rj​(t;tj,start​,tj,end​)={rj​0​if tj,start​<t<tj,end​otherwise​
    rjr_jrj​
    jjj
    βj\beta_jβj​
    βj′(t)=(1−rj(t;{tj,k,start,tj,k,end}k))βj0\beta_j'(t) = (1-r_j(t; \{t_{j,k,\text{start}},t_{j,k,\text{end}}\}_k))\beta_j^0βj′​(t)=(1−rj​(t;{tj,k,start​,tj,k,end​}k​))βj0​
    rj(t;{tj,k,start,tj,k,end}k)={rjif tj,k1,start<t<tj,k1,endrjif tj,k2,start<t<tj,k2,end...rjif tj,kn,start<t<tj,kn,end0otherwiser_j(t;\{t_{j,k,\text{start}},t_{j,k,\text{end}}\}_k) = \begin{dcases} r_ j&\text{if } t_{j,k1,\text{start}} < t <t_{j,k1,\text{end}} \\ r_j &\text{if } t_{j,k2,\text{start}} < t <t_{j,k2,\text{end}} \\ & ... \\ r_j &\text{if } t_{j,kn,\text{start}} < t <t_{j,kn,\text{end}} \\ 0 &\text{otherwise} \end{dcases}rj​(t;{tj,k,start​,tj,k,end​}k​)=⎩⎨⎧​rj​rj​rj​0​if tj,k1,start​<t<tj,k1,end​if tj,k2,start​<t<tj,k2,end​...if tj,kn,start​<t<tj,kn,end​otherwise​
    πj\pi_jπj​
    rjr_jrj​
    βj′(t)=(1−rj(t;tj,start,tj,end)(1−πr,j(t;tr,j,start,tr,j,end)))βj0\beta_j'(t) = (1-r_j(t;t_{j,\text{start}},t_{j,\text{end}})(1-\pi_{r,j}(t;t_{r,j,\text{start}},t_{r,j,\text{end}})))\beta_j^0βj′​(t)=(1−rj​(t;tj,start​,tj,end​)(1−πr,j​(t;tr,j,start​,tr,j,end​)))βj0​
    πr,j(t;tr,j,start,tr,j,end)={πr,jif tr,j,start<t<tr,j,end0otherwise\pi_{r,j}(t;t_{r,j,\text{start}},t_{r,j,\text{end}}) = \begin{cases} \pi_{r,j} &\text{if } t_{r,j,\text{start}} < t <t_{r,j,\text{end}} \\ 0 &\text{otherwise} \end{cases}πr,j​(t;tr,j,start​,tr,j,end​)={πr,j​0​if tr,j,start​<t<tr,j,end​otherwise​
    transmission model
    observation model
    Advanced
    Model Implementation
    Tutorials
    Model Implementation
    Tutorials
    Model Implementation
    inferred
    HHH

    Specifying initial conditions

    This section describes how to specify the values of each model state at the time the simulation starts, and how to make instantaneous changes to state values at other times (e.g., due to importations)

    Overview

    In order for the models specified previously to be dynamically simulated, the user must provide initial conditions, in addition to the model structure and parameter values. Initial conditions describe the value of each variable in the model at the time point that the simulation is to start. For example, on day zero of an outbreak, we may assume that the entire population is susceptible except for one single infected individual. Alternatively, we could assume that some portion of the population already has prior immunity due to vaccination or previous infection. Different initial conditions lead to different model trajectories.

    The initial_conditions section of the configuration file is detailed below. Note that in some cases, can replace or complement the initial condition, the table below provides a quick comparison of these sections.

    Feature
    initial_conditions
    seeding

    Specifying model initial conditions

    The configuration items in the initial_conditions section of the config file are

    initial_conditions:method Must be either "Default", "SetInitialConditions", or "FromFile".

    initial_conditions:initial_conditions_fileRequired for methods “SetInitialConditions” and “FromFile” . Path to a .csv or .parquet file containing the list of initial conditions for each compartment.

    initial_conditions:initial_file_type Only required for method: “FolderDraw”. Description TBA

    initial_conditions::allow_missing_subpops Optional for all methods, determines what will happen if initial_conditions_file is missing values for some subpopulations. If FALSE, the default behavior, or unspecified, an error will occur if subpopulations are missing. If TRUE, then for subpopulations missing from the initial_conditions file, it will be assumed that all individuals begin in the first compartment (the “first” compartment depends on how the model was specified, and will be the compartment that contains the first named category in each compartment group), unless another compartment is designated to hold the rest of the individuals ;

    initial_conditions::allow_missing_compartments Optional for all methods. If FALSE, the default behavior, or unspecified, an error will occur if any compartments are missing for any subpopulation. If TRUE, then it will be assumed there are zero individuals in compartments missing from the initial_conditions file.

    initial_conditions::proportional If TRUE, assume that the user has specified all input initial conditions as fractions of the population, instead of numbers of individuals (the default behavior, or if set to FALSE). Code will check that initial values in all compartments sum to 1.0 and throw an error if not, and then will multiply all values by the total population size for that subpopulation ;

    Details on implementing each initial conditions method and the options that go along with it are below.

    initial_conditions::method

    Default

    The default initial conditions are that the initial value of all compartments for each subpopulation will be zero, except for the first compartment, whose value will be the population size. The “first” compartment depends on how the model was specified, and will be the compartment that contains the first named category in each compartment group.

    For example, a model with the following compartments

    with the accompanying geodata file

    will be started with 1000 individuals in the S_child_unvaxxed in the "small province" and 10,000 in that compartment in the "large province".

    SetInitialConditions

    With this method users can specify arbitrary initial conditions in a convenient formatted input .csv or .parquet file.

    For example, for a model with the following compartments and initial_conditions sections

    with the accompanying geodata file

    where initial_conditions.csv contains

    the model will be started with half of the population of both subpopulations, consisting of children and the other half of adults, everyone unvaccinated, and 5 infections (in exposed-but-not-yet-infectious class) among the unvaccinated adults in the large province, with the remaining individuals susceptible (4995). All other compartments will contain zero individuals initially ;

    initial_conditions::initial_conditions_file must contain the following columns:

    • subpop – the name of the subpopulation for which the initial condition is being specified. By default, all subpopulations must be listed in this file, unless the allow_missing_subpops option is set to TRUE.

    • mc_name – the concatenated name of the compartment for which an initial condition is being specified. The order of the compartment groups in the name must be the same as the order in which these groups are defined in the config for the model, e.g., you cannot say unvaccinated_S.

    For each subpopulation, if there are compartments that are not listed in SetInitialConditions, an error will be thrown unless allow_missing_compartments is set to TRUE, in which case it will be assumed there are zero individuals in them. If the sum of the values of the initial conditions in all compartments in a location does not add up to the total population of that location (specified in the geodata file), an error will be thrown. To allocate all remaining individuals in a subpopulation (the difference between the total population size and those allocated by defined initial conditions) to a single pre-specified compartment, include this compartment in the initial_conditions_file but instead of a number in the amount column, put the word "rest" ;

    If allow_missing_subpops is FALSE or unspecified, an error will occur if initial conditions for some subpopulations are missing. If TRUE, then for subpopulations missing from the initial_conditions file, it will be assumed that all individuals begin in the first compartment. (The “first” compartment depends on how the model was specified, and will be the compartment that contains the first named category in each compartment group.)

    FromFile

    Similar to "SetInitialConditions", with this method users can specify arbitrary initial conditions in a formatted .csv or .parquet input file. However, the format of the input file is different. The required file format is consistent with the from the compartmental model, so the user could take output from one simulation and use it as input into another simulation with the same model structure ;

    For example, for an input configuration file containing

    with the accompanying geodata file

    where initial_conditions_from_previous.csv contains

    The simulation would be initiated on 2021-06-01 with these values in each compartment (no children vaccinated, only adults in the small province vaccinated, some past and current infection in both compartments but ).

    initial_conditions::initial_conditions_file must contain the following columns:

    • mc_value_type – in model output files, this is either prevalence or incidence. Prevalence values only are selected to be used as initial conditions, since compartmental models described the prevalence (number of individuals at any given time) in each compartment. Prevalence is taken to be the value measured instantaneously at the start of the day

    • mc_name – The name of the compartment for which the value is reported, which is a concatenation of the compartment status in each state type, e.g. "S_adult_unvaxxed" and must be in the same order as these groups are defined in the config for the model, e.g., you cannot say unvaxxed_S_adult.

    SetInitialConditionsFolderDraw, FromFileFolderDraw

    The way that initial conditions is specified with SetInitialConditions and FromFile results in a single value for each compartment and does not easily allow the user to instead specify a distribution (like is possible for compartmental or outcome model parameters). If a user wants to use different possible initial condition values each time the model is run, the way to do this is to instead specify a folder containing a set of file with initial condition values for each simulation that will be run. The user can do this using files with the format described in initial_conditions::method::SetInitialConditions using instead method::SetInitialConditionsFolder draw. Similarly, to provide a folder of initial condition files with the format described in initial_conditions::method:FromFile using instead method::FromFileFolderDraw ;

    Each file in the folder needs to be named according to the same naming conventions as the model output files: run_number.runID.file_type.[csv or parquet] where ....[DESCRIBE] as it is now taking the place of the seeding files the model would normally outpu ;

    Only one additional config argument is needed to use a FolderDraw method for initial conditions:

    initial_file_type: either seir or seed

    When using FolderDraw methods, initial_conditions_file should now be the path to the directory that contains the folder with all the initial conditions files. For example, if you are using output from another model run and so the files are in an seir folder within a model_output folder which is in within your project directory, you would use initial_conditions_file: model_outpu ;

    Input description

    Input is a list of compartment names, location names, and amounts of individuals in that compartment location. All compartments must be listed unless a setting to default missing compartments to zero is turned on.

    Input is list of seeding events defined by source compartment, destination compartment, number of individuals transitioning, and date of movement. Compartments without seeding events don't need to be listed.

    Specifies an incidence or prevalence?

    Amounts specified are prevalence values

    Amounts specified are instantaneous incidence values

    Useful for?

    Specifying initial conditions, especially if simulation does not start with a single infection introduced into a naive population.

    Modeling importations, evolution of new strains, and specifying initial conditions

    amount – the value of the initial condition; either a numeric value or the string "rest".

    subpop_1, subpop_2, etc. – one column for each different subpopulation, containing the value of the number of individuals in the described compartment in that subpopulation at the given date. Note that these are named after the nodenames defined by the user in the geodata file.

  • date – The calendar date in the simulation, in YYYY-MM-DD format. Only values with a date that matches to the simulation start_date will be used ;

  • Config section optional or required?

    Optional

    Optional

    Function of section

    Specify number of individuals in each compartment at time zero

    Allow for instantaneous changes in individuals' states

    Default

    Entire population in first compartment, zero in all other compartments

    No seeding events

    Requires input file?

    Yes, .csv

    the seeding section
    output "seir" file

    Yes, .csv

     compartments:
       infection_stage: ["S", "I", "R"]
       age_group: ["child", "adult"]
       vaccination_status: ["unvaxxed", "vaxxed"]
     
     initial_conditions:
       method: default
    subpop,          population
    large_province, 10000
    small_province, 1000
     compartments:
       infection_stage: ["S", "I", "R"]
       age_group: ["child", "adult"]
       vaccination_status: ["unvaxxed", "vaxxed"]
       
    initial_conditions:
        method: SetInitialConditions
        initial_conditions_file: initial_conditions.csv
        allow_missing_subpops: TRUE
        allow_missing_compartments: TRUE
    subpop,          population
    large_province, 10000
    small_province, 1000
    subpop, mc_name, amount
    small_province, S_child_unvaxxed, 500
    small_province, S_adult_unvaxxed, 500
    large_province, S_child_unvaxxed, 5000
    large_province, E_adult_unvaxxed, 5
    large_province, S_adult_unvaxxed, "rest"
    name: test_simulation
    start_date: 2021-06-01
    
     compartments:
       infection_stage: ["S", "I", "R"]
       age_group: ["child", "adult"]
       vaccination_status: ["unvaxxed", "vaxxed"]
       
    initial_conditions:
        method: FromFile
        initial_conditions_file: initial_conditions_from_previous.csv
        allow_missing_compartments: FALSE
        allow_missing_subpops: FALSE
    subpop,          population
    large_province, 10000
    small_province, 1000
    mc_value_type, mc_infection_stage, mc_age, mc_vaccination_status, mc_name, small_province, large_province, date
    ....
    prevalence, S, child, unvaxxed, 400, 900, 2021-06-01
    prevalence, S, child, vaxxed, 0, 0, 2021-06-01
    prevalence, I, child, unvaxxed, 5, 100, 2021-06-01
    prevalence, I, child, vaxxed, 0, 0, 2021-06-01
    prevalence, R, child, unvaxxed, 95, 4000, 2021-06-01
    prevalence, R, child, vaxxed, 0, 0, 2021-06-01
    prevalence, S, adult, unvaxxed, 50, 900, 2021-06-01
    prevalence, S, adult, vaxxed, 400, 0, 2021-06-01
    prevalence, I, adult, unvaxxed, 4, 100, 2021-06-01
    prevalence, I, adult, vaxxed, 1, 0, 2021-06-01
    prevalence, R, adult, unvaxxed, 75, 4000, 2021-06-01
    prevalence, R, adult, vaxxed, 20, 0, 2021-06-01
    ...

    Distributions

    This page describes the configuration schema for specifying distributions

    Distribution
    Parameters
    Type/Format
    Description

    fixed

    value

    Any real number

    Draws all values exactly equal to value

    uniform

    low

    Any real number

    Draws all values randomly from a uniform distribution with range [low, high]

    high

    Any real number greater than low

    poisson

    lam

    Any positive real number

    Draws all values randomly from a Poisson distribution with rate parameter (mean) lam (lambda)

    binomial

    size

    Any non-negative integer

    Draws all values randomly from a binomial distribution with number of trials (n) = size and probability of success on each trial (p) = prob

    prob

    Any number in [0,1]

    lognormal

    meanlog

    Any real number

    Draws all values randomly from a lognormal distribution (natural log, base e) with mean on a log scale of meanlog and standard deviation on a log scale of sdlog

    sdlog

    Any non-negative real number

    truncnorm

    mean

    Any real number

    Draws all values randomly from a truncated normal distribution with mean mean and standard deviation sd, truncated to have a maximum value of a and a minimum value of b

    sd

    Any non-negative real number

    a

    Any real number, or -Inf

    b

    Any real number greater than a, or Inf

    Code structure

    Inference Description

    Methods for fitting model to data

    flepiMoP can be used to conduct forward simulations of a model with user-defined parameter values, or, it can be used to iteratively run a model with some unknown parameters, compare the model output to ground truth data, and find parameter values that optimize the fit of the model to data (i.e., conduct model "inference"). We have developed a custom model inference method that is based on standard Markov Chain Monte Carlo (MCMC)-based approaches to Bayesian inference for dynamic models, but is adapted to deal with some of the particular challenges of large-scale epidemic models, including i) long times and high computational resources required to simulate single model runs, ii) multiple subpopulations with location-specific parameters but inter-location transmission, iii) a high-dimensional parameter space, iv) the need to produce real-time epidemic projections, and v) the availability of parallel computing resources.

    Notation

    • – A set of unknown model parameters to be estimated by fitting the model output to data. For a model with subpopulations each with their own parameters, this set includes all location-specific parameters .

    • – The timeseries output of one or more of the state variables of the model under parameters For simplicity, we will often just use the notation . The value at a timepoint is . For a model with subpopulations for which there are different state variables, this becomes . (Note that for the general case when the dynamics in one location can effect the dynamics in another, the model state in one location depends on the full set of parameters, not just the location-specific parameters.)

    • – The timeseries for the observed data (also referred to as "ground truth") that the model attempts to recreate. For a model with subpopulations each with their own observed data for variable , this becomes .

    • – The likelihood of the observed data being produced by the model for an input parameter set . This is a probability density function over all possible values of the data being produced by the model, conditional on a fixed model parameter value ;

    • – The prior probability distribution, which in Bayesian inference encodes beliefs about the possible values of the unknown parameter before any observed data is formally compared to the model.

    • – The posterior probability distribution, which in Bayesian inference describes the updated probability of the parameters conditional on the observed data .

    • – The proposal density, used in Metropolis-Hastings algorithms for Markov Chain Monte Carlo (MCMC) techniques for sampling the posterior distribution, describes the probability of proposing a new parameter set from a current accepted parameter set .

    Background

    This section can be skipped by those familiar with Markov Chain Monte Carlo approaches to Bayesian inference.

    Bayesian inference

    Our model fitting framework is based on the principles of Bayesian inference. Instead of estimating a single "best-fit" value of the unknown model parameters, our goal is to evaluate the consistency of every possible parameter value with the observed data, or in other words, to construct a distribution that describes the probability that a parameter has a certain value given the observations. This output is referred to as the posterior probability. This framework assumes that the model structure accurately describes the underlying generative process which created the data, but that the underlying parameters are unknown and that there can be some error in the observation of the data.

    Bayes' Rule states that the posterior probability of a set of model parameters given observed data can be expressed as a function of the likelihood of observing the data under the model with those parameters () and the prior probability ascribed to those parameters before any data was observed ()

    where the denominator is a constant factor – independent of – that only serves to normalize the posterior and thus can be ignored ;

    The likelihood function can be defined for a model/data combination based on an understanding of both a) the distribution of model outcomes for a given set of input parameters (if output is stochastic), and b) the nature of the measurement error in observing the data (if relevant) ;

    For complex models with many parameters like those used to simulate epidemic spread, it is generally impossible to construct the full posterior distribution either analytically or numerically. Instead, we rely on a class of methods called "Markov Chain Monte Carlo" (MCMC) that allows us to draw a random sample of parameters from the posterior distribution. Ideally, the statistics of the parameters drawn from this sample should be an unbiased estimate of those from the complete posterior.

    Markov Chain Monte Carlo methods

    In many Bayesian inference problems that arise in scientific model fitting, it is impossible to directly evaluate the full posterior distribution, since there are many parameters to be inferred (high dimensionality) and it is computationally costly to evaluate the model at any individual parameter set. Instead, it is common to employ Markov Chain Monte Carlo (MCMC) methods, which provide a way to iteratively construct a sequence of values that when taken together represent a sample from a desired probability distribution. In the limit of infinitely long sequences ("chains") of values, these methods are mathematically proven to converge to an unbiased sample from the distribution. There are many different MCMC algorithms, but each of them relies on some type of rule for generating a new "sampled" parameter set from an existing one. Our parameter inference method is based on the popular Metropolis-Hastings algorithm. Briefly, at every step of this iterative algorithm, a new set of parameters is jointly proposed, the model is evaluated at that proposed set, the value of the posterior (e.g., likelihood and prior) is evaluated at the proposed set, and if the posterior is improved compared to the previous step, the proposed parameters are "accepted" and become the next entry in the sequence, whereas if the value of the posterior is decreased, the proposed parameters are only accepted with some probability and otherwise rejected (in which case the next entry in the sequences becomes a repeat of the previous parameter set).

    The full algorithm for Metropolis-Hastings Markov Chain Monte Carlo is:

    • Generate initial set of parameters

    • Evaluate the likelihood () and prior () at this parameter set

    • For where is the length of the MCMC chain, add to the sequence of parameter values :

    Inference algorithm

    Likelihood

    In our algorithm, model fitting involves comparing timeseries of variables produced by the model (either transmission model state variables or observable outcomes constructed from those variables) to timeseries of observed "ground truth" data with the same time points. For timeseries data that arises from a deterministic, dynamic model, then the overall likelihood can be calculated as the product of the likelihood of the model output at each timepoint (since we assume the data at each timepoint was measured independently). If there are multiple observed datastreams corresponding to multiple model outputs (e.g., cases and deaths) ;

    For each subpopulation in the model, the likelihood of observing the "ground truth" data given the model parameters is

    where describes the process by which the data is assumed to be observed/measured from the underlying true vales. For example, observations may be assumed to be normally distributed around the truth with a known variance, or, count data may be assumed to be generated by a Poisson process ;

    And the overall likelihood taking into account all subpopulations, is the product of the individual likelihoods

    Note that the likelihood for each subpopulation depends not only on the parameter values that act within that subpopulation, but on the entire parameter set , since in general the infection dynamics in one subpopulation are also affected by those in each other region. Also note that we assume that the parameters only impact the likelihood through the single model output timeseries . While this is exactly true for a deterministic model, we make the simplifying assumption that it is also true for stochastic models, instead of attempting to calculate the full distribution of possible trajectories for a given parameter set and include that in the likelihood as well.

    Fitting algorithm

    The method we use for estimating model parameters is based on the Metropolis-Hastings algorithm, which is a class of Markov Chain Monte Carlo (MCMC) methods for obtaining samples from a posterior probability distribution. We developed a custom version of this algorithm to deal with some of the particular mathematical properties and computational challenges of fitting large disease transmission models ;

    There are to major unique features of our adapted algorithm:

    • Parallelization – Generally MCMC methods starting from a single initial parameter set and generating an extremely long sequence of parameter samples such that the Markov process is acceptably close to a stationary state where it represents an unbiased sample from the posterior. Instead, we simulate multiple shorter chains in parallel, starting from different initial conditions, and pool the results. Due to the computational time required to simulate the epidemic model, and the timescale on which forecasts of epidemic trajectories are often needed (~weeks), it is not possible to sequentially simulate the model millions of times. However, modern supercomputers allow massively parallel computation. The hope of this algorithm is that the parallel chains sample different subspaces of the posterior distribution, and together represent a reasonable sample from the full posterior. To maximize the chance of at least local stationarity of these subsamples, we pool only the final values of each of the parallel chains.

    • Multi-level – Our pipeline, and the fitting algorithm in particular, were designed to be able to simulate disease dynamics in a collection of linked subpopulations. This population structure creates challenges for model fitting. We want the model to be able to recreate the dynamics in each subpopulation, not just the overall summed dynamics. Each subpopulation has unique parameters, but due to the coupling between them (), the model outcomes in one subpopulation also depend on the parameter values in other subpopulations. For some subpopulations (), this coupling may effectively be weak and have little impact on dynamics, but for others (), spillover from another closely connected subpopulation may be the primary driver of the local dynamics. Thus, the model cannot be separately fit to each subpopulation, but must consider the combined likelihood. However, such an algorithm may be very slow to find parameters that optimize fits in all locations simultaneously, and may be predominantly drawn to fitting to the largest/most connected subpopulations. The avoid these issues, we

    Note that while the traditional Metropolis-Hastings algorithm for MCMC will provably converge to a stationary distribution where the sequence of parameters represents a sample from the posterior distribution, no such claim has been mathematically proven for our method.

    • For , where is the number of parallel MCMC chains (also known as slots)

      • Generate initial state

        • Generate an initial set of parameters , and copy this to both the global (

    We consider the sequence to represent a sample from the posterior probability distribution, and use it to calculate statistics about the inferred parameter values and the epidemic trajectories resulting from them (e.g., mean, median, 95% credible intervals).

    There are a few important notes/limitations about our method currently:

    • All parameters to be fit must be location-specific. There is currently no way to fit a parameter that has the identical value across all locations

    • The pipeline currently does not allow for fitting of the basic parameters of the compartmental epidemic model. Instead, these must be fixed, and the value of location-specific "interventions" acting to increase/reduce these parameters can be fit. All parameters related to the observational/outcomes model can be fit, as well as "interventions" acting to increase or reduce them ;

    • At no point is the parameter fitting optimizing the fit of the summed total population data to total population model predictions. The "overall" likelihood function used to make "global" parameter acceptance decisions is the product of the individual subpopulations likelihoods (which are based on comparing location-specific data to location-specific model output), which is not equivalent to likelihood for the total population. For example, if overestimates of the model in some subpopulations were exactly balanced by underestimates in others, the total population estimate could be very accurate and the total population likelihood high, but the overall likelihood we use here would still be low.

    Hierarchical parameters

    The baseline likelihood function used in the fitting algorithm described above allows for parameter values to differ arbitrarily between different subpopulations. However, it may be desired to instead impose constraints on the best-fit parameters, such that subpopulations that are similar in some way, or belong to some pre-defined group, have parameters that are close to one another. Formally, this is typically done with group-level or hierarchical models that fit meta-parameters from which individual subpopulation parameters are assumed to draw. Here, we instead impose this group-level structure by adding an additional term to the likelihood that describes the probability that the set of parameters proposed for a group of subpopulations comes from a normal distribution. This term of the likelihood will be larger when the variance of this parameter set is smaller. Formally

    where is a group of subpopulations, is one of the parameters in the set , is the probability density function of the normal distribution, and are the mean and standard deviation of all values of the parameter in the group . There is also the option to use a logit-normal distribution instead of a standard normal, which may be more appropriate if the parameter is a proportion bounded in [0,1].

    Generate a proposed set of parameters Θ∗\Theta^*Θ∗ based on an arbitrary proposal distribution g(Θ∗∣Θk−1)g(\Theta^*|\Theta_{k-1})g(Θ∗∣Θk−1​)

  • Evaluate the likelihood and prior at the proposed parameter set

  • Generate a uniform random number u∼U[0,1]u \sim \mathcal{U}[0,1]u∼U[0,1]

  • Calculate the acceptance ratio α=L(D∣Θ∗)p(Θ∗)g(Θk−1∣Θ∗)L(D∣Θk−1))p(Θk−1)g(Θ∗∣Θk−1)\alpha=\frac{\mathcal{L}(D|\Theta^*) p(\Theta^*) g(\Theta_{k-1}|\Theta^*)}{\mathcal{L}(D|\Theta_{k-1})) p(\Theta_{k-1}) g(\Theta^*|\Theta_{k-1})}α=L(D∣Θk−1​))p(Θk−1​)g(Θ∗∣Θk−1​)L(D∣Θ∗)p(Θ∗)g(Θk−1​∣Θ∗)​

  • If α>u\alpha> uα>u, ACCEPT the proposed parameters to the parameter chain. Set Θk=Θ∗\Theta_k=\Theta^*Θk​=Θ∗ ;

  • Else, REJECT the proposed parameters for the chimeric parameter chain. Set Θk=Θk−1\Theta_k = \Theta_{k-1}Θk​=Θk−1​

  • simultaneously generate two communicating parameter chains
    : a "chimeric" chain that allows the parameters for each subpopulation to evolve quasi-independently based on local fit quality, and a "global" chain that evolves only based on the overall fit quality (for all subpopulations combined).
    ) and chimeric (
    ) parameter chain (sequence ;
  • Generate an initial epidemic trajectory Z(Θm,0)Z(\Theta_{m,0})Z(Θm,0​)

  • Calculate and record the initial likelihood for each subpopulation, $$\mathcal{L_i}(D_i|Z_i(\Theta_{m,0}))$ ;

  • For k=1...Kk= 1 ... Kk=1...K where KKK is the length of the MCMC chain, add to the sequence of parameter values :

    • Generate a proposed set of parameters Θ∗\Theta^*Θ∗from the current chimeric parameters using the proposal distribution $$g(\Theta^*|\Theta^C_{m,k-1})$ ;

    • Generate an epidemic trajectory with these proposed parameters, Z(Θ∗)Z(\Theta^*)Z(Θ∗)

    • Calculate the likelihood of the data given the proposed parameters for each subpopulation,

    • Calculate the overall likelihood with the proposed parameters,

    • Make "global" decision about proposed parameters

      • Generate a uniform random number

      • Calculate the overall likelihood with the current global parameters,

      • Calculate the acceptance ratio

    • End making global decision

  • End for KKK iterations of each MCMC chain

  • End for MMM parallel MCMC chains

  • Collect the final global parameter values for each parallel chain θm={Θm,KG}m\theta_m = \{\Theta^G_{m,K}\}_mθm​={Θm,KG​}m​

  • There is no model simulation run or record that corresponds to the combined parameters recorded in the chimeric parameter chain (ΘmC\Theta^C_{m}ΘmC​). For entry mmm in the chain, some of these parameter values were recently accepted from the last proposal and were used in the simulation produced by that proposal, while for other subpopulations, the most recent proposed parameters were rejected so ΘmC\Theta^C_{m}ΘmC​ contains parameters accepted – and part of the simulations produced – in a previous iteration.

  • It is currently not possible to infer parameters of the measurement process encoded in the likelihood function. For example, if the likelihood is chosen to be a normal distribution, which implies an assumption that the observed data is generated from the underlying truth according to a normal distribution with mean zero, then the standard deviation must be specified, and cannot be inferred along with the other model parameters ;

  • There is an option to use a slightly different version of our algorithm, in which globally accepted parameter values are not pushed back into the chimeric likelihood, but the chimeric likelihood is instead allowed to continue to evolve independently. In this variation, the chimeric acceptance decision is always made, not only if a global rejection happens ;

  • The proposal distribution g(Θ∗∣Θ)g(\Theta^*|\Theta)g(Θ∗∣Θ) for generating new parameter sets is currently constrained to be a joint distribution in which the the value of each new proposed parameter is chosen independently of any other parameters.

  • While in general in Metropolis-Hasting algorithms the formula for the the acceptance ratio includes the proposal distribution g(Θ∗∣Θ)g(\Theta^*|\Theta)g(Θ∗∣Θ), those terms cancel out if the proposal distribution is symmetrical. Our algorithm assumes such symmetry and thus does not include ggg in the formula, so the user must be careful to only select symmetric distributions.

  • Θ\ThetaΘ
    iii
    Θi\Theta_iΘi​
    Z(Θ)Z(\Theta)Z(Θ)
    Θ.\Theta.Θ.
    ZZZ
    ttt
    ZtZ_tZt​
    iii
    jjj
    Zi,j,t(Θ)Z_{i,j,t}(\Theta)Zi,j,t​(Θ)
    DtD_tDt​
    iii
    jjj
    Di,j,tD_{i,j,t}Di,j,t​
    L(D∣Θ)\mathcal{L}(D|\Theta)L(D∣Θ)
    DDD
    Θ\ThetaΘ
    p(Θ)p(\Theta)p(Θ)
    Θ\ThetaΘ
    P(Θ∣D)P(\Theta|D)P(Θ∣D)
    Θ\ThetaΘ
    DDD
    g(Θ∗∣Θ)g(\Theta^*|\Theta)g(Θ∗∣Θ)
    Θ∗\Theta^*Θ∗
    Θ\ThetaΘ
    P(Θ∣D)P(\Theta|D)P(Θ∣D)
    Θ\ThetaΘ
    DDD
    L(D∣Θ)\mathcal{L}(D|\Theta)L(D∣Θ)
    p(Θ)p(\Theta)p(Θ)
    P(Θ∣D)=L(D∣Θ)p(Θ)P(D)P(\Theta|D) = \frac{\mathcal{L}(D|\Theta)p(\Theta)}{P(D)}P(Θ∣D)=P(D)L(D∣Θ)p(Θ)​
    P(D)=∫ΘL(D∣Θ)p(Θ)dΘP(D) = \int_\Theta \mathcal{L}(D|\Theta)p(\Theta) d\ThetaP(D)=∫Θ​L(D∣Θ)p(Θ)dΘ
    Θ\ThetaΘ
    Θ0\Theta_0Θ0​
    L(D∣Θ)\mathcal{L}(D|\Theta)L(D∣Θ)
    p(Θ)p(\Theta)p(Θ)
    k=1⋯Kk = 1 \cdots Kk=1⋯K
    KKK
    Θ\ThetaΘ
    Li(Di∣Θ)=L(Di∣Zi(Θ))=∏j∏tpjobs(Di,j,t∣Zi,j,t(Θ))\mathcal{L}_i(D_i|\Theta) = \mathcal{L}(D_i|Z_i(\Theta)) = \prod_j \prod_t p^{\text{obs}}_j(D_{i,j,t}|Z_{i,j,t}(\Theta))Li​(Di​∣Θ)=L(Di​∣Zi​(Θ))=j∏​t∏​pjobs​(Di,j,t​∣Zi,j,t​(Θ))
    pobs(D∣Z)p^{\text{obs}}(D|Z)pobs(D∣Z)
    L(D∣Θ)=L(D∣Z(Θ))=∏iLi(Di∣Zi(Θ))\mathcal{L}(D|\Theta) =\mathcal{L}(D|Z(\Theta)) = \prod_i \mathcal{L}_i(D_i|Z_i(\Theta))L(D∣Θ)=L(D∣Z(Θ))=i∏​Li​(Di​∣Zi​(Θ))
    Θi\Theta_iΘi​
    Θ\ThetaΘ
    Θ\ThetaΘ
    ZtZ_tZt​
    m=1…Mm=1 \dots Mm=1…M
    MMM
    Θm,0\Theta_{m,0}Θm,0​
    Θm,0G\Theta^G_{m,0}Θm,0G​
    θm\theta_mθm​
    L(D∣Θ)→∏iLi(Di∣Zi(Θ))⋅∏g∏i∈g∏lϕ(Θl,i;μl,g,σl,g)\mathcal{L}(D|\Theta) \rightarrow \prod_i \mathcal{L}_i(D_i|Z_i(\Theta)) \cdot \prod_g \prod_{i \in g} \prod_l \phi(\Theta_{l,i}; \mu_{l,g}, \sigma_{l,g})L(D∣Θ)→i∏​Li​(Di​∣Zi​(Θ))⋅g∏​i∈g∏​l∏​ϕ(Θl,i​;μl,g​,σl,g​)
    ggg
    lll
    Θ\ThetaΘ
    φ(x;μ,σ)\varphi(x;\mu,\sigma)φ(x;μ,σ)
    μl,g\mu_{l,g}μl,g​
    σl,g\sigma_{l,g}σl,g​
    Θl\Theta_lΘl​
    ggg
    Diagram of the custom multi-level MCMC method used for parameter inference in flepiMoP. Each square represents a single subpopulation which has a set of associated parameter values. Diagram is for a single MCMC chain; multiple parallel chains are typically run and combined to form a posterior distribution of parameter values.
    Θm,0C\Theta^C_{m,0}Θm,0C​
    ​
  • If αG>uG\alpha^G > u^GαG>uG: ACCEPT the proposed parameters to the global and chimeric parameter chains

    • Set \Theta^G_{m,k} =$$$$\Theta^*

    • Set Θm,kC=Θ∗\Theta_{m,k}^C=\Theta^*Θm,kC​=Θ∗

    • Update the recorded subpopulation-specific likelihood values (chimeric and global) with the likelihoods calculated using the proposed parameter ;

  • Else: REJECT the proposed parameters for the global chain and make subpopulation-specific decisions for the chimeric chain

    • Set Θm,kG=Θm,k−1G\Theta^G_{m,k} = \Theta^G_{m,k-1}Θm,kG​=Θm,k−1G​

    • Make "chimeric" decision:

      • For

        • Generate a uniform random number

        • Calculate the acceptance ratio

        • If : ACCEPT the proposed parameters to the chimeric parameter chain for this location

      • End for subpopulations

    • End making chimeric decisions

  • End if

  • Li(Di∣Zi(Θ∗))\mathcal{L}_i(D_i|Z_i(\Theta^*))Li​(Di​∣Zi​(Θ∗))
    L(D∣Z(Θ∗))\mathcal{L}(D|Z(\Theta^*))L(D∣Z(Θ∗))
    uG∼U[0,1]u^G \sim \mathcal{U}[0,1]uG∼U[0,1]
    L(D∣Z(Θm,k−1G))\mathcal{L}(D|Z(\Theta^G_{m,k-1}))L(D∣Z(Θm,k−1G​))
    αG=min⁡(1,L(D∣Z(Θ∗))p(Θ∗)L(D∣Z(Θm,k−1G))p(Θm,k−1G))\alpha^G=\min \left(1, \frac{\mathcal{L}(D|Z(\Theta^*)) p(\Theta^*) }{\mathcal{L}(D|Z(\Theta^G_{m,k-1})) p(\Theta^G_{m,k-1}) } \right)αG=min(1,L(D∣Z(Θm,k−1G​))p(Θm,k−1G​)L(D∣Z(Θ∗))p(Θ∗)​)
    • Set

    • Update the recorded chimeric likelihood value for subpopulation to that calculated with the proposed parameter​

  • Else: REJECT the proposed parameters for the chimeric parameter chain for this location

    • Set ​

  • `End if ;

  • i=1…Ni = 1 \dots Ni=1…N
    uiC∼U[0,1]u_i^C \sim \mathcal{U}[0,1]uiC​∼U[0,1]
    αiC=Li(Di∣Zi(Θ∗))p(Θ∗)Li(Di∣Zi(Θm,k−1C))p(Θm,k−1)\alpha_i^C=\frac{\mathcal{L}_i(D_i|Z_i(\Theta^*)) p(\Theta^*) }{\mathcal{L}i(D_i|Z_i(\Theta^C_{m,k-1})) p(\Theta_{m,k-1}) }αiC​=Li(Di​∣Zi​(Θm,k−1C​))p(Θm,k−1​)Li​(Di​∣Zi​(Θ∗))p(Θ∗)​
    αiC>uiC\alpha_i^C > u_i^CαiC​>uiC​
    NNN
    Θm,k,iC=Θi∗\Theta_{m,k,i}^C = \Theta^*_{i}Θm,k,iC​=Θi∗​
    iii
    Θm,k,iC=Θm,k−1,i\Theta_{m,k,i}^C=\Theta_{m,k-1,i}Θm,k,iC​=Θm,k−1,i​

    Quick Start Guide

    Quick instructions on how to install Prerequisites, install flepiMoP itself, and then run through a quick example of how to use flepiMoP.

    flepiMoP is flexible pipeline for modeling epidemics. It has functionality for simulating epidemics as well as doing inference for simulation parameters and post-processing of simulation/inference outputs. It is written in a combination of python and R and uses anaconda to manage installations which allows flepiMoP to enforce version constraints across both languages.

    Prerequisites

    flepiMoP requires the following:

    • , and

    • .

    If you do not have git installed you can go to to find the appropriate installation for your system. It's also recommended, but not required, to have a account. If you're totally new to git and GitHub, GitHub has a very nice introduction to the that is worth reading before continuing.

    If you do not have conda installed you can go to to find the appropriate installation for your system. We would recommend selecting the Anaconda Distribution installer of conda.

    Installing flepiMoP

    Navigate to the parent location of where you would like to install flepiMoP, a subdirectory called flepiMoP will be created there. For example, if you navigate to ~/Desktop then flepiMoP will be installed to ~/Desktop/flepiMoP.

    This installation script is currently only designed for Linux/MacOS operating systems or linux shells for windows. If you need windows native installation please reach out for assistance.

    This installations script will guide you through a series of prompts to determine how and where to install flepiMoP. Loosely this script:

    1. Determines what directory flepiMoP is being installed into,

    2. Optionally gets a clone of flepiMoP if it is not present at the install location,

    3. Creates a conda environment to house the installation,

    4. Installs

    For more help on how to use the installation script you can do ./flepimop-install -h to get help information. For first time users accepting the default prompt will be the best choice (as shown below):

    Once the prompts are done the installer will output information about the installations that it is doing. After the installation has completed you should see an installation summary similar to:

    This summary gives a brief overview of the R/python/package versions installed. If you encounter any issues with your installation please include this information with your issue report.

    Activating A flepiMoP Installation

    To activate flepiMoP you need to activate the conda environment that it is installed to with:

    Or replacing flepimop-env with the appropriate conda environment if you decided on a non-default conda environment. Once you do this you should have the flepimop CLI available to you with:

    Defining Environment Variables (Optional)

    If you choose not to define environment variables, remember to use the full or relative path names for navigating to the right directories and provide appropriate flepi/project path arguments in future steps.

    flepiMoP frequently uses two environment variables to refer to specific directories both as a default for CLI arguments and throughout the documentation:

    1. FLEPI_PATH: Refers to the directory where flepiMoP is installed to, and

    2. PROJECT_PATH: Refers to the directory where flepiMoP is being ran from.

    Furthermore, you'll likely be navigating between these directories frequently in production usage so having these environment variables set can save some typing.

    Continuing with the same paths from the installation example flepiMoP was installed to /Users/example/Desktop/flepiMoP. On Linux/MacOS or in linux shells on windows setting an environment variable can be done by:

    Where /your/path/to is the directory containing flepiMoP. If you have already navigated to your flepiMoP directory you can just do:

    You can check that the variables have been set by either typing env to see all defined environment variables, or typing echo $FLEPI_PATH and echo $PROJECT_PATH to see the values of FLEPI_PATH and PROJECT_PATH.

    However, if you're on Windows:

    Where /your/path/to is the directory containing flepiMoP. If you have already navigated to your flepiMoP directory you can just do:

    You can check that the variables have been set by either typing set to see all defined environment variables, or typing echo $FLEPI_PATH$ and echo $PROJECT_PATH$ to see the values of FLEPI_PATH and PROJECT_PATH.

    For more information on the usage of environment variables with flepiMoP please refer to the documentation.

    Run flepiMoP

    Now that flepiMoP has been successfully installed on your system you will be able to use the tool to model epidemics.

    First, navigate to the PROJECT_PATH folder and make sure to delete any old model output files that are there:

    Non-Inference Run

    Stay in the PROJECT_PATH folder, and run a simulation directly from forward-simulation Python package gempyor. Call flepimop simulate providing the name of the configuration file you want to run. For example here, we use config_sample_2pop.yml.

    This will produce a model_output folder, which you can look at using provided post-processing functions and scripts.

    We recommend using model_output_notebook.Rmd as a starting point to interact with your model outputs. First, modify the YAML preamble in the notebook (make sure the configuration file listed matches the one used in simulation), then knit this markdown. This will produce plots of the prevalence of infection states over time. You can edit this markdown to produce any figures you'd like to explore your model output.

    For your first flepiMoP run, it's better to run each command individually as described above to be sure each exits successfully. However, eventually you can put all these steps together in a script, seen below:

    Note that you only have to re-run the installation steps once each time you update any of the files in the flepimop repository (either by pulling changes made by the developers and stored on Github, or by changing them yourself). If you're just running the same or different configuration file, just repeat the final steps:

    Inference Run

    An inference run requires a configuration file that has the inference section. Stay in the $PROJECT_PATH folder, and run the inference script, providing the name of the configuration file you want to run. For example here, we use config_sample_2pop_inference.yml.

    This will run the model and create a lot of output files in $PROJECT_PATH/model_output/.

    The last few lines visible on the command prompt should be:

    [[1]]

    [[1]][[1]]

    [[1]][[1]][[1]]

    NULL

    If you want to quickly do runs with options different from those encoded in the configuration file, you can do that from the command line, for example

    where:

    • n is the number of parallel inference slots,

    • j is the number of CPU cores to use on your machine (if j > n, only n cores will actually be used. If j <n, some cores will run multiple slots in sequence)

    Again, it is helpful to run the model output notebook (model_output_notebook.Rmd to explore your model outputs. Knitting this file for an inference run will also provide an analysis of your fits: the acceptance probabilities, likelihoods overtime, and the fits against the provided ground truth.

    For your first flepiMoP inference run, it's better to run each command individually as described above to be sure each exits successfully. However, eventually you can put all these steps together in a script, seen below:

    Note that you only have to re-run the installation steps once each time you update any of the files in the flepimop repository (either by pulling changes made by the developers and stored on Github, or by changing them yourself). If you're just running the same or different configuration file, just repeat the final steps

    Examining Model Output

    If your run is successful, you should see your output files in the model_output folder. The structure of the files in this folder is described in the section. By default, all the output files are .parquet format (a compressed format which can be imported as dataframes using R's arrow package arrow::read_parquet or using the free desktop application for quick viewing). However, you can add the option --write-csv to the end of the commands to run the code (e.g., flepimop simulate --write-csv config.yml) to have everything saved as .csv files instead ;

    Updating flepiMoP

    You can use the flepimop-install script provided by the flepiMoP repository to update your install of flepiMoP with:

    Or to reinstall flepiMoP from scratch (say if your conda environment is very out of date or in a bad state) you can do so with:

    Next Steps

    These configs and notebooks should be a good starting point for getting started with flepiMoP. To explore other running options, see .

    flepiMoP
    's dependencies and custom packages to this conda environment, and
  • Finally prints out a summary of the installation with helpful debugging information.

  • k is the number of iterations per slots.

    git
    conda
    the downloads page
    GitHub
    basics of git
    the downloads page
    Environment Variables
    Model Output
    Tad
    How to run: Advanced
    $ curl -LsSf -o flepimop-install "https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/main/bin/flepimop-install"
    $ chmod +x flepimop-install
    $ ./flepimop-install
    An explicit $USERDIR was not provided, please set one (or press enter to use '/Users/example/Desktop'):
    Using '/Users/example/Desktop' for $USERDIR.
    An explicit $FLEPI_PATH was not provided, please set one (or press enter to use '/Users/example/Desktop/flepiMoP'):
    Using '/Users/example/Desktop/flepiMoP' for $FLEPI_PATH.
    Did not find flepiMoP at '/Users/example/Desktop/flepiMoP', do you want to clone the repo? (y/n) y
    Cloning on your behalf.
    Cloning into '/Users/example/Desktop/flepiMoP'...
    remote: Enumerating objects: 28513, done.
    remote: Counting objects: 100% (3424/3424), done.
    remote: Compressing objects: 100% (845/845), done.
    remote: Total 28513 (delta 2899), reused 2786 (delta 2576), pack-reused 25089 (from 2)
    Receiving objects: 100% (28513/28513), 145.99 MiB | 26.32 MiB/s, done.
    Resolving deltas: 100% (14831/14831), done.
    An explicit $FLEPI_CONDA was not provided, please set one (or press enter to use 'flepimop-env'):
    Using 'flepimop-env' name for $FLEPI_CONDA.
    ...
    flepiMoP installation summary:
    > flepiMoP version: ec707d36cd9f8675466c05cbaba295cc4f4a7112
    > flepiMoP path: /Users/example/Desktop/flepiMoP
    > flepiMoP conda env: flepimop-env
    > conda: 24.9.2
    > R 4.3.3: /opt/anaconda3/envs/flepimop-env/bin/R
    > Python 3.11.12: /opt/anaconda3/envs/flepimop-env/bin/python
    > gempyor version: 2.1
    > R flepicommon version: 0.0.1
    > R flepiconfig version: 3.0.0
    > R inference version: 0.0.1
    
    To activate the flepimop conda environment, run:
        conda activate flepimop-env
    $ conda activate flepimop-env
    $ flepimop --help
    Usage: flepimop [OPTIONS] COMMAND [ARGS]...
    
      Flexible Epidemic Modeling Platform (FlepiMoP) Command Line Interface
    
    Options:
      --help  Show this message and exit.
    
    Commands:
      batch-calibrate  Submit a calibration job to a batch system.
      compartments     Add commands for working with FlepiMoP compartments.
      modifiers
      patch            Merge configuration files.
      simulate         Forward simulate a model using gempyor.
      sync             Sync flepimop files between local and remote locations.
    export FLEPI_PATH=/Users/example/Desktop/flepiMoP
    export PROJECT_PATH=/Users/example/Desktop/flepiMoP/examples/tutorials
    export FLEPI_PATH=$(pwd)
    export PROJECT_PATH=$(pwd)/examples/tutorials
    set FLEPI_PATH=C:\your\path\to\flepiMoP
    set PROJECT_PATH=C:\your\path\to\flepiMoP\examples\tutorials
    set FLEPI_PATH=%CD%
    set PROJECT_PATH=%CD%\examples\tutorials
    $ cd $PROJECT_PATH
    $ rm -r model_output/
    flepimop simulate config_sample_2pop.yml
    export FLEPI_PATH=/Users/YourName/Github/flepiMoP
    export PROJECT_PATH=/Users/YourName/Github/flepiMoP/examples/tutorials
    cd $PROJECT_PATH
    rm -rf model_output
    flepimop simulate config.yml
    rm -rf model_output
    flepimop simulate new_config.yml
    flepimop-inference-main -c config_sample_2pop_inference.yml
    flepimop-inference-main -j 1 -n 1 -k 1 -c config_inference.yml
    export FLEPI_PATH=/Users/YourName/Github/flepiMoP
    export PROJECT_PATH=/Users/YourName/Github/flepiMoP/examples/tutorials
    cd $FLEPI_PATH
    pip install --no-deps -e flepimop/gempyor_pkg/
    Rscript build/local_install.R
    cd $PROJECT_PATH
    rm -rf model_output
    flepimop-inference-main -c config_inference.yml
    rm -rf model_output
    flepimop-inference-main -c config_inference_new.yml
    $ cd $FLEPI_PATH
    $ ./bin/flepimop-install -u
    $ ./bin/flepimop-install -r -u

    Running On A HPC With Slurm

    Tutorial on how to install and run flepiMoP on a supported HPC with slurm.

    These details cover how to install and initialize flepiMoP on an HPC environment and submit a job with slurm.

    Currently only JHU's Rockfish and UNC's Longleaf HPC clusters are supported. If you need support for a new HPC cluster please file an issue in the flepiMoP GitHub repository.

    For getting access to one of the supported HPC environments please refer to the following documentation before continuing:

    • for UNC users, or

    • for JHU users.

    External users will need to consult with their PI contact at the respective institution.

    Installing flepiMoP

    This task needs to be ran once to do the initial install of flepiMoP.

    On JHU's Rockfish you'll need to run these steps in a slurm interactive job. This can be launched with /data/apps/helpers/interact -n 4 -m 12GB -t 4:00:00, but please consult the for up to date information.

    Download and run the the appropriate installation script with the following command:

    Substituting <cluster-name> with either rockfish or longleaf. This script will install flepiMoP to the correct locations on the cluster. Once the installation is done the conda environment can be activated and the script can be removed with:

    Updating flepiMoP

    Updating flepiMoP is designed to work just the same as installing flepiMoP. First change directory to your flepiMoP installation and then make sure that your clone of the flepiMoP repository is set to the branch you are working with (if doing development or operations work) and then run the flepimop-install-<cluster-name> script, substituting <cluster-name> with either rockfish or longleaf.

    Initialize The Created flepiMoP Environment

    These steps to initialize the environment need to run on a per run or as needed basis.

    Change directory to where a full clone of the flepiMoP repository was placed (it will state the location in the output of the script above). And then run the hpc_init script, substituting <cluster-name> with either rockfish or longleaf. This script will assume the same defaults as the script before for where the flepiMoP clone is and the name of the conda environment. This script will also ask about the path to your flepiMoP installation and project directory. It will also ask if you would like to set a default configuration file, if you plan to use the flepimop batch-calibrate command below we recommend pressing enter to skip setting this environment variable. If this is your first time initializing flepiMoP it might be helpful to use configs out of flepiMoP/examples/tutorials directory as a test.

    Upon completing this script it will output a sample set of commands to run to quickly test if the installation/initialization has gone okay.

    Submitting A Batch Inference Job To Slurm

    The main entry point for submitting batch inference jobs is the flepimop batch-calibrate action. This CLI tool will let you submit a job to slurm once logged into a cluster. For details on the available options please refer to flepimop batch-calibrate --help. As a quick example let's submit an R inference and EMCEE inference job. For the R inference run execute the following once logged into either longleaf or rockfish:

    This command will produce a large amount of output, due to -vvv. If you want to try the command without actually submitting the job you can pass the --dry-run option. This command will submit a job to calibrate the sample 2 population configuration which uses R inference. The R inference supports array jobs so each chain will be run on an individual node with 1 CPU and 1GB of memory a piece. Additionally the extra option allows you to provide additional info to the batch system, in this case what partition to submit the jobs to but email is also supported with slurm for notifications. After running this command you should notice the following output:

    • config_sample_2pop-YYYYMMDDTHHMMSS.yml: This file contains the compiled config that is actually submitted for inference,

    • manifest.json: This file contains a description of the submitted job with the command used, the job name, and flepiMoP and project git commit hashes,

    • slurm-*_*.out: These files contain output from slurm for each of the array jobs submitted,

    For operational runs these files should be committed to the checked out branch for archival/reproducibility reasons. Since this is just a test you can safely remove these files after inspecting them.

    Now, let's submit an EMCEE inference job with the same tool. Importantly, the options we'll use won't change much because flepimop batch-calibrate is designed to provide a unified implementation independent interface.

    One notable difference is, unlike R inference, EMCEE inference only supports running on 1 node so resources for this command are adjusted accordingly:

    • Swapping 4 nodes with 1 cpu each to 1 node with 4 cpus, and

    • Doubling the memory usage from 4 nodes with 1GB each for 4GB total to 1 node with 8GB for 8GB total.

    The extra increase in memory is to run a configuration that is slightly more resource intense than the previous example. This command will also produce a similar set of record keeping files like before that you can safely remove after inspecting.

    Estimating Required Resources For A Batch Inference Job

    When inspecting the output of flepimop batch-calibrate --help you may have noticed several options named --estimate-*. While not required for the smaller jobs above this tool has the ability to estimate the required resources to run a larger batch estimation job. The tool does this by running smaller jobs and then projecting the required resources for a large job from those smaller jobs. To use this feature provide the --estimate flag, a job size of the targeted job, resources for test jobs, and the following estimation settings:

    • --estimate-runs: The number of smaller jobs to run to estimate the required resources from,

    • --estimate-interval: The size of the prediction interval to use for estimating the resource/time limit upper bounds,

    • --estimate-vary: The job size elements to vary when generating smaller jobs,

    Effectively using these options requires some knowledge of the underlying inference method. Sticking with the simple usa state level example above try submitting the following command (after cleaning up the output from the previous example):

    In short, this command will submit 6 test jobs that will vary simulations and measure time and memory. The number of simulations will be used to project the required resources. The test jobs will range from 1/5 to 1/10 of the target job size. This command will take a bit to run because it needs to wait on these test jobs to finish running before it can do the analysis, so you can check on the progress by checking the output of the simple_usa_statelevel_estimation.log file.

    Once this command finishes running you should notice a file called USA_influpaint_resources.json. This JSON file contains the estimated resources required to run the target job. You can submit the target job with the estimated resources by using the same command as before without the --estimate-* options and using the --from-estimate option to pull the information from the outputted file:

    Saving Model Outputs On Batch Inference Job Finish

    For production runs it is particularly helpful to save the calibration results after a successful run to long term storage for safe keeping. To accomplish this flepimop batch-calibrate can chain a call to flepimop sync after a successful run via the --sync-protocol option. For more details on the flepimop sync command in general please refer to the guide.

    For a quick demonstration of how to use this option start with the config_sample_2pop_inference.yml configuration file and add the following section:

    Where /path/to/an/example-folder and s3://my-bucket/and-sub-bucket are place holders for paths to your desired location. Importantly, note that there is no trailing slash on the model_output directory name. This will cause flepimop sync to sync the model_output directory itself and not just it's contents. You can also apply additional filters to the sync protocols here, say to limit the backed up model outputs to certain folders or exclude llik outputs, but the --sync-protocol option will add filters to limit the synced directories to those corresponding to the run submitted. Note that users do not need to specify run/job ids or configuration file names in the sync protocol. The flepimop batch-calibrate CLI will take advantage of flepimop sync's options to set paths appropriately to accommodate for run/job ids.

    Modifying the first flepimop batch-calibrate command from before:

    This command will submit an array job just like before, but will also add a dependent job with the same name prefixed with 'sync_'. This should looks like:

    After those jobs finish the results can be found in a subdirectory named after the job and whose contents will look like:

    Note that this contains the model_output directory but only limited to the batch run named 'sample_2pop-20250521T190823_Ro_all_test_limits' as well as a file called manifest.json which can be used to reproduce the run from scratch if needed.

    Saving Model Outputs To S3 For Hopkins Users

    For Hopkins affiliated users there is a configuration file patch included with flepiMoP that can be used to add S3 syncing for model outputs to s3://idd-inference-runs. Taking the example before of running the config_sample_2pop_inference.yml configuration we can slightly modify the command to:

    This will take advantage of the patching abilities of the flepimop batch-calibrate to add a sync protocol named s3-idd-inference-runs that will save the results to the s3://idd-inference-runs bucket.

    tmp*.sbatch: Contains the generated file submitted to slurm with sbatch.

    --estimate-factors: The factors to use in projecting the larger scale estimation job,

  • --estimate-measurements: The resources to estimate,

  • --estimate-scale-upper: The scale factor to use to determine the largest sample job to generate, and

  • --estimate-scale-lower: The scale factor to use to determine the smallest sample job to generate.

  • UNC's Longleaf Cluster
    JHU's Rockfish Cluster
    Rockfish user guide
    Synchronizing files: Syntax and Applications
    $ curl -LsSf -o flepimop-install-<cluster-name> https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/main/bin/flepimop-install-<cluster-name>
    $ chmod +x flepimop-install-<cluster-name>
    $ ./flepimop-install-<cluster-name>
    $ conda activate flepimop-env
    $ rm flepimop-install-<cluster-name> flepimop-install
    $ ./bin/flepimop-install-<cluster-name>
    $ ./batch/hpc_init <cluster-name>
    $ export PROJECT_PATH="$FLEPI_PATH/examples/tutorials/"
    $ cd $PROJECT_PATH
    $ flepimop batch-calibrate \
        --blocks 1 \
        --chains 4 \
        --samples 20 \
        --simulations 100 \
        --time-limit 30min \
        --slurm \
        --nodes 4 \
        --cpus 1 \
        --memory 1G \
        --extra 'partition=<your partition, if relevant>' \
        --extra 'email=<your email, if relevant>' \
        --skip-checkout \
        -vvv \
        config_sample_2pop_inference.yml
    $ export PROJECT_PATH="$FLEPI_PATH/examples/simple_usa_statelevel/"
    $ cd $PROJECT_PATH
    $ flepimop batch-calibrate \
        --blocks 1 \
        --chains 4 \
        --samples 20 \
        --simulations 100 \
        --time-limit 30min \
        --slurm \
        --nodes 1 \
        --cpus 4 \
        --memory 8G \
        --extra 'partition=<your partition, if relevant>' \
        --extra 'email=<your email, if relevant>' \
        --skip-checkout \
        -vvv \
        simple_usa_statelevel.yml
    $ flepimop batch-calibrate \
        --blocks 1 \
        --chains 4 \
        --samples 20 \
        --simulations 500 \
        --time-limit 2hr \
        --slurm \
        --nodes 1 \
        --cpus 4 \
        --memory 24GB \
        --extra 'partition=<your partition, if relevant>' \
        --extra 'email=<your email, if relevant>' \
        --skip-checkout \
        --estimate \
        --estimate-runs 6 \
        --estimate-interval 0.8 \
        --estimate-vary simulations \
        --estimate-factors simulations \
        --estimate-measurements time \
        --estimate-measurements memory \
        --estimate-scale-upper 5 \
        --estimate-scale-lower 10 \
        -vvv \
        simple_usa_statelevel.yml > simple_usa_statelevel_estimation.log 2>&1 & disown
    $ flepimop batch-calibrate \
        --blocks 1 \
        --chains 4 \
        --samples 20 \
        --simulations 500 \
        --time-limit 2hr \
        --slurm \
        --nodes 1 \
        --cpus 4 \
        --memory 24GB \
        --from-estimate USA_influpaint_resources.json \
        --extra 'partition=<your partition, if relevant>' \
        --extra 'email=<your email, if relevant>' \
        --skip-checkout \
        -vvv \
        simple_usa_statelevel.yml
    sync:
      rsync-model-output:
        type: rsync
        source: model_output
        target: /path/to/an/example-folder
      s3-model-output:
        type: s3sync
        source: model_output
        target: s3://my-bucket/and-sub-bucket
    $ export PROJECT_PATH="$FLEPI_PATH/examples/tutorials/"
    $ cd $PROJECT_PATH
    $ flepimop batch-calibrate \
        --blocks 1 \
        --chains 4 \
        --samples 20 \
        --simulations 100 \
        --time-limit 30min \
        --slurm \
        --nodes 4 \
        --cpus 1 \
        --memory 1G \
        --extra 'partition=<your partition, if relevant>' \
        --extra 'email=<your email, if relevant>' \
        --skip-checkout \
        --sync-protocol <your sync protocol, either rsync-model-output or s3-model-output in this case> \
        -vvv \
        config_sample_2pop_inference.yml
    [twillard@longleaf-login6 tutorials]$ squeue -p jlessler
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               2374868  jlessler sync_sam twillard PD       0:00      1 (Dependency)
             2374867_1  jlessler sample_2 twillard  R       2:26      1 g1803jles01
             2374867_2  jlessler sample_2 twillard  R       2:26      1 g1803jles01
             2374867_3  jlessler sample_2 twillard  R       2:26      1 g1803jles01
             2374867_4  jlessler sample_2 twillard  R       2:26      1 g1803jles01
    [twillard@longleaf-login6 sample_2pop-20250521T190823_Ro_all_test_limits]$ tree -L 4
    .
    ├── manifest.json
    └── model_output
        └── sample_2pop_Ro_all_test_limits
            └── sample_2pop-20250521T190823_Ro_all_test_limits
                ├── hnpi
                ├── hosp
                ├── hpar
                ├── init
                ├── llik
                ├── seir
                ├── snpi
                └── spar
    
    11 directories, 1 file
    $ flepimop batch-calibrate \
        --blocks 1 \
        --chains 4 \
        --samples 20 \
        --simulations 100 \
        --time-limit 30min \
        --slurm \
        --nodes 4 \
        --cpus 1 \
        --memory 1G \
        --extra 'partition=<your partition, if relevant>' \
        --extra 'email=<your email, if relevant>' \
        --skip-checkout \
        --sync-protocol s3-idd-inference-runs \
        -vvv \
        config_sample_2pop_inference.yml $FLEPI_PATH/common/s3-idd-inference-runs.yml

    Running with Docker locally 🛳

    Short tutorial on running FlepiMop on your personal computer using a "Docker" container

    Access model files

    See the Before any run section to ensure you have access to the correct files needed to run. On your local machine, determine the file paths to:

    • the directory containing the flepimop code (likely the folder you cloned from Github), which we'll call <FLEPI_PATH>

    • the directory containing your project code including input configuration file and population structure, which we'll call <PROJECT_PATH>

    For example, if you clone your Github repositories into a local folder called Github and are using the flepimop_sample as a project repository, your directory names could be _On Mac: ;

    <FLEPI_PATH> = /Users/YourName/Github/flepiMoP

    <PROJECT_PATH> = /Users/YourName/Github/fleiMoP/examples/tutorials On Windows: <FLEPI_PATH> = C:\Users\YourName\Github\flepiMoP

    <PROJECT_PATH> = C:\Users\YourName\Github\flepiMoP\examples\tutorials

    Note that Docker file and directory names are case sensitive

    🧱 Set up Docker

    is a software platform that allows you to build, test, and deploy applications quickly. Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. This means you can run and install software without installing the dependencies in the local operating system.

    A Docker container is an environment which is isolated from the rest of the operating system i.e. you can create files, programs, delete and everything but that will not affect your OS. It is a local virtual OS within your OS ;

    For flepiMoP, we have a Docker container that will help you get running quickly ;

    Make sure you have the Docker software installed, and then open your command prompt or terminal application ;

    Helpful tools

    To understand the basics of Docker, refer . The following may also be helpful ;

    To install Docker for Mac, refer to the following link: . Pay special attention to the specific chip your Mac has (Apple Silicon vs Intel), as installation files and directions differ

    To install Docker for Windows, refer to the following link:

    To find the Windows Command Prompt, type “Command Prompt" in the search bar and open it. This

    ⚠️ Getting errors on a Mac?

    If you have a newer Mac computer that runs with an Apple Silicon chip, you may encounter errors. Here are a few tips to avoid them:

    • Make sure you have Mac OS 11 or above

    • Install any minor updates to the operating system

    Run the Docker image

    First, make sure you have the latest version of the flepimop Docker (hopkinsidd/flepimop) downloaded on your machine by opening your terminal application and entering:

    Next, run the Docker image by entering the following, replace <FLEPI_PATH> and <PROJECT_PATH> with the path names for your machine (no quotes or brackets, just the path text):

    On Windows: If you get an error, you may need to delete the "\" line breaks and submit as a single continuous line of code.

    In this command, we run the Docker container, creating a volume and mounting (-v) your code and project directories into the container. Creating a volume and mounting it to a container basically allocates space in Docker for it to mirror - and have read and write access - to files on your local machine ;

    The folder with the flepiMoP code <PROJECT_PATH> will be on the path flepimop within the Docker environment, while the project folder will be at the path `drp. ;

    {% hint style="success" %} You now have a local Docker container installed, which includes the R and Python versions required to run flepiMop with all the required packagers already installed ; {% endhint %}

    {% hint style="info" %} You don't need to re-run the above steps every time you want to run the model. When you're done using Docker for the day, you can simply "detach" from the container and pause it, without deleting it from your machine. Then you can re-attach to it when you next want to run the model ; {% endhint %}

    Define environment variables

    Create environmental variables for the paths to the flepimop code folder and the project folder:

    `

    ``bash export FLEPI_PATH=/home/app/flepimop/ export PROJECT_PATH=/home/app/drp/

    Each installation step may take a few minutes to run.

    Note: These installations take place in the Docker container and not the local operating system. They must be made once while starting the container and need not be done for every time you run a model, provided they have been installed once. You will need an active internet connection for pulling the Docker image and installing the R packages (since some are hosted online), but not for other steps of running the model

    Run the code

    Everything is now ready 🎉 The next step depends on what sort of simulation you want to run: One that includes inference (fitting model to data) or only a forward simulation (non-inference). Inference is run from R, while forward-only simulations are run directly from the Python package gempyor.

    In either case, navigate to the project folder and make sure to delete any old model output files that are there

    Inference run

    An inference run requires a configuration file that has the inference section. Stay in the $PROJECT_PATH folder, and run the inference script, providing the name of the configuration file you want to run (ex. config.yml ;

    This will run the model and create a lot of output files in $PROJECT_PATH/model_output/ ;

    The last few lines visible on the command prompt should be:

    [[1]]

    [[1]][[1]]

    [[1]][[1]][[1]]

    NULL

    If you want to quickly do runs with options different from those encoded in the configuration file, you can do that from the command line, for example

    where:

    • n is the number of parallel inference slots,

    • j is the number of CPU cores to use on your machine (if j > n, only n cores will actually be used. If j <n, some cores will run multiple slots in sequence)

    You can put all of this together into a single script that can be run all at once ;

    Non-inference run

    Stay in the $PROJECT_PATH folder, and run a simulation directly from forward-simulation Python package gempyor,call flepimop simulate providing the name of the configuration file you want to run (ex. config.yml):

    It is currently required that all configuration files have an interventions section. There is currently no way to simulate a model with no interventions, though this functionality is expected soon. For now, simply create an intervention that has value zero ;

    You can put all of this together into a single script that can be run all at once ;

    Finishing up

    You can avoid repeating all the above steps every time you want to run the code. When the docker run command creates an container, it is stored locally on your computer with all the installed packages/variables/etc you created. You can leave this container and come back to it whenever you want, without having to redo all this set up ;

    When you're in the Docker container, figure out the name Docker has given to the container you created by typing

    the output will be something silly like

    write this down for later reference. You can also see the container name in the Docker Desktop app's Containers tab ;

    To "detach" from the Docker container and stop it, type CTLR + c

    The command prompt for your terminal application is now just running locally, not in the Docker container ;

    Next time you want to re-start and "attach" the container, type

    at the command line or hit the play button ▶️ beside the container's name in the Docker app. Replace container_name with the name for your old container ;

    Then "attach" to the container by typing

    The reason that stopping/starting a container is separate from detaching/attaching is that technically you can leave a container (and any processes within it) running in the background and exit it. In case you want to do that, detach and leave it running by typing CTRL + p then quickly CTRL + q. Then when you want to attach to it again, you don't need to do the part about starting the container ;

    If you the core model code within the flepimop repository (flepimop/flepimop/gempyor_pkg/ or flepimop/flepimop/R_packages) has been edited since you created the contained, or if the R or Python package requirements have changed, then you'll have to re-run the steps to install the packages, but otherwise, you can just start running model code!

    may be helpful for new users ;

    To find the Apple Terminal, type "Terminal" in the search bar or go to Applications -> Utilities -> Terminal ;

    Install Rosetta 2 for Ma ;

    • In terminal type softwareupdate --install-rosetta

  • Make sure you've installed the Docker version that matches with the chip your Mac has (Intel vs Apple Silicon).

  • Update Docker to the latest version

    • On Mac, updating Docker may require you to uninstall Docker before installing a newer version. To do this, open the Docker Desktop application and click the Troubleshoot icon (the small icon that looks like an insect at the top right corner of the window). Click the Uninstall button. Once this process is completed, open Applications in Finder and move Docker to the Trash. If you get an error message that says Docker cannot be deleted because it is open, then open Activity Monitor and stop all Docker processes. Then put Docker in the Trash. Once Docker is deleted, install the new Docker version appropriate for your Mac chip. After reinstallation is complete, restart your computer.

  • k is the number of iterations per slots.

    Docker
    Docker Basics
    Docker Tutorial
    Installing Docker for Mac
    Installing Docker for Windows
    Command Prompt Video Tutorial
    docker pull hopkinsidd/flepimop:latest-dev
    docker run -it \
      -v <FLEPI_PATH>:/home/app/flepimop \
      -v <PROJECT_PATH>:/home/app/drp \
    hopkinsidd/flepimop:latest-dev
    
    Go into the code directory and do the installation the R and Python code packages
    
    ```bash
    cd $FLEPI_PATH # move to the flepimop directory
    Rscript build/local_install.R # Install R packages
    pip install --no-deps -e flepimop/gempyor_pkg/ # Install Python package gempyor
    cd $PROJECT_PATH       # goes to your project repository
    rm -r model_output/ # delete the outputs of past run if there are
    flepimop-inference-main -c config.yml
    flepimop-inference-main -j 1 -n 1 -k 1 -c config.yml
    docker pull hopkinsidd/flepimop:latest-dev
    docker run -it \
      -v <FLEPI_PATH>:/home/app/flepimop \
      -v <PROJECT_PATH>:/home/app/drp \
    hopkinsidd/flepimop:latest-dev
    export FLEPI_PATH=/home/app/flepimop/
    export PROJECT_PATH=/home/app/drp/
    cd $FLEPI_PATH
    Rscript build/local_install.R
    pip install --no-deps -e flepimop/gempyor_pkg/
    cd $PROJECT_PATH
    rm -rf model_output
    flepimop-inference-main -j 1 -n 1 -k 1 -c config.yml
    flepimop simulate config.yml
    docker pull hopkinsidd/flepimop:latest-dev
    docker run -it \
      -v <FLEPI_PATH>:/home/app/flepimop \
      -v <PROJECT_PATH>:/home/app/drp \
    hopkinsidd/flepimop:latest-dev
    export FLEPI_PATH=/home/app/flepimop/
    export PROJECT_PATH=/home/app/drp/
    cd $FLEPI_PATH
    Rscript build/local_install.R
    pip install --no-deps -e flepimop/gempyor_pkg/
    cd $PROJECT_PATH
    rm -rf model_output
    flepimop simulate config.yml
    docker ps
    > festive_feistel
    docker start container_name
    docker attach container_name

    Running on AWS 🌳

    using Docker container

    🖥 Start and access AWS submission box

    Spin up an Ubuntu submission box if not already running. To do this, log onto AWS Console and start the EC2 instance.

    Update IP address in .ssh/config file. To do this, open a terminal and type the command below. This will open your config file where you can change the IP to the IP4 assigned to the AWS EC2 instance (see AWS Console for this):

    SSH into the box. In the terminal, SSH into your box. Typically we name these instances "staging", so usually the command is:

    🧱 Setup

    Now you should be logged onto the AWS submission box. If you haven't yet, set up your directory structure.

    🗂 Create the directory structure (ONCE PER USER)

    Type the following commands:

    Note that the repository is cloned nested, i.e the flepiMoP repository is INSIDE the data repository.

    Have your Github ssh key passphrase handy so you can paste it when prompted (possibly multiple times) with the git pull command. Alternatively, you can add your github key to your batch box so you don't have to enter your token 6 times per day.

    🚀 Run inference using AWS (do everytime)

    🛳 Initiate the Docker

    Start up and log into the docker container, and run setup scripts to setup the environment. This setup code links the docker directories to the existing directories on your box. As this is the case, you should not run job submission simultaneously using this setup, as one job submission might modify the data for another job submission.

    Setup environment

    To set up the environment for your run, run the following commands. These are specific to your run, i.e., change VALIDATION_DATE, FLEPI_RUN_INDEX and RESUME_LOCATION as required. If submitting multiple jobs, it is recommended to split jobs between 2 queues: Compartment-JQ-1588569569 and Compartment-JQ-1588569574.

    NOTE: If you are not running a resume run, DO NOT export the environmental variable RESUME_LOCATION.

    Additionally, if you want to profile how the model is using your memory resources during the run, run the following commands

    Then prepare the pipeline directory (if you have already done that and the pipeline hasn't been updated (git pull says it's up to date). You need to set $PROJECT_PATH to your data folder. For a COVID-19 run, do:

    for Flu do:

    Now for any type of run:

    For now, just in case: update the arrow package from 8.0.0 in the docker to 11.0.3 ;

    Now flepiMoP is ready 🎉 ;

    Do some clean-up before your run. The fast way is to restore the $PROJECT_PATH git repository to its blank states (⚠️ removes everything that does not come from git):

    I want more control over what is deleted

    if you prefer to have more control, delete the files you like, e.g

    If you still want to use git to clean the repo but want finer control or to understand how dangerous is the command, .

    Then run the preparatory data building scripts and you are good

    Now you may want to test that it works :

    If this fails, you may want to investigate this error. In case this succeeds, then you can proceed by first deleting the model_output:

    Launch your inference batch job on AWS

    Assuming that the initial test simulation finishes successfully, you will now enter credentials and submit your job onto AWS batch. Enter the following command into the terminal:

    You will be prompted to enter the following items. These can be found in a file you received from Shaun called new_user_credentials.csv.

    • Access key ID when prompted

    • Secret access key when prompted

    • Default region name: us-west-2

    • Default output: Leave blank when this is prompted and press enter (The Access Key ID and Secret Access Key will be given to you once in a file)

    Now you're fully set to go 🎉

    To launch the whole inference batch job, type the following command:

    This command infers everything from you environment variables, if there is a resume or not, what is the run_id, etc., and the default is to carry seeding if it is a resume (see below for alternative options).

    If you'd like to have more control, you can specify the arguments manually:

    We allow for a number of different jobs, with different setups, e.g., you may not want to carry seeding. Some examples of appropriate setups are given below. No modification of these code chunks should be required ;

    NOTE: Resume and Continuation Resume runs are currently submitted the same way, resuming from an S3 that was generated manually. Typically we will also submit any Continuation Resume run specifying --resume-carry-seeding as starting seeding conditions will be manually constructed and put in the S3.

    Carrying seeding (do this to use seeding fits from resumed run):

    Discarding seeding (do this to refit seeding again):

    Single Iteration + Carry seeding (do this to produce additional scenarios where no fitting is required

    Document the submission

    After the job is successfully submitted, you will now be in a new branch of the data repo. Commit the ground truth data files to the branch on github and then return to the main branch:

    Send the submission information to slack so we can identify the job later. Example output:

    notepad .ssh/config
    ssh staging
    )
    :
    read this
    git clone https://github.com/HopkinsIDD/flepiMoP.git
    git clone https://github.com/HopkinsIDD/Flu_USA.git
    git clone https://github.com/HopkinsIDD/COVID19_USA.git
    cd COVID19_USA
    git clone https://github.com/HopkinsIDD/flepiMoP.git
    cd ..
    # or any other data directories
    git config --global credential.helper store
    git config --global user.name "{NAME SURNAME}"
    git config --global user.email YOUREMAIL@EMAIL.COM
    git config --global pull.rebase false # so you use merge as the default reconciliation method
    cd COVID19_USA
    git config --global credential.helper cache
    git pull 
    git checkout main
    git pull
    
    cd flepiMoP
    git pull	
    git checkout main
    git pull
    cd .. 
    sudo docker pull hopkinsidd/flepimop:latest
    sudo docker run -it \
      -v /home/ec2-user/COVID19_USA:/home/app/drp/COVID19_USA \
      -v /home/ec2-user/flepiMoP:/home/app/drp/flepiMoP \
      -v /home/ec2-user/.ssh:/home/app/.ssh \
    hopkinsidd/flepimop:latest 
    cd ~/drp
    export CENSUS_API_KEY={A CENSUS API KEY}
    export FLEPI_RESET_CHIMERICS=TRUE
    export COMPUTE_QUEUE="Compartment-JQ-1588569574"
    
    export VALIDATION_DATE="2023-01-29"
    export RESUME_LOCATION=s3://idd-inference-runs/USA-20230122T145824
    export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc
    
    export CONFIG_PATH=config_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc.yml
    export FLEPI_MEM_PROFILE=TRUE
    export FLEPI_MEM_PROF_ITERS=50
    cd ~/drp
    export PROJECT_PATH=$(pwd)/COVID19_USA
    export GT_DATA_SOURCE="csse_case, fluview_death, hhs_hosp"
    cd ~/drp
    export PROJECT_PATH=$(pwd)/Flu_USA
    cd $PROJECT_PATH
    export FLEPI_PATH=$(pwd)/flepiMoP
    cd $FLEPI_PATH
    git checkout main
    git pull
    git config --global credential.helper 'cache --timeout 300000'
    
    #install gempyor and the R modules. There should be no error, please report if not.
    # Sometimes you might need to run the next line two times because inference depends
    # on report.generation, which is installed later, in alphabetical order.
    # (or if you know R well enough to fix that 😊)
    
    Rscript build/local_install.R # warnings are ok; there should be no error.
       python -m pip install --upgrade pip &
       pip install -e flepimop/gempyor_pkg/ &
       pip install boto3 &
       cd ..
    
    cd $PROJECT_PATH
    git pull 
    git checkout main
    git reset --hard && git clean -f -d  # this deletes everything that is not on github in this repo !!!
    rm -rf model_output data/us_data.csv data-truth &&
       rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
       rm -rf data/seeding_territories.csv && 
       rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv
    
    # don't delete model_output if you have another run in //
    rm -rf $PROJECT_PATH/model_output
    export CONFIG_PATH=config_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc.yml # if you haven't already done this
    Rscript $FLEPI_PATH/datasetup/build_US_setup.R
    
    # For covid do
    Rscript $FLEPI_PATH/datasetup/build_covid_data.R
    
    # For Flu do
    Rscript $FLEPI_PATH/datasetup/build_flu_data.R
    flepimop-inference-main -c $CONFIG_PATH -j 1 -n 1 -k 1 
    rm -r model_output
    aws configure
    python $FLEPI_PATH/batch/inference_job_launcher.py --aws -c $CONFIG_PATH -q $COMPUTE_QUEUE 
    python $FLEPI_PATH/batch/inference_job_launcher.py --aws \ ## FIX THIS TO REFLECT AWS OPTIONS
                        -c $CONFIG_PATH \
                        -p $FLEPI_PATH \
                        --data-path $PROJECT_PATH \
                        --upload-to-s3 True \
                        --id $FLEPI_RUN_INDEX \
                        --restart-from-location $RESUME_LOCATION
    cd $PROJECT_PATH 
    
    $FLEPI_PATH/batch/inference_job_launcher.py --aws -c $CONFIG_PATH -q $COMPUTE_QUEUE
    cd $PROJECT_PATH 
    
    $FLEPI_PATH/batch/inference_job_launcher.py --aws -c $CONFIG_PATH -q $COMPUTE_QUEUE -j 1 -k 1
    cd $PROJECT_PATH
    
    $FLEPI_PATH/batch/inference_job_launcher.py --aws -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-carry-seeding --restart-from-location $RESUME_LOCATION
    cd $PROJECT_PATH 
    
    $COVID_PATH/batch/inference_job_launcher.py --aws -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-discard-seeding --restart-from-location $RESUME_LOCATION
    git add data/ 
    git config --global user.email "[email]" 
    git config --global user.name "[github username]" 
    git commit -m"scenario run initial" 
    branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
    git push --set-upstream origin $branch
    
    git checkout main
    git pull
    Launching USA-20230426T135628_inference_med on aws...
     >> Job array: 300 slot(s) X 5 block(s) of 55 simulation(s) each.
     >> Final output will be: s3://idd-inference-runs/USA-20230426T135628/model_output/
     >> Run id is SMH_R17_noBoo_lowIE_phase1_blk1
     >> config is config_SMH_R17_noBoo_lowIE_phase1_blk1.yml
     >> FLEPIMOP branch is main with hash 3773ed8a20186e82accd6914bfaf907fd9c52002
     >> DATA branch is R17 with hash 6f060fefa9784d3f98d88a313af6ce433b1ac913
    cd $PROJECT_PATH 
    
    $COVID_PATH/batch/inference_job_launcher.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-carry-seeding --restart-from-location $RESUME_LOCATION

    Installing flepiMoP For Development

    Instructions on how to install `flepiMoP` for development purposes, which uses a specific utility script that installs extras and force reinstall.

    When developing flepiMoP it is helpful to install it in a way that gives developers more control over the development environment. To assist developers with this there is the bin/flepimop-install-dev helper script which wraps the bin/flepimop-install script with some defaults.

    1. First obtain a clone of flepiMoP where you plan on doing development work:

    $ git clone git@github.com:HopkinsIDD/flepiMoP.git
    Cloning into 'flepiMoP'...
    remote: Enumerating objects: 28538, done.
    remote: Counting objects: 100% (3449/3449), done.
    remote: Compressing objects: 100% (857/857), done.
    Receiving objects: 100% (28538/28538), 146.00 MiB | 37.49 MiB/s, done.
    remote: Total 28538 (delta 2915), reused 2808 (delta 2589), pack-reused 25089 (from 2)
    Resolving deltas: 100% (14847/14847), done.

    Or replace git@github.com:HopkinsIDD/flepiMoP.git with the appropriate URI for a fork if applicable.

    1. Then change directory into this clone, checkout a branch to do development on, and then run the bin/flepimop-install-dev script. For more details on branch naming and GitHub usage please refer to the documentation.

    The bin/flepimop-install-dev script is a thin wrapper around the standard bin/flepimop-install script that:

    • Installs flepiMoP to a conda environment located inside of this clone in a directory named venv/,

    • Force reinstalls flepiMoP which will wipe out an existing conda environment with a new one, and

    • Install all extra/optional dependencies for gempyor (like pytest

    1. Then you can activate this conda environment with:

    You can verify that the installation was made to this local conda environment with:

    1. If you need to refresh your development install, for example if you edit the R packages or add a new dependency, you can simply rerun the bin/flepimop-install-dev script which will wipe the previous conda environment and create a fresh new one.

    ,
    black
    ,
    pylint
    , etc.) that are particularly useful for python development.
    $ cd flepiMoP
    $ git checkout -b feature/XYZ/my-cool-new-thing
    $ ./bin/flepimop-install-dev
    Git and GitHub Usage
    conda activate venv/
    $ which flepimop
    /path/to/your/dev/flepiMoP/venv/bin/flepimop

    Module specification

    THIS IS DEPRECATED. GO TO HopkinsIDD/COVID19_Minimal

    R interface basics

    The python code will call your R scripts, setting some variable in the environment:

    • from_python: truthy boolean, test for this to know if your code is run automatically.

    • ti_str, tf_str model start and end as a string

    • foldername the folder that contains everything related to the setup. You'll have to load geodata.csv from there. It include the /

    The code is run from the root folder of the repository.

    Setup

    A setup has a name, and this name is a also folder that contains file geodata.csv (see below).

    Modules

    (and status if the current R implementation respect the specification)

    Mobility (WIP)

    • From R: dataframe named mobility with columns: from, to, amount. Relationships not specified will be set to zero. You can set different value for A -> B and B -> A (if you only specified A -> B, we'll assume B -> A = 0).

    • From file: matrix to be imported with numpy as it is. Dimension: (nnodes, nnnodes) (may have a third dimension if time varying). First index is from, second is to, diagonal is zero (mobility[ori, dest])

    Population (DONE)

    • From file: geodata.csv : specification of the spatial nodes, with at least column for the zero based index, the geoid or name, the population.

    Importation (TODO)

    • From R: dataframe named importation with column date, to, amount where date is a string, to contains a geoid and amount contains an integer.

    NPI (DONE)

    Different R scripts define the Nonpharmaceutical Intervention (NPI) to apply in the simulation. Based on the following system arguments, an R script will be called that generates the appropriate intervention. The start and end dates for each NPI needs to be specified (YYYY-MM-DD).

    • None: No intervention, R0 reduction is 0

    • SchoolClosure: School closure, counties randomly assigned an R0 reduction ranging from 16-30% (Jackson, M. et al., medRxiv, 2020)

    • Influenza1918: Influenza social distancing as observed in 1918 Influenza. Counties are randomly assigned an R reduction value ranging from 44-65% (the most intense social distancing R0 reduction values from Milwaukee) (Bootsma & Ferguson, PNAS, 2007)

    Transmission parameters (TODO)

    at the end.
    From python: numpy matrix as file.
    Wuhan: Counties randomly assigned an R0 reduction based on values reported in Wuhan before and after travel ban during COVID-19 outbreak (R0 reduction of 81-88%) (Zhang, B., Zhou, H., & Zhou F. medRxiv, 2020; Mizumoto, R., Kagaya, K., & Chowell, G., medRxiv, 2020)
  • TestIsolate: This intervention represents rapid testing and isolation of cases, similar to what was done in Wuhan at the beginning of the outbreak. It reduces R0 by 45-96%.

  • Mild: This intervention has two sequential interventions: School closures, followed by a period of Wuhan-style lockdown followed by nothing.

  • Mid: This intervention has three sequential interventions: School closures, followed by a period of Wuhan-style lockdown, followed by social distancing practices used during the 1918 Influenza pandemic

  • Severe: This intervention has three sequential interventions: School closures, followed by a Wuhan-style lockdown, followed by rapid testing and isolation.

  • if (!from_python) {         # or whatever values you use to test.
        ti_str <- '2020-01-31'
        tf_str <- '2020-08-31'
        foldername <- 'west-coast/'      
    }
    # write code here that uses what is above and can load more files.

    click commands

    Model Implementation

    Specifying population structure

    This page describes how users specify the names, sizes, and connectivities of the different subpopulations comprising the total population to be modeled

    Overview

    The subpop_setup section of the configuration file is where users can input the information required to define a population structure on which to simulate the model. The options allow the user to determine the population size of each subpopulation that makes up the overall population, and to specify the amount of mixing that occurs between each pair of subpopulations.

    An example configuration file with the global header and the spatial_setup section is below:

    Items and options

    Config Item
    Required?
    Type/Format
    Description

    geodata file and selected option

    • geodata is a .csv with column headers, with at least two columns: subpop and population.

    • selected if provided, is the subset of locations in geodata file (as determined by subpop column) to be modeled. Requesting subpopulation(s) that are not present will lead to an error.

    Example geodata file format

    mobility file

    The mobility file is a .csv file (it has to contain .csv as extension) with long form comma separated values. Columns have to be named ori, dest, amount, with amount being the average number individuals moving from the origin subpopulation ori to destination subpopulation dest on any given day. Details on the mathematics of this model of contact are explained in the . Unassigned relations are assumed to be zero. The location entries in the ori and dest columns should correspond to an entry in the subpop column in geodata.csv. When using selected, the mobility

    Example mobility file format

    It is also possible, but not recommended to specify the mobility file as a .txt with space-separated values in the shape of a matrix. This matrix is symmetric and of size K x K, with K being the number of rows in geodata. The above example corresponds to

    Examples

    Example 1

    To simulate a simple population structure with two subpopulations, a large province with 10,000 individuals and a small province with only 1,000 individuals, where every day 100 residents of the large province travel to the small province and interact with residents there, and 50 residents of the small province visit the large province

    geodata.csv contains the population structure (with columns subpop and population)

    mobility.csv contains

    Specifying seeding

    This section describes how to specify the values of each model state at the time the simulation starts, and how to make instantaneous changes to state values at other times (e.g., due to importations)

    Overview

    flepiMoP allows users to specify instantaneous changes in values of model variables, at any time during the simulation. We call this "seeding". For example, some individuals in the population may travel or otherwise acquire infection from outside the population throughout the epidemic, and this importation of infection could be specified with the seeding option. As another example, new genetic variants of the pathogen may arise due to mutation and selection that occurs within infected individuals, and this generation of new strains can also be modeled with seeding. Seeding allows individuals to change state at specified times in ways that do not depend on the model equations. In the first example, the individuals would be "seeded" into the infected compartment from the susceptible compartment, and in the second example, individuals would be seeded into the "infected with new variant" compartment from the "infected with wild type" compartment.

    flepiMoP's configuration file

    About configuration files

    flepiMop is set up so that all parameters and other options for running the pipeline can be specified in a single "configuration" file (aka "config"). Users do not need to edit any other code files, or even be aware of their contents, to create and run complex model scenarios. Configuration files also provide a convenient record of model options and promote reproducibility of model results.

    We use the YAML language syntax to write config files, which are typically named something like config.yml. The file has simple plain text contents and follows a tabbed outline structure. When config files are read by the model code, a data structure encoding the model options is created.

    name: test_simulation
    model_output_dirname: model_output
    start_date: 2020-01-01
    end_date: 2020-12-31
    nslots: 100
    
    subpop_setup:
      geodata: model_input/geodata.csv
      mobility: model_input/mobility.csv
    data will also be filtered.

    geodata

    required

    path to file

    path to geodata file

    mobility

    required

    path to file

    path to mobility file

    selected

    optional

    string or list of strings

    Model Description section

    name of selected location ingeodata

    The seeding option can also be used as a convenient alternative way to specify initial conditions. By default, flepiMoP initiates models by putting the entire population size (specified in the geodata file) in the first model compartment. If the desired initial condition is only slightly different than the default state, it may be more convenient to specify it with a few "seedings" that occur on the first day of the simulation. For example, for a simple SIR model where the desired initial condition is just a small number of infected individuals, this could be specified by a single seeding into the infected compartment from the susceptible compartment at time zero, instead of specifying the initial values of three separate compartments. For larger models, the difference becomes more relevant.

    Specifying model seeding

    The configuration items in the seeding section of the config file are

    seeding:method Must be either "NoSeeding", "FromFile", "PoissonDistributed", "NegativeBinomialDistributed", or "FolderDraw".

    seeding::seeding_file Only required for method: “FromFile”. Path to a .csv file containing the list of seeding events

    seeding::lambda_file Only required for methods "PoissonDistributed" or "NegativeBinomialDistributed". Path to a .csv file containing the list of the events from which the actual seeding will be randomly drawn.

    seeding::seeding_file_type Only required for method "FolderDraw". Either seir or seed

    Details on implementing each seeding method and the options that go along with it are below.

    seeding::method

    NoSeeding

    If there is no seeding, then the amount of individuals in each compartment will be initiated using the values specified in theinitial_conditions section and will only be changed at later times based on the equations defined in the seir section. No other arguments are needed in the seeding section in this case

    Example

    FromFile

    This seeding method reads in a user-defined file with a list of seeding events (instantaneous transitions of individuals between compartments) including the time of the event and subpopulation where it occurs, and the source and destination compartment of the individuals. For example, for the simple two-subpopulation SIR model where the outbreak starts with 5 individuals in the small province being infected from a source outside the population, the seeding section of the config could be specified as

    Where seeding.csv contains

    seeding::seeding_file must contain the following columns:

    • subpop – the name of the subpopulation in which the seeding event takes place. Seeding cannot move individuals between different subpopulations.

    • date – the date the seeding event occurs, in YYYY-MM-DD format

    • amount – an integer value for the amount of individuals who transition between states in the seeding event

    • source_* and destination_* – For each compartment group (i.e., infection stage, vaccination stage, age group), a different column describes the status of individuals before and after the transition described by the seeding event. For example, for a model where individuals are stratified by age and vaccination status, and a 1-day vaccination campaign for young children and the elderly moves a large number of individuals into a vaccinated state, this file could be something like

    PoissonDistributed or NegativeBinomialDistributed

    These methods are very similar to FromFile, except the seeding value used in the simulation is randomly drawn from the seeding value specified in the file, with an average value equal to the file value. These methods can be useful when the true seeded value is unknown, and only an observed value is available which is assumed to be observed with some uncertainty. The input requirements are the same for both distributions

    or

    and the lambda_file has the same format requirements as the seeding_file for the FromFile method described above.

    For method::PoissonDistributed, the seeding value for each seeding event is drawn from a Poisson distribution with mean and variance equal to the value in the amount column. Formethod::NegativeBinomialDistributed, seeding is drawn from a negative binomial distribution with mean amount and variance amount+5 (so identical to "PoissonDistributed" for large values of amount but has higher variance for small values).

    FolderDraw

    TB ;

    subpop,population
    10001,1000
    20002,2000
    ori, dest, amount
    10001, 20002, 3
    20002, 10001, 3
    0 3
    3 0
    subpop_setup:
      geodata: model_input/geodata.csv
      mobility: model_input/mobility.csv
    subpop,          population
    large_province, 10000
    small_province, 1000
    ori,            dest,           amount
    large_province, small_province, 100
    small_province, large_province, 50
    seeding:
        method: “NoSeeding”
    seeding:
      method: "FromFile"
      seeding_file: seeding_2pop.csv
    subpop, date, amount, source_infection_stage, destination_infection_stage
    small_province, 2020-02-01, 5, S, E
    subpop, date, amount, source_infection_stage, source_vaccine_doses, source_age_group, destination_infection_stage, destination_vaccine_doses, destination_age_group
    anytown, 1950-03-15, 452, S, 0dose, under5years, S, 1dose, under5years
    anytown, 1950-03-16, 527, S, 0dose, 5_10years, S, 1dose, 5_10years
    anytown, 1950-03-17, 1153, S, 0dose, over65years, S, 1dose, over65years
    seeding:
      method: "PoissonDistributed"
      lambda_file: seeding.csv
    seeding:
      method: "NegativeBinomialDistributed"
      lambda_file: seeding.csv

    Comments can be added to the config file by starting with the hash key (#) then a space. Comments can start anywhere on a line and continue until the end, but if they run over to a new line, a new # must be used at the start of the new line.

    Example

    (give a simple configuration for a toy model with two subpopulations, SEIR, single "cases" outcome, single seeded infection, single NPI that starts after some time? this page is currently under development, please see our example repo _for some simple configurations) ;

    When referring to config items (individual parameters), we use their full position in the outline. For example, in the sample config file above, we denote

    as subpop_setup::geodata having a value of minimal

    Notation

    Parameters and other options specified in the configuration files can take on a variety of types of values, using the following notations:

    • dates are specified as [year]-[month]-[day]. (e.g., 2020-01-31)

    • boolean values are either "TRUE" or "FALSE"

    • files names are strings

    • probability is a float between 0 and 1

    • distribution is a probability distribution from which a random value for the parameter is drawn each time a new simulation is run (or chain, if doing inference). See for the require schema.

    Configuration files sections

    Global header

    Required section

    These global configuration options typically sit at the top of the configuration file.

    Item
    Required?
    Type/Format
    Description

    name

    required

    string

    Name of this configuration. Will be used in file names created to store model output.

    start_date

    required

    date

    model simulation start date

    end_date

    required

    date

    For example, for a configuration file to simulate the spread of COVID-19 in the US during 2020 and compare to data from March 1 onwards, with 1000 independent simulations, the header of the config might read:

    subpop_setup section

    Required section

    This section specifies the population structure on which the model will be simulated, including the names and sizes of each subpopulation and the connectivity between them. More details here.

    compartments section

    Required section

    This section is where users can specify the variables (infection states) that will be tracked in the infectious disease transmission model. More details can be found here. The other details of the model are specified in the seir section, including transitions between these compartments (seir::transitions), the names of the parameters governing the transitions (seir::parameters), and the numerical method used to simulate the equations over time (seir::integration). The initial conditions of the model can be specified in the initial_conditions section, and any other inputs into the model from external populations or instantaneous transitions between states that occur at later times can be specified in the seeding section. ;

    seir section

    Required section

    This section is where users can specify the details of the infectious disease transmission model they wish to simulate (e.g., SEIR). This model describes the allowed transitions (seir::transitions) between the compartments that were specified in the compartments section, the values of the parameters involved in these transitions (seir::parameters), and the numerical method used to simulate the equations over time (seir::integration). More details here. The initial conditions of the model can be specified in the separate initial_conditions section, and any other inputs into the model from external populations or instantaneous transitions between states that occur at later times can be specified in the seeding section. ;

    initial_conditions section

    Optional section

    This section is used to specify the initial conditions of the model, which define how individuals are distributed between the model compartments at the time the model simulation begins. Importantly, the initial conditions specify the time and location where infection is first introduced. If this section is omitted, default values are used. If users want to add infections to the population at later times, or add or remove individuals from compartments separately from the model rules, they can do so via the related seeding section. More details here ;

    seeding section

    Optional section

    This section is used to specify how individuals are instantaneously "seeded" from one compartment to another, where they then continue to be governed by the model equations. For example, this seeding could be used to represent importations of infected individuals from an outside population, mutation events that create new strains, or vaccinations that alter disease susceptibility. Seeding events can occur at any time in the simulation. The seeding section specifies the numeric values added to or removed from any compartment of the model. More details here ;

    outcomes section

    Optional section

    This section is where users can define new variables representing the observed quantities and how they are related to the underlying state variables in the model (e.g., the fraction of infections that are detected as cases). More details here ;

    interventions section

    Required section

    This section is where users can specify time-varying changes to parameters governing either the infectious disease model or the observational model. More details here ;

    inference section

    Optional section

    This section is where users can specify the details of how the model is fit to data, including what data streams they will be included and which outcome variables they represent and the likelihood functions describing the probability of the data given the model. More details here. ;

    (OLD) Configuration options

    filtering section

    The filtering section configures the settings for the inference algorithm. The below example shows the settings for some typical default settings, where the model is calibrated to the weekly incident deaths and weekly incident confirmed cases for each subpop. Statistics, hierarchical_stats_geo, and priors each have scenario names (e.g., sum_deaths, local_var_hierarchy, and local_var_prior, respectively).

    Home

    Welcome to flepiMoP documentation!

    The “FLexible EPIdemic MOdeling Pipeline” (flepiMoP; formerly known as the COVID Scenario Modeling Pipeline or CSP) is an open-source software suite designed by researchers in the and at to simulate a wide range of compartmental models of infectious disease transmission. The disease transmission and observation models are defined by a no-code configuration file, which allows models of varying complexity to be specified quickly and consistently, from simple problems described by SIR-style models in a single population to more complicated models of multiple pathogen strains transmitting between thousands of connected spatial divisions and age groups.

    It was initially designed in early 2020 and was routinely used to provide projections of the emerging COVID-19 epidemic to health authorities worldwide. Currently, flepiMoP provides COVID-19 projections to the US CDC-funded model aggregation sites, the and the , influenza projections to and to the , and RSV projections to the .

    However, the pipeline is much more general and can be used to simulate the dynamics of any infection that can be expressed as a

    Specifying compartmental model

    This section describes how to specify the compartmental model of infectious disease transmission.

    We want to allow users to work with a wide variety of infectious diseases or, one infectious disease under a wide variety of modeling assumptions. To facilitate this, we allow the user to specify their compartmental model of disease dynamics via the configuration file.

    We originally considered asking users to specify each compartment and transition manually. However, we quickly found that this created long, confusing configuration files, and so we created a shorthand to more succinctly specify both compartments and transitions between them. This works especially well for models where individuals are stratified by other properties (like age, vaccination status, etc.) in addition to their infection status.

    The model is specified in two separate sections of the configuration file. In the compartments section, users define the possible states individuals can be categorized into. Then in the seir section, users define the possible transitions between states, the values of parameters that govern the rates of these transitions, and the numerical method used to simulate the model.

    An example section of a configuration file defining a simple SIR model is below.

    subpop_setup:
      ...
      geodata: minimal
    name: USA_covid19_2020
    model_output_dirname: model_output
    start_date: 2020-01-01
    end_date: 2020-12-31
    start_date_groundtruth: 2020-03-01
    end_date_groundtruth: 2020-12-31
    nslots: 1000

    model simulation end date

    start_date_groundtruth

    optional for non-inference runs, required for inference runs

    date

    start date for comparing model to data

    end_date_groundtruth

    optional for non-inference runs, required for inference runs

    date

    end date for comparing model to data

    nslots

    optional (can also be defined by an environmental variable)

    int

    number of independent simulations to run

    setup_name

    optional

    string

    setup name used to describe the run, used in setting up file names

    model_output_dirname

    optional

    folder path

    path to folder where all the outputs created by the model are stored, if not specified, default is model_output

    here

    filtering settings

    With inference model runs, the number of simulations nsimulations refers to the number of final model simulations that will be produced. The filtering$simulations_per_slot setting refers to the number of iterative simulations that will be run in order to produce a single final simulation (i.e., number of simulations in a single MCMC chain).

    Item
    Required?
    Type/Format

    simulations_per_slot

    required

    number of iterations in a single MCMC inference chain

    do_filtering

    required

    TRUE if inference should be performed

    data_path

    required

    file path where observed data are saved

    likelihood_directory

    required

    filtering::statistics

    The statistics specified here are used to calibrate the model to empirical data. If multiple statistics are specified, this inference is performed jointly and they are weighted in the likelihood according to the number of data points and the variance of the proposal distribution.

    Item
    Required?
    Type/Format

    name

    required

    name of statistic, user defined

    aggregator

    required

    function used to aggregate data over the period, usually sum or mean

    period

    required

    duration over which data should be aggregated prior to use in the likelihood, may be specified in any number of days, weeks, months

    sim_var

    required

    filtering::hierarchical_stats_geo

    The hierarchical settings specified here are used to group the inference of certain parameters together (similar to inference in "hierarchical" or "fixed/group effects" models). For example, users may desire to group all counties in a given state because they are geograhically proximate and impacted by the same statewide policies. The effect should be to make these inferred parameters follow a normal distribution and to observe shrinkage among the variance in these grouped estimates.

    Item
    Required?
    Type/Format

    scenario name

    required

    name of hierarchical scenario, user defined

    name

    required

    name of the estimated parameter that will be grouped (e.g., the NPI scenario name or a standardized, combined health outcome name like probability_incidI_incidC)

    module

    required

    name of the module where this parameter is estimated (important for finding the appropriate files)

    geo_group_col

    required

    filtering::priors

    It is now possible to specify prior values for inferred parameters. This will have the effect of speeding up model convergence.

    Item
    Required?
    Type/Format

    scenario name

    required

    name of prior scenario, user defined

    name

    required

    name of NPI scenario or parameter that will have the prior

    module

    required

    name of the module where this parameter is estimated

    likelihood

    required

    Ground truth data

    Likelihood function

    Fitting parameters

    Ground truth data

    . These include applications in chemical reaction kinetics, pharmacokinetics, within-host disease dynamics, or applications in the social sciences.

    In addition to producing forward simulations given a specified model and parameter values, the pipeline can also attempt to optimize unknown parameters (e.g., transmission rate, case detection rate, intervention efficacy) to fit the model to datasets the user provides (e.g., hospitalizations due to severe disease) using a Bayesian inference framework. This feature allows the pipeline to be utilized for short-term forecasting or longer-term scenario projections for ongoing epidemics, since it can simultaneously be fit to data for dates in the past and then use best-fit parameters to make projections into the future.

    General description of flepiMoP

    The main features of flepiMoP are:

    • Open-source (GPL v3.0) infectious dynamics modeling software, written in R and Python

    • Versatile, no-code design applicable for most compartmental models and outcome observation models, allowing for quick iteration in reaction to epidemic events (e.g., emergence of new variants, vaccines, non-pharmaceutical interventions (NPIs))

    • Powerful, just-in-time compiled disease transmission model and distributed inference engine ready for large scale simulations on high-performance computing clusters or cloud workflows

    • Adapted to small- and large-scale problems, from a simple SIR model to a complex model structure with hundreds of compartments on thousands of connected populations

    • Strong emphasis on mechanistic processes, with a design aimed at leveraging domain knowledge in conjunction with statistical inference

    • Portable for Windows WSL, MacOS, and Linux with the provided Docker image and an Anaconda environment

    Overview of the pipeline organization

    The mathematical model within the pipeline is a compartmental epidemic model embedded within a well-mixed metapopulation. A compartmental epidemic model is a model that divides all individuals in a population into a discrete set of states (e.g. “infected”, “recovered”) and tracks – over time – the number of individuals in each state and the rates at which individuals transition between these states. The well-known SIR model is a classic example of such a model, and much more complex versions of this model type have been simulated with this framework (for example, an SEIR-style model in which individuals are further subdivided into multiple age groups and vaccination statuses).

    The structure of the desired model, as well as the parameter values and initial conditions, can be specified flexibly by the user in a no-code fashion. The pipeline allows for parameter values to change over time at discrete intervals, which can be used to specify time-dependent aspects of disease transmission and control (such as seasonality or vaccination campaigns).

    The model is embedded within a meta-population structure, which consists of a series of distinct subpopulations (e.g. states, provinces, or other communities) in which the model structure is repeated, albeit with potentially different parameter values. The subpopulations can interact, either through the movement of individuals or the influence of individuals in one subpopulation on the transition rate of individuals in another ;

    Within each subpopulation, the population is assumed to be well-mixed, meaning that interactions are assumed to be equally likely between any pair of individuals (since unique identities of individuals are not explicitly tracked). The same model structure can be simulated in a continuous-time deterministic or discrete-time stochastic manner ;

    In addition to the variables described by the compartmental model, the model can track other observable variables (“outcomes”) that are functions of the basic model variables but do not themselves influence the dynamics (i.e., some portion of infections are reported as cases, depending on a testing rate). The model can be run iteratively to tune the values of certain parameters so that these outcome variables best match timeseries data provided by the user for a certain time period ;

    Fitting is done using a Bayesian-like framework, where the user can specify the likelihood of observed outcomes in data given modeled outcomes, and the priors on any parameters to be fit. Multiple data streams (e.g., cases and deaths) can be fit simultaneously. A custom Markov Chain Monte Carlo method is used to sequentially propose and accept or reject parameter values based on the model fit to data, in a way that balances fit quality within each individual subpopulation with that of the total aggregate population, and that takes advantage of parallel computing environments.

    The code is written in a combination of R and Python, and the vast majority of users only need to interact with the pipeline via the components written in R. It is structured in a modular fashion, such that individual components – such as the epidemic model, the observable variables, the population structure, or the parameters – can be edited or completely replaced without any handling of other parts of the code ;

    When model simulation is combined with fitting to data, the code is designed to run most efficiently on a supercomputing cluster with many cores. We most commonly run the code on Amazon Web Services or on high-performance computers using SLURM. However, even relatively large models can be run efficiently on most personal computers. Typically, the memory of the machine will limit the number of compartments (i.e., variables) that can be included in the epidemic model, while the machine’s CPU will determine the speed at which each model run is completed and the number of iterations of the model that can be run during parameter searches when fitting the model to data. While the pipeline can be installed on any computer, it is sometime easier to use an Anaconda environment or the provided Docker container, where all the software dependencies (e.g., standardized R and Python versions along with required packages) are included, independent of the user’s local machine. All the code is maintained on our GitHub and shared with the GNU General Public License v3.0 license. It is build on top of a fully open-source software stack.

    This documentation is organized as follows. The Model Description section describes the mathematical framework for the compartmental epidemic models that can be simulated forward in time by the pipeline. The Model Inference section describes the statistical framework for fitting the model to data. The Data and Parameter section describes the inputs the user must provide to the pipeline, in terms of the model structure and parameters, the population characteristics, the initial conditions, time-varying interventions, data to be fit, and more. The How to Run section provides concrete guidance on setting up and running the model and analyzing the output. The Quick Start Guide provides a simple example model setup. The Advanced section goes into more detail on specific features of the model and the code that are likely to only be of interest to users who want to run more complex models or data fitting routines or substantially edit the code. It includes a subsection describing each file and package used in the pipeline and their interactions during a model run.

    Users who wish to jump to running the model themselves can see Quick Start Guide.

    For questions about the pipeline or to report a bug, please use the “Issues” or "Discussions" feature on our GitHub.

    Acknowledgments

    flepiMoP is actively developed by its current contributors, including Joseph C Lemaitre, Sara L Loo, Emily Przykucki, Clifton McKee, Claire Smith, Sung-mok Jung, Koji Sato, Pengcheng Fang, Erica Carcelen, Alison Hill, Justin Lessler, and Shaun Truelove, affiliated with the ;

    • Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA for (JCL, JL)

    • Johns Hopkins University International Vaccine Access Center, Department of International Health, Baltimore, MD, USA for (SLL, KJ, EC, ST)

    • Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA for (CM, CS, JL, ST)

    • Carolina Population Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA for (S-m.J, JL)

    • Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD, USA for (AH).

    The development of this model was supported by from funds the National Science Foundation (2127976; ST, CPS, JK, ECL, AH), Centers for Disease Control and Prevention (200-2016-91781; ST, CPS, JK, AH, JL, JCL, SL, CM, EC, KS, S-m.J), US Department of Health and Human Services / Department of Homeland Security (ST, CPS, JK, ECL, AH, JL), California Department of Public Health (ST, CPS, JK, ECL, JL), Johns Hopkins University (ST, CPS, JK, ECL, JL), Amazon Web Services (ST, CPS, JK, ECL, AH, JL, JCL), National Institutes of Health (R01GM140564; JL, 5R01AI102939; JCL), and the Swiss National Science Foundation (200021-172578; JCL)

    We need to also acknowledge past contributions to the development of the COVID Scenario Pipeline, which evolved into flepiMoP. These include contributions by Heramb Gupta, Kyra H. Grantz, Hannah R. Meredith, Stephen A. Lauer, Lindsay T. Keegan, Sam Shah, Josh Wills, Kathryn Kaminsky, Javier Perez-Saez, Joshua Kaminsky, and Elizabeth C. Lee.

    Johns Hopkins Infectious Disease Dynamics Group
    UNC Chapel Hill
    COVID-19 Forecast Hub
    COVID-19 Scenario Modeling Hub
    FluSight
    Flu Scenario Modeling Hub
    RSV Scenario Modeling Hub
    compartmental epidemic model
    Specifying model compartments (compartments)

    The first stage of specifying the model is to define the infection states (variables) that the model will track. These "compartments" are defined first in the compartments section of the config file, before describing the processes that lead to transitions between them. The compartments are defined separately from the rest of the model because they are also used by the seeding section that defines initial conditions and importations.

    For simple disease models, the compartments can simply be listed with whatever notation the user chooses. For example, for a simple SIR model, the compartments could be ["S", "I", "R"]. The config also requires that there be a variable name for the property of the individual that these compartments describe, which for example in this case could be infection_stage

    Our syntax allows for more complex models to be specified without much additional notation. For example, consider a model of a disease that followed SIR dynamics but for which individuals could receive vaccination, which might change how they experience infection.

    In this case we can specify compartments as the cross product of multiple states of interest. For example:

    Corresponds to 6 compartments, which the code internally converts to this data frame

    In order to more easily describe transitions, we want to be able to refer to a compartment by its components, but then use it by its compartment name.

    If the user wants to specify a model in which some compartments are repeated across states but others are not, there will be pros and cons of how the model is specified. Specifying it using the cross product notation is simpler, less error prone, and makes config files easier to read, and there is no issue with having compartments that have zero individuals in them throughout the model. However, for very large models, extra compartments increase the memory required to conduct the simulation, and so having unnecessary compartments tracked may not be desired.

    For example, consider a model of a disease that follows SI dynamics in two separate age groups (children and adults), but for which only adults receive vaccination, with one or two doses of vaccine. With the simplified notation, this model could be specified as:

    corresponding to 12 compartments, 4 of which are unnecessary to the model

    Or, it could be specified with the less concise notation

    which does not result in any unnecessary compartments being included.

    These compartments are referenced in multiple different subsequent sections of the config. In the seeding (LINK TBA) section the user can specify how the initial (or later imported) infections are distributed across compartments; in the seir section the user can specify the form and rate of the transitions between these compartments encoded by the model; in the outcomes section the user can specify how the observed variables are generated from the underlying model states.

    Notation must be consistent between these sections.

    Specifying compartmental model transitions (seir::transitions)

    The way we specify transitions between compartments in the model is a bit more complicated than how the compartments themselves are specified, but allows users to specify complex stratified infectious disease models with minimal code. This makes checking, sharing, and updating models more efficient and less error-prone.

    We specify one or more transition globs, each of which corresponds to one or more transitions. Since transition globs are shorthand for collections of transitions, we will first explain how to specify a single transition before discussing transition globs.

    A transition has 5 pieces of associated information that a user can specify:

    • source

    • destination

    • rate

    • proportional_to

    • proportion_exponent

    For more details on the mathematical forms possible for transitions in our models, read the Model Description section.

    We first consider a simple example of an SI model where individuals may either be vaccinated (v) or unvaccinated (u), but the vaccine does not change the susceptibility to infection nor the infectiousness of infected individuals.

    We will focus on describing the first transition of this model, the rate at which unvaccinated individuals move from the susceptible to infected state.

    Specifying a single transition

    Source

    The compartment the transition moves individuals out of (e.g., the source compartment) is an array. For example, to describe a transition that moves unvaccinated susceptible individuals to another state, we would write

    which corresponds to the compartment S_unvaccinated

    Destination

    The compartment the transition moves individuals into (e.g. the destination compartment) is an array. For example, to describe a transition that moves individuals into the unvaccinated but infected state, we would write

    which corresponds to the compartment I_unvaccinated

    Rate

    The rate constant specifies the probability per time that an individual in the source compartment changes state and moves to the destination compartment. For example, to describe a transition that occurs with rate 5/time, we would write:

    instead, we could describe the rate using a parameter beta, which can be given a numeric value later:

    The interpretation and unit of the rate constant depend on the model details, as the rate may potentially also be per number (or proportion) of individuals in other compartments (see below).

    Proportional to

    A vector of groups of compartments (each of which is an array) that modify the overall rate of transition between the source and destination compartment. Each separate group of compartments in the vector are first summed, and then all entries of the vector are multiplied to get the rate modifier. For example, to specify that the transition rate depends on the product of the number of unvaccinated susceptible individuals and the total infected individuals (vaccinated and unvaccinated), we would write:

    To understand this term, consider the compartments written out as strings

    and then sum the terms in each group

    From here, we can say that the transition we are describing is proportional to S_unvaccinated and I_unvaccinated + I_vaccinated, i.e., the rate depends on the product S_unvaccinated * (I_unvaccinated + I_vaccinated).

    For transitions that occur at a constant per-capita rate (ie, E -> I at rate γ\gammaγ in an SEIR model), it is possible to simply write proportional_to: ["source"].

    Proportion exponent

    This is an exponent modifying each group of compartments that contribute to the rate. It is equivalent to the "order" term in chemical kinetics. For example, if the reaction rate for the model above depends linearly on the number of unvaccinated susceptible individuals but on the total infected individuals sub-linearly, for example to a power 0.9, we would write:

    or a power parameter alpha, which can be given a numeric value later:

    The (top level) length of the proportion_exponent vector must be the same as the (top level) length of the proportional_to vector, even if the desire of the user is to have the same exponent for all terms being multiplied together to get the rate.

    Summary

    Putting it all together, the model transition is specified as

    would correspond to the following model if expressed as an ordinary differential equation

    with parameter and parameter (we will describe how to use parameter symbols in the transitions and specify their numeric values separately in the section Specifying compartmental model parameters).

    Transition globs

    We now explain a shorthand we have developed for specifying multiple transitions that have similar forms all at once, via transition globs. The basic idea is that for each component of the single transitions described above where a term corresponded to a single model compartment, we can instead specify one or more compartment. Similarly, multiple rate values can be specified at once, for each involved compartment. From one transition glob, multiple individual transitions are created, by broadcasting across the specified compartments.

    For transition globs, any time you could specify multiple arguments as a list, you may instead specify one argument as a non-list, which will be used for every broadcast. So [1,1,1] is equivalent to 1 if the dimension of that broadcast is 3.

    We continue with the same SI model example, where individuals are stratified by vaccination status, but expand it to allow infection to occur at different rates in vaccinated and unvaccinated individuals:

    A stratified SI model including vaccination

    Source

    We allow one or more arguments to be specified for each compartment. So to specify the transitions out of both susceptible compartments (S_unvaccinated and S_unvaccinated), we would use

    Destination

    The destination variable should be the same shape as the source, and in the same relative order. So to specify a transition from S_unvaccinated to I_unvaccinated and S_vaccinated to I_vaccinated, we would write the destination as:

    If instead we wrote:

    we would have a transition from S_unvaccinated to I_vaccinated and S_vaccinated to I_unvaccinated.

    Rate

    The rate vector allows users to specify the rate constant for all the source -> destination transitions that are defined in a shorthand way, by instead specifying how the rate is altered depending on the compartment type. For example, the rate of transmission between a susceptible (S) and an infected (I) individual may vary depending on whether the susceptible individual is vaccinated or not and whether the infected individual is vaccinated or not. The overall rate constant is constructed by multiplying together or "broadcasting" all the compartment type-specific terms that are relevant to a given compartment.

    For example,

    This would mean our transition from S_unvaccinated to I_unvaccinated would have a rate of 3 * 0.6 while our transition from S_vaccinated to I_vaccinated would have a rate of 3 * 0.5.

    The rate vector should be the same shape as source and destination and in the same relative order.

    Note that if the desire is to make a model where the difference in the rate constants varies in a more complicated than multiplicative way between different compartment types, it would be better to specify separate transitions for each compartment type instead of using this shorthand.

    Proportional to

    The broadcasting here is a bit more complicated. In other cases, each broadcast is over a single component. However, in this case, we have a broadcast over a group of components. We allow a different group to be chosen for each broadcast.

    Again, let's unpack what it says. Since the broadcast is over groups, let's split the config back up

    into those groups

    From here, we can say that we are describing two transitions. Both occur proportionally to the same compartments: S_unvaccinated and the total number of infections (I_unvaccinated+I_vaccinated).

    If, for example, we want to model a situation where vaccinated susceptibles cannot be infected by unvaccinated individuals, we would instead write:

    Proportion exponent

    Similarly to rate and proportional_to, we provide an exponent for each component and every group across the broadcast. So we could for example use:

    The (top level) length of the proportion_exponent vector must be the same as the (top level) length of the proportional_to vector, even if the desire of the user is to have the same exponent for all terms being multiplied together to get the rate. Within each vector entry, the arrays must have the same length as the source and destination vectors.

    Summary

    Putting it all together, the transition glob

    is equivalent to the following transitions

    Warning

    We warn the user that with this shorthand, it is possible to specify large models with few lines of code in the configuration file. The more compartments and transitions you specify, the longer the model will take to run, and the more memory it will require.

    Specifying compartmental model parameters (seir::parameters)

    When the transitions of the compartmental model are specified as described above, they can either be entered as numeric values (e.g., 0.1) or as strings which can be assigned numeric values later (e.g., beta). We recommend the latter method for all but the simplest models, since parameters may recur in multiple transitions and so that parameter values may be edited without risk of editing the model structure itself. It also improves readability of the configuration files.

    Parameters can take on three types of values:

    • Fixed values

    • Value drawn from distributions

    • Values read from timeseries specified in a data file

    Specifying fixed parameter values

    Parameters can be assigned values by using the value argument after their name and then simply stating their numeric argument. For example, in a config describing a simple SIR model with transmission rate β\betaβ (beta) = 0.1/day and recovery rate γ\gammaγ (gamma) = 0.2/day. This could be specified as

    The full model section of the config could then read

    For the stratified SI model described above, this portion of the config would read

    If there are no parameter values that need to be specified (all rates given numeric values when defining model transitions), the seir::parameters section of the config can be left blank or omitted.

    Specifying parameters values from distributions

    Parameter values can also be specified as random values drawn from a distribution, as a way of including uncertainty in parameters in the model output. In this case, every time the model is run independently, a new random value of the parameter is drawn. For example, to choose the same value of beta = 0.1 each time the model is run but to choose a random values of gamma with mean on a log scale of e−1.6=0.2e^{-1.6} = 0.2e−1.6=0.2 and standard deviation on a log scale of e0.2=1.2e^{0.2} = 1.2e0.2=1.2 (e.g., 1.2-fold variation):

    Details on the possible distributions that are currently available, and how to specify their parameters, is provided in the Distributions section.

    Note that understanding when a new parameter values from this distribution is drawn becomes more complicated when the model is run in Inference mode. In Inference mode, we distinguish model runs as occurring in different "slots" – i.e., completely independent model instances that could be run on different processing cores in a parallel computing environment – and different "iterations" of the model that occur sequentially when the model is being fit to data and update fitted parameters each time based on the fit quality found in the previous iteration. A new parameter values is only drawn from the above distribution once per slot. Within a slot, at each iteration during an inference run, the parameter is only changed if it is being fit and the inference algorithm decides to perturb it to test a possible improved fit. Otherwise, it would maintain the same value no matter how many times the model was run within a slot.

    Specifying parameter values as timeseries from data files

    Sometimes, we want to be able to specify model parameters that have different values at different timepoints. For example, the relative transmissibility may vary throughout the year based on the weather conditions, or the rate at which individuals are vaccinated may vary as vaccine programs are rolled out. One way to do this is to instead specify the parameter values as a timeseries.

    This can be done by providing a data file in .csv or .parquet format that has a list of values of the parameter for a corresponding timepoint and subpopulation name. One column should be date, which should have an entry for every calendar day of the simulation, with the first and last date corresponding to the start_date and end_date for the simulation specified in the header of the config. There should be another column for each subpopulation, where the column name is the subpop name used in other files and the values are the desired parameter values for that subpopulation for the corresponding day. If any day or subpopulation is missing, an error will occur. However, if you want all subpopulations to have the same parameter value for every day, then only a single column in addition to date is needed, which can have any name, and will be applied to every subpop ;

    For example, for an SIR model with a simple two-province population structure where the relative transmissibility peaks on January 1 then decreases linearly to a minimal value on June 1 then increases linearly again, but varies more in the small province than the large province, the theta parameter could be constructed from the file seasonal_transmission_2pop.csv with contents including

    as a part of a configuration file with the model sections:

    Note that there is an alternative way to specify time dependence in parameter values that is described in the Specifying time-varying parameter modifications section. That method allows the user to define intervention parameters that apply specific additive or multiplicative shifts to other parameter values for a defined time interval. Interventions are useful if the parameter doesn't vary frequently and if the values of the shift is unknown and it is desired to either sample over uncertainty in it or try to estimate its value by fitting the model to data. If the parameter varies frequently and its value or relative value over time is known, specifying it as a timeseries is more efficient.

    Compartmental model parameters can have an additional attribute beyond value or timeseries, which is called stacked_modifier_method. This value is explained in the section on coding time-dependent parameter modifications (also known as "modifiers") as it determines what happens when two different modifiers act on the same parameter at the same time (are they combined additively or multiplicatively?) ;

    Config item
    Required?
    Type/Format
    Description

    value

    either value or timeseries is required

    numerical, or distribution

    This defines the value of the parameter, as described above.

    timeseries

    either value or timeseries is required

    path to a csv file

    This defines a timeseries for each day, as above.

    stacked_modifier_method

    optional

    string: sum, product, reduction_product

    Specifying model simulation method (seir::integration)

    A compartmental model defined using the notation in the previous sections describes rules for classifying individuals in the population based on infection state dynamically, but does not uniquely specify the mathematical framework that should be used to simulate the model.

    Our framework allows for two major methods for implementing compartmental models of disease transmission:

    • ordinary differential equations, which are completely deterministic, operate in continuous time (consider infinitesimally small timesteps), and allow for arbitrary fractions of the population (i.e., not just discrete individuals) to move between model compartments

    • discrete-time stochastic process, which tracks discrete individuals and produces random variation in the number of individuals transitioning between states for any given rate, and which allows transitions between states only to occur at discrete time intervals

    The mathematics behind each implementation is described in the Model Description section

    Config item
    Required?
    Type/format
    Description

    method

    optional

    string: rk4 (default),euler, stochastic

    The algorithm used to simulate the model equations. If rk4, model is simulated deterministically by numerical integration using a 4th order Runge-Kutta algorithm. If euler or stochastic, uses a discrete-time process, with steps proceeding either deterministically (at the average rate) or stochastically. For both of these cases, the algorithm ensures no compartment goes below zero for the requested time step. The -(-m)ethod option can be used (see ) to override this configuration option.

    dt

    optional

    positive real number (default: 2)

    The timestep used for the numerical integration or discrete time stochastic update; for rk4 method, this is a reasonable value, but for other options, this should be 0.2 or less.

    For example, to simulate a model deterministically using the 4th order Runge-Kutta algorithm for numerical integration with a timestep of 1 day:

    Alternatively, to simulate a model stochastically with a timestep of 0.1 days

    For any method, the results of the model will be more accurate when the timestep is smaller (i.e., output will more precisely match the mathematics of the model description and be invariant to the choice of timestep). However, the computing time required to simulate the model for a certain time range of interest increases with the number of timesteps required (i.e., with smaller timesteps). In our experience, the 4th order Runge-Kutta algorithm (for details see Advanced section) is a very accurate method of numerically integrating such models and can handle timesteps as large as roughly a day for models with the maximum per capita transition rates in this same order of magnitude. However, both of the discrete time engines require smaller timesteps to be accurate (around 0.1 for COVID-19-like dynamics in our experience).

    δSunvaccinatedδt=−βSunvaccinated1(Iunvaccinated+Ivaccinated)α\frac{\delta \text{S}_\text{unvaccinated}}{\delta t} = - \beta \text{S}_\text{unvaccinated}^1 (\text{I}_\text{unvaccinated}+\text{I}_\text{vaccinated})^{\alpha}δtδSunvaccinated​​=−βSunvaccinated1​(Iunvaccinated​+Ivaccinated​)α
    δIunvaccinatedδt=βSunvaccinated1(Iunvaccinated+Ivaccinated)α\frac{\delta \text{I}_\text{unvaccinated}}{\delta t} = \beta \text{S}_\text{unvaccinated}^1 (\text{I}_\text{unvaccinated}+\text{I}_\text{vaccinated})^{\alpha}δtδIunvaccinated​​=βSunvaccinated1​(Iunvaccinated​+Ivaccinated​)α

    Specifying data source and fitted variables

    inference settings

    iterations_per_slot

    do_inference

    gt_data_path

    With inference model runs, the number of simulations nsimulations refers to the number of final model simulations that will be produced. The filtering$simulations_per_slot setting refers to the number of iterative simulations that will be run in order to produce a single final simulation (i.e., number of simulations in a single MCMC chain).

    Item
    Required?
    Type/Format
    Description

    f

    inference::statistics options

    required options

    name

    aggregator

    period

    sim_var

    data_var

    likelihood

    The statistics specified here are used to calibrate the model to empirical data. If multiple statistics are specified, this inference is performed jointly and they are weighted in the likelihood according to the number of data points and the variance of the proposal distribution.

    Item
    Required?
    Type/Format
    Description

    f

    optional options ?

    remove_na

    add_one

    gt_start_date

    gt_end_date

    Optional sections

    inference::hierarchical_stats_geo

    The hierarchical settings specified here are used to group the inference of certain parameters together (similar to inference in "hierarchical" or "fixed/group effects" models). For example, users may desire to group all counties in a given state because they are geograhically proximate and impacted by the same statewide policies. The effect should be to make these inferred parameters follow a normal distribution and to observe shrinkage among the variance in these grouped estimates.

    Item
    Required?
    Type/Format

    inference::priors

    It is now possible to specify prior values for inferred parameters. This will have the effect of speeding up model convergence.

    Item
    Required?
    Type/Format

    Ground truth data

    name

    module

    geo_group_col

    transform

    inference:::priors

    inference::

    Specifying observational model

    This page describes how to specify the outcomes section of the configuration file

    Thinking about outcomes variables

    Our pipeline allows users to encode state variables describing the infection status of individuals in the population in two different ways. The first way is via the state variables and transitions of the compartmental model of disease transmission, which are specified in the compartments and seir sections of the config. This model should include all variables that influence the natural course of the epidemic (i.e., all variables that feed back into the model by influencing the rate of change of other variables). For example, the number of infected individuals influences the rate at which new infections occur, and the number of immune individuals influences the number of individuals at risk of acquiring infection.

    However, these intrinsic model variables may be difficult to observe in the real world and so directly comparing model predictions about the values of these variables to data might not make sense. Instead, the observable outcomes of infection may include only a subset of individuals in any state, and may only be observed with a time delay. Thus, we allow users to define new outcome variables that are functions of the underlying model variables. Commonly used examples include detected cases or hospitalizations ;

    Variables should not be included as outcomes if they influence the infection trajectory. The choice of what variables to include in the compartmental disease model vs. the outcomes section may be very model specific. For example, hospitalizations due to infection could be encoded as an outcome variable that is some fraction of infections, but if we believe hospitalized individuals are isolated from the population and don't contribute to onward infection, or that the number of hospitalizations feeds back into the population's perception of risk of infection and influences everyone's contact behavior, this would not be the best choice. Similarly, we could include deaths due to infection as an outcome variable that is also some fraction of infections, but unless death is a very rare outcome of infection and we aren't worried about actually removing deceased individuals from the modeled populations, deaths should be in the compartmental model instead.

    The outcomes section is not required in the config. However, there are benefits to including it, even if the only outcome variable is set to be equivalent to one of the infection model variables. If the compartmental model is complicated but you only want to visualize a few output variables, the will be much easier to work with. Outcome variables always occur with some fixed delay from their source infection model variable, which can be more convenient than the exponential distribution underlying the infection model. Outcome variables can be created to automatically sum over multiple compartments of the infection model, removing the need for post-processing code to do this. If the model is being fit to data, then the outcomes section is required, as only outcome variables can be compared to data.

    As an example, imagine we are simulating an SIR-style model and want to compare it to real epidemic data in which cases of infection and death from infection are reported. Our model doesn't explicitly include death, but suppose we know that 1% of all infections eventually lead to hospitalization, and that hospitalization occurs on average 1 week after infection. We know that not all infections are reported as cases, and assume that only 50% are detected and are reported 2 days after infection begins. The model and outcomes section of the config for these outcomes, which we call incidC (daily incidence of cases) and incidH (daily incidence of hospital admission) would be

    in the following sections we describe in more detail how this specification works

    Specifying outcomes in the configuration file

    The outcomes config section consists of a list of defined outcome variables (observables), which are defined by a user-created name (e.g., "incidH"). For each of these outcome variables, the user defines the source compartment(s) in the infectious disease model that they draw from and whether they draw from the incidence (new individuals entering into that compartment) or prevalence (total current individuals in that compartment). Each new outcome variable is always associated with two mandatory parameters ;

    • probability of being counted in this outcome variable if in the source compartment

    • ;delay between when an individual enters the source compartment and when they are counted in the outcome variable

    and one optional parameter

    • duration after entering that an individual is counted as part of the outcome variable

    The value of the probability, delay, and duration parameters can be a single value or come from ;

    Outcome model parameters probability, delay, and distribution can have an additional attribute beyond value called modifier_key. This value is explained in the section on coding (also known as "modifiers") as it provides a way to have the same modifier act on multiple different outcomes ;

    Just like the case for , when outcome parameters are drawn from a distribution, each time the model is run, a different value for this parameter will be drawn from this distribution, but that value will be used for all calculations within this model run. Note that understanding when a new parameter values from this distribution is drawn becomes more complicated when the model is run in mode. In Inference mode, we distinguish model runs as occurring in different "slots" – i.e., completely independent model instances that could be run on different processing cores in a parallel computing environment – and different "iterations" of the model that occur sequentially when the model is being fit to data and update fitted parameters each time based on the fit quality found in the previous iteration. A new parameter values is only drawn from the above distribution once per slot. Within a slot, at each iteration during an inference run, the parameter is only changed if it is being fit and the inference algorithm decides to perturb it to test a possible improved fit. Otherwise, it would maintain the same value no matter how many times the model was run within a slot.

    Example

    Config item
    Required?
    Type/format
    Description

    `source ;

    Required, unless option is used instead. This sub-section describes the compartment(s) in the infectious disease model from which this outcome variable is drawn. Outcome variables can be drawn from the incidence of a variable - meaning that some fraction of new individuals entering the infection model state each day are chosen to contribute to the outcome variable - or from the prevalence, meaning that each day some fraction of individuals currently in the infection state are chosen to contribute to the outcome variable. Note that whatever the source type, the named outcome variable itself is always a measure of incidence ;

    To specify which compartment(s) contribute the user must specify the state(s) within each model stratification. For stratifications not mentioned, the outcome will sum over that states in all strata ;

    For example, consider a configuration in which the compartmental model was constructed to track infection status stratified by vaccination status and age group. The following code would be used to create an outcome called incidH_child (incidence of hospitalization for children) and incidH_adult (incidence of hospitalization for adults) where some fraction of infected individuals would become hospitalized and we wanted to track separately track pediatric vs adult hospitalizations, but did not care about tracking the vaccination status of hospitalized individuals as in reality it was not tracked by the hospitals ;

    to instead create an outcome variable for cases where on each day of infection there is some probability of testing positive (for example, for the situation of an asymptomatic infection where testing is administered totally randomly), the following code would be used

    The source of an outcome variable can also be a previous defined outcome variable. For example, t to create a new variable for the number of individuals recruited to be part of a contact tracing program (incidT), which is just some fraction of diagnosed cases ;

    `probability ;

    Required, unless option is used instead. Probability is the fraction of individuals in the source compartment who are counted as part of this outcome variable (if the source is incidence; if the source is prevalence it is the fraction of individuals per day). It must be between 0 and 1 ;

    Specifying the probability creates a parameter called outcome_name::probability that can be referred to in the section of the config. The value of this parameter can be changed using the probability::intervention_param_name option ;

    For example, to track the incidence of hospitalization when 5% of children but only 1% of adults infected require hospitalization, and to create a modifier_key such that both of these rates could be modified by the same amount during some time period using the section:

    To track the incidence of diagnosed cases iterating over uncertainty in the case detection rate (ranging 20% to 30%), and naming this parameter "case_detect_rate"

    Each time the model is run a new random value for the probability of case detection will be chosen ;

    Delay

    Required, unless option is used instead. delay is the time delay between when individuals are chosen from the source compartment and when they are counted as part of this outcome variable ;

    For example, to track the incidence of hospitalization when 5% of children are hospitalized and hospitalization occurs 7 days after infection:

    To iterate over uncertainty in the exact delay time, we could include some variation between simulations in the delay time using a normal distribution with standard deviation of 2 (truncating to make sure the delay does not become negative). Note that a delay distribution here does not mean that the delay time varies between individuals - it is identical) ;

    Duration

    By default, all outcome variables describe incidence (new individuals entering each day). However, they can also track an associated "prevalence" if the user specifies how long individuals will stay classified as the outcome state the outcome variable describes. This is the duration parameter ;

    When the duration parameter is set, a new outcome variable is automatically created and named with the name of the original outcome variable + "_curr". This name can be changed using the duration::name option ;

    For example, to track the incidence and prevalence of hospitalization when 5% of children are hospitalized, hospitalization occurs 7 days after infection, and the duration of hospitalization is 3 days:

    which creates the variable "incidH_child_curr" to track all currently hospitalized children. Since it doesn't make sense to call this new outcome variable an incidence, as it is a prevalence, we could instead rename it:

    Sum

    Optional. sum is used to create new outcome variables that are sums over other previously defined outcome variables ;

    If sum is included, source, probability, delay, and duration will be ignored ;

    For example, to track new hospital admissions and current hospitalizations separately for children and adults, as well as for all ages combined

    outcomes::settings

    There are other required and optional configuration items for the outcomes section which can be specified under outcomes::settings:

    method: delayframe.This is the mathematical method used to create the outcomes variable values from the transmission model variables. Currently, the only model supported is delayframe, which .. ;

    param_from_file: Optional, TRUE or FALSE. It is possible to allow any of the outcomes variables to have values that vary across the subpopulations. For example, disease severity rates or diagnosis rates may differ by demographic group. In this case, all the outcome parameter values defined in will represent baseline values, and then you can define a relative change from this baseline for any particular subpopulation using the paths section. If params_from_file: TRUE is specified, then these relative values will be read from the params_subpop_file. Otherwise, if params_from_file: FALSE or is not listed at all, all subpopulations will have the same values for the outcome parameters, defined below ;

    param_subpop_file: Required if params_from_file: TRUE. The path to a .csv or .parquet file that contains the relative amount by which a given outcome variable is shifted relative to baseline in each subpopulation. File must contain the following columns:

    • subpop: The subpopulation for which the parameter change applies. Must be a subpopulation defined in the file. For example, small_province

    • parameter: The outcomes parameter which will be altered for this subpopulation. For example, incidH_child: probability

    • value: The amount by which the baseline value will be multiplied, for example, 0.75 or 1.1

    Examples

    Consider a disease described by an SIR model in a population that is divided into two age groups, adults and children, which experience the disease separately. We are interested in comparing the predictions of the model to real world data, but we know we cannot observe every infected individual. Instead, we have two types of outcomes that are observed.

    First, via syndromic surveillance, we have a database that records how many individuals in the population are experiencing symptoms from the disease at any given time. Suppose careful cohort studies have shown that 50% of infected adults and 80% of infected children will develop symptoms, and that symptoms occur in both age groups around 3 days after infection (following a log-normal distribution with log mean X and log standard deviation of Y). The duration that symptoms persist is also a variable, following a ...

    Secondly, via laboratory surveillance we have a database of every positive test result for the infection. We assume the test is 100% sensitive and specific. Only individuals with symptoms are tested, and they are always tested exactly 1 day after their symptom onset. We are unsure what portion of symptomatic individuals are seeking out testing, but are interested in considering two extreme scenarios: 95% of symptomatic individuals are tested, or only 75% of individuals are tested.

    The configuration file we could use to model this situation includes

    Other configuration options

    Command line inputs

    flepiMoP allows some input parameters/options to be specified in the command line at the time of model submission, in addition to or instead of in the configuration file. This can be helpful for users who want to quickly run different versions of the model – typically a different number of simulations or a different intervention scenario from among all those specified in the config – without having to edit or create a new configuration file every time. In addition, some arguments can only be specified via the command line.

    In addition to the configuration file and the command line, the inputs described below can also be specified as environmental variables.

    In all cases, command line arguments override configuration file entries which override environmental variables. The order of command line arguments does not matter.

    Details on how to run the model, including how to add command line arguments or environmental variables, are in the section .

    Command-line only inputs

    Argument
    Env. Variable
    Value type
    Description
    Required?
    Default

    Command-line versions of configuration file inputs

    Argument
    Config item
    Env. Variable
    Value type
    Description
    Required?
    Default

    Example

    As an example, consider running the following configuration file

    To run this model directly in Python (it can alternatively be run from R, for all details see section ), we could use the command line entry

    Alternatively, to run 100 simulations using only 4 of the available processors on our computer, but only running the "" scenario with a deterministic model, and to save the files as .csv (since the model is relatively simple), we could call the model using the command line entry

    Environmental variables

    TBA

    US-specific configuration file options

    Things below here are very out of date. Put here as place holder but not updated recently.

    global: smh_round, setup_name, disease

    spatial_setup: census_year, modeled_states, state_level

    For US-specific population structures

    For creating US-based population structures using the helper script build_US_setup.R which is run before the main model simulation script, the following extra parameters can be specified

    Config Item
    Required?
    Type/Format
    Description

    Example 2

    To simulate an epidemic across all 50 states of the US or a subset of them, users can take advantage of built in machinery to create geodata and mobility files for the US based on the population size and number of daily commuting trips reported in the US Census.

    Before running the simulation, the script build_US_setup.R can be run to get the required population data files from online census data and filter out only states/territories of interest for the model. More details are provided in the How to Run section.

    This example simulates COVID-19 in the New England states, assuming no transmission from other states, using 2019 census data for the population sizes and a pre-created file for estimated interstate commutes during the 2011-2015 period.

    geodata.csv contains

    mobility_2011-2015_statelevel.csv contains

    importation section (optional)

    This section is optional. It is used by the to import global air importation data for seeding infections into the United States.

    If you wish to include it, here are the options.

    Config Item
    Required?
    Type/Format
    Description

    importation::param_list

    Config Item
    Required?
    Type/Format
    Description

    report section

    The report section is completely optional and provides settings for making an R Markdown report. For an example of a report, see the Supplementary Material of our

    If you wish to include it, here are the options.

    Config Item
    Required?
    Type/Format
    Description
    filtering:
      simulations_per_slot: 350
      do_filtering: TRUE
      gt_data_path: data/observed_data.csv
      likelihood_directory: importation/likelihood/
      statistics:
        sum_deaths:
          name: sum_deaths
          aggregator: sum ## function applied over the period
          period: "1 weeks"
          sim_var: incidD
          data_var: death_incid
          remove_na: TRUE
          add_one: FALSE
          likelihood:
            dist: sqrtnorm
            param: [.1]
        sum_confirmed:
          name: sum_confirmed
          aggregator: sum
          period: "1 weeks"
          sim_var: incidC
          data_var: confirmed_incid
          remove_na: TRUE
          add_one: FALSE
          likelihood:
            dist: sqrtnorm
            param: [.2]
      hierarchical_stats_geo:
        local_var_hierarchy:
          name: local_variance
          module: seir
          geo_group_col: USPS
          transform: none
        local_conf:
          name: probability_incidI_incidC
          module: hospitalization
          geo_group_col: USPS
          transform: logit
      priors:
        local_var_prior:
          name: local_variance
          module: seir
          likelihood:
            dist: normal
            param:
            - 0
            - 1
    compartments:
      infection_stage: ["S", "I", "R"]
      
    seir:
      transitions:
        # infection
        - source: [S]
          destination: [I]
          proportional_to: [[S], [I]]
          rate: [beta]
          proportion_exponent: 1
        # recovery
        - source: [I]
          destination: [R]
          proportional_to: [[I]]
          rate: [gamma]
          proportion_exponent: 1
      parameters:
        beta: 0.1
        gamma: 0.2
      integration:
         method: rk4
         dt: 1.00
    compartments:
      infection_stage: ["S", "I", "R"]
     compartments:
       infection_stage: ["S", "I", "R"]
       vaccination_status: ["unvaccinated", "vaccinated"]
    infection_stage, vaccination_status, compartment_name
    S,               unvaccinated,       S_unvaccinated
    I,               unvaccinated,       I_unvaccinated
    R,               unvaccinated,       R_unvaccinated
    S,               vaccinated,         S_vaccinated
    I,               vaccinated,         I_vaccinated
    R,               vaccinated,         R_vaccinated
     compartments:
       infection_stage: ["S", "I"]
       age_group: ["child", "adult"]
       vaccination_status: ["unvaccinated", "1dose", "2dose"]
    infection_stage, age_group, vaccination_status, compartment_name
    S,		 child,	    unvaccinated,	S_child_unvaccinated	
    I,		 child,	    unvaccinated,	I_child_unvaccinated
    S,		 adult,	    unvaccinated,	S_adult_unvaccinated
    I,		 adult,	    unvaccinated,	I_adult_unvaccinated
    S,		 child,	    1dose,		S_child_1dose
    I,		 child,	    1dose,		I_child_1dose
    S,		 adult,     1dose,		S_adult_1dose
    I,		 adult,     1dose,		I_adult_1dose
    S,		 child,     2dose,		S_child_2dose	
    I,		 child,     2dose,		I_child_2dose
    S,		 adult,	    2dose,		S_adult_2dose
    I,		 adult,	    2dose,		I_adult_2dose
    compartments:
       overall_state: ["S_child", "I_child", "S_adult_unvaccinated", "I_adult_unvaccinated", "S_adult_1dose", "I_adult_1dose", "S_adult_2dose", "I_adult_2dose"]
    [S,unvaccinated]
    [I,unvaccinated]
    5
    beta
    [[[S,unvaccinated]], [[I,unvaccinated], [I, vaccinated]]]
    [[S_unvaccinated], [I_unvaccinated, I_vaccinated]]
    [S_unvaccinated, I_unvaccinated + I_vaccinated]
    [1, 0.9]
    [1, alpha]
    source: [S, unvaccinated]
    destination: [I, unvaccinated]
    proportional_to: [[[S,unvaccinated]], [[I,unvaccinated], [I,vaccinated]]]
    rate: [5]
    proportion_exponent: [1, 0.9]
    [[S], [unvaccinated,vaccinated]]
    [[I], [unvaccinated,vaccinated]]
    [[I], [vaccinated,unvaccinated]]
    rate: [[3], [0.6,0.5]]
    [
      [[S,unvaccinated], [S,vaccinated]],
      [[I,unvaccinated],[I, vaccinated]], [[I,unvaccinated],[I, vaccinated]]
    ]
    [
      [S,unvaccinated],
      [[I,unvaccinated],[I, vaccinated]]
    ]
    [
      [S,vaccinated],
      [[I,unvaccinated],[I, vaccinated]]
    ]
    [
      [[S,unvaccinated], [S,vaccinated]],
      [[I,unvaccinated],[I, vaccinated]], [[I, vaccinated]]
    ]
    [[1,1], [0.9,0.8]]
    seir:
      transitions:
        source: [[S],[unvaccinated,vaccinated]]
        destination: [[I],[unvaccinated,vaccinated]]
        proportional_to: [
                           [[S,unvaccinated], [S,vaccinated]],
                           [[I,unvaccinated],[I, vaccinated]], [[I, vaccinated]]
                         ]
        rate: [[3], [0.6,0.5]]
        proportion_exponent: [[1,1], [0.9,0.8]]
    seir:
      transitions:
        - source: [S,unvaccinated]
          destination: [I,unvaccinated]
          proportional_to: [[[S,unvaccinated]], [[I,unvaccinated],[I, vaccinated]]]
          proportion_exponent: [1 * 0.9]
          rate: [3*0.6]
        - source: [S,vaccinated]
          destination: [I,vaccinated]
          proportional_to: [[[S,vaccinated]], [[I, vaccinated]]]
          proportion_exponent: [1 * 0.8]
          rate: [3*0.5]
    seir:
      parameters:
        beta: 
          value: 0.1
        gamma: 
          value: 0.2
    compartments:
      infection_state: ["S", "I", "R"]
      
    seir:
      transitions:
        # infection
        - source: [S]
          destination: [I]
          proportional_to: [[S], [I]]
          rate: [beta]
          proportion_exponent: 1
        # recovery
        - source: [I]
          destination: [R]
          proportional_to: [[I]]
          rate: [gamma]
          proportion_exponent: [1,1]
      parameters:
        beta: 
          value: 0.1
        gamma: 
          value: 0.2
    compartments:
      infection_stage: ["S", "I", "R"]
      vaccination_status: ["unvaccinated", "vaccinated"]
      
    seir:
      transitions:
        source: [[S],[unvaccinated,vaccinated]]
        destination: [[I],[unvaccinated,vaccinated]]
        proportional_to: [
                           [[S,unvaccinated], [S,vaccinated]],
                           [[I,unvaccinated],[I, vaccinated]], [[I, vaccinated]]
                         ]
        rate: [[beta], [theta_u,theta_v]]
        proportion_exponent: [[1,1], [alpha_u,alpha_v]]
      parameters:
        beta: 
          value: 0.1
        theta_u: 
          value: 0.6
        theta_v: 
          value: 0.5
        alpha_u: 
          value: 0.9
        alpha_v: 
          value: 0.8
    seir:
      parameters:
        beta: 
          value:
            distribution: fixed
            value: 0.1
        gamma: 
          value:
            distribution: lognorm
            logmean: -1.6
            logsd: 0.2
    date,        small_province,    large_province
    2022-01-01,  1.5,               1.3
    .....
    2022-05-01,  0.5,               0.7 
    ....
    2022-12-31,  1.5,               1.3
    compartments:
      infection_stage: ["S", "I", "R"]
    
    seir:
      transitions:
        # infection
        - source: [S]
          destination: [I]
          proportional_to: [[S], [I]]
          rate: [beta*theta]
          proportion_exponent: 1
        # recovery
        - source: [I]
          destination: [R]
          proportional_to: [[I]]
          rate: [gamma]
          proportion_exponent: 1
      parameters:
        beta: 
          value: 0.1
        gamma: 
          value: 0.2
        theta:
           timeseries: data/seasonal_transmission.csv
    seir:
      integration:
         method: rk4
         dt: 1.00
    seir:
      integration:
         method: stochastic
         dt: 0.1

    folder path where likelihood evaluations will be stored as the inference algorithm runs

    statistics

    required

    specifies which data will be used to calibrate the model. see filtering::statistics for details

    hierarchical_stats_geo

    optional

    specifies whether a hierarchical structure should be applied to any inferred parameters. See filtering::hierarchical_stats_geo for details.

    priors

    optional

    specifies prior distributions on inferred parameters. See filtering::priors for details

    column name where model data can be found, from the hospitalization outcomes files

    data_var

    required

    column where data can be found in data_path file

    remove_na

    required

    logical

    add_one

    required

    logical, TRUE if evaluating the log likelihood

    likelihood::dist

    required

    distribution of the likelihood

    likelihood::param

    required

    parameter value(s) for the likelihood distribution. These differ by distribution so check the code in inference/R/functions.R/logLikStat function.

    geodata column name that should be used to group parameter estimation

    transform

    required

    type of transform that should be applied to the likelihood: "none" or "logit"

    specifies the distribution of the prior

    This option defines the method used when modifiers are applied. The default is product.

    rolling_mean_windows

    optional

    integer

    The size of the rolling mean window if a rolling mean is applied.

    Other Configuration Options

    Code structure

    Files where these algorithms are contained

    COVIDScenarioPipeline

    /R/scripts

    • filter_MC.R

    • full_filter.R

    • build_US_setup.R

    • build_nonUS_setup.R

    /R/pkgs

    • inference

      • groundtruth.R

      • functions.R

      • filter_MC_runner_funcs.R

    Inference Implementation

    Reporting

    Advanced

    (Any more advanced mathematical or computational methods used, or possible configuration options, that only specialized users would need to change)

    Setting up the model and post-processing

    Setting up the model and post-processing data

    required

    config subsection

    Specifies details of how each model output variable will be compared to data during fitting. See inference::statistics section.

    hierarchical_stats_geo

    optional

    config subsection

    Specifies whether a hierarchical structure should be applied the likelihood function for any of the fitted parameters. See inference::hierarchical_stats_geo for details.

    priors

    optional

    config subsection

    Specifies prior distributions on fitted parameters. See inference::priors for details

    sim_var

    required

    string

    Name of the outcome variable - as defined inoutcomes section of the config - that will be compared to data when calculating the likelihood. This will also be the column name of this variable in the hosp files in the model_output directory

    data_var

    required

    string

    Name of the data variable that will be compared to the model output variable when calculating the likelihood. This should be the name of a column in the

    file specified in inference::gt_data_path config option

    remove_na

    required

    logical

    if TRUE if FALSE

    add_one

    required

    logical

    if TRUE if FALSE Will be overwritten to TRUE if the likelihood distribution is chosen to be log

    likelihood::dist

    required

    Distribution of the likelihood

    likelihood::param

    required

    parameter value(s) for the likelihood distribution. These differ by distribution so check the code in inference/R/functions.R/logLikStat function.

    transform

    required

    type of transform that should be applied to the likelihood: "none" or "logit"

    iterations_per_slot

    required

    Integer ≥\geq≥ 1

    Number of iterations in a single MCMC inference chain

    do_inference

    required

    TRUE/FALSE

    TRUE if inference should be performed. If FALSE, just runs a single run per slot, without perturbing parameters

    gt_data_path

    required

    file path

    Path to files containing "ground truth" data to which model output will be compared

    name

    required

    string

    name of statistic, user defined

    period

    required

    days, weeks, or months

    Duration of time over which data and model output should be aggregated before being used in the likelihood. If weeks, epiweeks are used

    aggregator

    required

    string, name of any R function

    scenario name

    required

    name of hierarchical scenario, user defined

    name

    required

    name of the estimated parameter that will be grouped (e.g., the NPI scenario name or a standardized, combined health outcome name like probability_incidI_incidC)

    module

    required

    name of the module where this parameter is estimated (important for finding the appropriate files)

    geo_group_col

    required

    geodata column name that should be used to group parameter estimation

    scenario name

    required

    name of prior scenario, user defined

    name

    required

    name of NPI scenario or parameter that will have the prior

    module

    required

    name of the module where this parameter is estimated

    likelihood

    required

    specifies the distribution of the prior

    statistics

    Function used to aggregate data over theperiod, usually sum or mean

    duration

    No

    value or distribution

    The duration of time an individual remains counted within the named outcome variablet

    sum

    No

    List

    A list of other outcome variables to sum into the current outcome variable

    source

    Yes

    Varies

    The infection model variable or outcome variable from which the named outcome variable is created

    probability

    Yes, unless sum option is used instead

    value or distribution

    The probability that an individual in the source variable appears in the named outcome variable

    delay

    Yes, unless sum option is used instead

    value or distribution

    outcomes output file
    distribution
    time-dependent parameter modifications
    compartment model parameters
    Inference
    sum
    sum
    outcome_modifiers
    outcomes_modifier
    sum
    outcomes::outcomes
    geodata

    The time delay between individual's appearance in source variable and appearance in named outcome variable

    compartments:
      infection_stage: ["S", "I", "R"]
      
    seir:
      transitions:
        # infection
        - source: [S]
          destination: [I]
          proportional_to: [[S], [I]]
          rate: [beta]
          proportion_exponent: 1
        # recovery
        - source: [I]
          destination: [R]
          proportional_to: [[I]]
          rate: [gamma]
          proportion_exponent: 1
      parameters:
        beta: 
          value: 0.2
        gamma: 
          value: 0.1
    
    outcomes:
      settings:
        method: delayframe
      outcomes:
        incidC:
          source:
            incidence:
              infection_stage: "I"
          probability: 
            value: 0.5
          delay: 
            value: 2
        incidH:
          source:
            incidence:
              infection_stage: "I"
          probability: 
            value: 0.01
          delay: 
            value: 21 
    // Some code
     compartments:
       infection_state: ["S", "I", "R"]
       age_group: ["child", "adult"]
       vaccination_status: ["unvaxxed", "vaxxed"]
       
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        ...
      incidH_adult:
        source:
          incidence:
            infection_state: "I"
            age_group: "adult"
        ...
      incidH_all:
        source:
          incidence:
            infection_state: "I"
        ...
     compartments:
       infection_state: ["S", "I", "R"]
       age_group: ["child", "adult"]
       vaccination_status: ["unvaxxed", "vaxxed"]
       
    outcomes:
      incidC:
        source:
          prevalence:
            infection_state: "I"
        ...
    outcomes:
      incidC:
        source:
          prevalence:
            infection_state: "I"
        ...
      incidT:
        source: incidC
        ...
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        probability: 
          value: 0.05
          modifier_key: hosp_rate
      incidH_adult:
        source:
          incidence:
            infection_state: "I"
            age_group: "adult"
        probability: 
          value: 0.01
          modifier_key: hosp_rate
    outcomes:
      incidC:
        source:
          prevalence:
            infection_state: "I"
        probability:
          value:
            distribution: uniform
            low: 
              value: 0.2
            high: 
              value: 0.3
          intervention_param_name: "case_detect_rate"
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        probability: 
          value: 0.05
        delay: 
          value: 7
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        probability: 
          value: 0.05
        delay: 
          value: 
            distribution: truncnorm
            mean: 7
            sd: 2
            a: 0
            b: Inf
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        probability: 
          value: 0.05
        delay: 
          value: 7
        duration: 
          value: 3
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        probability: 
          value: 0.05
        delay: 
          value: 7
        duration: 
          value: 3
          name: "hosp_child_curr"
    outcomes:
      incidH_child:
        source:
          incidence:
            infection_state: "I"
            age_group: "child"
        probability: 0.05
        delay: 6
        duration: 
          value: 14
          name: "hosp_child_curr"
      incidH_adult:
        source:
          incidence:
            infection_state: "I"
            age_group: "adult"
        probability: 0.01
        delay: 8
        duration:
          value: 7
          name: "hosp_adult_curr"
      incidH_total: 
        sum: ["incidH_child","incidH_adult"]
      hosp_curr_total:   
        sum: ["hosp_child_curr","hosp_adult_curr"]
    // Some code
    

    No

    1

    -j or --jobs

    FLEPI_NJOBS

    integar 1

    Number of parallel processors used to run the simulation. If there are more slots that jobs, slots will be divided up between processors and run in series on each.

    No

    Number of processors on the computer used to run the simulation

    --interactiveor --batch

    NA

    Choose either option

    Run simulation in interactive or batch mode

    No

    batch

    --write-csv or --no-write-csv

    NA

    Choose either option

    Whether model output will be saved as .csv files

    No

    no_write_csv

    --write-parquet or --no-write-parquet

    NA

    Choose either option

    Whether model output will be saved as .parquet files (a compressed representation that can be opened and manipulated with minimal memory. May be required for large simulations). Read more about .

    No

    write_parquet

    FLEPI_NUM_SLOTS

    integar 1

    Number of independent simulations of the model to be run

    No

    Config value

    --method or -m

    seir: integration: method

    `rk4`, `euler`, or `stochastic`

    If provided, will override the `seir::integration::method` (including the default, if unspecified in the configuration file)

    No

    Config value if present, otherwise `rk4`

    --in-id

    FLEPI_RUN_INDEX

    string

    Unique ID given to the model runs. If the same config is run multiple times, you can avoid the output being overwritten by using unique model run IDs.

    No

    Constructed from current date and time as YYYY.MM.DD.HH/MM/SS

    --out-id

    FLEPI_RUN_INDEX

    string

    Unique ID given to the model runs. If the same config is run multiple times, you can avoid the output being overwritten by using unique model run IDs.

    No

    Constructed from current date and time as YYYY.MM.DD.HH/MM/SS

    dest_type

    required

    categorical

    location type

    dest_country

    required

    string (Country)

    ISO3 code for country of importation. Currently only USA is supported

    aggregate_to

    required

    categorical

    location type to aggregate to

    cache_work

    required

    boolean

    whether to save case data

    update_case_data

    required

    boolean

    deprecated; whether to update the case data or used saved

    draw_travel_from_distribution

    required

    boolean

    whether to add additional stochasticity to travel data; default is FALSE

    print_progress

    required

    boolean

    whether to print progress of importation model simulations

    travelers_threshold

    required

    integer

    include airports with at least the travelers_threshold mean daily number of travelers

    airport_cluster_distance

    required

    numeric

    cluster airports within airport_cluster_distance km

    param_list

    required

    See section below

    see below

    inf_period_nohosp_sd

    required

    numeric

    infectious period, non-hospitalized, sd

    inf_period_hosp_mean_log

    required

    numeric

    infectious period, hospitalized, log-normal mean

    inf_period_hosp_sd_log

    required

    numeric

    infectious period, hospitalized, log-normal sd

    p_report_source

    required

    numeric

    reporting probability, Hubei and elsewhere

    shift_incid_days

    required

    numeric

    mean delay from infection to reporting of cases; default = -10

    delta

    required

    numeric

    days per estimations period

    formatting::scenario_labels

    list of strings; one for each scenario in interventions::scenarios

    formatting::scenario_colors

    list of strings; one for each scenario in interventions::scenarios

    formatting::pdeath_labels

    list of strings

    formatting::display_dates

    list of dates

    formatting::display_dates2

    optional

    list of dates

    a 2nd string of display dates that can optionally be supplied to specific report functions

    -c or --config

    CONFIG_PATH

    file path

    Name of configuration file. Must be located in the current working directory, or else relative or absolute file path must be provided.

    Yes

    NA

    -i or --first_sim_index

    FIRST_SIM_INDEX

    integar ≥\geq≥1

    -s or --npi_scenario

    interventions: scenarios

    FLEPI_NPI_SCENARIOS

    list of strings

    Names of the intervention scenarios described in the config file that will be run. Must be a subset of scenarios defined.

    No

    All scenarios described in config

    -n or --nslots

    census_year

    optional

    integer (year)

    Determines the year for which census population size data is pulled.

    state_level

    optional

    boolean

    Determines whether county-level population-size data is instead grouped into state-level data (TRUE). Default FALSE

    modeled_states

    optional

    list of location codes

    census_api_key

    required

    string

    get an API key

    travel_dispersion

    required

    number

    ow dispersed daily travel data is; default = 3.

    maximum_destinations

    required

    integer

    incub_mean_log

    required

    numeric

    incubation period, log mean

    incub_sd_log

    required

    numeric

    incubation period, log standard deviation

    inf_period_nohosp_mean

    required

    numeric

    data_settings::pop_year

    integer

    plot_settings::plot_intervention

    boolean

    formatting::scenario_labels_short

    list of strings; one for each scenario in interventions::scenarios

    How to Run
    How to Run
    covidImportation package
    preprint

    The index of the first simulation

    nslots

    A vector of locations that will be modeled; others will be ignored

    number of airports to limit importation to

    infectious period, non-hospitalized, mean

    (OLD) Configuration setup

    Need to add MultiPeriodModifier and hospitalization interventions

    Overview

    This documentation describes the new YAML configuration file options that may be used when performing inference on model runs. As compared to previous model releases, there are additions to the seeding and interventions sections, the outcomes section replaces the hospitalization section, and the filtering section added to the file.

    Importantly, we now name our pipeline modules: seeding, seir, hospitalization and this becomes relevant to some of the new filtering specifications.

    Models may be calibrated to any available time series data that is also an outcome of the model (COVID-19 confirmed cases, deaths, hospitalization or ICU admissions, hospital or ICU occupancy, and ventilator use). Our typical usage has calibrated the model to deaths, confirmed cases, or both. We can also perform inference on intervention effectiveness, county-specific baseline R0, and the risk of specific health outcomes.

    We describe these options below and present default values in the example configuration sections.

    Modifications to seeding

    The model can perform inference on the seeding date and initial number of seeding infections in each subpop. An example of this new config section is:

    Config Item
    Required?
    Type/Format
    Description

    The method for determining the proposal distribution for the seeding amount is hard-coded in the inference package (R/pkgs/inference/R/functions/perturb_seeding.R). It is pertubed with a normal distribution where the mean of the distribution 10 times the number of confirmed cases on a given date and the standard deviation is 1.

    Modifications to interventions

    The model can perform inference on the effectiveness of interventions as long as there is at least some calibration health outcome data that overlaps with the intervention period. For example, if calibrating to deaths, there should be data from time points where it would be possible to observe deaths from infections that occurred during the intervention period (e.g., assuming 10-18 day delay between infection and death, on average).

    An example configuration file where inference is performed on scenario planning interventions is as follows:

    interventions::settings::[setting_name]

    This configuration allows us to infer subpop-level baseline R0 estimates by adding a local_variance intervention. The baseline subpop-specific R0 estimate may be calculated as where R0 is the baseline simulation R0 value, and local_variance is an estimated subpop-specific value.

    Interventions may be specified in the same way as before, or with an added perturbation section that indicates that inference should be performed on a given intervention's effectiveness. As previously, interventions with perturbations may be specified for all modeled locations or for explicit subpop. In this setup, both the prior distribution and the range of the support of the final inferred value are specified by the value section. In the configuration above, the inference algorithm will search 0 to 0.9 for all subpops to estimate the effectiveness of the stayhome intervention period. The prior distribution on intervention effectiveness follows a truncated normal distribution with a mean of 0.6 and a standard deviation of 0.3. The perturbation section specifies the perturbation/step size between the previously-accepted values and the next proposal value.

    Item
    Required?
    Type/Format

    New outcomes section

    This section is now structured more like the interventions section of the config, in that it has scenarios and settings. We envision that separate scenarios will be specified for each IFR assumption.

    Item
    Required?
    Type/Format

    outcomes::settings::[setting_name]

    The settings for each scenario correspond to a set of different health outcome risks, most often just differences in the probability of death given infection (Pr(incidD|incidI)) and the probability of hospitalization given infection (Pr(incidH|incidI)). Each health outcome risk is referenced in relation to the outcome indicated in source. For example, the probability and delay in becoming a confirmed case (incidC) is most likely to be indexed off of the number and timing of infection (incidI).

    Importantly, we note that incidI is automatically defined from the SEIR transmission model outputs, while the other compartment sources must be defined in the config before they are used.

    Users must specific two metrics for each health outcome, probability and delay, while a duration is optional (e.g., duration of time spent in the hospital). It is also optional to specify a perturbation section (similar to perturbations specified in the NPI section) for a given health outcome and metric. If you want to perform inference (i.e., if perturbation is specified) on a given metric, that metric must be specified as a distribution (i.e., not fixed) and the range of support for the distribution represents the range of parameter space explored in the inference.

    Item
    Required?
    Type/Format

    New filtering section

    This section configures the settings for the inference algorithm. The below example shows the settings for some typical default settings, where the model is calibrated to the weekly incident deaths and weekly incident confirmed cases for each subpop. Statistics, hierarchical_stats_geo, and priors each have scenario names (e.g., sum_deaths, local_var_hierarchy, and local_var_prior, respectively).

    filtering settings

    With inference model runs, the number of simulations nsimulations refers to the number of final model simulations that will be produced. The filtering$simulations_per_slot setting refers to the number of iterative simulations that will be run in order to produce a single final simulation (i.e., number of simulations in a single MCMC chain).

    Item
    Required?
    Type/Format

    filtering::statistics

    The statistics specified here are used to calibrate the model to empirical data. If multiple statistics are specified, this inference is performed jointly and they are weighted in the likelihood according to the number of data points and the variance of the proposal distribution.

    Item
    Required?
    Type/Format

    filtering::hierarchical_stats_geo

    The hierarchical settings specified here are used to group the inference of certain parameters together (similar to inference in "hierarchical" or "fixed/group effects" models). For example, users may desire to group all counties in a given state because they are geograhically proximate and impacted by the same statewide policies. The effect should be to make these inferred parameters follow a normal distribution and to observe shrinkage among the variance in these grouped estimates.

    Item
    Required?
    Type/Format

    filtering::priors

    It is now possible to specify prior values for inferred parameters. This will have the effect of speeding up model convergence.

    Item
    Required?
    Type/Format

    Model Output

    (This section describes the location and contents of each of the output files produced during a non-inference model run)

    The model will output 2–6 different types of files depending on whether the configuration file contains optional sections (such interventions, outcomes, and outcomes interventions) and whether model inference is conducted ;

    These files contain the values of the variables for both the infection and (if included) observational model at each point in time and for each subpopulation. A new file of the same type is produced for each independent simulation and each intervention scenario. Other files report the values of the initial conditions, seeding, and model parameters for each subpopulation and independent simulation (since parameters may be chosen to vary randomly between simulations). When model inference is run, there are also file types reporting the model likelihood (relative to the provided data) and files for each iteration of the inference algorithm.

    Within the model_output directory in the project's directory, the files will be organized into folders named for the file types: seir, spar, snpi, hpar, hnpi, seed, init, or llik (see descriptions below). Within each file type folder, files will further be organized by the simulation name (setup_name in config), the modifier scenario names - if scenarios exist for either seir or outcome parameters (specified with seir_modifiers::scenarios and outcome_modifiers::scenarios in config), and the run_id (the date and time of the simulation, by default). For example:

    The name of each individual file contains (in order) the slot, run_id and file type. The first index indicates the slot (chain, in MCMC language). If multiple iterations or blocks are run, the filename will look like 000000001.000000001.000000001.run_id.seir.parquet indicating slot.block.iteration ;

    Each file is a data table that is by default saved as a (a compressed representation that can be opened and manipulated with minimal memory) but can alternatively be saved as a csv file. See options for specifying output type in

    The example files outputs we show were generated with the following configuration file

    The types and contents of the model output files changes slightly depending on whether the model is run as a forward simulation only, or is run in inference mode, in which parameter values are estimated by comparing the model to data. Output specific to model inference is described in a ;

    SEIR (infection model output)

    Files in the seir folder contain the output of the infection model over time. They contain the value of every variable for each day of the simulation for every subpopulation.

    For the example configuration file shown above, the seir file is

    The meanings of the columns are:

    mc_value_type – either prevalence or incidence. Variable values are reported both as a prevalence (number of individuals in that state measured instantaneously at the start of the day, equivalent to the meaning of the S, I, or R variable in the differential equations or their stochastic representation) and as incidence (total number of individuals who newly entered this state, from all other states, over the course of the 24-hour period comprising that calendar day).

    mc_infection_stage, mc_vaccination_status, etc. – The name of the compartment for which the value is reported, broken down into the infection stage for each state type (eg. vaccination, age).

    mc_name – The name of the compartment for which the value is reported, which is a concatenation of the compartment status in each state type.

    subpop_1, subpop_2, etc. – one column for each different subpopulation, containing the value of the number of individuals in the described compartment in that subpopulation at the given date. Note that these are named after the nodenames defined by the user in the geodata file.

    date – The calendar date in the simulation, in YYYY-MM-DD format.

    There will be a separate seir file output for each slot (independent simulation) and for each iteration of the simulation if is conducted.

    SPAR (infection model parameter values)

    The files in the spar folder contain the parameters that define the transitions in the compartmental model of disease transmission, defined in the seir::parameters section of the config ;

    The value column gives the numerical values of the parameters defined in the corresponding column parameter.

    SNPI (infection model parameter intervention values)

    Files in the snpi folder contain the time-dependent modifications to the transmission model parameter values (defined in seir_modifiers section of the config) for each subpopulation. They contain the modifiers that apply to a given subpopulation and the dates within which they apply, and the value of the reduction to the given parameter.

    The meanings of the columns are:

    subpop – The subpopulation to which this intervention parameter applies.

    modifier_name – The name of the intervention parameter.

    start_date – The start date of this intervention, as defined in the configuration file.

    end_date – The end date of this intervention, as defined in the configuration file.

    parameter – The parameter to which the intervention applies, as defined in the configuration file.

    value – The size of the modifier to the parameter either from the config, or fit by inference if that is run.

    HPAR (observation model parameter values)

    Files in the hpar folder contain the output parameters of the observational model. They contain the values of the probabilities, delays or durations for each outcome in a given subpopulation.

    The meanings of the columns are:

    subpop – Values in this column are the names of the nodes as defined in the geodata file given by the user.

    quantity – The values in this column are the types of parameter values described in the config. The options are probability, delay, and duration. These are the quantities to which there is some parameter defined in the config.

    outcome – The values here are the outcomes to which this parameter applies. These are names of the outcome compartments defined in the model.

    value – The values in this column are the parameter values of the quantity that apply to the given subpopulation and outcome.

    HOSP (observation model output)

    Files in the hosp folder contain the output of the infection model over time. They contain the value of every outcome variable for each day of the simulation for every subpopulation.

    Columns are:

    date – The calendar date in the simulation, in YYYY-MM-DD format.

    subpop – Values in this column are the names of the nodes as defined in the geodata file given by the user.

    outcome_variable_1, outcome_variable_2, ... - one column for each different outcome variable as defined in the config, containing the value of the number of individuals in the described compartment in that subpopulation at the given date ;

    HNPI (observation model parameter intervention values)

    Files in the hnpi folder contain any parameter modifier values that apply to the outcomes model, defined in the outcome_modifiers section of the config. They contain the values of the outcome parameter modifiers, and the dates to which they apply in a given subpopulation.

    The meanings of the columns are:

    subpop – The values of this column are the names of the nodes from the geodata file.

    modifier_name – The names/labels of the modifier parameters, defined by the user in the config file, which applies to the given node and time period.

    start_date – The start date of this intervention, as defined in the configuration file.

    end_date – The end date of this intervention, as defined in the configuration file.

    parameter – The outcome parameter to which the intervention applies.

    value – The values in this column are the modifier values of the intervention parameters, which apply to the given parameter in a given subpopulation. Note that these are strictly reductions; thus a negative value corresponds to an increase in the parameter, while a positive value corresponds to a decrease in the parameter.

    SEED (model seeding values)

    Files in the seed folder contain the seeded values of the infection model. They contain the amounts seeded into each variable, the variable they are seeded from, and the time at which the seeding occurs. The user can provide a single seeding file (which will be used across all simulations), or, if multiple simulations are being run the user can provide a separate file for each simulation ;

    The meanings of the columns are:

    subpop - The values of this column are the names of the nodes from the geodata file.

    date - The values in this column are the dates of seeding.

    amount - The amount seeded in the given subpopulation from source variables to destination variables, at the given date ;

    source_infection_stage, source_vaccination_status, etc. - The name of the compartment from which the amount is seeded, broken down into the infection stage for each state type (eg. vaccination, age).

    destination_infection_stage, destination_vaccination_status, etc. - The name of the compartment into which the amount is seeded, broken down into the infection stage for each state type (eg. vaccination, age).

    no_perturb - The values in this column can be either true or false. If true, then the amount and/or date can be perturbed if running an inference run. Whether the amount or date is perturbed is defined in the config using perturb_amount and perturb_date ;

    INIT (model initial conditions)

    Files in the init folder contain the initial values of the infection model. Either seed or init files will be present, depending on the configuration of the model . These files contain the initial conditions of the infection model at the start date defined in the configuration file. As with seeding, the user can provide a single initial conditions file (which will be used across all simulations), or, if multiple simulations are being run the user can provide a separate file for each simulation ;

    The meanings of the columns are:

    subpop - The values of this column are the names of the nodes from the geodata file.

    mc_infection_stage, mc_vaccination_status, etc. - The name of the compartment for which the value is reported, broken down into the infection stage for each state type (eg. vaccination, age).

    amount - The amount initialized seeded in the given subpopulation at the start date defined in the configuration file ;

    Inference Model Output

    (This section describes the location and contents of the additional output files produced during an inference model run)

    Updates to other files

    \LLIK (inference runs only)

    During inference runs, an additional file type, llik, is created, which is described in the section ;

    These files contain the log-likelihoods of the model simulation for each subpopulation, as well as some diagnostics on acceptance.

    The meanings of the columns are:

    ll - These values are the log-likelihoods of data given the model and parameter values for a single subpopulation (in subpop column) ;

    filename - ...

    subpop - The values of this column are the names of the nodes from the geodata file.

    accept - Either 0 or 1, depending on whether the parameters during this iteration of the simulation were accepted (1) or rejected (0) in that subpopulation ;

    accept_avg ;

    accept_prob ;

    For inference runs, ... flepiMoP produces one file per parallel slot, for both global and chimeric outputs...

    Specifying time-varying parameter modifications

    This section describes how to specify modifications to any of the parameters of the transmission model or observational model during certain time periods.

    Modifiers are a powerful feature in flepiMoP to enable users to modify any of the parameters being specified in the model during particular time periods. They can be used, for example, to mirror public health control interventions, like non-pharmaceutical interventions (NPIs) or increased access to diagnosis or care, or annual seasonal variations in disease parameters. Modifiers can act on any of the transmission model parameters or observation model parameters ;

    In the seir_modifiers and outcome_modifiers sections of the configuration file the user can specify several possible types of modifiers which will then be implemented in the model. Each modifier changes a parameter during one or multiple time periods and for one or multiple specified subpopulations.

    We currently support the following intervention types. Each of these is described in detail below:

    • "SinglePeriodModifier" – Modifies a parameter during a single time period

    • "MultiPeriodModifier" – Modifies a parameter by the same amount during a multiple time periods

    • "ModifierModifier" – Modifies another intervention during a single time period

    • "StackedModifier" – Combines two or more interventions additively or multiplicatively, and is used to be able to turn on and off groups of interventions easily for different runs ;

    Note that if you want a parameter to vary continuously over time (for example, a daily transmission rate that is influenced by temperature and humidity), then it is easier to do this by using a "timeseries" parameter value than by combining many separate modifiers. Timeseries parameter values are described in the section. Timeseries parameters for parameters (e.g., a testing rate that fluctuates rapidly due to test availability) are in development but not currently available ;

    Within flepiMoP, modifiers can be run as "scenarios". With scenarios, we can use the same configuration file to run multiple versions of the model where only the modifiers applied differ.

    The modifiers section contains two sub-sections: modifiers::scenarios, which lists the name of the modifiers that will run in each separate scenario, and modifiers::modifiers, where the details of each modifier are specified (e.g., the parameter it acts on, the time it is active, and the subpopulation it is applied to). An example is outlined below

    In this example, each scenario runs a single intervention, but more complicated examples are possible. ;

    The major benefit of specifying both "scenarios" and "modifiers" is that the user can use "StackedModifier" option to combine other modifiers in different ways, and then run either the individual or combined modifiers as scenarios. This way, each scenario may consist of one or more individual parameter modifications, and each modification may be part of multiple scenarios. This provides a shorthand to quickly consider multiple different versions of a model that have different combinations of parameter modifications occurring. For example, during an outbreak we could evaluate the impact of school closures, case isolation, and masking, or any one or two of these three measures. An example of a configuration file combining modifiers to create new scenarios is given below

    The seir_modifiers::scenarios andoutcome_modifiers::scenarios sections are optional. If the scenariossection is not included, the model will run with all of the modifiers turned "on" ;

    If thescenariossection is included for either seir or outcomes, then each time a configuration file is run, the user much specify which modifier scenarios will be run. If not specified, the model will be run one time for each combination of seir and outcome scenario ;

    Example

    [Give a configuration file that tries to use all the possible option available. Based on simple SIR model with parameters beta and gamma in 2 subpopulations. Maybe a SinglePeriodModifier on beta for a lockdown and gamma for isolation, one having a fixed value and one from a distribution, MultiPeriodModifier for school year in different places, ModifierModifer for ..., StackedModifier for .... ]

    modifiers::scenarios

    A optional list consisting of a subset of the modifiers that are described in modifiers::settings, each of which will be run as a separate scenario. For example

    or

    modifiers::settings

    A formatted list consisting of the description of each modifier, including its name, the parameter it acts on, the duration and amount of the change to that parameter, and the subset of subpopulations in which the parameter modification takes place. The list items are summarized in the table below and detailed in the sections below.

    Config item
    Required
    Type/format
    Description

    SinglePeriodModifier

    SinglePeriodModifier interventions enable the user to specify a multiplicative reduction to a parameter of interest. It take a parameter, and reduces it's value by value (new = (1-value) * old) for the subpopulations listed insubpop during the time interval [period_start_date, period_end_date]

    For example, if you would like to create an SEIR modifier called lockdown that reduces transmission by 70% in the state of California and the District of Columbia between two dates, you could specify this with a SinglePeriodModifier, as in the example below

    Example

    Or to create an outcome variable modifier called enhanced_testing during which the case detection rate double ;

    Configuration options

    method: SinglePeriodModifier

    parameter: The name of the parameter that will be modified. This could be a parameter defined for the transmission model in or for the observational model in . If the parameter is used in multiple transitions in the model then all those transitions will be modified by this amount ;

    period_start_date: The date when the modification starts, in YYYY-MM-DD format. The modification will only reduce the value of the parameter after (inclusive of) this date.

    period_end_date: The date when the modification ends, in YYYY-MM-DD format. The modification will only reduce the value of the parameter before (inclusive of) this date.

    subpop:A list of subpopulation names/ids in which the specified modification will be applied. This can be a single subpop, a list, or the word "all" (specifying the modification applies to all existing subpopulations in the model). The modification will do nothing for any subpopulations not listed here.

    value:The fractional reduction of the parameter during the time period the modification is active. This can be a scalar number, or a distribution using the notation described in the section. The new parameter value will be

    subpop_groups: An optional list of lists specifying which subsets of subpopulations in subpop should share parameter values; when parameters are drawn from a distribution or fit to data. See section below for more details ;

    MultiPeriodModifier

    MultiPeriodModifier interventions enable the user to specify a multiplicative reduction to the parameter of interest by value (new = (1-value) * old) for the subpopulations listed in subpop during multiple different time intervals each defined by a start_date and end_date.

    For example, if you would like to describe the impact that transmission in schools has on overall disease spread, you could create a modifier that increases transmission by 30% during the dates that K-12 schools are in session in different regions (e.g., Massachusetts and Florida):

    Example

    Configuration options

    method: MultiPeriodModifier

    parameter: The name of the parameter that will be modified. This could be a parameter defined for the transmission model in or for the observational model in . If the parameter is used in multiple transitions in the model then all those transitions will be modified by this amount ;

    groups: A list of subpopulations (subpops) or groups of them, and time periods the modification will be active in each of them

    • groups:subpop A list of subpopulation names/ids in which the specified modification will be applied. This can be a single subpop, a list, or the word "all" (specifying the modification applies to all existing subpopulations in the model). The modification will do nothing for any subpopulations not listed here.

    • groups: periods A list of time periods, each defined by a start and end date, when the modification will be applied

    value:The fractional reduction of the parameter during the time period the modification is active. This can be a scalar number, or a distribution using the notation described in the section. The new parameter value will be

    subpop_groups: An optional list of lists specifying which subsets of subpopulations in subpop should share parameter values; when parameters are drawn from a distribution or fit to data. See section below for more details ;

    ModifierModifier

    ModifierModifier interventions allow the user to specify an intervention that acts to modify the value of another intervention, as opposed to modifying a baseline parameter value. The intervention multiplicatively reduces the modifier of interest by value (new = (1-value) * old) for the subpopulations listed in subpop during the time interval [period_start_date, period_end_date].

    Example

    For example, ModifierModifier could be used to describe a social distancing policy that is in effect between two dates and reduces transmission by 60% if followed by the whole population, but part way through this period, adherence to the policy drops to only 50% of in one of the subpopulations population:

    Note that this configuration is identical to the following alternative specification

    However, there are situations when the ModiferModifier notation is more convenient, especially when doing parameter fitting. ;

    Configuration options

    method: ModifierModifier

    baseline_modifier: The name of the original parameter modification which will be further modified.

    parameter: The name of the parameter in the baseline_scenario that will be modified ;

    period_start_date: The date when the intervention modifier starts, in YYYY-MM-DD format. The intervention modifier will only reduce the value of the other intervention after (inclusive of) this date.

    period_end_date: The date when the intervention modifier ends, in YYYY-MM-DD format. The intervention modifier will only reduce the value of the other intervention before (inclusive of) this date.

    subpop:A list of subpopulation names/ids in which the specified intervention modifier will be applied. This can be a single subpop, a list, or the word "all" (specifying the interventions applies to all existing subpopulations in the model). The intervention will do nothing for any subpopulations not listed here.

    value:The fractional reduction of the baseline intervention during the time period the modifier intervention is active. This can be a scalar number, or a distribution using the notation described in the section. The new parameter value will be

    and so the value of the underlying parameter that was modified by the baseline intervention will be

    subpop_groups: An optional list of lists specifying which subsets of subpopulations in subpop should share parameter values; when parameters are drawn from a distribution or fit to data. See section below for more details ;

    StackedModifier

    Combine two or more modifiers into a scenario, so that they can easily be singled out to be run together without the other modifiers. If multiply modifiers act during the same time period in the same subpopulation, their effects are combined multiplicatively. Modifiers of different types (i.e. SinglePeriodModifier, MultiPeriodModifier, ModifierModifier, other StackedModifiers) can be combined ;

    Examples

    or

    Configuration options

    method: StackedModifier

    modifiers: A list of names of the other modifiers (specified above) that will be combined to create the new modifier (which we typically refer to as a "scenario")

    modifiers::modifiers::groups

    subpop_groups: For any of the modifier types, subpop_groups is an optional list of lists specifying which subsets of subpopulations in subpop should share parameter values; when parameters are drawn from a distribution or fit to data. All other subpopulations not listed will have unique intervention values unlinked to other areas. If the value is 'all', then all subpopulations will be assumed to have the same modifier value. When the subpop_groups option is not specified, all subpopulations will be assumed to have unique values of the modifier ;

    For example, for a model of disease spread in Canada where we want to specify that the (to be varied) value of a modification to the transmission rate should be the same in all the Atlantic provinces (Nova Scotia, Newfoundland, Prince Edward Island, and New Brunswick), the same in all the prairie provinces (Manitoba, Saskatchewan, Alberta), the same in the three territories (Nunavut, Northwest Territories, and Yukon), and yet take unique values in Ontario, Quebec, and British Columbia, we could write

    Config writer

    The model needs the configurations file to run (described in previous sections). These configs become lengthy and sometimes difficult to type manually. The config writer helps to generate configs provided the relevant files are present.

    Print Functions:

    These functions are used to print specific sections of the configuration files.

    print_header

    Used to generate the global header. For more information on global headers click .

    Variable name
    Required (default value if optional)
    Description

    print_spatial_setup

    Used to generate the spatial setup section of the configuration. For more information on spatial setup click .

    Variable name
    Required (default value if optional)
    Description

    print_compartments

    Used to generate the compartment list for each way a population can be divided.

    Variable Name
    Required (default value if optional)
    Description

    Parts of the configuration files that are printed but not needed for FlepiMop runs (need to be mentioned for US or COVID-19 specific runs??)

    Spatial Setup:

    • census year: year of geodata files

    • modeled states (sim_states): This has US state abbreviations. Do we include the names of the sub-populations in the geodata file? Eg: small_province, large_province

    • state_level: Specifies if the runs are run for US states

    Diagnostic plotting scripts

    We provide helper scripts to aid users in understanding model outputs and diagnosing simulations and iterations. These scripts may be set to run automatically after a model run, and are dependent on the model defined in the user's defined config file ;

    The script postprocess_snapshot.R requires the following command line inputs:

    • a user-defined config, $CONFIG_PATH

    • a run index, $FLEPI_RUN_INDEX

    • a path to the model output results, $FS_RESULTS_PATH

    • a path to the flepiMoP repository, $FLEPI_PATH; an ;

    • a list of outputs to plot, $OUTPUTS, by default the script provides diagnostics for the following model output file ;

    Plots of hosp output files show confidence intervals of model runs, against the provided ground truth data for inference runs, for each metapopulation node. hnpi and snpi plots provide violin plots of parameter values for each slot ;

    Other scripts are included as more specific examples of post-processing, used for diagnostic tools. processing_diagnostics.R scripts provides a detailed diagnosis of inference model runs and fits ;

    Inference with EMCEE

    Config Changes Relative To Classical Inference

    The major changes are:

    1. Under the 'inference' section add method: emcee entry, and

    2. Under the 'statistics' section move the resample specific configuration under a 'resample' subsection as show bellow:

    In addition to those configuration changes there are now new likelihood statistics offered: pois, norm/norm_homoskedastic, norm_cov/norm_heteroskedastic, nbinom, rmse, absolute_error. As well as new regularizations: forecast and allsubpops.

    Running Locally

    You can test your updated config by running:

    If it works, it should produce:

    • Plots of simulation directly from your config,

    • Plots after the fits with the fits and the parameter chains,

    • An h5 file with all the chains, and

    • The usual model_output/ directory.

    It will also immediately produce standard out that is similar to (dependent on config):

    Here, it says the config fits 92 parameters, we'll keep that in mind and choose a number of walkers greater than (ideally 2 times) this number of parameters.

    Running On An HPC Environment With Slurm

    First, install flepiMoP on the cluster following the guide. Then manually create a batch file to submit to slurm like so:

    Breaking down what each of these lines does:

    • #SBATCH --ntasks 1: Requests that this be run as a single job,

    • #SBATCH --nodes 1: Requests that the job be run on 1 node, as of right now EMCEE only supports single nodes,

    • #SBATCH --mem 450g: Requests that the whole job get 405GB of memory should be ~2-3GB per a walker,

    For more details on other options provided by gempyor for calibration please see flepimop-calibrate --help.

    Postprocessing EMCEE

    At this stage postprocessing for EMCEE outputs is fairly manual. A good starting point can be found in postprocessing/emcee_postprocess.ipynb which plots the chains and can run forward projections from the sample drawn from calibration.

    Create a post-processing script

    These scripts are run automatically after an inference run

    Some information to consider if you'd like your script to be run automatically after an inference run ;

    • Most R/python packages are installed already installed. Try to run your script on the conda environment defined on the submission page (or easier if you are not set up on MARCC, ask me)

    • There will be some variables set in the environment. These variables are:

      • $CONFIG_PATH the path to the configuration fil ;

      • $FLEPI_RUN_INDEX the run id for this run (e.g `CH_R3_highVE_pesImm_2022_Jan29`

      • $JOB_NAME this job name (e.g USA-20230130T163847_inference_med)

      • $FS_RESULTS_PATH the path where lies the model results. It's a folder that contains the model_ouput/ as a subfolder

      • $FLEPI_PATH path of the flepiMoP repository.

      • $PROJECT_PATH path of the Data directory (e.g Flu_USA or COVID19_USA).

      • Anything you ask can theoretically be provided here.

    • The script must run without any user intervention.

    • The script is run from $PROJECT_PATH.

    • Your script lies in the flepiMoP directory (preferably) or it's ok if it is in a data directory if it makes sense ;

    • It is run on a 64Gb of RAM multicore machine. All scripts combined must complete under 4 hours, and you can use multiprocessing (48 cores)

    • Outputs (pdf, csv, html, txt, png ...) must be saved in a directory named pplot/ (you can assume that it exists) in order to be sent to slack by FlepiBot 🤖 after the run.

    • an example postprocessing script (in python) is .

    • You can test your script on MARCC on a run that is already saved in /data/struelo1/flepimop-runs or I can do it for you.

    • Once your script works, add (or ask to add) the command line to run in file batch/postprocessing_scripts.sh between the START and END lines, with a little comment about what your script does.

    name: sir
    setup_name: minimal
    start_date: 2020-01-31
    end_date: 2020-05-31
    nslots: 1
    
    subpop_setup:
      geodata: geodata_sample_1pop.csv
      mobility: mobility_sample_1pop.csv
      popnodes: population
      nodenames: name
    
    seeding:
      method: FromFile
      seeding_file: data/seeding_1pop.csv
    
    compartments:
      infection_stage: ["S", "I", "R"]
    
    seir:
      integration:
        method: stochastic
        dt: 1 / 10
      parameters:
        gamma:
          value:
            distribution: fixed
            value: 1 / 5
        Ro:
          value:
            distribution: uniform
            low: 2
            high: 3
      transitions:
        - source: ["S"]
          destination: ["I"]
          rate: ["Ro * gamma"]
          proportional_to: [["S"],["I"]]
          proportion_exponent: ["1","1"]
        - source: ["I"]
          destination: ["R"]
          rate: ["gamma"]
          proportional_to: ["I"]
          proportion_exponent: ["1"]
    
    interventions:
      scenarios:
        - None
        - Lockdown
      modifiers:
        None:
          method: SinglePeriodModifier
          parameter: r0
          period_start_date: 2020-04-01
          period_end_date: 2020-05-15
          value:
            distribution: fixed
            value: 0
            settings:
        Lockdown:
          method: SinglePeriodModifier
          parameter: r0
          period_start_date: 2020-04-01
          period_end_date: 2020-05-15
          value:
            distribution: fixed
            value: 0.7
    > flepimop simulate sir_control.yml
    /> flepimop simulate -n 100 -j 4 -npi_scenario None -m euler --write_csv sir_control.yml
    subpop_setup:
      census_year: 2010
      state_level: TRUE
      geodata: geodata_2019_statelevel.csv
      mobility: mobility_2011-2015_statelevel.csv
      modeled_states:
        - CT
        - MA
        - ME
        - NH
        - RI
        - VT
      
    USPS	subpop	population
    AL	01000	4876250
    AK	02000	737068
    AZ	04000	7050299
    AR	05000	2999370
    CA	06000	39283497
    .....
    ori	dest	amount
    01000	02000	198
    01000	04000	292
    01000	05000	570
    01000	06000	1030
    01000	08000	328
    .....
    importation:
      census_api_key: "fakeapikey00000"
      travel_dispersion: 3
      maximum_destinations: Inf
      dest_type: state
      dest_county: USA
      aggregate_to: airport
      cache_work: TRUE
      update_case_data: TRUE
      draw_travel_from_distribution: FALSE
      print_progress: FALSE
      travelers_threshold: 10000
      airport_cluster_distance: 80
      param_list:
        incub_mean_log: log(5.89)
        incub_sd_log: log(1.74)
        inf_period_nohosp_mean: 15
        inf_period_nohosp_sd: 5
        inf_period_hosp_mean_log: 1.23
        inf_period_hosp_sd_log: 0.79
        p_report_source: [0.05, 0.25]
        shift_incid_days: -10
        delta: 1
    report:
      data_settings:
        pop_year: 2018
      plot_settings:
        plot_intervention: TRUE
      formatting:
        scenario_labels_short: ["UC", "S1"]
        scenario_labels:
          - Uncontrolled
          - Scenario 1
        scenario_colors: ["#D95F02", "#1B9E77"]
        pdeath_labels: ["0.25% IFR", "0.5% IFR", "1% IFR"]
        display_dates: ["2020-04-15", "2020-05-01", "2020-05-15", "2020-06-01", "2020-06-15"]
        display_dates2: ["2020-04-15", "2020-05-15", "2020-06-15"]
    here
    (here)
    ≥\geq≥
    ≥\geq≥
    parquet files

    Multiple Configuration Files

    Additional parameter options

    Aka “magic numbers” - fixed parameters that may or may not be in config, like MCMC step size, dt, etc . . .

    • MCMC step size

    • Numerical integration step size

    • Mobility proportion

    Swapping model modules

    (Ie using a totally different compartmental model or outcomes model)

    Numerical methods

    model_output_dir_name

    Optional (model_output)

    Folder path where the outputs of the simulated model are stored

    sim_start_date

    Required

    Start date for model simulation

    sim_end_date

    Required

    End date for model simulation

    start_date_groundtruth

    Optional (NA)

    Start date for fitting data for inference runs

    end_date_groundtruth

    Optional (NA)

    End date for fitting data for inference runs

    nslots

    Required

    Number of independent simulations to run

    popnodes

    Optional (pop2019est)

    Name of a column in the geodata file that specifies the population of every subpopulation column

    nodenames

    Optional (subpop)

    Name of a column in the geodata file that specifies the name of the subpopulation

    state_level

    Optional (TRUE)

    Specifies if the subpopulations are US states

    sim_name

    Required

    Name of the configuration file to be generated. Generally based on the type of simulation

    setup_name

    Optional (SMH)

    Type of run - a Scenario Modeling Hub ("SMH") or Forecasting Hub ("FCH") Simulation.

    disease

    Optional (covid19)

    Pathogen or disease being simulated

    smh_round

    Optional (NA)

    census_year

    Optional (2019)

    The year of data uses to generate the geodata files for US simulations ?? [Unsure about this]

    sim_states

    Required

    Vector of locations that will be modeled (US Specific?)

    geodata_file

    Optional (geodata.csv)

    Name of the geodata file which is imported

    mobility_file

    Optional (mobility.csv)

    inf_stages

    Optional (S,E,I1,I2,I3,R,W)

    Various infection stages an individual can be in

    vaccine_compartments

    Optional (unvaccinated, 1dose, 2dose, waned)

    Various levels of vaccinations an individual can have

    variant_compartments

    Optional (WILD, ALPHA, DELTA, OMICRON)

    Variants of the pathogen

    age_strata

    Optional (age0to17, age18to64, age65to100)

    HERE
    HERE

    Round number for Scenario Modeling Hub Submission

    Name of the mobility file which is imported

    Different age groups, the population has been stratified in

    US specific How to Run

    lambda_file

    required

    path to seeding file

    perturbation_sd

    required

    standard deviation for the proposal value of the seeding date, in number of days

    perturbation

    optional for SinglePeriodModifierR0

    this option indicates whether inference will be performed on this setting and how the proposal value will be identified from the last accepted value

    subpop

    optional for SinglePeriodModifierR0

    list of subpops, which must be in geodata

    settings

    required

    See details below

    probability::perturbation

    optional

    inference settings for the probability metric

    delay

    required

    time delay between source and the specified health outcome

    delay::value

    required

    specifies whether the value is fixed or distributional and the parameters specific to that metric and distribution

    delay::perturbation

    optional

    inference settings for the time delay metric (coming soon)

    duration

    optional

    duration that health outcome status endures

    duration::value

    required

    specifies whether the value is fixed or distributional and the parameters specific to that metric and distribution

    duration::perturbation

    optional

    inference settings for the duration metric (coming soon)

    statistics

    required

    specifies which data will be used to calibrate the model. see filtering::statistics for details

    hierarchical_stats_geo

    optional

    specifies whether a hierarchical structure should be applied to any inferred parameters. See filtering::hierarchical_stats_geo for details.

    priors

    optional

    specifies prior distributions on inferred parameters. See filtering::priors for details

    data_var

    required

    column where data can be found in data_path file

    remove_na

    required

    logical

    add_one

    required

    logical, TRUE if evaluating the log likelihood

    likelihood::dist

    required

    distribution of the likelihood

    likelihood::param

    required

    parameter value(s) for the likelihood distribution. These differ by distribution so check the code in inference/R/functions.R/logLikStat function.

    transform

    required

    type of transform that should be applied to the likelihood: "none" or "logit"

    method

    required

    "FolderDraw"

    seeding_file_type

    required for FolderDraw

    "seed" or "impa"

    indicates which seeding file type the SEIR model will look for, "seed", which is generated from create_seeding.R, or "impa", which refers to importation

    folder_path

    required

    path to folder where importation inference files will be saved

    R0∗(1−localvariance),R0*(1-local_variance),R0∗(1−localv​ariance),

    template

    Required

    "SinglePeriodModifierR0" or "StackedModifier"

    period_start_date

    optional for SinglePeriodModifierR0

    date between global start_date and end_date; default is global start_date

    period_end_date

    optional for SinglePeriodModifierR0

    date between global start_date and end_date; default is global end_date

    value

    required for SinglePeriodModifierR0

    method

    required

    "delayframe"

    param_from_file

    required

    if TRUE, will look for param_subpop_file

    param_subpop_file

    optional

    path to subpop-params parquet file, which indicates location specific risk values. Values in this file will override values in the config if there is overlap.

    scenarios

    required

    (health outcome metric)

    required

    "incidH", "incidD", "incidICU", "incidVent", "incidC", corresponding to variable names

    source

    required

    name of health outcome metric that is used as the reference point

    probability

    required

    health outcome risk

    probability::value

    required

    simulations_per_slot

    required

    number of iterations in a single MCMC inference chain

    do_filtering

    required

    TRUE if inference should be performed

    data_path

    required

    file path where observed data are saved

    likelihood_directory

    required

    name

    required

    name of statistic, user defined

    aggregator

    required

    function used to aggregate data over the period, usually sum or mean

    period

    required

    duration over which data should be aggregated prior to use in the likelihood, may be specified in any number of days, weeks, months

    sim_var

    required

    scenario name

    required

    name of hierarchical scenario, user defined

    name

    required

    name of the estimated parameter that will be grouped (e.g., the NPI scenario name or a standardized, combined health outcome name like probability_incidI_incidC)

    module

    required

    name of the module where this parameter is estimated (important for finding the appropriate files)

    geo_group_col

    required

    scenario name

    required

    name of prior scenario, user defined

    name

    required

    name of NPI scenario or parameter that will have the prior

    module

    required

    name of the module where this parameter is estimated

    likelihood

    required

    specifies both the prior distribution and range of support for the final inferred values

    user-defined scenario name

    specifies whether the value is fixed or distributional and the parameters specific to that metric and distribution

    folder path where likelihood evaluations will be stored as the inference algorithm runs

    column name where model data can be found, from the hospitalization outcomes files

    geodata column name that should be used to group parameter estimation

    specifies the distribution of the prior

    parquet file
    Other Configuration Options.
    separate section
    Model Inference
    flepiMoP/examples/tutorials
    ├── model_output
    │   ├── seir
    │   ├── spar
    │   ├── snpi
    │   └── llik
    │       └── sample_2pop
    │           └── None
    │               └── 2023.05.24.02/12/48.
    │                   ├── chimeric
    │                   └── global
    │                       ├── final
    │                       │   └── 000000001.2023.05.24.02/12/48..llik.parquet
    │                       └── intermediate
    │                           └── 000000001.000000001.2023.05.24.02/12/48..llik.parquet
    Inference Model Output

    period_end_date or periods::end_date

    required

    numeric, YYYY-MM-DD

    The date when the modification ends. Notation depends on value of method.

    subpop

    required

    String, or list of strings

    The subpopulations to which the modifications will be applied, or "all" . Subpopulations must appear in the geodata file.

    value

    required

    Distribution, or single value

    The relative amount by which a modification reduces the value of a parameter.

    subpop_groups

    optional

    string or a list of lists of strings

    A list of lists defining groupings of subpopulations, which defines how modification values should be shared between them, or 'all' in which case all subpopulations are put into one group with identical modification values. By default, if parameters are chosen randomly from a distribution or fit based on data, they can have unique values in each subpopulation.

    baseline_scenario

    Used only for ModifierModifier

    String

    Name of the original modification which will be further modified

    modifiers

    Used only for StackedModifier

    List of strings

    List of modifier names to be grouped into the new combined modifier/scenario name

    groups:periods:start_date The date when the modification starts, in YYYY-MM-DD format. The modification will only reduce the value of the parameter after (inclusive of) this date.

  • groups:periods:end_date The date when the modification ends, in YYYY-MM-DD format. The modification will only reduce the value of the parameter before (inclusive of) this date.

  • method

    required

    string

    one of SinglePeriodModifier, MultiPeriodModifier, ModifierModifier, or StackedModifier

    parameter

    required

    string

    The parameter on which the modification is acting. Must be a parameter defined in seir::parameters or outcomes

    period_start_date or periods::start_date

    required

    numeric, YYYY-MM-DD

    seir::parameters
    outcomes
    seir::parameters
    outcomes
    Distributions
    subpop_groups
    seir::parameters
    outcomes
    Distributions
    subpop_groups
    Distributions
    subpop_groups

    The date when the modification starts. Notation depends on value of method.

    "hosp, hpar, snpi, hnpi, llik"

    #SBATCH --cpus-per-task 256: Requests that the whole job get 256 CPUs (technically 256 per a task by ntasks should be set to 1 for EMCEE),

  • #SBATCH --time 20:00:00: Specifies a time limit of 20hrs for this job to complete in, and

  • flepimop-calibrate ...:

    • --config config_NC_emcee.yml: Use the config_NC_emcee.yml for this calibration run,

    • --nwalkers 500: Use 500 walkers (or chains) for this calibration, should be about 2x the number of parameters,

    • --jobs 256: The number of parallel walkers to run, should be either 1x or 0.5x the number of cpus,

    • --niterations: The number of iterations to run for for each walker,

    • --nsamples: The number of posterier samples (taken from the end of each walker) to save to the model_output/ directory, and

    • --id: An optional short but unique job name, if not explicitly provided one will be generated from the config.

  • Running On A HPC With Slurm
    left: classical inference config, right: new EMCEE config

    Using plug-ins 🧩[experimental]

    How to plug-in your code/data directly into flepiMoP

    Sometimes, the default modules, such as seeding, or initial condition, do not provide the desired functionality. Thankfully, it is possible to replace a gempyor module with your own code, using plug-ins. This works only for initial conditions and seeding at the moment, reach out to us if you are interested in having it works on parameters, modifiers, ...

    Here is an example, to set a random initial condition, where each subpopulation a random proportion of individuals is infected. For this, simply set the method of a block to plugin and provide the path of your file.

    initial_conditions:
      method: plugin
      plugin_file_path: model_input/my_initial_conditions.py
      # you can also include some configuration for your plugin:
      ub_prop_infected: 0.001 # upper bound of the uniform distribution

    This file contains a class that inherits from a gempyor class, which means that everything already defined in gempyor is available but you can overwrite any single method. Here, we will rewrite the load and draw methods of the initial conditions methods

    import gempyor.seeding_ic
    import numpy as np
    
    class InitialConditions(gempyor.seeding_ic.InitialConditions):
    
        def get_from_config(self, sim_id: int, setup) -> np.ndarray:
            y0 = np.zeros((setup.compartments.compartments.shape[0], setup.nsubpops))
            S_idx = setup.compartments.get_comp_idx({"infection_stage":"S"})
            I_idx = setup.compartments.get_comp_idx({"infection_stage":"I"})
            prop_inf = np.random.uniform(low=0,high=self.config["ub_prop_infected"].get(), size=setup.nsubpops)
            y0[S_idx, :] = setup.subpop_pop * (1-prop_inf)
            y0[I_idx, :] = setup.subpop_pop * prop_inf
            
            return y0
        
        def get_from_file(self, sim_id: int, setup) -> np.ndarray:
            return self.get_from_config(sim_id=sim_id, setup=setup)

    You can use any code within these functions, as long as the return object has the shape and type that gempyor expect (and that is undocumented and still subject to change, but as you see in this case gempyor except an array (a matrix) of shape: number of compartments X number of subpopulations). You can e.g call bash functions or excute R scripts such as below

    import gempyor.seeding_ic
    import numpy as np
    
    class InitialConditions(gempyor.seeding_ic.InitialConditions):
    
        def get_from_config(self, sim_id: int, setup) -> np.ndarray:
            import rpy2.robjects as robjects
            robjects.r.source("path_to_your_Rscript.R", encoding="utf-8")
            y0 = robjects.r["initial_condition_fromR"]
            return y0
        
        def get_from_file(self, sim_id: int, setup) -> np.ndarray:
            return self.get_from_config(sim_id=sim_id, setup=setup)
    seeding:
      method: FolderDraw
      seeding_file_type: seed
      folder_path: importation/minimal/
      lambda_file: data/minimal/seeding.csv
      perturbation_sd: 3
    interventions:
      scenarios:
        - Scenario1
      settings:
        local_variance:
          template: SinglePeriodModifierR0
          value:
            distribution: truncnorm
            mean: 0
            sd: .1
            a: -1
            b: 1
          perturbation:
            distribution: truncnorm
            mean: 0
            sd: .1
            a: -1
            b: 1
        stayhome:
          template: SinglePeriodModifierR0
          period_start_date: 2020-04-04
          period_end_date: 2020-04-30
          value:
            distribution: truncnorm
            mean: 0.6
            sd: 0.3
            a: 0
            b: 0.9
          perturbation:
            distribution: truncnorm
            mean: 0
            sd: .1
            a: -1
            b: 1
        Scenario1:
          template: StackedModifier
          scenarios: 
            - local_variance
            - stayhome
    outcomes:
      method: delayframe
      param_from_file: TRUE
      param_subpop_file: "usa-subpop-params-output.parquet" ## ../../Outcomes/data/usa-subpop-params-output.parquet
      scenarios:
        - med
      settings:
        med:
          incidH:
            source: incidI
            probability:
              value:
                distribution: fixed
                value: .035
            delay:
              value:
                distribution: fixed
                value: 7
            duration:
              value:
                distribution: fixed
                value: 7
              name: hosp_curr
          incidD:
            source: incidI
            probability:
              value:
                distribution: fixed
                value: .01
            delay:
              value:
                distribution: fixed
                value: 20
          incidICU:
            source: incidH
            probability: 
              value:
                distribution: fixed
                value: 0.167
            delay:
              value:
                distribution: fixed
                value: 3
            duration:
              value:
                distribution: fixed
                value: 8
          incidVent:
            source: incidICU
            probability: 
              value:
                distribution: fixed
                value: 0.463
            delay:
              value:
                distribution: fixed
                value: 1
            duration:
              value:
                distribution: fixed
                value: 7
          incidC:
            source: incidI
            probability:
              value:
                distribution: truncnorm
                mean: .1
                sd: .1
                a: 0
                b: 10
              perturbation:
                distribution: truncnorm
                mean: 0
                sd: .1
                a: -1
                b: 1
            delay:
              value:
                distribution: fixed
                value: 7
    filtering:
      simulations_per_slot: 350
      do_filtering: TRUE
      data_path: data/observed_data.csv
      likelihood_directory: importation/likelihood/
      statistics:
        sum_deaths:
          name: sum_deaths
          aggregator: sum ## function applied over the period
          period: "1 weeks"
          sim_var: incidD
          data_var: death_incid
          remove_na: TRUE
          add_one: FALSE
          likelihood:
            dist: sqrtnorm
            param: [.1]
        sum_confirmed:
          name: sum_confirmed
          aggregator: sum
          period: "1 weeks"
          sim_var: incidC
          data_var: confirmed_incid
          remove_na: TRUE
          add_one: FALSE
          likelihood:
            dist: sqrtnorm
            param: [.2]
      hierarchical_stats_geo:
        local_var_hierarchy:
          name: local_variance
          module: seir
          geo_group_col: USPS
          transform: none
        local_conf:
          name: probability_incidI_incidC
          module: hospitalization
          geo_group_col: USPS
          transform: logit
      priors:
        local_var_prior:
          name: local_variance
          module: seir
          likelihood:
            dist: normal
            param:
            - 0
            - 1
    flepiMoP/examples/tutorials
    ├── model_output
    │   ├── {setup_name}_{seir_modifier_scenario}_{outcome_modifier_scenario}
    │   │   └── run_id
    │   │       └── seir
    │   │           └── 000000001.run_id.seir.parquet
    │   ├── spar
    │   ├── snpi
    TBA
    // Some code
    seir_modifiers:
      scenarios:
        -NameOfIntervention1
        -NameofIntervention2
      modifiers:
        NameOfIntervention1:
          ...
        NameOfIntervention2:
          ...
    seir_modifiers:
      scenarios:
        -SchoolClosures
        -AllNPIs
      modifiers:
        SchoolClosures:
          method:SinglePeriodModifier
          ...
        CaseIsolation:
          method:SinglePeriodModifier
          ...
        Masking:
          method:SinglePeriodModifier
          ....
        AllNPIs
          method: StackedModifier
          modifiers: ["SchoolClosures","CaseIsolation","Masking"]
    seir_modifiers:
      scenarios:
        -SchoolClosures
        -AllNPIs
    outcome_modifiers
      scenarios:
        -BaselineTesting
        -TestShortage
    seir_modifiers:
      modifiers:
        lockdown: 
          method: SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-03-15
          period_end_date: 2020-05-01
          subpop: ['06000', '11000']
          value: 0.7
    outcome_modifiers:
      modifiers:
        enhanced_testing: 
          method: SinglePeriodModifier
          parameter: incidC::probability
          period_start_date: 2020-03-15
          period_end_date: 2020-05-01
          subpop: ['06000', '11000']
          value: -1.0
    new_parameter_value = old_parameter_value * (1 - value)
    school_year:
      method: MultiPeriodModifier
      parameter: beta
      groups:
        - subpop: ["25000"] 
          periods:
            - start_date: 2021-09-09
              end_date: 2021-12-23
            - start_date: 2022-01-04
              end_date: 2022-06-22
        - subpop: ["12000"]
          periods:
            - start_date: 2021-08-10
              end_date: 2021-12-17
            - start_date: 2022-01-04
              end_date: 2022-05-27
      value: -0.3
    new_parameter_value = old_parameter_value * (1 - value)
    seir_modifiers:
      modifiers:
        social_distancing: 
          method: SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-03-15
          period_end_date: 2020-06-30
          subpop: ['all']
          value: 0.6
        fatigue: 
          method: ModifierModifier
          baseline_scenario: social_distancing
          parameter: beta
          period_start_date: 2020-05-01
          period_end_date: 2020-06-30
          subpop: ['large_province']
          value: 0.5
    seir_modifiers:
      modifiers:
        social_distancing_initial: 
          method: SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-03-15
          period_end_date: 2020-04-31
          subpop: ['all']
          value: 0.6
        social_distancing_fatigue_sp: 
          method: SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-05-01
          period_end_date: 2020-06-30
          subpop: ['small_province']
          value: 0.6
        social_distancing_fatigue_lp: 
          method: SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-05-01
          period_end_date: 2020-06-30
          subpop: ['large_province']
          value: 0.3
    new_intervention_value = old_intervention_value * (1 - value)
    new_parameter_value = original_parameter_value * (1 - baseline_intervention_value * (1 - value) )
    seir_modifiers:
      scenarios:
        -SchoolClosures
        -AllNPIs
      modifiers:
        SchoolClosures:
          method:SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-03-15
          period_end_date: 2020-05-01
          subpop: 'all'
          value: 0.7
        CaseIsolation:
          method:SinglePeriodModifier
          parameter: gamma
          period_start_date: 2020-04-01
          period_end_date: 2020-05-01
          subpop: 'all'
          value: -1.0
        Masking:
          method:SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-04-15
          period_end_date: 2020-05-01
          subpop: 'all'
          value: 0.5
        AllNPIs
          method: StackedModifier
          modifiers: ["SchoolClosures","CaseIsolation","Masking"]
    outcome_modifiers:
      scenarios:
        - ReducedTesting
        - AllDelays
      modifiers:
        DelayedTesting
          method:SinglePeriodModifier
          parameter: incidC::probability
          period_start_date: 2020-03-15
          period_end_date: 2020-05-01
          subpop: 'all'
          value: 0.5
        DelayedHosp
          method:SinglePeriodModifier
          parameter: incidD::delay
          period_start_date: 2020-04-01
          period_end_date: 2020-05-01
          subpop: 'all'
          value: -1.0
        LongerHospStay
          method:SinglePeriodModifier
          parameter: incidH::duration
          period_start_date: 2020-04-15
          period_end_date: 2020-05-01
          subpop: 'all'
          value: -0.5
    seir_modifiers:
      modifiers:
        lockdown: 
          method: SinglePeriodModifier
          parameter: beta
          period_start_date: 2020-03-15
          period_end_date: 2020-05-01
          subpop: 'all'
          subpop_groups: [['NS','NB','PE','NF'],['MB','SK','AB'],['NV','NW','YK']]
          value: 
            distribution: uniform
            low: 0.3
            high: 0.7
            
    flepimop-calibrate -c config_emcee.yml --nwalkers 5  --jobs 5 --niterations 10 --nsamples 5 --id my_run_id
      gempyor >> Running ***DETERMINISTIC*** simulation;
      gempyor >> ModelInfo USA_inference_all; index: 1; run_id: SMH_Rdisparity_phase_one_phase1_blk1_fixprojnpis_CA-NC_emcee,
      gempyor >> prefix: USA_inference_all/SMH_Rdisparity_phase_one_phase1_blk1_fixprojnpis_CA-NC_emcee/;
    Loaded subpops in loaded relative probablity file: 51 Intersect with seir simulation:  2 kept
    Running Gempyor Inference
    
    LogLoss: 6 statistics and 92 data points,number of NA for each statistic: 
    incidD_latino    46
    incidD_other      0
    incidD_asian      0
    incidD_black      0
    incidD_white      0
    incidC_white     24
    incidC_black     24
    incidC_other     24
    incidC_asian     24
    incidC_latino    61
    incidC           24
    incidD            0
    dtype: int64
    InferenceParameters: with 92 parameters: 
        seir_modifiers: 84 parameters
        outcome_modifiers: 8 parameters
    #!/bin/bash
    #SBATCH --ntasks 1
    #SBATCH --nodes 1
    #SBATCH --mem 450g
    #SBATCH --cpus-per-task 256
    #SBATCH --time 20:00:00
    flepimop-calibrate --config config_NC_emcee.yml \
      --nwalkers 500  \
      --jobs 256 \
      --niterations 2000 \
      --nsamples 250 \
      --id my_id  > out_fit256.out 2>&1

    Advanced run guides

    For running the model locally, especially for testing, non-inference runs, and short chains, we provide a guide for setting up and running in a conda environment, and provide a Docker container for use. A Docker container is an environment which is isolated from the rest of the operating system i.e. you can create files, programs, delete and everything but that will not affect your OS. It is a local virtual OS within your OS. We recommend Docker for users who are not familiar with setting up environments and seek a containerized environment to quickly launch jobs ;

    For longer inference runs across multiple slots, we provide instructions and scripts for two methods to launch on SLURM HPC and on AWS using Docker. These methods are best for launching large jobs (long inference chains, multi-core and computationally expensive model runs), but not the best methods for debugging model setups.

    Running locally

    Running with Docker locally 🛳https://github.com/HopkinsIDD/flepiMoP/blob/documentation-gitbook/documentation/gitbook/how-to-run/advanced-run-guides/quick-start-guide-conda.md

    Running longer inference runs across multiple slots

    File descriptions

    flepiMoP

    https://github.com/HopkinsIDD/flepiMoP

    Current branch: main

    This repository contains all the code underlying the mathematical model and the data fitting procedure, as well as ...

    To actually run the model, this repository folder must be located inside a location folder (e.g. COVID19_USA) which contains additional files describing the specifics of the model to be run (i.e. the config file), all the necessary input data (i.e. the population structure), and any data to which the model will be fit (i.e. cases and death counts each day)

    /gempyor_pkg

    This directory contains the core Python code that creates and simulates generic compartmental models and additionally simulates observed variables. This code is called gempyor for General Epidemics Modeling Pipeline with Yterventions and Outcome Reporting. The code in gempyor is called from R scripts (see /main_scripts and /R sections below) that read the config, run the model simulation via gempyor as required, read in data, and run the model inference algorithms.

    • pyproject.toml - contains the build system requirements and dependencies for the gempyor package; used during package installation

    • setup.cfg - contains information used by Python's setuptools to build the gempyor package. Contains the definitions of command line shortcuts for running simulations directly from gempyor (bypassing R interface) if desired

    /gempyor_pkg/src/gempyor/

    • seir.py - Contains the core code for simulating the mathematical model. Takes in the model definition and parameters from the config, and outputs a file with a timeseries of the value of each state variable (# of individuals in each compartment)

    • simulate_seir.py -

    • steps_rk.py -

    • steps_source.py -

    /gempyor_pkg/docs

    Contains notebooks with some gempyor-specific documentation and examples

    • Rinterface.Rmd - And R notebook that provides some background on gempyor and describes how to run it as a standalone package in python, without the R wrapper scripts or the Docker.

    • Rinterface.html - HTML output of Rinterface.Rmd

    /R

    /main_scripts

    This directory contains the R scripts that takes the specifications in the configuration file and sets up the model simulation, reads the data, and performs inference.

    • inference_main.R - This is the master R script used to run the model. It distributes the model runs across computer cores, setting up runs for all the scenarios specified in the config, and for each model iteration used in the parameter inference. Note that despite the name "inference" in this file, this script must be used to run the model even if no parameter inference is conducted

    • inference_slot.R - This script contains the main code of the inference algorithm.

    • create_seeding.R -

    /R_packages

    This directory contains the core R code - organized into functions within packages - that handle the model setup, data pulling and processing, conducting parameter inference for the model, and manipulating model output.

    • flepicommon

      • config.R

      • DataUtils.R

      • file_paths.R

    /test

    /data

    Depreciated? Should be removed

    /vignettes

    Depreciated? Should be removed

    /doc

    Depreciated? Should be removed

    /batch

    /slurm_batch

    COVID19_USA Repository

    Current branch: main

    /R

    Contains R scripts for generating model input parameters from data, writing config files, or processing model output. Most of the files in here are historic (specific to a particular model run) and not frequently used. Important scripts include:

    • get_vacc_rate_and_outcomes_R13.R - this pulls vaccination coverage and variant prevalence data specific to rounds (either empirical, or specified by the scenario), and adjusts these data to the formats required for the model. Several data files are created in this process: variant proportions for each scenario, vaccination rates by age and dose. A file is also generated that defines the outcome ratios (taking in to account immune escape, cross protection and VE).

    /R/scripts/config_writers

    Scripts to generate config files for particular submissions to the Scenario Modeling Hub. Most of this functionality has now been replaced by the config writer package ()

    R/scripts/postprocess

    Scripts to process the output of model runs into data formats and plots used for Scenario Modeling Hub and Forecast Hub. These scripts pull runs from AWS S3 buckets and processes and formats them to specifications for submissions to Scenario Modeling Hubs, Forecast Hubs and FluSight. These formatted files are saved and the results visualized. This script uses functions defined in /COVIDScenarioPipeline/R/scripts/postprocess.

    • run_sum_processing.R

    /data

    Contains data files used in parameterizing the model for COVID-19 in the US (such as creating the population structure, describing vaccine efficacy, describing parameter alterations due to variants, etc). Some data files are re-downloaded frequently using scripts in the pipeline (us_data.csv) while others are more static (geodata, mobility)

    Important files and folders include

    • geodata.csv

    • geodata_2019_statelevel.csv

    • mobility.csv

    • mobility_territories_2011-2015_statelevel.csv

    /data/shp

    "Shape-files" (.shp) that .....

    /data/outcomes

    • usa-subpop-params-output_V2.parquet

    /data/intervention_tracking

    Data files containing the dates that different non pharmaceutical interventions (like mask mandates, stay-at-home orders, school closures) were implemented by state

    /data/vaccination

    Files used to create the config elements related to vaccination, such as vaccination rates by state by age and vaccine efficacy by dose

    /data/variant

    Files created in the process of downloading and analyzing data on variant proportions

    /manuscripts

    Contains files for scientific manuscripts using results from the pipeline. Not up to date

    /config

    Contains an archive of configuration files used for previous model runs

    /old_configs

    Same as above. Contains an archive of configuration files used for previous model runs

    /scripts

    Depreciated - to be removed? - contains rarely used scripts

    /notebook

    Depreciated - to be removed? - contains rarely used notebooks to check model input. Might be used in some unit tests?

    /NPI

    empty?

    /ScenarioHub

    Depreciated - to be removed?

    Environment Variables


    description: >- A library of environment variables in the flepiMoP codebase. These variables may be updated or deprecated as the project evolves.

    Environment Variables

    Below you will find a list of environment variables (envvars) defined throughout the flepiMoP codebase. Often, these variables are set in response to command-line argument input. Though, some are set by flepiMoP without direct user input (these are denoted by a 'Not a CLI option' note in the Argument column.)

    Envvar.
    Argument
    Description
    Default
    Valid values
    Key file locations (inexhaustive)

    Common errors

    Common error messages and how to debug them

    Docker Issues

    Docker Issues - storage full
    • Storage full on submission box

      • there can be so many error patterns in shortage of storage

    In this case, a common problem is that multiple unused docker containers remain on the submission box. A simple solution is to prune unused containers:

    or belows:

    docker volume could not be changed even if the container was updated and relaunched
    • Docker volume will not be changed and remained once it was created (except mount with -v). That is to say "exists independently".

    • This is observed as well in using docker-compose. If you define and use volumes: in docker-compose.yml file, be cautious that the created volume will not be removed after invoking docker compose down .

    Synchronizing Files

    The flepimop pipeline typically requires a large set of input files and can produce a large set of output files, particularly for calibration runs, which can be challenging to move around to different machines/storage systems for analysis or backup. To address this need the flepimop sync tool can move files to and from the working location. Currently the flepimop sync command supports three underlying tools:

    • rsync: Generally for use on inputs and outputs between a local machine and an HPC,

    • aws s3 sync: Generally for long term record of outputs or external sharing, and

    • git: For version controlled elements, like pre-/post-processing scripts, configuration files, model inputs, etc.

    Used directly, these underlying tools are flexible, but complex. When abstracted by flepimop sync these tools can be used indirectly with a simplified, but limited interface. This trade off makes it easy to have reproducible and cross environment tooling support but fails to address more complicated use cases.

    For a particular project multiple sync "protocols" can be defined, associated with different tasks. By default the first protocol will be used, but users can also specify a sync "protocol" to use explicitly. A sync section is defined by a top-level sync key, mapping to any number of keys (which name the protocols). Each "protocol" has a type key, indicating the underlying tool, and any necessary configuration options (see following sections for the necessary fields by type). An example of what a sync section in a configuration file might look like is:

    All of these tools have push vs pull modes, so protocols need to define that direction. We distinguish these by setting source and target locations for rsync and s3sync or by setting mode for git. The directionality can be flipped with the --reverse flag.

    Both rsync and s3sync protocols support filters to include or exclude files. By default, all sync actions will include everything within the source definition, and if the source is a directory, it will be recursively crawled. For the git protocol which does not support filters users can take advantage of . To modify this behavior for other protocols, you can use filters either as part of the protocol definition OR with options provided when invoking sync. See the (Filters)[#filters] section below for details about filtering, but in general sync uses the rsync conventions for including/excluding files.

    Protocols

    rsync Protocol

    The rsync mode is intended to be used on a local machine to sync with another machine running an rsync server, for example an HPC system. Without special setup, you won't be able to initiate sync from that "other" system back to your local machine. Typically personal laptops will not be running an rsync server, but shared resources like HPCs will. You can still get files from the HPC to that local machine, you just have to run the flepimop sync command on your machine possibly with the --reverse flag depending on how the protocol is configured.

    A template for configuring an rsync protocol is:

    While the source and target values should have the same format, they should not be the same value as this would result in a no-op. Typically one of source or target will be a remote directory on an HPC environment and the other will be a local directory so the sync protocol can be used to move model inputs/outputs from compute environment to local environments, or vise versa. The examples below will help guide you through the details of how to set this up with some concrete applications.

    Example: Pushing Inputs

    Let's say your have some inputs that you generate by hand on your personal machine, that you need to push to an HPC ahead of running work on it. You might define a protocol as:

    Note: we are assuming you have setup your .ssh/config file to define you username, credentials location, host details, etc - so here longleaf corresponds to the host name of an HPC system. If you haven't done that, flepimop makes no guarantees about handling prompts for username, password, etc when using sync.

    When the files were ready, you could then run from your local project folder:

    Or if necessary using the -(-p)rotocol=pushlongleaf option to identify pushlongleaf is the sync item to execute (when you have multiple protocols specified and pushlongleaf wasn't the first / default protocol).

    Example: Pulling Outputs

    Now imagine you have run your flepimop pipeline on the HPC, and you want to pull the results back to your local machine to do some plotting or analysis. You might define a protocol as:

    Then

    would pull the results from "model_output" matching SOME_RUN_INDEX. If you're iterating on some model specification, you might be working a series of run indices. Rather than revising the configuration file repeatedly, you could instead call:

    This would fetch only the results associated with the defined $FLEPI_RUN_INDEX and group them together in a corresponding subfolder of model_output.

    Of course, if your local machine still had earlier results, sync will automatically understand that those files haven't changed and that it only needs to fetch new run results.

    s3sync Protocol

    The aws s3 sync mode is intended to be used to get results to and from long term storage on AWS S3. That should generally be snapshotting a "final" analysis run, rather troubleshooting results during development towards such a run. Use of this tool assume that you have already taken two steps. First, that aws s3 sync is available on the command line, which might require e.g. module load s3 or adjusting your $PATH such that aws command line interface is available with having to provide a fully qualified location. This should be handled for you automatically by batch/hpc_init on either Longleaf or Rockfish HPCs. Second, your credentials are setup such that you can directly invoked aws s3 sync without having to provide username, etc.

    Example: Pushing Results

    Imagining that you've got some final results and its time to send them to S3 (e.g. for a dashboard to pull from). You could define a protocol as:

    Note the distinction here where target starts with s3:// - that defines that this end is the s3 bucket. A valid s3sync protocol requires at least one end to be an s3 bucket, and thus to start with s3://. Furthermore note that there is no trailing slash for the model_output directory, similarly to the rsync protocol, this tells the flepimop rsync command to sync the whole model_output directory to s3://idd-inference-runs/myproject. If there had been a trailing slash (i.e. model_output/) then the contents of that directory would have been synced to s3://idd-inference-runs/myproject instead.

    You could then

    To send your outputs to the s3 bucket.

    git Protocol

    Though git is fairly straightforward to use directly, we also provide an simplified sync mode associated with git to ensure a model has the latest code elements associated with it. Practically, this can be used on an either a local machine or HPC setup to ensure that you have the latest version, or that if you have made changes, those have been pushed to authoritative reference.

    In general, git mode is much simpler to specify and use than the other two options, since its for different concerns. An example configuration protocol looks like:

    The git mode is simply a wrapper around normal git operations and expects that you are dealing with a normal git flow for staging files, making commits, marking files to be ignored, etc. It will issue a warning and take the following actions when various conditions are met:

    • halt: there are staged-but-not-committed files (e.g. git add/rm ... operations, but not yet committed)

    • halt: there unstaged changes to tracked files

    • warn: changes to files which are untracked, but are also unignored

    If there are no issues with the repository, sync will fetch the authoritative repository version, attempting to update the local repository. If there are any merge conflicts, the sync operation will fail and refer you to the normal process for resolving such conflicts.

    Filters

    Filtering happens by applying include or exclude filters in sequence. A filter is a string that starts either with a "- " for an exclude filter, "+ " for an include filter, "s " for a substring filter, or none of those which defaults to an include filter. Filters can include * or ** for file or file/directory globs - see the particular tool documentation for more supported patterns. We adopt the rsync convention where earlier filters in the sequence have precedence over filters applied later, which flepiMoP translates to other tools conventions as necessary. So an "+ *" as the first filter means "include everything" and has precedence over subsequent filters. Similarly, an initial "- *" filter, meaning exclude everything, would block all subsequent inclusions specified. Substring filters are resolved by the protocol to only include paths with a user specified substring in them.

    For convenience, when users provide only include filters (after resolving all configuration file(s) and any command line options), this is interpreted as "include whatever matches this filter, and then exclude everything else". This happens by automatically adding a - ** as the final (lowest precedence) filter.

    In configuration files, the filter key is filters within a supporting protocol type. The value of that key can be a single string or a list of strings (in either square-bracket or bullet form). The left-to-right (or top-to-bottom) order determines which filter is first vs last.

    When invoked on the command line, you can also specify changes to the filters in a few ways:

    • -(-f)ilter option(s) to override any configuration file filters. To provide multiple stages of filters, simply provide the option multiple times: -f'+ include.me' -f'- *.me' would include a include.me and exclude all other .me files.

    • -e|--fsuffix and -a|--fprefix option(s) to prefix and/or suffix filter(s) to the core filter (which can be from the configuration file, or an via override -f

    Overrides And Appends

    Similarly to the filtering options above, flepimop sync provides the --source and --target options for overriding or appending to the source and target of a protocol. By providing --source or --target with a new path will outright override the value provided by the sync protocol configuration. If the value provided to --source or --target starts with '+ ' then this override will be appended. The appends will respect if the source/target being appended to ends or does not end with a separator. Consider the following example where the columns correspond to the affect of --source or --target and the rows correspond to the value of the source or target in the configuration:

    Troubleshooting

    Before running a flepimop sync command for the first time it is helpful to take advantage of the --dry-run flag to see what the command would do without actually running the command. The output of this can be quite verbose, especially when using -vvv for full verbosity, so it can be helpful to pipe the output of the dry run to a text file for inspection.

    Applications: gempyor resume and continue operations

    The gempyor approaches to projection and inference support resuming from previously completed work.

    Resuming Inference

    Communication Between Iterations

    The pipeline uses files to communicate between different iterations. Currently, the following file types exist:

    • seed

    • init

    • snpi

    • spar

    During each iteration, inference uses these files to communicate with the compartmental model and outcomes. The intent is that inference should only need to read and write these files, and that the compartmental model can handle everything else. In addition to the global versions of these files actually passed to the compartmental/reporting model, there exist chimeric versions used internally by inference and stored in memory. These copies are what inference interacts with when it needs to perturb values. While this design was chosen primarily to support modularity (a fixed communication boundary makes it easy to swap out the compartmental model), it has had a number of additional benefits.

    Bootstrapping

    The first iteration of an MCMC algorithm is a special case, because we need to pull initial conditions for our parameters. We originally developed the model without inference in mind, so the compartmental model is already set up to read parameter distributions from the configuration file, and to draw values from those distributions, and record those parameters to file. We take advantage of this to bootstrap our MCMC parameters by running the model one time, and reading the parameters it generated from file.

    Resume from previous run

    Instead of bootstrapping our first iteration, flepiMoP supports reading in final values of a previous iteration. This allows us to resume from runs to save computational time and effectively continue iterating on the same chain. We call these resumes, in which inferred parameters are taken from a previous run and allowed to continue being inferred.

    Resumes take the following files, if they exist, from previous runs and uses them as the starting point of a new run:

    • hnpi

    • snpi

    • seed

    So a resume protocol for sync, to fetch previously computed results, might look something like:

    Continuing projection

    In addition to resuming parameters, we can also perform a continuation resume. In addition to resuming parameters and seeding, continuations also use the compartmental fits from previous runs. For a config starting at time continuing and resuming from a previous run, the compartmental states of the previous run at time are used as the initial conditions of the continuation resume.

    Saving Model Outputs To AWS S3 With flepimop batch-calibrate

    For details on how to do this please refer to the guide for the latest information.

    Tips, tricks, FAQ

    All the little things to save you time on the clusters

    Deleting model_output/ (or any big folder) is too long on the cluster

    Yes, it takes ages because IO can be so slow, and there are many small files. If you are in a hurry, you can do

    The first command rename/move model_output, it is instantaneous. You can now re-run something. To delete the renamed folder, run the second command. the & at the end makes it execute in the background.

    Guidelines for contributors

    All are welcome to contribute to flepiMoP! The easiest way is to open an issue on GitHub if you encounter a bug or if you have an issue with the framework. We would be very happy to help you out.

    If you want to contribute code, fork the , modify it, and submit your Pull Request (PR). In order to be merged, a pull request need:

    • the approval of two reviewers AND

    • the continuous integration (CI) tests passing.

    Useful commands

    Git setup

    Type the following line so git remembers your credential and you don't have to enter your token 6 times per day:

    Get a notification on your phone/mail when a run is done

    Running on Rockfish/MARCC - JHU 🪨🐠

    or any HPC using the slurm workload manager

    🗂️ Files and folder organization

    Rockfish administrators provided with different properties. For our needs (storage intensive and shared environment), we work in the /scratch4/struelo1/ partition, where we have 20T of space. Our folders are organized as:

    Git and GitHub Usage

    We now use a modified gitflow style workflow for working with git and GitHub. For a detailed overview of this topic please refer to .

    New Features

    New features should be developed in a new branch checked out from the dev branch and then merged back into the dev branch via a PR on GitHub when ready for review. These feature branches can be deleted after merging into dev, unless someone from operations requests that it be kept around. For example, operations may want to merge the feature into their operational branch to get new functionality in advance of a release. By convention feature branches should be prefixed with feature/<GitHub issue>/

    Running with docker on AWS - OLD probably outdated

    This page, along with the other AWS run guides, are not deprecated in case we need to run flepiMoP on AWS again in the future, but also are not maintained as other platforms (such as longleaf and rockfish) are preferred for running production jobs.

    For large simulations, running the model on a cluster or cloud computing is required. AWS provides a good solution for this, if funding or credits are available (AWS can get very expensive).

    Running on AWS 🌳
    Running On A HPC With Slurm

    outcomes.py - Contains the core code for generating the outcome variables. Takes in the output of the mathematical model and parameters from the config, and outputs a file with a timeseries of the value of each outcome (observed) variable

  • simulate_outcomes.py -

  • setup.py

  • file_paths.py -

  • compartments.py

  • parameters.py

  • results.py

  • seeding_ic.py

  • /NPI/

    • base.py -

    • SinglePeriodModifier.py -

    • MultiPeriodModifier.py -

    • SinglePeriodModifierInterven.py -

  • /dev - contains functions that are still in development

  • /data - ?

  • safe_eval.R

  • compartments.R

  • inference - contains code to

    • groundtruth.R - contains functions for pulling ground truth data from various sources. Calls functions in the flepicommon package

    • functions.R - contains many functions used in running the inference algorithm

    • inference_slot_runner_funcs.R - contains many functions used in running the inference algorithm

    • inference_to_forecast.R -

    • documentation.Rmd - Summarizes the documentation relevant to the inference package, including the configuration file options relevant to model fitting

    • InferenceTest.R -

    • /tests/ -

  • config.writer

    • create_config_data.R

    • process_npi_list.R

    • yaml_utils.R

  • report.generation

    • DataLoadFuncs.R

    • ReportBuildUtils.R

    • ReportLoadData.R

    • setup_testing_environment.R

  • outcomes_ratios.csv

  • US_CFR_shift_dates_v3.csv

  • US_hosp_ratio_corrections.cs

  • seeding_agestrat_RX.csv

  • https://github.com/HopkinsIDD/COVID19_USA
    , I.e.
    feature/99/cool-new-thing
    . Feature branch should also include edits to the GitBook documentation that describe their changes.

    Hot Fixes

    Hot fixes should be developed in a new branch checked out from the main branch and merged back into the main branch via a PR on GitHub when ready for review. After successfully merging into main the hot fix branch should then be merged into dev, making appropriate adjustments to stabilize the feature. The priority for hot fixes is to correct a major issue quickly, so it is okay to delay detailed testing/documentation until merging into dev. By convention hot fix branches should be prefixed with hotfix/, I.e. hotfix/important-fix-to-something, and then converted into a feature branch after merging into main. These do not have to include edits to the GitBook documentation, but if the hotfix conflicts with what is described in the GitBook documentation it's strongly recommended.

    Creating Releases

    Periodically releases will be created by merging the dev branch into main via a PR on GitHub and creating a new release the main branch after merging. These PRs should avoid discussion of individual feature changes, those discussions should be reserved for and handled in the feature PRs. If there is a feature that poses a significant problem in the process of creating a new release those changes should be treated like a new feature. The main purpose of this PR is to:

    1. Resolve merge conflicts generated by hot fixes,

    2. Making minor edits to documentation to make it clearer or more cohesive, and

    3. Updating the NEWS.md file with a summary of the changes included in the release.

    Operations

    Operational work should be developed in a new branch checked out from the main branch if there are modifications needed to the flepiMoP codebase. Pre-released features can be merged directly into this operational branch from the corresponding feature branch as needed via a git merge or rebase not a GitHub PR. After the operational cycle is over, the operations branch should not be deleted, instead should be kept around for archival reasons. Operational work needs to move quickly and usually does not involve documenting or testing code and is therefore unsuitable for merging into dev or main directly. Instead potential features should be extracted from an operations branch into a feature branch using git cherry-pick and then modified into an appropriates state for merging into dev like a feature branch. By convention operations branch names should be prefixed with operations/, I.e. operations/flu-SMH-2023-24.

    Atlassian's article on Gitflow workflow

    a workaround is to delete the docker volume beforehand

    Add error
    // Some code
    docker system prune
    // Remove all stopped containers
    % docker rm $(docker ps -a -q)
    // or Remove all containers
    % docker rm -f $(docker ps -a -q)
    # Remove volumes
    % docker volume rm <specified volume name>
    # or 
    % docker volume prune
    s). If there are no configuration-based filters, these are equivalent to just using
    -f
    filters.
  • --no-filter overrides specified configuration filter(s) to be an empty list; cannot be combined with -f|a|e options.

  • /a/b/c/g/h/i/

    /d/e/f

    /d/e/f/

    --source '+ g/h/i/'

    /a/b/c/g/h/i

    /a/b/c/g/h/i/

    /d/e/f

    /d/e/f/

    --target /j/k/l

    /a/b/c

    /a/b/c/

    /j/k/l

    /j/k/l

    --target '+ j/k/l'

    /a/b/c

    /a/b/c/

    /d/e/f/j/k/l

    /d/e/f/j/k/l/

    --target '+ j/k/l/'

    /a/b/c

    /a/b/c/

    /d/e/f/j/k/l

    /d/e/f/j/k/l/

    seir
  • hpar

  • hnpi

  • hosp

  • llik

  • source: '/a/b/c'

    source: '/a/b/c/'

    target: '/d/e/f'

    target: '/d/e/f/'

    No --source or --target

    /a/b/c

    /a/b/c/

    /d/e/f

    /d/e/f/

    --source /g/h/i

    /g/h/i

    /g/h/i

    /d/e/f

    /d/e/f/

    --source '+ g/h/i'

    tst_sts​
    tst_sts​
    .gitignore files provided by git
    Saving Model Outputs On Batch Inference Job Finish

    /a/b/c/g/h/i

    Use seff to analyze a job

    After a job has run (either to completion or got terminated/fail), you may run:

    to know how much ressources your job used in your node, what was the cause for termination and so on. If you don't remember the JOB_ID, look for the number in the filename of the slurm log (slurm_{JOB_ID}.out).

    mv model_output/ model_output_old
    rm -r model_output_old &
    seff JOB_ID

    We use ntfy.sh for notification. Install ntfy on your Iphone or Android device. Then subscribe to the channel ntfy.sh/flepimop_alerts where you'll receive notifications when runs are done.

    • End of job notifications goes as urgent priority.

    Install slack integration

    Within included example postprocessing scripts, we include a helper script that sends a slack message with some output snapshots of our model output. So our 🤖-friend can send us some notifications once a run is done.

    Delphi Epidata API

    If you are using the Delph Epidata API, first register for a key. Once you have a key, add that below where you see [YOUR API KEY]. Alternatively, you can put that key in your config file in the inference section as gt_api_key: "YOUR API KEY".

    🚀 Run inference using slurm (do everytime)

    TODO: add how to run test, and everything

    Don't paste them if you don't know what they do

    Filepaths structure

    in configs with a setup name: USA

    where, eg:

    • the index is 1

    • the run_id is 2021.12.14.23:56:12.CET

    • the prefix is USA/inference/med/2021.12.14.23:56:12.CET/global/intermediate/000000001.

    Steps to first local run

    where:

    • nnn is slots

    • jjj is core

    • kkk is iteration per slot

    Launch the docker locally

    Pipeline git-fu (dealing with the commute_data)

    because a big file get changed and added automatically. Since Git 2.13 (Q2 2017), you can stash individual files, with git stash push. One of these should work.

    code-folder:
    /scratch4/struelo1/flepimop-code/
    where each user has its own subfolder, from where the repos are cloned and the runs are launched. e.g for user chadi, we'll find:
    • /scratch4/struelo1/flepimop-code/chadi/covidsp/Flu_USA

    • /scratch4/struelo1/flepimop-code/chadi/COVID19_USA

    • /scratch4/struelo1/flepimop-code/chadi/flepiMoP

    • ...

    • (we keep separated repositories by users so that different versions of the pipeline are not mixed where we run several runs in parallel. Don't hesitate to create other subfolders in the code folder (/scratch4/struelo1/flepimop-code/chadi-flusight, ...) if you need them.

    Note that the repository is cloned flat, i.e the flepiMoP repository is at the same level as the data repository, not inside it!

    • output folder:/scratch4/struelo1/flepimop-runs stores the run outputs. After an inference run finishes, it's output and the logs files are copied from the $PROJECT_PATH/model_output to /scratch4/struelo1/flepimop-runs/THISRUNJOBNAME where the jobname is usually of the form USA-DATE.

    When logging on you'll see two folders data_struelo1 and scr4_struelo1, which are shortcuts to /data/struelo1 and /scratch4/struelo1. We don't use data/struelo1.

    Login on rockfish

    Using ssh from your terminal, type in:

    and enter your password when prompted. You'll be into rockfish's login node, which is a remote computer whose only purpose is to prepare and launch computations on so-called compute nodes.

    🧱 Setup (to be done only once per USER )

    Load the right modules for the setup:

    Now, type the following line so git remembers your credential and you don't have to enter your token 6 times per day:

    Now you need to create the conda environment. You will create the environment in two shorter commands, installing the python and R stuff separately. This can be extremely long if done in one command, so doing it in two helps. This command is quite long you'll have the time to brew some nice coffee ☕️:

    Clone the FlepiMoP and other model repositories

    Use the following commands to have git clone the FlepiMoP repository and any other model repositories you'd like to work on through https. In the code below, $USER is a variable that contains your username.

    You will be prompted to provide your GitHub username and password. Note that from 2021, GitHub has changed the use of passwords to the use of personal acces tokens, so the prompted "password" is not the password you use to login. Instead, we recommend using the more safe ssh protocol to clone GitHub repositories. To do so, first generate an ssh private-public keypair on the Rockfish cluster and then copy the generated public key from the Rockfish cluster to your local computer by opening a terminal and prompting,

    scp -r <username>@rfdtn1.rockfish.jhu.edu:/home/<username>/.ssh/<key_name.pub> .

    Then add the public key to your GitHub account. Next, make a file ~/.ssh/config by using the command vi ~/.ssh/config`. Press 'I' to go into insert mode and paste the following chunck of code,

    Press 'esc' to exit INSERT model followed by ':x' to save and exit the file. By adding this configuration file, you make sure Rockfish doesn't forget your ssh key when you log out. Now clone the github repositories as follows,

    and you will not be prompted for credentials.

    Setup your Amazon Web Services (AWS) credentials

    This can be done in a second step -- but is necessary in order to push and pull data to Amazon Simple Storage Service (S3). Setup AWS by running,

    Then run ./aws-cli/bin/aws configure to set up your credentials,

    To get the (secret) access key, ask the AWS administrator (Shaun Truelove) to generate them for you.

    🚀 Run inference using slurm (do everytime)

    log-in to rockfish via ssh, then type:

    which will prepare the environment and setup variables for the validation date (choose as day after end_date_groundtruth), the resume location and the run index for this run. If you don't want to set a variable, just hit enter.

    Note that now the run-id of the run we resume from is automatically inferred by the batch script :)

    what does this do || it returns an error

    This script runs the following commands to setup up the environment, which you can run individually as well.

    and the it does some prompts to fix the following 3 enviroment variables. You can skip this part and do it later manually.

    Check that the conda environment is activated: you should see(flepimop-env) on the left of your command-line prompt.

    Then prepare the pipeline directory (if you have already done that and the pipeline hasn't been updated (git pull says it's up to date) then you can skip these steps

    Now flepiMoP is ready 🎉. If the R command doesn't work, try r and if that doesn't work run module load r/4.0.2`.

    Next step is to setup the data. First $PROJECT_PATH to your data folder, and set any data options. If you are using the Delph Epidata API, first register for a key here: https://cmu-delphi.github.io/delphi-epidata/. Once you have a key, add that below where you see [YOUR API KEY]. Alternatively, you can put that key in your config file in the inference section as gt_api_key: "YOUR API KEY".

    For a COVID-19 run, do:

    for Flu do:

    Now for any type of run:

    Do some clean-up before your run. The fast way is to restore the $PROJECT_PATH git repository to its blank states (⚠️ removes everything that does not come from git):

    I want more control over what is deleted

    if you prefer to have more control, delete the files you like, e.g

    If you still want to use git to clean the repo but want finer control or to understand how dangerous is the command, read this.

    Run the preparatory script for the data and you are good,

    If you want to profile how the model is using your memory resources during the run:

    Now you may want to test that it works :

    If this fails, you may want to investigate this error. In case this succeeds, then you can proceed by first deleting the model_output:

    Launch your inference batch job

    When an inference batch job is launched, a few post processing scripts are called to run automatically postprocessing-scripts.sh. You can manually change what you want to run by editing this script.

    Now you're fully set to go 🎉

    To launch the whole inference batch job, type the following command:

    This command infers everything from you environment variables, if there is a resume or not, what is the run_id, etc. The part after the "2" makes sure this file output is redirected to a script for logging, but has no impact on your submission.

    If you'd like to have more control, you can specify the arguments manually:

    If you want to send any post-processing outputs to #flepibot-test instead of csp-production, add the flag --slack-channel debug

    Commit files to Github. After the job is successfully submitted, you will now be in a new branch of the data repo. Commit the ground truth data files to the branch on github and then return to the main branch:

    but DO NOT finish up by git checking main like the aws instructions, as the run will use data in the current folder.

    Monitor your run

    TODO JPSEH WRITE UP TO HERE

    Two types of logfiles: in `$PROJECT_PATH`: slurm-JOBID_SLOTID.out and and filter_MC logs:

    ```tail -f /scratch4/struelo1/flepimop-runs/USA-20230130T163847/log_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc_100.txt

    ```

    Helpful commands

    When approching the file number quota, type

    to find which subfolders contains how many files

    Common errors

    • Check that the python comes from conda with which python if some weird package missing errors arrive. Sometime conda magically disappears.

    • Don't use ipython as it breaks click's flags

    cleanup:

    Get a notification on your phone/mail when a run is done

    We use ntfy.sh for notification. Install ntfy on your Iphone or Android device. Then subscribe to the channel ntfy.sh/flepimop_alerts where you'll receive notifications when runs are done.

    • End of job notifications goes as urgent priority.

    How to use slurm

    Check your running jobs:

    where job_id has your full array job_id and each slot after the under-score. You can see their status (R: running, P: pending), how long they have been running and soo on.

    To cancel a job

    Running an interactive session

    To check your code prior to submitting a large batch job, it's often helpful to run an interactive session to debug your code and check everything works as you want. On 🪨🐠 this can be done using interact like the below line, which requests an interactive session with 4 cores, 24GB of memory, for 12 hours.

    The options here are [-n tasks or cores], [-t walltime], [-p partition] and [-m memory], though other options can also be included or modified to your requirements. More details can be found on the ARCH User Guide.

    Moving files to your local computer

    Often you'll need to move files back and forth between Rockfish and your local computer. To do this, you can use Open-On-Demand, or any other command line tool.

    scp -r <user>@rfdtn1.rockfish.jhu.edu:"<file path of what you want>" <where you want to put it in your local>

    Installation notes

    These steps are already done an affects all users, but might be interesting in case you'd like to run on another cluster

    Install slack integration

    So our 🤖-friend can send us some notifications once a run is done.

    several partitions
    sync:
      protocolA: # the default protocol: fetch hpc results to local directory
        type: rsync
        source: host:someproj/model_output
        target: some/local/dir
      protocolB: # a protocol to store results to S3
        type: s3sync
        source: some/hpc/model_output
        target: s3://s3bucket/someproj/model_output
    sync:
      <protocol name>:              # User supplied name for referencing protocol explicitly
        type: rsync
        source: <path to source>   # A path to a source, can be a local directory like `/abc/def` or `~/ghi` or a remote directory like `user@machine:~/xyz
        target: <path to target>   # A path to a target with the same format as source
        filters:                   # An optional set of filters to apply in order, if not provided then no filters are used.
          - <optional filter one>
          - <optional filter two>
          ...
    sync:
      pushlongleaf: # defines the protocol name
        type: rsync # defines this is an rsync protocol
        source: model_input # what *local* folder to sync from
        target: longleaf:~/flepiproject/model_input # what *remote* project folder to sync to
    $ flepimop sync myconfig.yml
    sync:
      pulllongleaf: # defines the protocol name
        type: rsync # defines this is an rsync protocol
        source: "longleaf:~/flepiproject/model_output" # what *remote* project folder to sync from
        target: model_output # what *local* folder to sync to
        filters: '+ *.SOME_RUN_INDEX.*' # an optional match-only SOME_RUN_INDEX filter
    $ flepimop sync myconfig.yml
    $ flepimop sync -f'+ $FLEPI_RUN_INDEX' --target=model_output/$FLEPI_RUN_INDEX myconfig.yml
    sync:
      snapshots3: # defines the protocol name
        type: s3sync # defines this is an aws s3 sync protocol
        source: model_output # what folder to sync from
        target: s3://idd-inference-runs/myproject # what *remote* project folder to sync to
    $ flepimop sync myconfig.yml
    sync:
      checkcode:
        type: git
    sync:
      resumerun:
        type: s3sync
        source: s3://mybucket/myproject/
        target: model_output
        filters: ["*hnpi*", "*snpi*", "*seed*", "- *"]
    export DELPHI_API_KEY="[YOUR API KEY]"
    git config --global credential.helper store
    git config --global user.name "{NAME SURNAME}"
    git config --global user.email YOUREMAIL@EMAIL.COM
    git config --global pull.rebase false # so you use merge as the default reconciliation method
    cd /scratch4/struelo1/flepimop-code/
    nano slack_credentials.sh
    # and fill the file:
    export SLACK_WEBHOOK="{THE SLACK WEBHOOK FOR CSP_PRODUCTION}"
    export SLACK_TOKEN="{THE SLACK TOKEN}"
    
    model_output/{FileType}/{Prefix}{Index}.{run_id}.{FileType}.{Extension}
                               ^ 
                              setup name(USA)/scenario(inference/med)/run_id/{Inference stuff}
                                                                               ^ global/{final, intermediate}/slot#.
    export COVID_PATH=$(pwd)/COVIDScenarioPipeline
    export PROJECT_PATH=$(pwd)/COVID19_USA
    conda activate covidSP
    cd $COVID_PATH
    Rscript local_install.R
    pip install --no-deps -e gempyor_pkg # before: python setup.py develop --no-deps
    git lfs install
    git lfs pull
    export CENSUS_API_KEY=YOUR_KEY
    cd $PROJECT_PATH
    git restore data/
    export CONFIG_PATH=config_smh_r11_optsev_highie_base_deathscases_blk1.yml
    Rscript $COVID_PATH/R/scripts/build_US_setup.R
    Rscript $COVID_PATH/R/scripts/create_seeding.R
    Rscript $COVID_PATH/R/scripts/full_filter.R -j 1 -n 1 -k 1
    docker pull hopkinsidd/covidscenariopipeline:latest-dev
    docker run -it -v "$(pwd)":/home/app/covidsp hopkinsidd/covidscenariopipeline:latest-dev
    git restore --staged sample_data/united-states-commutes/commute_data.csv
    git stash push sample_data/united-states-commutes/commute_data.csv
    git reset sample_data/united-states-commutes/commute_data.csv
    module purge
    module load gcc/9.3.0
    module load git
    module load git-lfs
    module load slurm
    module load anaconda3/2022.05
    conda activate flepimop-env
    export CENSUS_API_KEY={A CENSUS API KEY}
    export FLEPI_RESET_CHIMERICS=TRUE
    export FLEPI_PATH=/scratch4/struelo1/flepimop-code/$USER/flepiMoP
    
    # And then it asks you some questions to setup some enviroment variables
    export VALIDATION_DATE="2023-01-29"
    export RESUME_LOCATION=s3://idd-inference-runs/USA-20230122T145824
    export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc
    rm -rf model_output data/us_data.csv data-truth &&
       rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
       rm -rf data/seeding_territories.csv && 
       rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv
    
    # don't delete model_output if you have another run in //
    rm -rf $PROJECT_PATH/model_output
    # delete log files from previous runs
    rm *.out
    ssh {YOUR ROCKFISH USERNAME}@login.rockfish.jhu.edu
    module purge
    module load gcc/9.3.0
    module load anaconda3/2022.05  # very important to pin this version as other are buggy
    module load git                # needed for git
    module load git-lfs            # git-lfs (do we still need it?)
    git config --global credential.helper store
    git config --global user.name "{NAME SURNAME}"
    git config --global user.email YOUREMAIL@EMAIL.COM
    git config --global pull.rebase false # so you use merge as the default reconciliation method
    # install all python stuff first
    conda create -c conda-forge -n flepimop-env numba pandas numpy seaborn tqdm matplotlib click confuse pyarrow sympy dask pytest scipy graphviz emcee xarray boto3 slack_sdk
    
    # activate the enviromnment and install the R stuff
    conda activate flepimop-env
    conda install -c conda-forge r-readr r-sf r-lubridate r-tigris r-tidyverse r-gridextra r-reticulate r-truncnorm r-xts r-ggfortify r-flextable r-doparallel r-foreach r-arrow r-optparse r-devtools r-tidycensus r-cdltools r-cowplot 
    cd /scratch4/struelo1/flepimop-code/
    mkdir $USER
    cd $USER
    git clone https://github.com/HopkinsIDD/flepiMoP.git
    git clone https://github.com/HopkinsIDD/Flu_USA.git
    # or any other model repositories
    Host github.com
        User git
        IdentityFile ~/.ssh/<key_name>
    cd /scratch4/struelo1/flepimop-code/
    mkdir $USER
    cd $USER
    git clone git@github.com:HopkinsIDD/flepiMoP.git
    git clone git@github.com:HopkinsIDD/Flu_USA.git
    # or any other model repositories
    cd ~ # go in your home directory
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    ./aws/install -i ~/aws-cli -b ~/aws-cli/bin
    ./aws-cli/bin/aws --version
    # AWS Access Key ID [None]: Access key
    # AWS Secret Access Key [None]: Secret Access key
    # Default region name [None]: us-west-2
    # Default output format [None]: json
    source /scratch4/struelo1/flepimop-code/$USER/flepiMoP/batch/slurm_init.sh
    cd /scratch4/struelo1/flepimop-code/$USER
    export FLEPI_PATH=$(pwd)/flepiMoP
    cd $FLEPI_PATH
    git checkout main
    git pull
    
    # install dependencies ggraph and tidy graph
    R
    > install.packages(c("ggraph","tidygraph"))
    > quit()
    
    # install the R module
    Rscript build/local_install.R # warnings are ok; there should be no error.
    
    # install gempyor
    pip install --no-deps -e flepimop/gempyor_pkg/
    cd /scratch4/struelo1/flepimop-code/$USER
    export PROJECT_PATH=$(pwd)/COVID19_USA
    export GT_DATA_SOURCE="csse_case, fluview_death, hhs_hosp"
    export DELPHI_API_KEY="[YOUR API KEY]"
    cd /scratch4/struelo1/flepimop-code/$USER
    export PROJECT_PATH=$(pwd)/Flu_USA
    cd $PROJECT_PATH
    git pull 
    git checkout main
    git reset --hard && git clean -f -d  # this deletes everything that is not on github in this repo !!!
    export CONFIG_PATH=config_FCH_R16_lowBoo_modVar_ContRes_blk4_Jan29_tsvacc.yml
    Rscript $FLEPI_PATH/datasetup/build_US_setup.R
    
    # For covid do
    Rscript $FLEPI_PATH/datasetup/build_covid_data.R
    
    # For Flu (do not do this for the scenariohub!)
    R
    > install.packages(c("RSocrata"))
    Rscript $FLEPI_PATH/datasetup/build_flu_data.R
    
    # build seeding
    Rscript $FLEPI_PATH/datasetup/build_initial_seeding.R
    export FLEPI_MEM_PROFILE=TRUE
    export FLEPI_MEM_PROF_ITERS=50
    flepimop-inference-main -c $CONFIG_PATH -j 1 -n 1 -k 1 
    rm -r model_output
    python $FLEPI_PATH/batch/inference_job_launcher.py --slurm 2>&1 | tee $FLEPI_RUN_INDEX_submission.log
    python $FLEPI_PATH/batch/inference_job_launcher.py --slurm \
                        -c $CONFIG_PATH \
                        -p $FLEPI_PATH \
                        --data-path $PROJECT_PATH \
                        --upload-to-s3 True \
                        --id $FLEPI_RUN_INDEX \
                        --fs-folder /scratch4/struelo1/flepimop-runs \
                        --restart-from-location $RESUME_LOCATION
    git add data/ 
    git commit -m"scenario run initial" 
    branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
    git push --set-upstream origin $branch
    find . -maxdepth 1 -type d | while read -r dir
     do printf "%s:\t" "$dir"; find "$dir" -type f | wc -l; done 
    rm -r /scratch4/struelo1/flepimop-runs/
    rm -r model_output
    cd $COVID_PATH;git pull;cd $PROJECT_PATH
    rm *.out
    squeue -u $USER
    scancel JOB_ID
    interact -p defq -n 4 -m 24G -t 12:00:00
    cd /scratch4/struelo1/flepimop-code/
    nano slack_credentials.sh
    # and fill the file:
    export SLACK_WEBHOOK="{THE SLACK WEBHOOK FOR CSP_PRODUCTION}"
    export SLACK_TOKEN="{THE SLACK TOKEN}"

    slurm_init.sh, build_US_setup.R

    CONFIG_PATH

    -c, --config

    Path to a configuration file.

    --

    your/path/to/config_file

    build_covid_data.R, build_US_setup.R, build_initial_seeding.R, build_flu_data.R, config.R, preprocessing/ files

    DELPHI_API_KEY

    -d, --delhpi_api_key

    Your personalized key for the Delphi Epidata API. Alternatively, this key can go in the config inference section as gt_api_key.

    --

    build_covid_data.R

    DIAGNOSTICS

    -n, --run-diagnostics

    Flag for whether or not diagnostic tests should be run during execution.

    TRUE

    --run-diagnostics FALSE for FALSE, --run-diagnostics or no mention for TRUE

    run_sim_processing_SLURM.R

    DISEASE

    -i, --disease

    Which disease is being simulated in the prsent run.

    flu

    e.g., rsv, covid

    run_sim_processing_SLURM.R/td>

    DVC_OUTPUTS

    Not a CLI option, but defined using --output

    The names of the directories with outputs to save in S3 (separated by a space).

    model_output model_parameters importation hospitalization

    e.g., model_output model_parameters importation hospitalization

    scenario_job.py, AWS_scenario_runner.sh

    FILENAME

    Not a CLI option.

    Filenames for output files, determined dynamically during inference.

    N/A

    file.parquet, plot.pdf

    AWS_postprocess_runner.sh, SLURM_inference_job.run, AWS_inference_runner.sh

    FIRST_SIM_INDEX

    -i, --first_sim_index

    The index of the first simulation.

    1

    int

    shared_cli.py

    FLEPI_BLOCK_INDEX

    -b, --this_block

    Index of current block.

    1

    int

    flepimop-inference-main.R, utils.py, AWS_postprocess_runner.sh, AWS_inference_runner.sh, SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_CONTINUATION

    --continuation/--no-continuation

    Flag for whether or not to use the resumed run seir files (or provided initial files bucket) as initial conditions for the next run.

    FALSE

    --continuation TRUE for TRUE, --continuation or no mention for FALSE

    SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_CONTINUATION_FTYPE

    Not a CLI option.

    If running a continuation, the file type of the initial condition files.

    config['initial_conditions']['initial_file_type']

    e.g., .csv

    SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_CONTINUATION_LOCATION

    --continuation-location

    The location (folder or an S3 bucket) from which to pull the /init/ files (if not set, uses the resume location seir files).

    --

    path/to/your/location

    SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_CONTINUATION_RUN_ID

    --continuation-run-id

    The ID of run to continue at, if doing a continuation.

    --

    int

    SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_INFO_PATH

    Not a CLI option.

    pending

    pending

    pending

    info.py

    FLEPI_ITERATIONS_PER_SLOT

    -k, --iterations_per_slot

    Number of iterations to run per slot.

    --

    int

    flepimop-inference-slot.R, flepimop-inference-main.R, SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_MAX_STACK_SIZE

    --stacked-max

    Maximum number of iterventions to allow in a stacked intervention.

    5000

    int >=350

    StackedModifier.py, inference_job_launcher.py

    FLEPI_MEM_PROFILE

    -M, --memory_profiling

    Flag for whether or not memory profile should be run during iterations.

    FALSE

    --memory_profiling TRUE for TRUE, --memory_profiling or no mention for FALSE

    flepimop-inference-slot.R, flepimop-inference-main.R, inference_job_launcher.py

    FLEPI_MEM_PROF_ITERS

    -P, --memory_profiling_iters

    If doing memory profiling, after every X iterations, run the profiling.

    100

    int

    flepimop-inference-slot.R, flepimop-inference-main.R, inference_job_launcher.py

    FLEPI_NJOBS

    -j, --jobs

    Number of parallel processors used to run the simulation. If there are more slots than jobs, slots will be divided up between processors and run in series on each.

    Number of cores detected as available at computing cluster.

    int

    flepimop-inference-slot.R, flepimop-inference-main.R, calibrate.py

    FLEPI_NUM_SLOTS

    -n, --slots

    Number of independent simulations of the model to be run.

    --

    int >=1

    flepimop-inference-slot.R, flepimop-inference-main.R, calibrate.py, inference_job_launcher.py

    FLEPI_OUTCOME_SCENARIOS

    -d, --outcome_modifiers_scenarios

    Name of the outcome scenario to run.

    'all'

    pending

    flepimop-inference-slot.R, flepimop-inference-main.R, SLURM_inference_job.run, inference_job_launcher.py

    FLEPI_PATH

    -p, --flepi_path

    Path to the flepiMoP directory.

    'flepiMoP'

    path/to/flepiMoP

    several postprocessing/ files, several batch/ files, several preprocessing/ files, info.py, utils.py, _cli.py

    FLEPI_PREFIX

    --in-prefix

    Unique name for the run.

    --

    e.g., project_scenario1_outcomeA, etc.

    SLURM_inference_job.run, inference_job_launcher.py, AWS_postprocess_runner.sh, calibrate.py, several preprocessing/ files, several postprocessing/ files, several batch/ files

    FLEPI_RESET_CHIMERICS

    -L, --reset_chimeric_on_accept

    Flag for whether or not chimeric parameters should be reset to global parameters whena global acceptance occurs.

    TRUE

    --reset_chimeric_on_accept FALSE for FALSE, --reset_chimeric_on_accept or no mention for TRUE

    flepimop-inference-slot.R, flepimop-inference-main.R, slurm_init.sh, hpc_init, inference_job_launcher.py

    FLEPI_RESUME

    --resume/--no-resume

    Flag for whether or not to resume the current calibration.

    FALSE

    --resume TRUE for TRUE, --resume or no mention for FALSE

    flepimop-inference-slot.R, flepimop-inference-main.R, slurm_init.sh, hpc_init, inference_job_launcher.py

    FLEPI_RUN_INDEX

    -u, --run_id

    Unique ID given to the model run. If the same config is run multiple times, you can avoid the output being overwritten by using unique model run IDs.

    Auto-assigned run ID

    int

    copy_for_continuation.py, flepimop-inference-slot.R, flepimop-inference-main.R, shared_cli.py, base.py, calibrate.py, several batch/ files, several postprocessing/ files

    FLEPI_SEIR_SCENARIOS

    -s, --seir_modifier_scenarios

    Names of the intervention scenarios to run.

    'all'

    pending

    flepimop-inference-slot.R, flepimop-inference-main.R, inference_job_launcher.py

    FLEPI_SLOT_INDEX

    -i, --this_slot

    Index for current slots.

    1

    int

    flepimop-inference-slot.R, several batch/ files

    FS_RESULTS_PATH

    -R, --results-path

    A path to the model results.

    --

    your/path/to/model_results

    prune_by_llik.py, prune_by_llik_and_proj.py, several postprocessing/ files, several batch/ files, model_output_notebook.Rmd

    FULL_FIT

    -F, --full-fit

    Whether or not to process the full fit.

    FALSE

    --full-fit TRUE for TRUE, --full-fit or no mention for FALSE

    run_sim_processing_SLURM.R

    GT_DATA_SOURCE

    -s, --gt_data_source

    Sources of groundtruth data.

    'csse_case, fluview_death, hhs_hosp'

    See default

    build_covid_data.R

    GT_END_DATE

    --ground_truth_end

    Last date to include ground truth for.

    --

    YYYY-MM-DD format

    flepimop-inference-slot.R, flepimop-inference-main.R

    GT_START_DATE

    --ground_truth_start

    First date to include ground truth for.

    --

    YYYY-MM-DD format

    flepimop-inference-slot.R, flepimop-inference-main.R

    IMM_ESC_PROP

    --imm_esc_prop

    Annual percent of immune escape.

    0.35

    float between 0.00 - 1.00

    several preprocessing/ files

    INCL_AGGR_LIKELIHOOD

    -a, --incl_aggr_likelihood

    Whether or not the likelihood should be calculated with aggregate estimates.

    FALSE

    --incl_aggr_likelihood TRUE for TRUE, --incl_aggr_likelihood or no mention for FALSE

    flepimop-inference-slot.R

    IN_FILENAME

    Not a CLI option.

    Name of input files.

    N/A

    file_1.csv file_2.csv, etc.

    several batch/ files

    INIT_FILENAME

    --init_file_name

    Initial file global intermediate name.

    --

    file.csv

    seir_init_immuneladder.R, inference_job.run, several preprocessing/ files

    INTERACTIVE_RUN

    -I, --is-interactive

    Whether or not the current run is interactive.

    FALSE

    --is-interactive TRUE for TRUE, --is-interactive or no mention for FALSE

    flepimop-inference-slot.R, flepimop-inference-main.R

    JOB_NAME

    --job-name

    Unique job name (intended for use when submitting to SLURM).

    --

    Convention: {config['name']}-{timestamp} (str)

    several batch/ files

    LAST_JOB_OUTPUT

    Not a CLI option.

    Path to output of last job.

    N/A

    path/to/last_job/output

    utils.py, several batch/ files

    OLD_FLEPI_RUN_INDEX

    Not a CLI option.

    Run ID of old flepiMoP run.

    N/A

    int

    several batch/ files

    OUT_FILENAME

    Not a CLI option.

    Name of output files.

    N/A

    file_1.csv file_2.csv, etc.

    several batch/ files

    OUT_FILENAME_DIR

    Not a CLI option.

    Directory for output files.

    N/A

    path/to/output/files

    SLURM_inference_job.run

    OUTPUTS

    -o, --select-outputs

    A list of outputs to plot.

    'hosp, hnpi, snpi, llik'

    hosp, hnpi, snpi, llik

    postprocess_snapshot.R

    PARQUET_TYPES

    Not a CLI option.

    Parquet files.

    'seed spar snpi seir hpar hnpi hosp llik init'

    seed spar snpi seir hpar hnpi hosp llik init

    AWS_postprocess_runner.sh, SLURM_inference_job.run, AWS_inference_runner.sh

    PATH

    Not a CLI option.

    Path relating to AWS installation. Used during SLURM runs.

    N/A

    set with export PATH=~/aws-cli/bin:$PATH in SLURM_inference_job.run

    schema.yml, utils.py, info.py, AWS_postprocess_runner.sh, SLURM_inference_job.run

    PROCESS

    -r, --run-processing

    Whether or not to process the run.

    FALSE

    --run-processing TRUE for TRUE, --run-processing or no mention for FALSE

    run_sim_processing_SLURM.R

    PROJECT_PATH

    -d, --data_path

    Path to the folder with configs and model output.

    --

    path/to/configs_and_model-output

    base.py, _cli.py, calibrate.py, several postprocessing/ files, several batch/ files

    PULL_GT

    -g, --pull-gt

    Whether or not to pull ground truth data.

    FALSE

    --pull-gt TRUE for TRUE, --pull-gt or no mention for FALSE

    run_sm_processing_SLURM.R

    PYTHON_PATH

    -y, --python

    Path to Python executable.

    'python3'

    path/to/your_python

    flepimop-inference-slot.R, flepimop-inference-main.R

    RESUMED_CONFIG_PATH

    --res_config

    Path to previous config file, if using resumes.

    NA

    path/to/past_config

    seir_init_immuneladder.R, several preprocessing/ files

    RESUME_DISCARD_SEEDING

    --resume-discard-seeding, --resume-carry-seeding

    Whether or not to keep seeding in resume runs.

    FALSE

    --resume-carry-seeding TRUE for TRUE, --resume-carry-seeding or no mention for FALSE

    several batch/ files

    RESUME_LOCATION

    -r, --restart-from-location

    The location (folder or an S3 bucket) where the previous run is stored.

    --

    path/to/last_job/output

    built_initial_seeding.R, calibrate.py, slurm_init.sh, hpc_init, inference_job_launcher.py

    RESUME_RUN

    -R, --is-resume

    Whether or not this run is a resume.

    FALSE

    --is-a-resume TRUE for TRUE, --is-a-resume or no mention for FALSE

    flepimop-inference-slot.R, flepimop-inference-main.R

    RESUME_RUN_INDEX

    Not a CLI option.

    Index of resumed run.

    set by OLD_FLEPI_RUN_INDEX

    int

    SLURM_inference_job.run

    RSCRIPT_PATH

    -r, --rpath

    Path to R executable.

    'Rscript'

    path/to/your_R

    build_initial_seeding.R, flepimop-inference-slot.R, flepimop-inference-main.R

    RUN_INTERACTIVE

    -I, --is-interactive

    Whether or not the current run is interactive.

    FALSE

    --is-interactive for TRUE, --is-interactive or no mention for FALSE

    flepimop-inference-slot.R, flepimop-inference-main.R

    SAVE_HOSP

    -H, --save_hosp

    Whether or not the HOSP output files should be saved for each iteration.

    TRUE

    --save_hosp FALSE for FALSE, --save_hosp or no mention for TRUE

    flepimop-inference-slot.R, flepimop-inference-main.R

    SAVE_SEIR

    -S, --save_seir

    Whether or not the SEIR output files should be saved for each iteration.

    FALSE

    --save_seir TRUE for TRUE, --save_seir or no mention for FALSE

    flepimop-inference-slot.R, flepimop-inference-main.R

    SEED_VARIANTS

    -s, --seed_variants

    Whether or not to add variants/subtypes to outcomes in seeding.

    --

    FALSE, TRUE

    create_seeding.R

    SIMS_PER_JOB

    Not a CLI option.

    Simulations per job.

    N/A

    int >=1

    AWS_postprocess_runner.sh, inference_job_launcher.py, AWS_inference_runner.sh

    SLACK_CHANNEL

    -s, --slack-channel

    Slack channel, either 'csp-production' or 'debug'; or 'noslack' to disable slack.

    --

    csp-production, debug, or noslack

    postrpocess_auto.py, postprocessing-scripts.sh, inference_job_launcher.py

    SLACK_TOKEN

    -s, --slack-token

    Slack token.

    --

    postprocess_auto.py, SLURM_postprocess_runner.run

    SUBPOP_LENGTH

    -g, --subpop_len

    Number of digits in subpops.

    5

    int

    flepimop-inference-slot.R, flepimop-inference-main.R

    S3_MODEL_PROJECT_PATH

    Not a CLI option.

    Location in S3 bucket with the code, data, and dvc pipeline.

    N/A

    path/to/code_data_dvc

    several batch/ files

    S3_RESULTS_PATH

    Not a CLI option.

    Location in S3 to store results.

    N/A

    path/to/s3/results

    several batch/ files

    S3_UPLOAD

    Not a CLI option.

    Whether or not we also save runs to S3 for slurm runs

    TRUE

    TRUE, FALSE

    SLURM_postprocess_runner.run, SLURM_inference_job.run, inference_job_launcher.py

    VALIDATION_DATE

    --validation-end-date

    First date of projection/forecast (first date without ground truth data).

    date.today()

    YYYY-MM-DD format

    data_setup_source.R, DataUtils.R, groundtruth_source.R, slurm_init.sh, hpc_init, inference_job_launcher.py

    BATCH_SYSTEM

    Not a CLI option.

    System you are running on (e.g., aws, SLURM, local).

    N/A

    e.g., aws, slurm

    inference_job_launcher.py

    CENSUS_API_KEY

    Not a CLI option.

    A unique key to the API for census data.

    N/A

    Contributing to the Python code

    The "heart" of the pipeline, gempyor, is written in Python taking advantage of just-in-time compilation (via numba) and existing optimized libraries (numpy, pandas). If you would like to help us build gempyor, here is some useful information.

    Frameworks

    We make extensive use of the following packages:

    • click for managing the command-line arguments

    • confuse for accessing the configuration file

    • numba to just-in-time compile the core model

    • sympy to parse the model equations

    • pyarrow as parquet is our main data storage format

    • , which provides labels in the form of dimensions, coordinates and attributes on top of raw NumPy multidimensional arrays, for performance and convenience ;

    • emcee for inference, as an option

    • graphviz to export transition graph between compartments

    • pandas, numpy, scipy, seaborn, matplotlib and tqdm like many Python projects

    One of the current focus is to switch internal data types from dataframes and numpy array to xarrays!

    Tests and build dependencies

    To run the tests suite locally, you'll need to install the gempyor package with build dependencies:

    which installs the pytest and mock packages in addition to all other gempyor dependencies so that one can run tests.

    If you are running from a conda environment and installing with `--no-deps`, then you should make sure that these two packages are installed.

    Now you can try to run the gempyor test suite by running, from the flepimop/gempyor_pkg folder:

    If that works, then you are ready to develop gempyor. Feel free to open your first pull request.

    If you want more output on tests, e.g capturing standard output (print), you can use:

    and to run just some subset of the tests (e.g here just the outcome tests), use:

    For more details on how to use pytest please refer to their usage guide.

    Formatting and Linting

    We try to remain close to Python conventions and to follow the updated rules and best practices. For formatting, we use black, the Uncompromising Code Formatter before submitting pull requests. It provides a consistent style, which is useful when diffing. To get started with black please refer to their Getting Started guide. We use a custom length of 92 characters as the baseline is short for scientific code. Here is the line to use to format your code:

    To identify instances of poor Python practices within gempyor, we use pylint. pylint checks for these instances in the code, then produces a list of labeled errors. Again, we use a custom length of 92 characters as the recommended max line length. To lint your code with these settings, you can run the following line from the flepiMoP directory:

    For those using a Mac or Linux system for development, these commands are also available for execution by calling ./bin/lint. Similarly, you can take advantage of the formatting pre-commit hook found at bin/pre-commit. To start using it copy this file to your git hooks folder:

    flepiMoP repository
    Introduction

    This document contains instructions for setting up and running the two different kinds of SEIR modeling jobs supported by the COVIDScenarioPipeline repository on AWS:

    1. Inference jobs, using AWS Batch to coordinate hundreds/thousands of jobs across a fleet of servers, and

    2. Planning jobs, using a single relatively large EC2 instance (usually an r5.24xlarge) to run one or more planning scenarios on a single high-powered machine.

    Most of the steps required to setup and run the two different types of jobs on AWS are identical, and I will explicitly call out the places where different steps are required. Throughout the document, we assume that your client machine is a UNIX-like environment (e.g., OS X, Linux, or WSL).

    Local Client Setup

    I need a few things to be true about the local machine that you will be using to connect to AWS that I'll outline here:

    1. You have created and downloaded a .pem file for connecting to an EC2 instance to your ~/.ssh directory. When we provision machines, you'll need to use the .pem file as a secret key for connecting. You may need to change the permission of the .pem file:

    2. You have created a ~/.ssh/config file that contains an entry that looks like this so we can use staging as an alias for your provisioned EC2 instance in the rest of the runbook:

    3. You can This is important because we will need to use your Github SSH key to interact with private repositories from the staging server on EC2.

    Provisioning The Staging Server

    If you are running an Inference job, you should use a small instance type for your staging server (e.g., an m5.xlarge will be more than enough.) If you are running a Planning job, you should provision a beefy instance type (I am especially partial to the memory-and-CPU heavy r5.24xlarge, but given how fast the planning code has become, an r5.8xlarge should be perfectly adequate.)

    If you have access to the jh-covid account, you should use the IDD Staging AMI (ami-03641dd0c8554e5d0) to provision and launch new staging servers; it is already setup with all of the dependencies described in this section, however you will need to alter it's default network settings, iam role and security group(Please refer this page in details). You can find the AMI here, select it, and press the Launch button to walk you through the Launch Wizard to choose your instance type and .pem file to provision your staging server. When going through the Launch Wizard, be sure to select Next: Configure Instance details instead of Review and Launch. You will need to continue selecting the option that is not Review and Launch until you have selected a security group. In these screens, most of the default options are fine, but you will want to set the HPC VPC network, choose a public subnet (it will say public or private in the name), and set the iam role to EC2S3FullAccess on the first screen. You can also name the machine by providing a Name tag in the tags screen. Finally, you will need to set your security group to dcv_usa and/or dcv_usa2. You can then finalize the machine initialization with Review and Launch. Once your instance is provisioned, be sure to put its IP address into the HostName section of the ~/.ssh/config file on your local client so that you can connect to it from your client by simply typing ssh staging in your terminal window.

    If you are having connection timeout issues when trying to ssh into the AWS machine, you should check that you have SSH TCP Port 22 permissions in the dcv_usa/ security group.

    If you do not have access to the jh-covid account, you should walk through the regular EC2 Launch Wizard flow and be sure to choose the Amazon Linux 2 AMI (HVM), SSD Volume Type (ami-0e34e7b9ca0ace12d, the 64-bit x86 version) AMI. Once the machine is up and running and you can SSH to it, you will need to run the following code to install the software you will need for the rest of the run:

    Connect to Github

    Once your staging server is provisioned and you can connect to it, you should scp the private key file that you use for connecting to Github to the /home/ec2-user/.ssh directory on the staging server (e.g., if the local file is named ~/.ssh/id_rsa, then you should run scp ~/.ssh/id_rsa staging:/home/ec2-user/.ssh to do the copy). For convenience, you should create a /home/ec2-user/.ssh/config file on the staging server that has the following entry:

    This way, the git clone, git pull, etc. operations that you run on the staging server will use your SSH key without constantly prompting you to login. Be sure to chmod 600 ~/.ssh/config to give the new file the correct permissions. You should now be able to clone a COVID19 data repository into your home directory on the staging server to do work against. For this example, to use the COVID19_Minimal repo, run:

    to get it onto the staging server. By convention, we do runs with the COVIDScenarioPipeline repository nested inside of the data repository, so we then do:

    to clone the modeling code itself into a child directory of the data repository.

    Getting and Launching the Docker Container

    The previous section is only for getting a minimal set of dependencies setup on your staging server. To do an actual run, you will need to download the Docker container that contains the more extensive set of dependencies we need for running the code in the COVIDScenarioPipeline repository. To get the development container on your staging server, please run:

    There are multiple versions of the container published on DockerHub, but latest-dev contains the latest-and-greatest dependencies and can support both Inference and Planning jobs. In order to launch the container and run a job, we need to make our local COVID19_Minimal directory visible to the container's runtime. For Inference jobs, we do this by running:

    The -v option to docker run maps a file in the host filesystem (i.e., the path on the left side of the colon) to a file in the container's filesystem. Here, we are mapping the /home/ec2-user/COVID19_Minimal directory on the staging server where we checked out our data repo to the /home/app/src directory in the container (by convention, we run commands inside of the container as a user named app.) We also map our .ssh directory from the host filesystem into the container so that we can interact with Github if need be using our SSH keys. Once the container is launched, we can cd src; ls -ltr to look around and ensure that our directory mapping was successful and we see the data and code files that we are expecting to run with.

    Once you are in the src directory, there are a few final steps required to install the R packages and Python modules contained within the COVIDScenarioPipeline repository. First, checkout the correct branch of COVIDScenarioPipeline. Then, assuming that you created a COVIDScenarioPipeline directory within the data repo in the previous step, you should be able to run:

    to install the local R packages and then install the Python modules.

    Once this step is complete, your machine is properly provisioned to run Planning jobs using the tools you normally use (e.g., make_makefile.R or running simulate.py and hospdeath.R directly, depending on the situation.) Running Inference jobs requires some extra steps that are covered in the next two sections.

    Running Inference Jobs

    Once the container is setup from the previous step, we are ready to test out and then launch an inference job against a configuration file (I will use the example of config.yml for the rest of this document.) First, I setup and run the build_US_setup.R script against my configuration file to ensure that the mobility data is up to date:

    Next, I kick off a small local run of the full_filter.R script. This serves two purposes: first, we can verify that the configuration file is in good shape and can support a few small iterations of the inference calculations before we kick off hundreds/thousands of jobs via AWS Batch. Second, it downloads the case data that we need for inference calculations to the staging server so that it can be cached locally and used by the batch jobs on AWS- if we do not have a local cache of this data at the start of the run, then every job will try to download the data itself, which will force the upstream server to deny service to the worker jobs, which will cause all of the jobs to fail. My small runs usually look like:

    This will run two sequential simulations (-k 2) for a single slot (-n 1) using a single CPU core (-j 1), looking for the modeling source code in the COVIDScenarioPipeline directory (-p COVIDScenarioPipeline). (We need to use the command line arguments here to explicitly override the settings of these parameters inside of config.yml since this run is only for local testing.) Assuming that this run succeeds, we are ready to kick off a batch job on the cluster.

    The COVIDScenarioPipeline/batch/inference_job.py script will use the contents of the current directory and the values of the config file and any commandline arguments we pass it to launch a run on AWS Batch via the AWS API. To run this script, you need to have access to your AWS access keys so that you can enable access to the API by running aws configure at the command line, which will prompt you to enter your access key, secret, and preferred region, which should always be us-west-2 for jh-covid runs. (You can leave the Default format entry blank by simply hitting Enter.) IMPORTANT REMINDER: (Do not give anyone your access key and secret. If you lose it, deactivate it on the AWS console and create a new one. Keeep it safe.)

    The simplest way to launch an inference job is to simply run

    This will use the contents of the config file to determine how many slots to run, how many simulations to run for each slot, and how to break those simulations up into blocks of batch jobs that run sequentially. If you need to override any of those settings at the command line, you can run

    to see the full list of command line arguments the script takes and how to set them.

    One particular type of command line argument cannot be specified in the config: arguments to resume a run from a previously submitted run. This takes two arguments based on the previous run:

    Both the s3 bucket and run id are printed as part of the output for the previous submission. We store that information on a slack channel #csp-production, and suggest other groups find similar storage.

    Inference jobs are parallelized by NPI scenarios and hospitalization rates, so if your config file defines more than one top-level scenario or more than one set of hospitalization parameters, the inference_job.py script will kick off a separate batch job for the cross product of scenarios * hospitalizations. The script will announce that it is launching each job and will print out the path on S3 where the final output for the job will be written. You can monitor the progress of the running jobs using either the AWS Batch Dashboard or by running:

    which will show you the running status of the jobs in each of the queues.

    Operating Inference Jobs

    By default, the AWS Batch system will usually run around 50% of your desired simultaneously executable jobs concurrently for a given inference run. For example, if you are running 300 slots, Batch will generally run about 150 of those 300 tasks at a given time. If you need to force Batch to run more tasks concurrently, this section provides instructions for how to cajole Batch into running harder.

    You can see how many tasks are running within each of the different Batch Compute Environments corresponding to the Batch Job Queues via the Elastic Container Service (ECS) Dashboard. There is a one-to-one correspondence between Job Queues, Compute Environments, and ECS Clusters (the matching ones all end with the same numeric identifier.) You can force Batch to scale up the number of CPUs available for running tasks by selecting the radio button corresponding to the compute environment that you want to scale on the Batch Compute Environment dashboard, clicking Edit, increasing the Desired CPU (and possibly the Minimum CPU, see below), and clicking the Save button. You will be able to see new containers and tasks coming online via the ECS Dashboard after a few minutes.

    If you want to force new tasks to come online ASAP, you should consider increasing the Minimum CPU for the Compute Environment as well as the Desired CPU (the Desired CPU is not allowed to be lower than the Minimum CPU, so if you increase the Minimum you must increase the Desired as well to match it.) This will cause Batch to spin new containers up quickly and get them populated with running tasks. There are two downsides to doing this: first, it overrides the allocation algorithm that makes cost/performance tradeoff decisions in favor of spending more money in order to get more tasks running. Second, you must remember to update the Compute Environment towards the end of the job run to set the Minimum CPU to zero again so that the ECS cluster can spin down when the job is finished; if you do not do this, ECS will simply leave the machines up and running, wasting money without doing any actual work. (Note that you should never manually try to lower the value of the Desired CPU setting for the Compute Environment- the Batch system will throw an error if you attempt to do this.)

    Running with Docker locally (outdated/US specific) 🛳

    Short internal tutorial on running locally using a "Docker" container.

    There are more comprehensive directions in the How to run -> Running with Docker locally section, but this section has some specifics required to do US-specific, COVID-19 and flu-specific runs

    Setup

    Run Docker Image

    Current Docker image: /hopkinsidd/flepimop:latest-dev

    Docker is a software platform that allows you to build, test, and deploy applications quickly. Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. This means you can run and install software without installing the dependencies in the system.

    A docker container is an environment which is isolated from the rest of the operating system i.e. you can create files, programs, delete and everything but that will not affect your OS. It is a local virtual OS within your OS ;

    For flepiMoP, we have a docker container that will help you get running quickly ;

    In this command we run the docker image hopkinsidd/flepimop. The -v command is used to allocate space from Docker and mount it at the given location ;

    This mounts the data folder <dir1> to a path called drp within the docker environment, and the COVIDScenarioPipeline <dir2> in flepimop ;

    🚀 Run inference

    Fill the environment variables (do this every time)

    First, populate the folder name variables:

    Then, export variables for some flags and the census API key (you can use your own):

    Where do I get a census key API?

    The Census Data Application Programming Interface (API) is an API that gives the public access to raw statistical data from various Census Bureau data programs. To acquire your own API Key, click .

    After you enter your details, you should receive an email using which you can activate your key and then use it.

    Note: Do not enter the API Key in quotes, copy the key as it is.

    Go into the Pipeline repo (making sure it is up to date on your favorite branch) and do the installation required of the repository:

    Note: These installations take place in the docker container and not the Operating System. They must be made once while starting the container and need not be done for every time you are running tests, provided they have been installed once.

    Run the code

    Everything is now ready. 🎉 Let's do some clean-up in the data folder (these files might not exist, but it's good practice to make sure your simulation isn't re-using some old files) ;

    Stay in $PROJECT_PATH, select a config, and build the setup. The setup creates the population seeding file (geodata) and the population mobility file (mobility). Then, run inference:

    where:

    • n is the number of parallel inference slots,

    • j is the number of CPU cores it'll use in your machine,

    • k is the number of iterations per slots.

    It should run successfully and create a lot of files in model_output/ ;

    The last few lines visible on the command prompt should be:

    [[1]]

    [[1]][[1]]

    [[1]][[1]][[1]]

    NULL

    Other helpful tools

    To understand the basics of docker refer to the following:

    To install docker for windows refer to the following link:

    The following is a good tutorial for introduction to docker:

    To run the entire pipeline we use the command prompt. To open the command prompt type “Command Prompt" in the search bar and open the command prompt. Here is a tutorial video for navigating through the command prompt:

    To test, we use the test folder (test_documentation_inference_us in this case) in the CovidScenariPipeline as the data repository folder. We run the docker container and set the paths.

    AWS Submission Instructions: COVID-19

    This page, along with the other AWS run guides, are not deprecated in case we need to run flepiMoP on AWS again in the future, but also are not maintained as other platforms (such as longleaf and rockfish) are preferred for running production jobs.

    Step 1. Create the configuration file.

    see Building a configuration file

    Step 2. Start and access AWS submission box

    Spin up an Ubuntu submission box if not already running. To do this, log onto AWS Console and start the EC2 instance.

    Update IP address in .ssh/config file. To do this, open a terminal and type the command below. This will open your config file where you can change the IP to the IP4 assigned to the AWS EC2 instance (see AWS Console for this):

    SSH into the box. In the terminal, SSH into your box. Typically we name these instances "staging", so usually the command is:

    Step 3. Setup the environment

    Now you should be logged onto the AWS submission box.

    Update the github repositories. In the below example we assume you are running mainbranch in Flu_USA andmainbranch in COVIDScenarioPipeline. This assumes you have already loaded the appropriate repositories on your EC2 instance. Have your Github ssh key passphrase handy so you can paste it when prompted (possibly multiple times) with the git pull command. Alternatively, you can add your github key to your batch box so you do not have to log in repeated (see X).

    Initiate the docker. Start up and log into the docker container, pull the repos from Github, and run setup scripts to setup the environment. This setup code links the docker directories to the existing directories on your box. As this is the case, you should not run job submission simultaneously using this setup, as one job submission might modify the data for another job submission.

    Step 4. Model Setup

    To run the via AWS, we first run a setup run locally (in docker on the submission EC2 box) ;

    Setup environment variables. Modify the code chunk below and submit in the terminal. We also clear certain files and model output that get generated in the submission process. If these files exist in the repo, they may not get cleared and could cause issues. You need to modify the variable values in the first 4 lines below. These include the SCENARIO, VALIDATION_DATE, COVID_MAX_STACK_SIZE, and COMPUTE_QUEUE. If submitting multiple jobs, it is recommended to split jobs between 2 queues: Compartment-JQ-1588569569 and Compartment-JQ-1588569574.

    If not resuming off previous run:

    If resuming from a previous run, there are an additional couple variables to set. This is the same for a regular resume or continuation resume. Specifically:

    • RESUME_ID - the COVID_RUN_INDEX from the run resuming from.

    • RESUME_S3 - the S3 bucket where this previous run is stored

    Preliminary model run. We do a setup run with 1 to 2 iterations to make sure the model runs and setup input data. This takes several minutes to complete, depending on how complex the simulation will be. To do this, run the following code chunk, with no modification of the code required:

    Step 5. Launch job on AWS batch

    Configure AWS. Assuming that the simulations finish successfully, you will now enter credentials and submit your job onto AWS batch. Enter the following command into the terminal ;

    You will be prompted to enter the following items. These can be found in a file called new_user_credentials.csv ;

    • Access key ID when prompted

    • Secret access key when prompted

    • Default region name: us-west-2

    • Default output: Leave blank when this is prompted and press enter (The Access Key ID and Secret Access Key will be given to you once in a file)

    Launch the job. To launch the job, use the appropriate setup based on the type of job you are doing. No modification of these code chunks should be required.

    NOTE: Resume and Continuation Resume runs are currently submitted the same way, resuming from an S3 that was generated manually. Typically we will also submit any Continuation Resume run specifying --resume-carry-seeding as starting seeding conditions will be manually constructed and put in the S3.

    Carrying seeding (do this to use seeding fits from resumed run):

    Discarding seeding (do this to refit seeding again):

    Single Iteration + Carry seeding (do this to produce additional scenarios where no fitting is required):

    Step 6. Document the Submission

    Commit files to GitHub. After the job is successfully submitted, you will now be in a new branch of the population repo. Commit the ground truth data files to the branch on GitHub and then return to the main branch:

    Save submission info to slack. We use a slack channel to save the submission information that gets outputted. Copy this to slack so you can identify the job later. Example output:

    Inference scratch

    This is just a place to play around with different inference algorithms. Gitbook markdown is very application-specific so can't copy this algorithm text into other apps to play around with!

    Current inference algorithm

    • For m=1…Mm=1 \dots Mm=1…M, where MMMis the number of parallel MCMC chains (also known as slots)

      • Generate initial state

        • Generate an initial set of parameters , and copy this to both the global () and chimeric () parameter chain (sequence ;

        • Generate an initial epidemic trajectory

        • Calculate and record the initial likelihood for each subpopulation, $$\mathcal{L_i}(D_i|Z_i(\Theta_{m,0}))$ ;

      • For where is the length of the MCMC chain, add to the sequence of parameter values :

        • Generate a proposed set of parameters from the current chimeric parameters using the proposal distribution $$g(\Theta^*|\Theta^C_{m,k-1})$ ;

        • Generate an epidemic trajectory with these proposed parameters,

      • End for iterations of each MCMC chain

    • End for parallel MCMC chains

    • Collect the final global parameter values for each parallel chain

    Making chimeric decision first

    Provisioning AWS EC2 instance

    This page, along with the other AWS run guides, are not deprecated in case we need to run flepiMoP on AWS again in the future, but also are not maintained as other platforms (such as longleaf and rockfish) are preferred for running production jobs.

    Signing in to AWS Management Consol ;

    Click on below:

    Sign in as IAM user with your given Accound ID, username and Password

    Then the next view appears, check "regeon" as "Oregon" by default and "user@Accond ID" as you expeced.

    If you have already accessed AWS console, these kinds of view can be seen. In the case select "EC2" to go to "EC2 Dashboard"(if not, skip it).

    EC2 Dashboard

    In this EC2 Dashboard, we can maintain the EC2 boxes from creation to deletion. In this section, how to create an EC2 instance from the AMI image which has already been registered is shown.

    Select "Images>AMIs" in the right pane(Navigation pain) ;

    Select an AMI name which name is "IDD Staging AMI" in the "Amazon Machine Images (AMIs)" by clicking the responding AMI checkbox on the left, then push the "Launch instance from AMI" button (colored in orange).

    Launch an instance

    To create an EC2 instance, fill out the items as below (example):

    • Name and tags

      • input an appropriate name (e.g., "sample_box01")

    • Application and OS image

      • check whether "AMI from catalog" is _"IDD Staging AMI" (for example; select one as you want) ;

    • Advanced details

      • "EC2S3FullAccess" should be setected in IAM instance profile, but to do it an authentication (IAM role or policy) must be set on to the working IAM account

    then push "Launch Instance" button which is located at the bottom right side of the scree ;

    Running with RStudio Server on AWS EC2

    This page, along with the other AWS run guides, are not deprecated in case we need to run flepiMoP on AWS again in the future, but also are not maintained as other platforms (such as longleaf and rockfish) are preferred for running production jobs.

    Introduction

    As a computational environment, you can use an RStudio Server integrated AWS EC2 instance for either your personal space or shared usages among multiple users, via GUI as well as CLI using ssh. The EC2 instance-type was selected to be appropriate one for running programs as of now (2023/1), in views of both computational resources and finances, so that you can use a cloud-based computing environment which you can access with GUI including from Web, without any difficulties to set it up. The details hereinafter may be able to change.

    Versions

    The current installed versions of software or additional information related to AWS EC2 are as follows:

    • R/RStudio Server

      • R version: 4.2.2

      • RStudio Server version: v2022.07.02+576

    • AWS EC2 instance configurations

    Provisioning a server on AWS EC2

    To be written/ Talk to someone who would be able to do that.

    • EC2 instance initialization with specfic AMI

    • Configured network related including ports openings

    • registration of the user in the EC2 instance

    • Configuring shared directory via SMB and accoun ;

    Starting an EC2 instanc ;

    The procedure is same as a normal ec2 instance starting. One way is to select the ec2 instance and start it in EC2 Management Console ;

    Once the instance started, RStudio Server can be accessed without invoking manually ;

    Accessing RStudio Server

    By default RStudio Server runs on port 8787 and accepts connections from all remote clients. After invoking an EC2 box you should therefore be able to navigate a web browser to the following address to access the server:

    http://<ip-addr>:8787/

    Then the authentication dialog will be shown, try to log in by inputting your username and password which are already registered in the box and pushing the "Sign In " button:

    RStudio view can be appeared as below:

    Accessing the Linux server(Ubuntu) using RDP

    To access the linux server with GUI, RDP software can be applicable, in addition to the usual way, via ssh with command line.

    Using Windows

    By using "Remote Desktop Connection" app in Windows, you can log in to the Linux server box from remote environment.

    Using Mac

    For Mac users, the below RDP software is recommented:

    Microsoft Remote Desktop

    Accessing the shared space on Linux server

    As a shared space, the directory named:

    is deployed among multiple server boxes using EFS(Elastic File System) which covers NFSv4 protocol. ;

    Accessing the shared space on Linux server using Samba(obsolete)

    Common

    In the linux box, Samba(SMB) service has been on for file exchanging by default. The area in which can be readable and writable under the specific user privileages is:

    When accessing the area via SMB, you can input username and its password in a dialog window which will be shown. The username is:

    (ask password for above user in advance, if you want to access via SMB)

    Using Windows

    By inputting the form such as \\<ip-addr>\share in Windows Explorer

    Using Mac

    From Finder you can access the shared space using SMB.

    1. From Finder Menu, choose "MOVE" then "Connect to Server"

    2. When a dialog appears, fill username and password out as a registered user ;

    3. After pushing "connect" button, the designated area will be shown in Finder if no errors happen ;

    Notes

    When you are inside of the university networks, e.g. in labs or in office, you will not access to the server box with SMB because the networks may be blocking the ports related to the services.

    If you are using MAC as a local pc, there is a workaround to avoid the situation but for Windows it has not been clear there is a solution (now under investigation). If you want to know the related information, currently even for Mac user only though, please try to make a contact. In case of a Windows user, I recommend using "Local devices and resources" setting of Remote Desktop Connection.\

    AWS Submission Instructions: Influenza

    This page, along with the other AWS run guides, are not deprecated in case we need to run flepiMoP on AWS again in the future, but also are not maintained as other platforms (such as longleaf and rockfish) are preferred for running production jobs.

    Step 1. Create the configuration file.

    see Building a configuration file

    Step 2. Start and access AWS submission box

    Spin up an Ubuntu submission box if not already running. To do this, log onto AWS Console and start the EC2 instance.

    Update IP address in .ssh/config file. To do this, open a terminal and type the command below. This will open your config file where you can change the IP to the IP4 assigned to the AWS EC2 instance (see AWS Console for this):

    SSH into the box. In the terminal, SSH into your box. Typically we name these instances "staging", so usually the command is:

    Step 3. Setup the environment

    Now you should be logged onto the AWS submission box.

    Update the github repositories. In the below example we assume you are running mainbranch in Flu_USA andmainbranch in COVIDScenarioPipeline. This assumes you have already loaded the appropriate repositories on your EC2 instance. Have your GitHub ssh key passphrase handy so you can paste it when prompted (possibly multiple times) with the git pull command. Alternatively, you can add your github key to your batch box so you do not have to log in repeated (see X).

    Initiate the docker. Start up and log into the docker container, pull the repos from GitHub, and run setup scripts to setup the environment. This setup code links the docker directories to the existing directories on your box. As this is the case, you should not run job submission simultaneously using this setup, as one job submission might modify the data for another job submission.

    Step 4. Model Setup

    To run the via AWS, we first run a setup run locally (in docker on the submission EC2 box) ;

    Setup environment variables. Modify the code chunk below and submit in the terminal. We also clear certain files and model output that get generated in the submission process. If these files exist in the repo, they may not get cleared and could cause issues. You need to modify the variable values in the first 4 lines below. These include the SCENARIO, VALIDATION_DATE, COVID_MAX_STACK_SIZE, and COMPUTE_QUEUE. If submitting multiple jobs, it is recommended to split jobs between 2 queues: Compartment-JQ-1588569569 and Compartment-JQ-1588569574.

    If not resuming off previous run:

    If resuming from a previous run, there are an additional couple variables to set. This is the same for a regular resume or continuation resume. Specifically:

    • RESUME_ID - the COVID_RUN_INDEX from the run resuming from.

    • RESUME_S3 - the S3 bucket where this previous run is stored

    Preliminary model run. We do a setup run with 1 to 2 iterations to make sure the model runs and setup input data. This takes several minutes to complete, depending on how complex the simulation will be. To do this, run the following code chunk, with no modification of the code required:

    Step 5. Launch job on AWS batch

    Configure AWS. Assuming that the simulations finish successfully, you will now enter credentials and submit your job onto AWS batch. Enter the following command into the terminal ;

    You will be prompted to enter the following items. These can be found in a file called new_user_credentials.csv ;

    • Access key ID when prompted

    • Secret access key when prompted

    • Default region name: us-west-2

    • Default output: Leave blank when this is prompted and press enter (The Access Key ID and Secret Access Key will be given to you once in a file)

    Launch the job. To launch the job, use the appropriate setup based on the type of job you are doing. No modification of these code chunks should be required.

    NOTE: Resume and Continuation Resume runs are currently submitted the same way, resuming from an S3 that was generated manually. Typically we will also submit any Continuation Resume run specifying --resume-carry-seeding as starting seeding conditions will be manually constructed and put in the S3.

    Carrying seeding (do this to use seeding fits from resumed run):

    Discarding seeding (do this to refit seeding again):

    Single Iteration + Carry seeding (do this to produce additional scenarios where no fitting is required):

    NOTE: A Resume and Continuation Resume are currently submitted the same way, but with --resume-carry-seeding specified and resuming from an S3 that was generated manually.

    Step 6. Document the Submission

    Commit files to Github. After the job is successfully submitted, you will now be in a new branch of the population repo. Commit the ground truth data files to the branch on github and then return to the main branch:

    Save submission info to slack. We use a slack channel to save the submission information that gets outputted. Copy this to slack so you can identify the job later. Example output:

    pip install "flepimop/gempyor_pkg[test]"
    pytest
    pytest -vvvv
    pytest -vvvv -k outcomes
    black --line-length 92 \
        --extend-exclude 'flepimop/gempyor_pkg/src/gempyor/steps_rk4.py' \
        --verbose .
    pylint flepimop/gempyor_pkg/src/gempyor/ \
        --fail-under 5 \
        --rcfile flepimop/gempyor_pkg/.pylintrc \
        --verbose
    cp -f bin/pre-commit .git/hooks/
    chmod 400 ~/.ssh/<your .pem file goes here>
    host staging
    HostName <IP address of provisioned server goes here>
    IdentityFile ~/.ssh/<your .pem file goes here>
    User ec2-user
    IdentitiesOnly yes
    StrictHostKeyChecking no 
    sudo yum -y update
    sudo yum -y install awscli 
    sudo yum -y install git 
    sudo yum -y install docker.io 
    sudo yum -y install pbzip2 
    
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
    sudo yum -y install git-lfs
    git lfs install
    host github.com
     HostName github.com
     IdentityFile ~/.ssh/id_rsa
     User git
    git clone git@github.com:HopkinsIDD/COVID19_Minimal.git
    cd COVID19_Minimal
    git clone git@github.com:HopkinsIDD/COVIDScenarioPipeline.git
    sudo docker pull hopkinsidd/covidscenariopipeline:latest-dev
    sudo docker run \
      -v /home/ec2-user/COVID19_Minimal:/home/app/src \
      -v /home/ec2-user/.ssh:/home/app/.ssh \
      -it hopkinsidd/covidscenariopipeline:latest-dev
    Rscript COVIDScenarioPipeline/local_install.R
    (cd COVIDScenarioPipeline/; python setup.py install)
    export CENSUS_API_KEY=<your census api key>
    cd COVIDScenarioPipeline
    git lfs pull
    cd ..
    Rscript COVIDScenarioPipeline/R/scripts/build_US_setup.R -c config.yml
    Rscript COVIDScenarioPipeline/R/scripts/full_filter.R -c config.yml -k 2 -n 1 -j 1 -p COVIDScenarioPipeline
    ./COVIDScenarioPipeline/batch/inference_job.py -c config.yml
    ./COVIDScenarioPipeline/batch/inference_job.py --help
    ./COVIDScenarioPipeline/batch/inference_job.py --restart-from-s3-bucket=s3://idd-inference-runs/USA-20210131T170334/ --restart-from-run-id=2021.01.31.17:03:34.
    ./COVIDScenarioPipeline/batch/inference_job_status.py
    xarray
    connect to Github via SSH.
    Get your own API key
    Get your own API key
    How to get a Slack token
    here
    Docker Basics
    Installing Docker
    Docker Tutorial
    Command Prompt Tutorial
    docker pull hopkinsidd/flepimop:latest-dev
    docker run -it \
      -v <dir1>:/home/app/flepimop \
      -v <dir2>:/home/app/drp \
    hopkinsidd/flepimop:latest-dev  
    export FLEPI_PATH=/home/app/csp/
    export PROJECT_PATH=/home/app/drp/
    export FLEPI_RESET_CHIMERICS=TRUE
    export CENSUS_API_KEY="6a98b751a5a7a6fc365d14fa8e825d5785138935"
    cd $FLEPI_PATH   # it'll move to the flepiMoP/ directory
    Rscript local_install.R               # Install the R stuff
    pip install --no-deps -e gempyor_pkg/ # install gempyor
    git lfs install
    git lfs pull
    cd $PROJECT_PATH       # goes to Flu_USA
    git restore data/
    rm -rf data/mobility_territories.csv data/geodata_territories.csv data/us_data.csv
    rm -r model_output/ # delete the outputs of past run if there are
    export CONFIG_PATH=config_SMH_R1_lowVac_optImm_2022.yml
    Rscript $FLEPI_PATH/datasetup/build_US_setup.R
    Rscript $FLEPI_PATH/datasetup/build_flu_data.R
    flepimop-inference-main -j 1 -n 1 -k 1
    notepad .ssh/config
    ssh staging
    cd COVID19_USA
    git config --global credential.helper cache
    git pull 
    git checkout main
    git pull
    
    cd flepiMoP
    git pull	
    git checkout main
    git pull
    cd .. 
    sudo docker pull hopkinsidd/flepimop:latest-dev
    sudo docker run -it \
      -v /home/ec2-user/COVID19_USA:/home/app/drp/COVID19_USA \
      -v /home/ec2-user/flepiMoP:/home/app/drp/flepiMoP \
      -v /home/ec2-user/.ssh:/home/app/.ssh \
    hopkinsidd/flepimop:latest-dev  
        
    cd ~/drp/COVID19_USA
    git config credential.helper store 
    git pull 
    git checkout main
    git pull
    git config --global credential.helper 'cache --timeout 300000'
    
    cd ~/drp/flepiMoP 
    git pull 
    git checkout main
    git pull 
    
    Rscript build/local_install.R && 
       python -m pip install --upgrade pip &&
       pip install -e flepimop/gempyor_pkg/ && 
       pip install boto3 && 
       cd ..
    export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_FCH_Dec11_tsvacc && 
       export VALIDATION_DATE="2022-12-11" && 
       export COVID_MAX_STACK_SIZE=1000 && 
       export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
       export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
       export FLEPI_RESET_CHIMERICS=TRUE &&
       rm -rf model_output data/us_data.csv data-truth &&
       rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
       rm -rf data/seeding_territories.csv && 
       rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv
    export CONFIG_NAME=config_$SCENARIO.yml && 
       export CONFIG_PATH=/home/app/drp/COVID19_USA/$CONFIG_NAME && 
       export FLEPI_PATH=/home/app/drp/flepiMoP && 
       export PROJECT_PATH=/home/app/drp/COVID19_USA && 
       export INTERVENTION_NAME="med" && 
       export FLEPI_STOCHASTIC=FALSE && 
       rm -rf $PROJECT_PATH/model_output $PROJECT_PATH/us_data.csv && 
       cd $PROJECT_PATH && 
       Rscript $FLEPI_PATH/R/scripts/build_US_setup.R -c $CONFIG_NAME && 
       Rscript $FLEPI_PATH/R/scripts/build_covid_data.R -c $CONFIG_NAME && 
       Rscript $FLEPI_PATH/R/scripts/full_filter.R -c $CONFIG_NAME -j 1 -n 1 -k 1 && 
       printenv CONFIG_NAME
    aws configure
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $FLEPI_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $FLEPI_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE -j 1 -k 1 &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $FLEPI_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&  
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-discard-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID -j 1 -k 1 &&
    printenv CONFIG_NAME
    git add data/ 
    git config --global user.email "[email]" 
    git config --global user.name "[github username]" 
    git commit -m"scenario run initial" 
    branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
    git push --set-upstream origin $branch
    
    git checkout main
    git pull
    Setting number of output slots to 300 [via config file]
    Launching USA-20220923T160106_inference_med...
    Resuming from run id is SMH_R1_lowVac_optImm_2018 located in s3://idd-inference-runs/USA-20220913T000opt
    Discarding seeding results
    Final output will be: s3://idd-inference-runs/USA-20220923T160106/model_output/
    Run id is SMH_R1_highVac_optImm_2022
    Switched to a new branch 'run_USA-20220923T160106'
    config_SMH_R1_highVac_optImm_2022.yml
    export FLEPI_RUN_INDEX=FCH_R16_lowBoo_modVar_ContRes_blk4_Dec18_tsvacc && 
       export VALIDATION_DATE="2022-12-18" && 
       export COVID_MAX_STACK_SIZE=1000 && 
       export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
       export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
       export FLEPI_RESET_CHIMERICS=TRUE &&
       rm -rf model_output data/us_data.csv data-truth &&
       rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
       rm -rf data/seeding_territories.csv && 
       rm -rf data/seeding_territories_Level5.csv data/seeding_territories_Level67.csv
       
    export RESUME_LOCATION=s3://idd-inference-runs/USA-20230423T235232
    notepad .ssh/config
    ssh staging
    cd Flu_USA
    git config --global credential.helper cache
    git pull 
    
    cd COVIDScenarioPipeline
    git pull	
    git checkout main
    git pull
    cd ..
    sudo docker pull hopkinsidd/covidscenariopipeline:latest-dev
    sudo docker run -it \
      -v /home/ec2-user/Flu_USA:/home/app/drp \
      -v /home/ec2-user/Flu_USA/COVIDScenarioPipeline:/home/app/drp/COVIDScenarioPipeline \
      -v /home/ec2-user/.ssh:/home/app/.ssh \
    hopkinsidd/covidscenariopipeline:latest-dev  
        
    cd ~/drp 
    git config credential.helper store 
    git pull 
    git checkout main
    git config --global credential.helper 'cache --timeout 300000'
    
    cd ~/drp/COVIDScenarioPipeline 
    git pull 
    git checkout main
    git pull 
    
    Rscript local_install.R && 
       python -m pip install --upgrade pip &&
       pip install -e gempyor_pkg/ && 
       pip install boto3 && 
       cd ~/drp
    export SCENARIO=FCH_R1_highVac_pesImm_2022_Oct30 && 
       export VALIDATION_DATE="2022-10-16" && 
       export COVID_MAX_STACK_SIZE=1000 && 
       export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
       export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
       export COVID_RESET_CHIMERICS=TRUE &&
       rm -rf model_output data/us_data.csv data-truth &&
       rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
       rm -rf data/seeding_territories.csv
    export COVID_RUN_INDEX=$SCENARIO && 
       export CONFIG_NAME=config_$SCENARIO.yml && 
       export CONFIG_PATH=/home/app/drp/$CONFIG_NAME && 
       export COVID_PATH=/home/app/drp/COVIDScenarioPipeline && 
       export PROJECT_PATH=/home/app/drp && 
       export INTERVENTION_NAME="med" && 
       export COVID_STOCHASTIC=FALSE && 
       rm -rf $PROJECT_PATH/model_output $PROJECT_PATH/us_data.csv &&
       rm -rf $PROJECT_PATH/seeding_territories.csv && 
       cd $PROJECT_PATH && Rscript $COVID_PATH/R/scripts/build_US_setup.R -c $CONFIG_NAME && 
       Rscript $COVID_PATH/R/scripts/build_flu_data.R -c $CONFIG_NAME && 
       Rscript $COVID_PATH/R/scripts/full_filter.R -c $CONFIG_NAME -j 1 -n 1 -k 1 && 
       printenv CONFIG_NAME
    aws configure
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE -j 1 -k 1 &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&  
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-discard-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID &&
    printenv CONFIG_NAME
    export CONFIG_PATH=$CONFIG_NAME &&
    cd $PROJECT_PATH &&
    $COVID_PATH/batch/inference_job.py -c $CONFIG_PATH -q $COMPUTE_QUEUE --resume-carry-seeding --restart-from-location=s3://idd-inference-runs/$RESUME_S3 --restart-from-run-id=$RESUME_ID -j 1 -k 1 &&
    printenv CONFIG_NAME
    git add data/ 
    git config --global user.email "[email]" 
    git config --global user.name "[github username]" 
    git commit -m"scenario run initial" 
    branch=$(git branch | sed -n -e 's/^\* \(.*\)/\1/p')
    git push --set-upstream origin $branch
    
    git checkout main
    git pull
    Setting number of output slots to 300 [via config file]
    Launching USA-20220923T160106_inference_med...
    Resuming from run id is SMH_R1_lowVac_optImm_2018 located in s3://idd-inference-runs/USA-20220913T000opt
    Discarding seeding results
    Final output will be: s3://idd-inference-runs/USA-20220923T160106/model_output/
    Run id is SMH_R1_highVac_optImm_2022
    Switched to a new branch 'run_USA-20220923T160106'
    config_SMH_R1_highVac_optImm_2022.yml
    export SCENARIO=FCH_R1_highVac_pesImm_2022_Nov27 && 
       export VALIDATION_DATE="2022-11-27" && 
       export COVID_MAX_STACK_SIZE=1000 && 
       export COMPUTE_QUEUE="Compartment-JQ-1588569574" &&
       export CENSUS_API_KEY=c235e1b5620232fab506af060c5f8580604d89c1 && 
       export COVID_RESET_CHIMERICS=TRUE &&
       rm -rf model_output data/us_data.csv data-truth &&
       rm -rf data/mobility_territories.csv data/geodata_territories.csv &&
       rm -rf data/seeding_territories.csv
       
    export RESUME_ID=FCH_R1_highVac_pesImm_2022_Nov20 &&
      export RESUME_S3=USA-20221120T194228
    Calculate the likelihood of the data given the proposed parameters for each subpopulation, Li(Di∣Zi(Θ∗))\mathcal{L}_i(D_i|Z_i(\Theta^*))Li​(Di​∣Zi​(Θ∗))
  • Calculate the overall likelihood with the proposed parameters, L(D∣Z(Θ∗))\mathcal{L}(D|Z(\Theta^*))L(D∣Z(Θ∗))

  • Make "global" decision about proposed parameters

    • Generate a uniform random number uG∼U[0,1]u^G \sim \mathcal{U}[0,1]uG∼U[0,1]

    • Calculate the overall likelihood with the current global parameters, L(D∣Z(Θm,k−1G))\mathcal{L}(D|Z(\Theta^G_{m,k-1}))L(D∣Z(Θm,k−1G​))

    • Calculate the acceptance ratio αG=min⁡(1,L(D∣Z(Θ∗))p(Θ∗)L(D∣Z(Θm,k−1G))p(Θm,k−1G))\alpha^G=\min \left(1, \frac{\mathcal{L}(D|Z(\Theta^*)) p(\Theta^*) }{\mathcal{L}(D|Z(\Theta^G_{m,k-1})) p(\Theta^G_{m,k-1}) } \right)αG=min(1,L(D∣Z(Θm,k−1G​))p(Θm,k−1G​)L(D∣Z(Θ∗))p(Θ∗)​)​

    • If : ACCEPT the proposed parameters to the global and chimeric parameter chains

      • Set

      • Set

      • Update the recorded subpopulation-specific likelihood values (chimeric and global) with the likelihoods calculated using the proposed parameter ;

    • Else: REJECT the proposed parameters for the global chain and make subpopulation-specific decisions for the chimeric chain

      • Set

      • Make "chimeric" decision:

    • End if

  • End making global decision

  • Θm,0\Theta_{m,0}Θm,0​
    Θm,0G\Theta^G_{m,0}Θm,0G​
    Θm,0C\Theta^C_{m,0}Θm,0C​
    Z(Θm,0)Z(\Theta_{m,0})Z(Θm,0​)
    k=1...Kk= 1 ... Kk=1...K
    KKK
    Θ∗\Theta^*Θ∗
    Z(Θ∗)Z(\Theta^*)Z(Θ∗)
    KKK
    MMM
    θm={Θm,KG}m\theta_m = \{\Theta^G_{m,K}\}_mθm​={Θm,KG​}m​
  • Instance type

    • as you selected by drop-down list(e.g., m5.xlarge)

  • Key pair(login ;

    • you can generate new key pair if you want to connect to the instance securely (by clicking "Create new key pair" on the right side), but usually select "ams__ks_ED25519__keypair" by drop-down list so that you can be helped when local client setup (recommended).

      • In case that you use your own key, you will be the only person to log in, of course. you should be careful of handling key management ;

  • Network settings (push the button "Edit" on the right to extend configuration; see below)

    • VPC - required

      • select "HPC VPC" item by drop-down menu

    • Subnet

      • select "HPC Public Subnet among _"us-west-2*" ;

    • Firewall (security groups)

      • select "Select existing security grous" toggle, then

      • Common security groups

        • select "dvc_usa" and "dvc__usa2" by drop-down menu

  • Sign in as IAM user
    Console Home
    EC2 Dashboard
    Select an AMI Image among Amazon Machine Images(AMIs)
    Network settings
    Advanced details
    Launch Instance in Summary
    When in Success

    instance-type: r6i.4xlarge (16 cores, 128GB mem ;

  • Storage: 2TB x 1 (gp3)

  • OS: ubuntu 22.04 (Jammy)

  • https://apps.apple.com/us/app/microsoft-remote-desktop/id1295203466
    Authentication dialog
    RStudio view
    Dialog window (in Japanese form; this will be changed according to yours OS's locale)
    /home/shared
    /home/share
    smbshare
    For
    • Generate a uniform random number uiC∼U[0,1]u_i^C \sim \mathcal{U}[0,1]uiC​∼U[0,1]

    • Calculate the acceptance ratio αiC=Li(Di∣Zi(Θ∗))p(Θ∗)Li(Di∣Zi(Θm,k−1C))p(Θm,k−1)\alpha_i^C=\frac{\mathcal{L}_i(D_i|Z_i(\Theta^*)) p(\Theta^*) }{\mathcal{L}i(D_i|Z_i(\Theta^C_{m,k-1})) p(\Theta_{m,k-1}) }αiC​=Li(Di​∣Zi​(Θm,k−1C​))p(Θm,k−1​)Li​(Di​∣Zi​(Θ∗))p(Θ∗)​

    • If αiC>uiC\alpha_i^C > u_i^CαiC​>uiC​: ACCEPT the proposed parameters to the chimeric parameter chain for this location

      • Set

      • Update the recorded chimeric likelihood value for subpopulation to that calculated with the proposed parameter​

    • Else: REJECT the proposed parameters for the chimeric parameter chain for this location

      • Set ​

    • `End if ;

  • End for NNNsubpopulations

  • End making chimeric decisions

  • αG>uG\alpha^G > u^GαG>uG
    Θm,kC=Θ∗\Theta_{m,k}^C=\Theta^*Θm,kC​=Θ∗
    Θm,kG=Θm,k−1G\Theta^G_{m,k} = \Theta^G_{m,k-1}Θm,kG​=Θm,k−1G​
    i=1…Ni = 1 \dots Ni=1…N
    Θm,k,iC=Θi∗\Theta_{m,k,i}^C = \Theta^*_{i}Θm,k,iC​=Θi∗​
    iii
    Θm,k,iC=Θm,k−1,i\Theta_{m,k,i}^C=\Theta_{m,k-1,i}Θm,k,iC​=Θm,k−1,i​
    Manage AWS Resources - AWS Management Console - AWSAmazon Web Services, Inc.
    Logo