How to simulate multiple virus strains

In this tutorial, we are going to simulate the spread of Covid-19 with two virus strains.

[1]:
%matplotlib inline

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sid
from sid.config import INDEX_NAMES

warnings.filterwarnings("ignore")

Preparation

For the simulation we need to prepare several objects which are identical to the ones from the general tutorial on the simulation.

[2]:
available_ages = [
    "0-9",
    "10-19",
    "20-29",
    "30-39",
    "40-49",
    "50-59",
    "60-69",
    "70-79",
    "80-100",
]

ages = np.random.choice(available_ages, size=10_000)
regions = np.random.choice(["North", "South"], size=10_000)

initial_states = pd.DataFrame({"age_group": ages, "region": regions}).astype("category")
initial_states.head(5)
[2]:
age_group region
0 0-9 North
1 50-59 North
2 50-59 South
3 30-39 South
4 20-29 South
[3]:
def meet_distant(states, params, seed):
    possible_nr_contacts = np.arange(10)
    contacts = np.random.choice(possible_nr_contacts, size=len(states))
    return pd.Series(contacts, index=states.index)


def meet_close(states, params, seed):
    possible_nr_contacts = np.arange(5)
    contacts = np.random.choice(possible_nr_contacts, size=len(states))
    return pd.Series(contacts, index=states.index)


assort_by = ["age_group", "region"]

contact_models = {
    "distant": {"model": meet_distant, "assort_by": assort_by, "is_recurrent": False},
    "close": {"model": meet_close, "assort_by": assort_by, "is_recurrent": False},
}
[4]:
epidemiological_parameters = pd.read_csv("infection_probs.csv", index_col=INDEX_NAMES)
epidemiological_parameters
[4]:
value note source
category subcategory name
infection_prob close close 0.05 NaN NaN
distant distant 0.03 NaN NaN
household household 0.20 NaN NaN
[5]:
assort_probs = pd.read_csv("assort_by_params.csv", index_col=INDEX_NAMES)
assort_probs
[5]:
value note source
category subcategory name
assortative_matching close age_group 0.5 NaN NaN
region 0.9 NaN NaN
distant age_group 0.5 NaN NaN
region 0.9 NaN NaN
[6]:
disease_params = sid.load_epidemiological_parameters()
disease_params.head(6).round(2)
[6]:
value
category subcategory name
health_system icu_limit_relative icu_limit_relative 50.00
cd_infectious_true all 1 0.39
2 0.35
3 0.22
5 0.04
cd_infectious_false all 3 0.10
[7]:
immunity_params = pd.read_csv("immunity_params.csv", index_col=INDEX_NAMES)
immunity_params
[7]:
value note source
category subcategory name
immunity immunity_level from_infection 0.9900 NaN NaN
from_vaccination 0.8000 NaN NaN
immunity_waning time_to_reach_maximum_infection 7.0000 NaN NaN
time_to_reach_maximum_vaccination 28.0000 NaN NaN
slope_after_maximum_infection -0.0001 NaN NaN
slope_after_maximum_vaccination -0.0002 NaN NaN
[8]:
params = pd.concat(
    [disease_params, epidemiological_parameters, assort_probs, immunity_params]
)

Additional objects to simulate multiple virus strains

To implement multiple virus strains, we have to make the following extensions to the model.

  1. Add a multiplier for the contagiousness of each virus to the parameters.

  2. Add a multiplier for the immunity resistance factor for each virus to the parameters.

  3. Prepare a DataFrame for the initial conditions.

Here we set the immunity resistance factor to 0, which (de facto) removes its influence on the simulation. A detailed discussion on how to use this parameter is given in the tutorial on immunity.

[9]:
for virus, cf in [("base", 1), ("b117", 1.3)]:
    params.loc[("virus_strain", virus, "contagiousness_factor"), "value"] = cf
    params.loc[("virus_strain", virus, "immunity_resistance_factor"), "value"] = 0

For the initial conditions, we assume a two-day burn-in period. On the first day, 50 people are infected with the base virus, on the second day one halve of 50 people has the old and the other halve the new variant.

Each column in the DataFrame is a categorical. Infected individuals have a code for the variant, all others have NaNs.

[10]:
infected_first_day = set(np.random.choice(10_000, size=50, replace=False))
first_day = pd.Series([pd.NA] * 10_000)
first_day.iloc[list(infected_first_day)] = "base"
[11]:
infected_second_day_old_variant = set(
    np.random.choice(
        list(set(range(10_000)) - infected_first_day), size=25, replace=False
    )
)
infected_second_day_new_variant = set(
    np.random.choice(
        list(set(range(10_000)) - infected_first_day - infected_second_day_old_variant),
        size=25,
        replace=False,
    )
)

second_day = pd.Series([pd.NA] * 10_000)
second_day.iloc[list(infected_second_day_old_variant)] = "base"
second_day.iloc[list(infected_second_day_new_variant)] = "b117"
[12]:
initial_infections = pd.DataFrame(
    {
        pd.Timestamp("2020-02-25"): pd.Categorical(
            first_day, categories=["base", "b117"]
        ),
        pd.Timestamp("2020-02-26"): pd.Categorical(
            second_day, categories=["base", "b117"]
        ),
    }
)
[13]:
initial_conditions = {"initial_infections": initial_infections, "initial_immunity": 50}

Run the simulation

We are going to simulate this population for 200 periods.

[14]:
simulate = sid.get_simulate_func(
    initial_states=initial_states,
    contact_models=contact_models,
    params=params,
    initial_conditions=initial_conditions,
    duration={"start": "2020-02-27", "periods": 365},
    virus_strains=["base", "b117"],
    seed=0,
)
result = simulate(params=params)
Start the simulation...
2021-02-25: 100%|██████████| 365/365 [01:18<00:00,  4.68it/s]
[15]:
result["time_series"].head()
[15]:
date region symptomatic ever_vaccinated is_tested_positive_by_rapid_test newly_vaccinated needs_icu new_known_case infectious newly_deceased cd_infectious_false knows_infectious knows_immune ever_infected virus_strain n_has_infected newly_infected immunity age_group dead
0 2020-02-27 North False False False False False False False False -10002 False False False NaN 0 False 0.0 0-9 False
1 2020-02-27 North False False False False False False False False -10002 False False False NaN 0 False 0.0 50-59 False
2 2020-02-27 South False False False False False False False False -10002 False False False NaN 0 False 0.0 50-59 False
3 2020-02-27 South False False False False False False False False -10002 False False False NaN 0 False 0.0 30-39 False
4 2020-02-27 South False False False False False False False False -10002 False False False NaN 0 False 0.0 20-29 False
[16]:
result["last_states"].head()
[16]:
age_group region ever_infected infectious symptomatic needs_icu dead pending_test received_test_result knows_immune ... cd_needs_icu_true_draws cd_received_test_result_true_draws cd_symptoms_false_draws cd_symptoms_true_draws group_codes_close group_codes_distant date period n_contacts_close n_contacts_distant
0 0-9 North True False False False False False False False ... -1 2 27 -1 0 0 2021-02-25 786 2 2
1 50-59 North True False False False False False False False ... -1 4 18 2 10 10 2021-02-25 786 2 9
2 50-59 South True False False False False False False False ... -1 2 18 1 11 11 2021-02-25 786 0 9
3 30-39 South False False False False False False False False ... -1 1 27 -1 7 7 2021-02-25 786 3 8
4 20-29 South True False False False False False False False ... -1 1 27 -1 5 5 2021-02-25 786 4 8

5 rows × 51 columns

The return of simulate is a dictionary with containing the time series data and the last states as a Dask DataFrame. This allows to load the data lazily.

The last_states can be used to resume the simulation. We will inspect the time_series data. If data data fits your working memory, do the following to convert it to a pandas DataFrame.

[17]:
df = result["time_series"].compute()

Let us take a look at various statistics of the sample.

[18]:
fig, axs = plt.subplots(3, 2, figsize=(12, 8))
fig.subplots_adjust(bottom=0.15, wspace=0.2, hspace=0.4)

axs = axs.flatten()

df.resample("D", on="date")["ever_infected"].mean().plot(ax=axs[0])
df.resample("D", on="date")["infectious"].mean().plot(ax=axs[1])
df.resample("D", on="date")["dead"].sum().plot(ax=axs[2])

r_zero = sid.statistics.calculate_r_zero(df, window_length=7)
r_zero.plot(ax=axs[3])

r_effective = sid.statistics.calculate_r_effective(df, window_length=7)
r_effective.plot(ax=axs[4])

df.query("newly_infected").groupby([pd.Grouper(key="date", freq="D"), "virus_strain"])[
    "newly_infected"
].count().unstack().plot(ax=axs[5])

for ax in axs:
    ax.set_xlabel("")
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

axs[0].set_title("Share of Infected People")
axs[1].set_title("Share of Infectious People in the Population")
axs[2].set_title("Total Number of Deaths")
axs[3].set_title("$R_0$ (Basic Reproduction Number)")
axs[4].set_title("$R_t$ (Effective Reproduction Number)")
axs[5].set_title("Distribution of virus strains among newly infected")

plt.show()
../_images/tutorials_how_to_simulate_multiple_virus_strains_24_0.png