How to simulate multiple virus strains¶

In this tutorial, we are going to simulate the spread of Covid-19 with two virus strains.

[1]:

%matplotlib inline

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sid
from sid.config import INDEX_NAMES

warnings.filterwarnings("ignore")

Preparation¶

For the simulation we need to prepare several objects which are identical to the ones from the general tutorial on the simulation.

[2]:

available_ages = [
    "0-9",
    "10-19",
    "20-29",
    "30-39",
    "40-49",
    "50-59",
    "60-69",
    "70-79",
    "80-100",
]

ages = np.random.choice(available_ages, size=10_000)
regions = np.random.choice(["North", "South"], size=10_000)

initial_states = pd.DataFrame({"age_group": ages, "region": regions}).astype("category")
initial_states.head(5)

[2]:

	age_group	region
0	0-9	North
1	50-59	North
2	50-59	South
3	30-39	South
4	20-29	South

[3]:

def meet_distant(states, params, seed):
    possible_nr_contacts = np.arange(10)
    contacts = np.random.choice(possible_nr_contacts, size=len(states))
    return pd.Series(contacts, index=states.index)


def meet_close(states, params, seed):
    possible_nr_contacts = np.arange(5)
    contacts = np.random.choice(possible_nr_contacts, size=len(states))
    return pd.Series(contacts, index=states.index)


assort_by = ["age_group", "region"]

contact_models = {
    "distant": {"model": meet_distant, "assort_by": assort_by, "is_recurrent": False},
    "close": {"model": meet_close, "assort_by": assort_by, "is_recurrent": False},
}

[4]:

epidemiological_parameters = pd.read_csv("infection_probs.csv", index_col=INDEX_NAMES)
epidemiological_parameters

[4]:

			value	note	source
category	subcategory	name
infection_prob	close	close	0.05	NaN	NaN
	distant	distant	0.03	NaN	NaN
	household	household	0.20	NaN	NaN

[5]:

assort_probs = pd.read_csv("assort_by_params.csv", index_col=INDEX_NAMES)
assort_probs

[5]:

			value	note	source
category	subcategory	name
assortative_matching	close	age_group	0.5	NaN	NaN
	close	region	0.9	NaN	NaN
	distant	age_group	0.5	NaN	NaN
	distant	region	0.9	NaN	NaN

[6]:

disease_params = sid.load_epidemiological_parameters()
disease_params.head(6).round(2)

[6]:

			value
category	subcategory	name
health_system	icu_limit_relative	icu_limit_relative	50.00
cd_infectious_true	all	1	0.39
		2	0.35
		3	0.22
		5	0.04
cd_infectious_false	all	3	0.10

[7]:

immunity_params = pd.read_csv("immunity_params.csv", index_col=INDEX_NAMES)
immunity_params

[7]:

			value	note	source
category	subcategory	name
immunity	immunity_level	from_infection	0.9900	NaN	NaN
	immunity_level	from_vaccination	0.8000	NaN	NaN
	immunity_waning	time_to_reach_maximum_infection	7.0000	NaN	NaN
		time_to_reach_maximum_vaccination	28.0000	NaN	NaN
		slope_after_maximum_infection	-0.0001	NaN	NaN
		slope_after_maximum_vaccination	-0.0002	NaN	NaN

[8]:

params = pd.concat(
    [disease_params, epidemiological_parameters, assort_probs, immunity_params]
)

Additional objects to simulate multiple virus strains¶

To implement multiple virus strains, we have to make the following extensions to the model.

Add a multiplier for the contagiousness of each virus to the parameters.
Add a multiplier for the immunity resistance factor for each virus to the parameters.
Prepare a DataFrame for the initial conditions.

Here we set the immunity resistance factor to 0, which (de facto) removes its influence on the simulation. A detailed discussion on how to use this parameter is given in the tutorial on immunity.

[9]:

for virus, cf in [("base", 1), ("b117", 1.3)]:
    params.loc[("virus_strain", virus, "contagiousness_factor"), "value"] = cf
    params.loc[("virus_strain", virus, "immunity_resistance_factor"), "value"] = 0

For the initial conditions, we assume a two-day burn-in period. On the first day, 50 people are infected with the base virus, on the second day one halve of 50 people has the old and the other halve the new variant.

Each column in the DataFrame is a categorical. Infected individuals have a code for the variant, all others have NaNs.

[10]:

infected_first_day = set(np.random.choice(10_000, size=50, replace=False))
first_day = pd.Series([pd.NA] * 10_000)
first_day.iloc[list(infected_first_day)] = "base"

[11]:

infected_second_day_old_variant = set(
    np.random.choice(
        list(set(range(10_000)) - infected_first_day), size=25, replace=False
    )
)
infected_second_day_new_variant = set(
    np.random.choice(
        list(set(range(10_000)) - infected_first_day - infected_second_day_old_variant),
        size=25,
        replace=False,
    )
)

second_day = pd.Series([pd.NA] * 10_000)
second_day.iloc[list(infected_second_day_old_variant)] = "base"
second_day.iloc[list(infected_second_day_new_variant)] = "b117"

[12]:

initial_infections = pd.DataFrame(
    {
        pd.Timestamp("2020-02-25"): pd.Categorical(
            first_day, categories=["base", "b117"]
        ),
        pd.Timestamp("2020-02-26"): pd.Categorical(
            second_day, categories=["base", "b117"]
        ),
    }
)

[13]:

initial_conditions = {"initial_infections": initial_infections, "initial_immunity": 50}

Run the simulation¶

We are going to simulate this population for 200 periods.

[14]:

simulate = sid.get_simulate_func(
    initial_states=initial_states,
    contact_models=contact_models,
    params=params,
    initial_conditions=initial_conditions,
    duration={"start": "2020-02-27", "periods": 365},
    virus_strains=["base", "b117"],
    seed=0,
)
result = simulate(params=params)

Start the simulation...
2021-02-25: 100%|██████████| 365/365 [01:18<00:00,  4.68it/s]

[15]:

result["time_series"].head()

[15]:

	date	region	symptomatic	ever_vaccinated	is_tested_positive_by_rapid_test	newly_vaccinated	needs_icu	new_known_case	infectious	newly_deceased	cd_infectious_false	knows_infectious	knows_immune	ever_infected	virus_strain	newly_infected	age_group	dead
0	2020-02-27	North	False	False	False	False	False	False	False	False	-10002	False	False	False	NaN	False	0-9	False
1	2020-02-27	North	False	False	False	False	False	False	False	False	-10002	False	False	False	NaN	False	50-59	False
2	2020-02-27	South	False	False	False	False	False	False	False	False	-10002	False	False	False	NaN	False	50-59	False
3	2020-02-27	South	False	False	False	False	False	False	False	False	-10002	False	False	False	NaN	False	30-39	False
4	2020-02-27	South	False	False	False	False	False	False	False	False	-10002	False	False	False	NaN	False	20-29	False

[16]:

result["last_states"].head()

[16]:

	age_group	region	ever_infected	infectious	symptomatic	needs_icu	dead	pending_test	received_test_result	knows_immune	...	cd_needs_icu_true_draws	cd_received_test_result_true_draws	cd_symptoms_false_draws	cd_symptoms_true_draws	group_codes_close	group_codes_distant	date	period	n_contacts_close	n_contacts_distant
0	0-9	North	True	False	False	False	False	False	False	False	...	-1	2	27	-1	0	0	2021-02-25	786	2	2
1	50-59	North	True	False	False	False	False	False	False	False	...	-1	4	18	2	10	10	2021-02-25	786	2	9
2	50-59	South	True	False	False	False	False	False	False	False	...	-1	2	18	1	11	11	2021-02-25	786	0	9
3	30-39	South	False	False	False	False	False	False	False	False	...	-1	1	27	-1	7	7	2021-02-25	786	3	8
4	20-29	South	True	False	False	False	False	False	False	False	...	-1	1	27	-1	5	5	2021-02-25	786	4	8

5 rows × 51 columns

The return of simulate is a dictionary with containing the time series data and the last states as a Dask DataFrame. This allows to load the data lazily.

The last_states can be used to resume the simulation. We will inspect the time_series data. If data data fits your working memory, do the following to convert it to a pandas DataFrame.

[17]:

df = result["time_series"].compute()

Let us take a look at various statistics of the sample.

[18]:

fig, axs = plt.subplots(3, 2, figsize=(12, 8))
fig.subplots_adjust(bottom=0.15, wspace=0.2, hspace=0.4)

axs = axs.flatten()

df.resample("D", on="date")["ever_infected"].mean().plot(ax=axs[0])
df.resample("D", on="date")["infectious"].mean().plot(ax=axs[1])
df.resample("D", on="date")["dead"].sum().plot(ax=axs[2])

r_zero = sid.statistics.calculate_r_zero(df, window_length=7)
r_zero.plot(ax=axs[3])

r_effective = sid.statistics.calculate_r_effective(df, window_length=7)
r_effective.plot(ax=axs[4])

df.query("newly_infected").groupby([pd.Grouper(key="date", freq="D"), "virus_strain"])[
    "newly_infected"
].count().unstack().plot(ax=axs[5])

for ax in axs:
    ax.set_xlabel("")
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

axs[0].set_title("Share of Infected People")
axs[1].set_title("Share of Infectious People in the Population")
axs[2].set_title("Total Number of Deaths")
axs[3].set_title("$R_0$ (Basic Reproduction Number)")
axs[4].set_title("$R_t$ (Effective Reproduction Number)")
axs[5].set_title("Distribution of virus strains among newly infected")

plt.show()

../_images/tutorials_how_to_simulate_multiple_virus_strains_24_0.png