How to simulate¶

This tutorial helps you to get started with sid by building a simple model. First, it focuses on the core objects which are needed for simulation. Then, the spread of the disease in the artificial population is simulated.

Are we all set? Let’s go!

[1]:

%matplotlib inline

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sid
from sid.config import INDEX_NAMES

warnings.filterwarnings(
    "ignore", message="indexing past lexsort depth may impact performance."
)

The core objects¶

This section explains the core objects necessary to run a simulation with sid.

The `initial_states`¶

Since sid is an agent-based simulator, it needs information on agents or individuals in the model. This information is captured in the initial_states. The initial_states are a pandas.DataFrame where rows represent individuals and columns are characteristics, for example, age, geographic information, etc..

In this context, a state is a single row in the data frame representing an individual at a certain point in time. The characteristics contain all the information needed for the individual to make decisions such as going to work, demanding a test, or for policies to target the individual.

The following states include information on the age of individuals who fall into different age brackets. The age brackets are necessary for some stochastic processes in the model. For example, the probability of staying asymptomatic after an infection is age-dependent since elderly are more likely to experience strong symptoms.

The states also include very broad geographic information of the individuals which only there to speed up simulation.

Note

To give a little bit of background information on the matching algorithm: Later, the tutorial will allow for meetings between two randomly selected people in the model. Instead of searching for a contact partner among all people, it is easier to, first, sample a subset where the partner is in, such as a geographic cluster, and, after that, sample the specific individual.

Once there is an detailed information on the matching algorithm, it will be referenced here. Until then, please consult the API documentation on calculate_infections_by_contacts() and related functions.

We start with 10,000 individuals. Note that all characteristics are converted to categoricals which ensures that values are self-explanatory and save memory.

[2]:

n_individuals = 10_000
available_ages = [
    "0-9",
    "10-19",
    "20-29",
    "30-39",
    "40-49",
    "50-59",
    "60-69",
    "70-79",
    "80-100",
]

ages = np.random.choice(available_ages, size=n_individuals)
regions = np.random.choice(["North", "East", "South", "West"], size=n_individuals)

initial_states = pd.DataFrame({"age_group": ages, "region": regions}).astype("category")
initial_states.head(5)

[2]:

	age_group	region
0	10-19	East
1	70-79	West
2	10-19	South
3	10-19	North
4	40-49	East

See also

You can read more about the states in the reference guides.

The `contact_models`¶

The contact models are key to how contacts between individuals can be modeled with sid. There are two types of contact models, random and recurrent models:

Recurrent contact models are used when people belong to a certain group and if they meet this group, they meet all participating members. You would use recurrent models to model school classes, households, and colleagues/work groups.
Random contact models are used when people do not always meet the same predefined groups, but contacts happen more randomly. Usually, people are more likely to meet a person from their own group, but there is also a lower probability of meeting someone from another group. Groups can be formed by combining any number of characteristics. If we would use age groups, we would say contacts are assortative by age. Each person also has a budget for the number of people she can meet. You would use random contact models, for example, to model meetings during freetime activities, infrequent contacts at work, or acquaintances.

In the following, we define a random contact model which lets each individual in the population meet 0-9 other people each day. The contacts are assortative by age and region since we want to make contacts more likely within the same regions and in the same age groups.

[3]:

def meet_people(states, params, seed):
    np.random.seed(seed)
    contacts = np.random.choice(np.arange(10), size=len(states))
    return pd.Series(contacts, index=states.index)

[4]:

contact_models = {
    "meet_people": {
        "model": meet_people,
        "assort_by": ["age_group", "region"],
        "is_recurrent": False,
    }
}

Contact models are stored as a dictionary. Our contact model is called "meet_people" and we define it in another dictionary. The dictionary has the following components.

"model" holds the function which maps individual characteristics with parameters to a number of contacts which can be an integer or a float – internally the number of contacts is rounded over the whole population while preserving the sum of contacts.
"assort_by" specifies the variables by which people are more likely to meet each other.
"is_recurrent" is an indicator for a whether the models represents recurrent or random contacts.

See also

You can read more about contact_models in the references guides.

The `params`¶

The parameters are a pandas.DataFrame and include medical and other relevant information. sid offers a prepared database which can be loaded and easily adjusted. The following code can be used for loading.

[5]:

params = sid.load_epidemiological_parameters()
params.head(5)

[5]:

			value
category	subcategory	name
health_system	icu_limit_relative	icu_limit_relative	50.00
cd_infectious_true	all	1	0.39
		2	0.35
		3	0.22
		5	0.04

The index is a three dimensional multi index which allows to organize parameters hierarchically. If lower levels are not necessary, it is common to repeat the previous level.

For our model, we need to specify some additional parameters. First, we add the infection probability for our contact model. If a contact happens between an infectious and a susceptible person, the susceptible person becomes infected according to this probability.

[6]:

params.loc[("infection_prob", "meet_people", "meet_people"), "value"] = 0.05

Similarly, we specify for each contact model how the strength of the assortative matching.

We assume that 90% of contacts are with people from the same region and 50% with contacts from the same age group. It means that the probability of meeting someone from the same age group and region is 0.45. The rest of the probability mass will be split equally between the remaining groups formed by regions and age groups.

[7]:

params.loc[("assortative_matching", "meet_people", "age_group"), "value"] = 0.5
params.loc[("assortative_matching", "meet_people", "region"), "value"] = 0.9

We further have to specify parameters governing the immunity level and its waning. For a detailed explaination of these parameters see the tutorial notebook on how to model immunity.

[8]:

immunity_params = pd.read_csv("immunity_params.csv", index_col=INDEX_NAMES)

params = pd.concat((params, immunity_params))

See also

You can read more about params in the references guides.

The `initial_conditions`¶

Finally, we need to set the initial_conditions which describes the progression of the disease at the start of the simulation like who is infected or already immune, etc.. sid gives you the ability to start the progression of the infectious disease from patient zero. It also offers a simple interface to come up with complex patterns like individuals in all stages of the disease.

For now, we assume that there are 100 infected individuals and 50 with pre-existing immunity.

[9]:

initial_conditions = {"initial_infections": 100, "initial_immunity": 50}

See also

More information on the initial_conditions can be found in a dedicated how-to guide.

The simulation¶

Now, we are going to simulate this population for 200 periods. For that, we define a duration which is a dictionary with a "start" and a "periods" key or a "start" and "end" keys.

[10]:

duration = {"start": "2020-02-27", "periods": 200}

sid differentiates between the setup of the simulation and the actual simulation. That is why we need to build the simulation() function first.

[11]:

simulate = sid.get_simulate_func(
    initial_states=initial_states,
    contact_models=contact_models,
    params=params,
    initial_conditions=initial_conditions,
    duration=duration,
    seed=0,
)

After that, the function to simulate the data only depends on the parameters. Now, let us simulate our model!

[12]:

result = simulate(params=params)

Start the simulation...
2020-09-13: 100%|██████████| 200/200 [00:28<00:00,  7.07it/s]

The result object is a dictionary with two keys:

"time_series" is a dask.dataframe which holds the data generated by the simulation. Since it is not feasible to hold all the data from the simulation in memory, each day is stored on disk in compressed .parquet files. dask provides a convenient interface to load all files and perform calculation on the data. If you want to work with pandas.DataFrames instead, convert the dask.dataframe to pandas with time_series.compute().
"last_states" holds a dask.dataframe of the lastly simulated states including all data columns which are usually dropped to save memory. The last_states can be used to resume a simulation.

[13]:

result.keys()

[13]:

dict_keys(['time_series', 'last_states'])

Here is a preview for each of the two objects.

[14]:

result["time_series"].head()

[14]:

	symptomatic	is_tested_positive_by_rapid_test	infectious	age_group	newly_vaccinated	cd_infectious_false	newly_infected	ever_infected	date	virus_strain	ever_vaccinated	knows_immune	newly_deceased	region	knows_infectious	needs_icu	new_known_case	dead
0	False	False	False	10-19	False	-10001	False	False	2020-02-27	NaN	False	False	False	East	False	False	False	False
1	False	False	False	70-79	False	-10001	False	False	2020-02-27	NaN	False	False	False	West	False	False	False	False
2	False	False	False	10-19	False	-10001	False	False	2020-02-27	NaN	False	False	False	South	False	False	False	False
3	False	False	False	10-19	False	-10001	False	False	2020-02-27	NaN	False	False	False	North	False	False	False	False
4	False	False	False	40-49	False	-10001	False	False	2020-02-27	NaN	False	False	False	East	False	False	False	False

[15]:

result["last_states"].head()

[15]:

	age_group	region	ever_infected	infectious	symptomatic	needs_icu	dead	pending_test	received_test_result	knows_immune	...	cd_knows_infectious_false_draws	cd_needs_icu_false_draws	cd_needs_icu_true_draws	cd_received_test_result_true_draws	cd_symptoms_false_draws	cd_symptoms_true_draws	group_codes_meet_people	date	period	n_contacts_meet_people
0	10-19	East	True	False	False	False	False	False	False	False	...	-1	-1	-1	2	27	-1	4	2020-09-13	621	5
1	70-79	West	False	False	False	False	False	False	False	False	...	-1	-1	10	4	18	1	31	2020-09-13	621	3
2	10-19	South	False	False	False	False	False	False	False	False	...	-1	15	-1	2	18	-1	6	2020-09-13	621	1
3	10-19	North	False	False	False	False	False	False	False	False	...	-1	-1	-1	1	27	-1	5	2020-09-13	621	0
4	40-49	East	False	False	False	False	False	False	False	False	...	-1	25	-1	1	27	1	16	2020-09-13	621	6

5 rows × 49 columns

Let us take a look at various statistics of the sample.

[16]:

df = result["time_series"].compute()

[17]:

fig, axs = plt.subplots(3, 2, figsize=(12, 8))
fig.subplots_adjust(bottom=0.15, wspace=0.2, hspace=0.4)

axs = axs.flatten()

df.resample("D", on="date")["ever_infected"].mean().plot(ax=axs[0])
df.resample("D", on="date")["infectious"].mean().plot(ax=axs[1])
df.resample("D", on="date")["dead"].sum().plot(ax=axs[2])

r_zero = sid.statistics.calculate_r_zero(df, window_length=7)
r_zero.plot(ax=axs[3])

r_effective = sid.statistics.calculate_r_effective(df, window_length=7)
r_effective.plot(ax=axs[4])

for ax in axs:
    ax.set_xlabel("")
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

axs[0].set_title("Share of Infected People")
axs[1].set_title("Share of Infectious People in the Population")
axs[2].set_title("Total Number of Deaths")
axs[3].set_title("$R_0$ (Basic Reproduction Number)")
axs[4].set_title("$R_t$ (Effective Reproduction Number)")

axs[5].set_visible(False)

plt.show()

../_images/tutorials_how_to_simulate_37_0.png

How to simulate¶

The core objects¶

The initial_states¶

The contact_models¶

The params¶

The initial_conditions¶

The simulation¶

The `initial_states`¶

The `contact_models`¶

The `params`¶

The `initial_conditions`¶