4.2 Generating candidate models: general principle
The approach to generating many candidate models used in regional_analysis.Rmd
is to generate character
strings matching trending models definitions (see
previous section) and then using the eval(parse(text = my_text))
trick to turn
these into actual models. The simplest approach is to:
generate text matching the right-hand side of the formula of the different models (we later refer to this as model content)
build
character
strings matching model calls using the terms in 1);sprintf
is particularly handy for this parttransform these
character
strings to actual model calls within alapply
Here is a toy example illustrating the approach:
library(trending)
# step 1
mod_content <- c("1", "test", "date", "date + tests")
# step 2
models_txt <- sprintf(
"glm_model(cases ~ %s, family = poisson)",
mod_content)
# step 3
models <- lapply(models_txt, function(e) eval(parse(text = e)))
class(models) # this is a list
## [1] "list"
length(models) # each component is a model
## [1] 4
lapply(models, class) # check classes of each model
## [[1]]
## [1] "trending_glm" "trending_model"
##
## [[2]]
## [1] "trending_glm" "trending_model"
##
## [[3]]
## [1] "trending_glm" "trending_model"
##
## [[4]]
## [1] "trending_glm" "trending_model"
As the main thing that changes across models is the model content, the main
task boils down to generating combinations of predictors to capture different
trends in the data. To this end, we will use expand.grid
, which creates all
possible combinations of a given set of variables. For instance, to generate all
models which:
- include a
date
effect - may include a
test
effect - may include a
weekday
effect - may include a previous day’s incidence as predictor (
cases_lag_1
), i.e. autoregressive model
We can use:
# generate all combinations
mod_content_grid <- expand.grid(c("", "tests"),
"date",
c("", "weekday"),
c("", "cases_lag_1"))
mod_content_grid
## Var1 Var2 Var3 Var4
## 1 date
## 2 tests date
## 3 date weekday
## 4 tests date weekday
## 5 date cases_lag_1
## 6 tests date cases_lag_1
## 7 date weekday cases_lag_1
## 8 tests date weekday cases_lag_1
# concatenate the columns
mod_content <- apply(mod_content_grid, 1, paste, collapse = " + ")
mod_content
## [1] " + date + + "
## [2] "tests + date + + "
## [3] " + date + weekday + "
## [4] "tests + date + weekday + "
## [5] " + date + + cases_lag_1"
## [6] "tests + date + + cases_lag_1"
## [7] " + date + weekday + cases_lag_1"
## [8] "tests + date + weekday + cases_lag_1"
We see that mod_content
contains the relevant model content, with some
issues of additional +
signs which will need removing. This can be done using
simple regular expressions:
# load packages
pacman::p_load(tidyverse)
# comine effects to generate all possible models
# note the use of 'NA' to have models with/without effects
mod_content_grid <- expand.grid(
c(NA, "tests"),
"date",
c(NA, "weekday"),
c(NA, "cases_lag_1"))
mod_content_grid
## Var1 Var2 Var3 Var4
## 1 <NA> date <NA> <NA>
## 2 tests date <NA> <NA>
## 3 <NA> date weekday <NA>
## 4 tests date weekday <NA>
## 5 <NA> date <NA> cases_lag_1
## 6 tests date <NA> cases_lag_1
## 7 <NA> date weekday cases_lag_1
## 8 tests date weekday cases_lag_1
# Use unite() to combine all columns, with sep = " + " and na.rm = TRUE
mod_content <- mod_content_grid %>%
unite(
col = "models", # name of the new united column
1:ncol(mod_content_grid), # columns to unite
sep = " + ", # separator to use in united column
remove = TRUE, # if TRUE, removes input cols from the data frame
na.rm = TRUE # if TRUE, missing values are removed before uniting
) %>%
pull(models) # extract column into a character vector
# Check results
mod_content
## [1] "date"
## [2] "tests + date"
## [3] "date + weekday"
## [4] "tests + date + weekday"
## [5] "date + cases_lag_1"
## [6] "tests + date + cases_lag_1"
## [7] "date + weekday + cases_lag_1"
## [8] "tests + date + weekday + cases_lag_1"
We now have clean model content which can be turned into trending models using the approach illustrated before. In the following sections, we highlight tricks for capturing specific trends in the data, but all ultimately rely on the principle illustrated here.