4.2 Generating candidate models: general principle

The approach to generating many candidate models used in regional_analysis.Rmd is to generate character strings matching trending models definitions (see previous section) and then using the eval(parse(text = my_text)) trick to turn these into actual models. The simplest approach is to:

generate text matching the right-hand side of the formula of the different models (we later refer to this as model content)
build character strings matching model calls using the terms in 1); sprintf is particularly handy for this part
transform these character strings to actual model calls within a lapply

Here is a toy example illustrating the approach:


library(trending)

# step 1
mod_content <- c("1", "test", "date", "date + tests")

# step 2
models_txt <- sprintf(
  "glm_model(cases ~ %s, family = poisson)",
  mod_content)

# step 3
models <- lapply(models_txt, function(e) eval(parse(text = e)))
class(models) # this is a list
## [1] "list"
length(models) # each component is a model
## [1] 4
lapply(models, class) # check classes of each model
## [[1]]
## [1] "trending_glm"   "trending_model"
## 
## [[2]]
## [1] "trending_glm"   "trending_model"
## 
## [[3]]
## [1] "trending_glm"   "trending_model"
## 
## [[4]]
## [1] "trending_glm"   "trending_model"

As the main thing that changes across models is the model content, the main task boils down to generating combinations of predictors to capture different trends in the data. To this end, we will use expand.grid, which creates all possible combinations of a given set of variables. For instance, to generate all models which:

include a date effect
may include a test effect
may include a weekday effect
may include a previous day’s incidence as predictor (cases_lag_1), i.e. autoregressive model

We can use:


# generate all combinations
mod_content_grid <- expand.grid(c("", "tests"),
                                "date",
                                c("", "weekday"),
                                c("", "cases_lag_1"))
mod_content_grid
##    Var1 Var2    Var3        Var4
## 1       date                    
## 2 tests date                    
## 3       date weekday            
## 4 tests date weekday            
## 5       date         cases_lag_1
## 6 tests date         cases_lag_1
## 7       date weekday cases_lag_1
## 8 tests date weekday cases_lag_1

# concatenate the columns
mod_content <- apply(mod_content_grid, 1, paste, collapse = " + ")
mod_content
## [1] " + date +  + "                       
## [2] "tests + date +  + "                  
## [3] " + date + weekday + "                
## [4] "tests + date + weekday + "           
## [5] " + date +  + cases_lag_1"            
## [6] "tests + date +  + cases_lag_1"       
## [7] " + date + weekday + cases_lag_1"     
## [8] "tests + date + weekday + cases_lag_1"

We see that mod_content contains the relevant model content, with some issues of additional + signs which will need removing. This can be done using simple regular expressions:


# load packages
pacman::p_load(tidyverse)

# comine effects to generate all possible models
# note the use of 'NA' to have models with/without effects
mod_content_grid <- expand.grid(
  c(NA, "tests"),
  "date",
  c(NA, "weekday"),
  c(NA, "cases_lag_1"))

mod_content_grid
##    Var1 Var2    Var3        Var4
## 1  <NA> date    <NA>        <NA>
## 2 tests date    <NA>        <NA>
## 3  <NA> date weekday        <NA>
## 4 tests date weekday        <NA>
## 5  <NA> date    <NA> cases_lag_1
## 6 tests date    <NA> cases_lag_1
## 7  <NA> date weekday cases_lag_1
## 8 tests date weekday cases_lag_1

# Use unite() to combine all columns, with sep = " + " and na.rm = TRUE
mod_content <- mod_content_grid %>% 
     unite(
          col = "models",            # name of the new united column
          1:ncol(mod_content_grid),  # columns to unite
          sep = " + ",               # separator to use in united column
          remove = TRUE,             # if TRUE, removes input cols from the data frame
          na.rm = TRUE               # if TRUE, missing values are removed before uniting
     ) %>%
  pull(models) # extract column into a character vector

# Check results
mod_content
## [1] "date"                                
## [2] "tests + date"                        
## [3] "date + weekday"                      
## [4] "tests + date + weekday"              
## [5] "date + cases_lag_1"                  
## [6] "tests + date + cases_lag_1"          
## [7] "date + weekday + cases_lag_1"        
## [8] "tests + date + weekday + cases_lag_1"

We now have clean model content which can be turned into trending models using the approach illustrated before. In the following sections, we highlight tricks for capturing specific trends in the data, but all ultimately rely on the principle illustrated here.