Chapter 2 Models Family Choice

In this chapter, we discuss the appropriate models’ family to take into account data characteristics.

2.1 Negative Binomial

Externalizing problems are computed as the sum of 10 items of the SDQ, obtaining discrete scores that range from 0 to 20. Thus, we should use appropriate discrete distribution such as the Poisson distribution or the Negative Binomial. In the Poisson distribution mean and variance are defined according to the same parameter \(\lambda\). On the contrary, Negative Binomial has an extra parameter to adjust the variance allowing more flexibility. Considering data distribution (see Figure~1.8), we can observe that data have high dispersion with a long right tail. In this case, the Poisson distribution would be a poor choice and we prefer Negative Binomial instead.

Again, considering data distribution (see Figure~1.8), we can observe a high peak of values at zero. Remember that this is not a clinical sample, thus it is expected that the majority of children have no problems or really few problems. We could question ourselves, however, whether a Zero-Inflated model may be appropriate

2.2 Zero Inflated Negative Binomial

To evaluate the presence of zero inflation in our data, we compare the number of observed zeros and expected zeros in a Negative Binomial mixed-effects model. We consider in the model the role of gender and the interaction between mother attachment and father attachment. Moreover, we consider the children’s classroom ID as a random effect to account for teachers’ different ability to evaluate children’s problems. Using R formula syntax, we have

# model formula
externalizing_sum ~ gender + mother * father + (1|ID_class)

The model is fitted using the function glmmTMB() from the glmmTMB R-package (Brooks et al., 2017). Next, we compare the number of observed zero and expected zeros using an adapted version of the function check_zeroinflation() from the R-package performance (Lüdecke et al., 2021) that solves a small bug (see issue https://github.com/easystats/performance/issues/367).

my_check_zeroinflation(fit_ext_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 276
##   Predicted zeros: 245
##             Ratio: 0.89
## Model is underfitting zeros (probable zero-inflation).

Results indicate that the model is slightly under-fitting the number of zeros. Now, we can try to fit a Zero Inflated Negative Binomial (ZINB) model and compare the performance of the two models. ZINB models are defined as \[ y_{ij} \sim ZINB(p_{ij}, \mu_{ij}, \phi), \] where \(p_{ij}\) is the probability of an observation \(y_{ij}\) being an extra zero (i.e., a zero not coming from the Negative Binomial distribution) and \(1-p_{ij}\) indicates the probability of a given observation \(y_{ij}\) being generated form a Negative Binomial distribution with mean \(\mu_{ij}\) and variance \(\sigma_{ij}^2 = \mu_{ij} + \frac{\mu_{ij}^2}{\phi}\). Moreover, we have that \[ p_{ij} = \text{logit}^{-1}(X_i^T\beta_p+ Z_j^Tu_p),\\ \mu_{ij} = \text{exp}(X_i^T\beta_{\mu}+ Z_j^Tu_{\mu}). \] That is, both \(p\) and \(\mu\) are modelled separately according to (possibly) different variables. In our case, we consider only the role of gender for \(p\) (i.e., the probability of having externalizing problems depends on gender), whereas for \(\mu\) we also consider the interaction between mother attachment and father attachment. In both cases, we consider the children’s classroom ID as a random effect (teachers may differ in the ability to detect children’s problems and quantify them). Using R formula syntax, we have

# formula for p
p ~ gender + (1|ID_class)

# formula for mu
mu ~ gender + mother * father + (1|ID_class)

The ZINB model is fitted using the function glmmTMB(). To compare the ZINB model and the Negative Binomial model we conduct an analysis of Deviance. Note that, in the case of generalized linear models (GLM), the deviance is the corresponding of the residual variance used in the traditional ANOVA in the case of linear models.

anova(fit_ext_nb, fit_ext_zinb)
## Data: data_cluster
## Models:
## fit_ext_nb: externalizing_sum ~ gender + mother * father + (1 | ID_class), zi=~0, disp=~1
## fit_ext_zinb: externalizing_sum ~ gender + mother * father + (1 | ID_class), zi=~gender + (1 | ID_class), disp=~1
##              Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## fit_ext_nb   19 3910.1 4000.2 -1936.1   3872.1                             
## fit_ext_zinb 22 3867.6 3971.9 -1911.8   3823.6 48.556      3  1.622e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Overall, results indicate that the ZINB model performs better than the Negative Binomial model. Thus, in the following analyses, we decide to use ZINB models.

References

Brooks, M. E., Kristensen, K., van Benthem, K. J., Magnusson, A., Berg, C. W., Nielsen, A., Skaug, H. J., Maechler, M., & Bolker, B. M. (2017). glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R Journal, 9(2), 378–400. https://journal.r-project.org/archive/2017/RJ-2017-066/index.html

Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139. https://doi.org/10.21105/joss.03139