Skip to main content
  • Research note
  • Open access
  • Published:

Robustness of zero-augmented models over generalized linear models in analysing fertility data in Nigeria

Abstract

Objective

Fertility is a count data usually rightly skewed and exhibiting large number of zeros than the distributional assumption of the generalized linear models (GLMs). This study examined the robustness of zero-augmented models over GLMs to fit fertility data across regions in Nigeria. The 2013 Nigeria Demographic and Health Survey data were used. The fertility models fitted included: Poisson, negative binomial, zero-inflated Poisson, zero-inflated negative binomial, hurdle Poisson and hurdle negative binomial. Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) were used to identify the model with best fit (α = 0.05).

Results

The percentage of zero count in the fertility responses were 21.3, 23.9, 31.1, 30.7, 37.6 and 42.4 in North West, North East, North Central, South West, South South and South East regions respectively. In all the six regions in Nigeria, the zero-augmented models were better than the generalized linear models except for North Central. Extensively, the zero-augmented negative binomial based models were of better fit than their Poisson based counterparts; or in rare cases maybe indistinguishable. However, specific family of zero-augmented model is recommended for each region in Nigeria.

Introduction

Count events frequently occur in all disciplines. In demography, count data like number of children ever born, number of deaths, and number of migration times have been previously modelled by Poisson regression [1]. One of the important assumptions guiding the use of Poisson distribution; is the equality of mean and variance which may not be feasible in reality. If this assumption is violated, the estimation method will produce biased estimates, inefficient standard errors, and misleading confidence interval and p-values [2]. Based on this limitation, researchers have recommended the use of negative binomial distribution which have an additional parameter that accounts for the usual occurrence of over-dispersion in count outcomes; thus, relaxing the constraint of equality of mean and variance [3].

Researchers have also argued that, count events are mainly characterized with large number of zeros [4,5,6,7] and this situation make modeling count data using both Poisson and negative binomial model inappropriate. Although, Poisson and negative binomial distribution assume possibilities of having zero counts but data may consist of large number of zero responses which violate the distributional assumptions of both models often referred to as the excess zero problems. Several studies have modelled fertility experience based on the distribution of the fertility pattern in different countries [3, 8,9,10,11,12,13,14] with a view to identifying factors influencing fertility. In Nigeria, the determinants of fertility have been examined using Poisson regression to account for the count nature of the variable [9, 11] and also negative binomial to account for over-dispersion or heterogeneity [3, 8]. Aside the limitation of the use of Poisson and negative binomial models for fertility data in Nigeria, the analysis is often conducted at national level thus neglecting some of the consequences of cultural diversities at regional level.

Nigeria has six regions defined by sociocultural differences which have implication on fertility. Striking variation exists in fertility across these regions ranging from total fertility rate (TFR) of 4.3 in South South, to 6.7 in North West [15]. Nigeria is the most populous country in Africa with population figure of about 200 million, the population of each of the six regions in the country is more than that of some countries like Togo, Republic of Benin, Liberia, Malawi, to mention a few [16]. Thus, modelling fertility data at national level and with the use of a particular model is likely to be fraught with hidden errors due to the peculiarities of the number of zeros and level of skewness inherent across regional data structures. Therefore, different models may be suitable for fertility at different regions. The current study extends [7] and modelled fertility data in each of the regions in Nigeria with six different distributions and evaluates the performance of the models for their suitability in each region.

Main Text

Methods

Data collection and utilization

The 2013 National Demography and Health Survey (NDHS) dataset was used for the implementation of the model fit. Data collection procedure involved a multi-stage cluster sampling technique. Prior to the survey, Nigeria was demarcated into smaller units regarded as enumeration areas (EAs) called clusters. This demarcation takes into consideration of the state boundaries to prevent merging of clusters within states. The respondents were selected from each cluster based on rural–urban allocation of specific numbers of clusters in the country. The current study used individual recode data with the information provided by women of childbearing age (15–49 years). Further information about the sampling strategy used for data collection can be accessed in the data originator’s website [15].

Data management

The outcome variable of interest was fertility which was measured by the number of children ever born (CEB), obtained from a total sample of 38,948 women. The data were weighted and the clustering effect was adjusted for in the various count models but unweighted for the skewness test and descriptive summaries of children (Additional file 1). To examine the correlation between CEB and background characteristics of women, a pairwise correlation test based on Bonferroni correction [17] for each region was conducted, 12 variables were used for the model fit: residence, women educational level, religion, ethnicity, wealth index, contraceptive use, currently residing with partner, number of other wives, age at first sex, husband educational level, women working status and husband/partners’ age. All these independent variables were retained for North Central and North West. For South East, South South and South West, residing with partner, number of wives, partner’s education was removed with an additional variable, women work status excluded for North East due to collinearity. All analyses were performed using Stata 15.0 at 0.05 level of significance.

Generalized linear models

Poisson model

The most common technique employed to model count data is Poisson regression. It has a usual feature of equality of mean and variance. Its probability mass function is given as:

$${\text{Pr}}\left( {{\text{Y}} = {{\text{y}}_{\text{i}}}{\text{|}}\mu } \right)= \frac{{{{\text{e}}^{{{ - }}\mu }}{\mu ^{{{\text{y}}_{\text{i}}}}}}}{{{{\text{y}}_{\text{i}}}{\text{!}}}};~{{\text{y}}_{\text{i}}}{\text{ = 0}},{\text{1}},{\text{2}}, \ldots$$
(1)

Where \({\text{y}}_{\text{i}}\) denote the random variable of the count response, that is, number of children ever born [18, 19].

Negative binomial model

The negative binomial (NB) distribution is a two-parameter distribution combining the Poisson distribution and the Gamma distribution (Gamma–Poisson mixture). It relaxes the assumption of equality of mean and variance, thus accounting for unobserved heterogeneity in count data [19,20,21,22]. Its probability mass function is given as:

$$Pr\left( {{\text{y}}_{\text{i}} {\text{|}} {{\mu }},\alpha } \right) = \frac{{\varGamma \left( {\alpha^{ - 1} + {\text{y}}_{\text{i}} } \right)}}{{\varGamma \left( {\alpha^{ - 1} } \right)\varGamma \left( {{\text{y}}_{\text{i}} + 1} \right) }} \left( {\frac{{\alpha^{ - 1} }}{{\alpha^{ - 1} + {{\mu }}}}} \right)^{{\alpha^{ - 1} }} \times \left( {\frac{{{\mu }}}{{{{\mu }} + \alpha^{ - 1} }}} \right)^{{{\text{y}}_{\text{i}}}} .$$
(2)

The mean and variance of the negative binomial distribution are E [y|µ, α] = µ and V [y|µ, α] = µ (1 + αµ). Where α is the dispersion parameter (if α > 0 and µ > 0). Special cases of the negative binomial include the Poisson (α = 0) and the geometric (α = 1) [19].

Zero-inflated models

For the zero-inflated Poisson (ZIP), the first process consist of a Poisson distribution that generates counts, some of which may be zero-sampling zero, and the second process is governed by binary distribution (logit or probit) for zero values-structural zeros [23]. Given variable yi, The ZIP model probability mass function has two model components as follows:

$$\Pr \left( {y_{i} |\mu _{i} } \right) = \left\{ {\begin{array}{*{20}l} {{\text{p}}_{{\text{i}}} + \left( {1 - {\text{p}}_{{\text{i}}} } \right)\exp \left( { - \mu _{{\text{i}}} } \right),} & {{\text{y}}_{{\text{i}}} = 0,0 \le p \le 1} \\ {\frac{{\left( {1 - {\text{p}}} \right)\exp \left( { - \mu _{{\text{i}}} } \right)\mu _{{\text{i}}}^{{{\text{y}}_{{\text{i}}} }} }}{{{\text{y}}_{{\text{i}}} !}}}, & {{\text{y}}_{{\text{i}}} \ge 1} \\ \end{array} } \right.$$
(3)

The outcome variable \(y_{i}\) is a non-negative integer, \(\mu_{i}\) is the expected Poisson count for the ith individual; \(p\) is the probability of extra zeros.

Similarly to the ZIP, the zero-inflated negative binomial (ZINB) model is employed to account for both over-dispersion and excess zero problems. For dependent variable yi with many zeros, the ZINB model probability mass function is given as:

$$\Pr \left( {y_{i} |\mu _{i} ,\alpha } \right) = \left\{ {\begin{array}{*{20}l} {p_{i} + \left( {1 - p_{i} } \right)\left( {1 + \alpha \mu _{i} } \right)^{{ - \alpha ^{{ - 1}} }} }, & {0 < p < 1} \\ {\left( {1 - p_{i} } \right)\frac{{\Gamma \left( {y_{i} + \frac{1}{\alpha }} \right)\left( {\alpha \mu _{i} } \right)^{{y_{i} }} }}{{y_{{i!}} {\text{ }}\Gamma \left( {\frac{1}{\alpha }} \right)1 + \alpha \mu ^{{y_{i} + \frac{1}{\alpha }}} }}} , & {y_{i} > \alpha } \\ \end{array} } \right.$$
(4)

where α ≥ 0 is an over-dispersion parameter [22].

Hurdle models

In the hurdle Poisson (HP) model, the first part is the hurdle at zero, which addresses the “few” or “more” zero outcome than the distributional assumption of the Poisson model and the second part governs the truncation part or positive outcomes [2, 19, 23]. Given a variable \(y_{i}\). the HP probability distribution is given as:

$$\Pr \left( {y_{i} = 0} \right) = 1 - p, \quad 0 \le p \le 1$$
$$\Pr \left( {Y = y_{i} } \right) = p\frac{{\exp \left( { - \mu_{i} } \right)\mu_{i}^{{y_{i} }} }}{{y_{i} !}}, \mu > 0;\quad y_{i} = 1,2, \ldots$$
(5)

where µ is the mean of the Poisson model, when \(\left( {1 - p} \right) > { \exp }\left( { - \mu } \right)\), the data contain more zeros relative to the Poisson model.

The hurdle negative binomial (HNB) is used when the hurdle model is appropriate and the data exhibit over-dispersion [19, 24]. The HNB model is given as:

$$\Pr \left( {y = 0} \right) = 1 - p, \quad 0 \le p \le 1$$
$${ \Pr }\left( {\text{Y = y}} \right) = \frac{\text{p}}{{ 1- \left( {\frac{\text{r}}{{\mu {\text{ + r}}}}} \right)^{\text{r}} }}\frac{{\varGamma ( {\text{y + r)}}}}{{\varGamma \left( {\text{r}} \right){\text{y!}}}}\left[ {\frac{\mu }{{\mu {\text{ + r}}}}} \right]^{\text{y}} \left[ {\frac{\text{r}}{{\mu {\text{ + r}}}}} \right]^{\text{r}} ,\quad {\text{ r,}}\;\mu \;{ > }\; 0 ;\;{\text{y = 1,2}} \ldots$$
(6)

The mean and variance of the HNB distribution are given as µ and µ (1 + µ/r) respectively, the quantity µ(1 + µ/r) is a measure of dispersion [22].

Model assessment and evaluation

The model selection criterion was based on the maximum likelihood estimates of the model parameter, using the log-likelihood and the Information Criterion (IC)—Akaike (AIC) and Bayesian (BIC). A lower IC value implies that the model is of better fit [25, 26]. An IC values with difference greater than 10 implies that the model with a smaller IC is superior, a value difference of 4 to 10 suggest a moderate superiority of one model against the other and an IC value differences less than 4 implies that the competing models are said to be indistinguishable [26].

Results

Socio-economic and demographic characteristics of respondents

In Nigeria, 29.5% of women age 15 to 49 years had no child, this percentage is highest in South South (42.4) and lowest in North West (21.3) (Fig. 1). The mean number of children ever born was highest in North West (3.89 ± 3.36) and lowest in South South (2.32 ± 2.58). As presented in Table 1, the information reveals that the age at first sex was lower in the Northern part of the country, compared to the Southern part, South East (18.96 ± 4.35), South West (18.69 ± 3.6) and South South (17.27 ± 3.22) except for North Central (18.06 ± 3.78). A higher number of women with no education were recorded in the Northern regions and women wealth quintiles were higher in Southern regions compared to the Northern regions. About 16% of women used any method of contraceptive in Nigeria and this varies across regions.

Table 1 Descriptive statistics of background characteristics by region
Fig. 1
figure 1

Percentage distribution of zero and non-zero count of children ever born by region (NDHS 2013)

Model selection criteria for the fitted model

The model assessments for each of the region are presented in Table 2 using the values from the AIC and BIC for evaluation basis. The hurdle negative binomial model was of best fit for North West (AIC = 45,421.19, BIC = 45,775.64) and South East (AIC = 13,767.37, BIC = 14,026.82) while the zero-inflated negative binomial provided a better fit for North East (AIC = 24,565.28, BIC = 24,828.33). Although, the zero-inflated negative binomial has a moderate superiority over the hurdle negative binomial in South South (AIC = 16,138.5, BIC = 16,411.23). For South West region, both AIC and BIC suggest that ZNB and ZIP are indistinguishable as best fit (\(ZINB \le ZIP < HNB \le HP < NB < Poisson)\) and no superiority exist between the zero-inflated models and their hurdle model analogs. In all cases, the zero-modified models were better than the GLMs, except for North Central were the BIC suggest that NB is of best fit (\(NB < HNB < ZINB < HP < ZIP < Poisson)\) contrary to the AIC and the log-likelihood (\(HNB < ZINB < HP < ZIP < NB < Poisson)\). Similarly, the models which take into account an over-dispersion parameter were better than their corresponding models not accounting for over-dispersion.

Table 2 Model assessment for alternative models

Discussion

This study examined the effectiveness of zero-augmented models compared to the standard Poisson and negative binomial models widely used for modelling fertility in Nigeria [3, 9, 11]. The current analysis was conducted separately in each of the six regions in Nigeria.

The results using the AIC and BIC has a model selection reviewed that both hurdle negative binomial and zero-inflated negative binomial provide a better fit for fertility data with large number of zeros and over-dispersion. Extensively, the AIC and BIC estimates from the zero-augmented negative binomial based models (HNB and ZINB) were of better fit than their Poisson based counterparts or in rare cases maybe indistinguishable. Consequently, both excess zeros and over-dispersion were recommended for fertility modelling not only at national level but also at regional levels. These findings are similar to other studies with similar data generating mechanism, containing large number of zeros [24, 27, 28]. Previous studies have noted that zero-inflated models are statistically appropriate in low fertility population studies and especially when there are large number of women with no children [13, 29].

The adjudged best model for each of the regions was used to predict the determinants of fertility peculiar to each region. For North Central, women with at least secondary level of education, partners with secondary education and women not working are factors driving low fertility. Secondary education, Igbo and higher age at first sex are factors determining low fertility in the North East. Residing in rural areas, secondary education, tertiary education, poorer women compared to poor women, no other wives, higher age at first sex and women not working are factors determining low level of fertility in the North West. Urban residence, women not working and increasing women educational level are factors responsible for low level of fertility in the South East. Increasing level of women education, wealth index, high age at first sex and women not working are drivers of low fertility in South South. Secondary and higher level of education, urban residency and women not working are factors contributing to low fertility level in the South West (Additional file 2).

In conclusion, the assessment in this paper provides evidence to support that fertility count data usually rightly skewed with excess zeros should be modelled using the zero-augmented models with negative binomial variant.

Limitation

Children ever born (CEB) was captured in NDHS based on the reported full birth history of women of reproductive age. There is likelihood of gross under-reporting of CEB due to cultural beliefs and norms of reporting actual number of births.

Availability of data and materials

This study used a secondary dataset from Measure DHS program, the dataset can be accessed after due permission from the DHS program archive and can be downloaded at https://dhsprogram.com/data/dataset/Nigeria_Standard-DHS_2013.cfm?flag=0.

Abbreviations

AIC:

Akaike Information Correction

BIC:

Bayesian Information Correction

CEB:

children ever born

DF:

degree of freedom

EAs:

enumeration areas

HNB:

hurdle negative binomial

HP:

hurdle Poisson

IC:

information criterion

NB:

negative binomial

SD:

standard deviation

TFR:

total fertility rate

ZIP:

zero-inflated Poisson

ZINB:

zero-inflated negative binomial

References

  1. Hilbe JM. Modeling Count Data. In: Lovric M, editor. International Encyclopedia of Statisticsl Science. Berlin: Springer; 2011.

    Google Scholar 

  2. Winkelmann R, Zimmermann KF. Recent developments in count data modelling: theory and application. J Econ Surv. 1995;9(1):1–24.

    Article  Google Scholar 

  3. Alaba OO, Olubusoye OE, Olaomi JO. Spatial patterns and determinants of fertility levels among women of childbearing age in Nigeria. South Afr Fam Pract. 2017;59(4):143–7.

    Article  Google Scholar 

  4. Hur K, Hedeker D, Henderson W, Khuri S, Daley J. Modeling clustered count data with excess zeros in health care outcomes research. Heal Serv Outcomes Res Methodol. 2002;3(1):5–20.

    Article  Google Scholar 

  5. Yusuf OB, Afolabi RF, Ayoola AS. Modelling excess zeros in count data with application to antenatal care utilisation. Int J Stat Probab. 2018;7(3):22.

    Article  Google Scholar 

  6. Samsudin S, Moffatt PG. Modelling count data with excess zeros: an application to health care utilisation data. Malaysian J Econ Stud. 2014;51(2):201–15.

    Google Scholar 

  7. Kareem YO, Yusuf OB. Statistical modeling of fertility experience among women of reproductive age in Nigeria. J Stat Appl. 2018;8(1):23–33.

    Google Scholar 

  8. Adebowale AS. Ethnic disparities in fertility and its determinants in Nigeria. Fertil Res Pract. 2019;5(3):1–16.

    Google Scholar 

  9. Akpa OM, Ikpotokin O. modeling the determinants of fertility among women of childbearing age in Nigeria: analysis using generalized linear modeling approach. Int J Humanit Soc Sci. 2012;2(18):7–11.

    Google Scholar 

  10. Dana DD. Binary logistic regression analysis of identifying demographic, socioeconomic, and cultural factors that affect fertility among women of child bearing age in Ethiopia. Sci J Appl Math Stat. 2019;6(3):65.

    Article  Google Scholar 

  11. Fagbamigbe AF, Adebowale AS. Current and predicted fertility using Poisson regression model: evidence from 2008 Nigerian demographic health survey. Afr J Reprod Heal. 2014;18(1):71–83.

    Google Scholar 

  12. Pandey R, Kaur C. Modelling fertility: an application of count regression models. Chin J Popul Resour Environ. 2015;13(4):349–57.

    Article  Google Scholar 

  13. Poston DLJ, McKibben SL. Using zero-inflated count regression models to estimate the fertility of U. S. women. J Mod Appl Stat Methods. 2003;2(2):10.

    Article  Google Scholar 

  14. Silva JMCS, Covas F. A modified hurdle model for completed fertility. J Popul Econ. 2000;13:173–88.

    Article  Google Scholar 

  15. National Population Commission (NPC) [Nigeria] and ICF International. Nigeria demographic and health survey. Abuja, Nigeria, and Rockville, Maryland. Rockville: NPC and ICF International; 2013. p. 2014.

    Google Scholar 

  16. United Nations .World Popul. Prospect; 2012 Revis. Popul. Di., New York, 2013.

  17. Armstrong RA. When to use the B onferroni correction. Ophthalmic and Physiol Optics. 2014;34(5):502–8.

    Article  Google Scholar 

  18. Rodriguez G. Poisson Models for Count Data, Chapter 4. 2007. p. 1–50. http://data.princeton.edu/wws509/notes/c4.pdf. Accessed 30 Dec 2017.

  19. Cameron AC, Trivedi PK. Essentials of Count Data Regression (Chapter 15). A Companion to Theoretical Econometrics. Malden: Blackwell Publishing Ltd.; 1999.

    Google Scholar 

  20. Reese R. The Poisson and negative binomial distributions. 2016. Available from: http://stats.stackexchange.com/questions/37814/poisson-is-to-exponential-as-gamma-poisson-is-to-what\nhttp://math.usu.edu/jrstevens/biostat/PoissonNB.pdf

  21. Baum CF. Models for count data and categorical response data. Adelaide: Boston College and DIW Berlin, University of Adelaide; 2010.

    Google Scholar 

  22. Yesilova A, Kaydan MB, Kaya Y. Modeling insect-egg data with excess zeros using zero-inflated regression models. Hacettepe J Math Stat. 2010;39(2):273–82.

    Google Scholar 

  23. Lam KF, Xue H, Cheung YB. Semiparametric analysis of zero-inflated count data. Biometrics. 2006;62:996–1003.

    Article  CAS  Google Scholar 

  24. Chipeta MG, Ngwira BM, Simoonga C, Kazembe LN. Zero adjusted models with applications to analysing helminths count data. BMC Res Notes. 2014;7:856.

    Article  Google Scholar 

  25. Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods. 2012;17(2):228–43. https://0-doi-org.brum.beds.ac.uk/10.1037/a0027127.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Pan W. Akaike’s information criterion in generalized estimating equations. Biometrics. 2001;57(1):120–5.

    Article  CAS  Google Scholar 

  27. Hu M-C, Pavlicova M, Nunes EV. Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse. 2011;37(5):367–75.

    Article  Google Scholar 

  28. Desjardins CD. Modeling zero-inflated and overdispersed count data: an empirical study of school suspensions. J Exp Educ. 2016;84(3):449–72.

    Article  Google Scholar 

  29. Melkersson M, Rooth D-O. Modeling female fertility using inflated count data models. J Popul Econ . 2000;13(2):189–203. http://0-www-jstor-org.brum.beds.ac.uk/stable/20007710.

    Article  Google Scholar 

Download references

Acknowledgements

The Authors acknowledge the kind permission of Measure Demographic and Health Survey to use the data for this study.

Funding

This research received no grant from any funding agency in public, commercial or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

YOK conceived the original idea of the study, design the study, analyzed the data and drafted the manuscript. IMB contributed to the design of the study, interpretation of findings and revision of the manuscript. ASA contributed to the conception of the study, interpretation, and revision of the manuscript. JOA contributed to statistical analysis, interpretation and revision of the manuscript. OBY contributed to the conception and design of the study, statistical analysis and revision of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yusuf Olushola Kareem.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Summary statistics of children ever born by region.

Additional file 2.

Determinant of fertility by regions based on the adjudged best model.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kareem, Y.O., Morhason-Bello, I.O., Adebowale, A.S. et al. Robustness of zero-augmented models over generalized linear models in analysing fertility data in Nigeria. BMC Res Notes 12, 815 (2019). https://0-doi-org.brum.beds.ac.uk/10.1186/s13104-019-4852-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s13104-019-4852-5

Keywords