Regression analysis on under-ﬁve child mortality

type

Post

status

Published

slug

mortality

summary

The composition of child mortality has changed over time. In particular a larger proportion of deaths are classed as neonatal mortality, which is a death that occurs within one month of birth.

What is this about?

The composition of child mortality has changed over time. In particular a larger proportion of deaths are classed as neonatal mortality, which is a death that occurs within one month of birth.

This regression analysis is to produce a linear regression model to estimate the average neonatal mortality rate (NMR).

Why am I doing this project? (Motivation)

Through this project, I would like to apply the concepts of statistics and machine learning (e.g., B-spline, Regression, Cross-validation, Prediction intervals) along with my interpretation.

Why this topic?

The topic of life expectancy by UN has been widely analysed, but not the Under-five mortality rate (U5MR, aka the number of children under 5 who die per 1000 live births).

One of this century’s global goals defined by United Nations has been the reduction of childhood mortality across all countries. So, we are interested in such a topic.

Description of the dataset and Questions of the analysis

In this analysis, we are going to look the data published by the IGME on neonatal mortality, available as neonatal_mortality.csv.

The data contains the following columns:

country_name: The (short) name of the country

year: The year the data was measured

region: The name of one of seven large global regions

nmr: The observed number of neonatal deaths per thousand live births (the neonatal mortality rate). This is measured either using a country’s vital registration system (births and deaths register) or using some sort of high-quality survey4.

u5mr: The estimated under-ﬁve mortality rate.

Note on Neonatal mortality rate

Neonatal mortality rate is a strictly positive variable, so we model it on a transformed scale (otherwise there is very little hope of Gaussian residuals). Alexander and Alkema (2018) recommend instead modelling the fraction

which is the (log of the) number of neonatal deaths per 1000 live births divided by the number of non-neonatal deaths per 1000 live births.

Task 1: Estimating neonatal mortality

One of this century’s global goals has been the reduction of childhood mortality across all countries. There has been enormous eﬀort put into this goal at all levels from the united nations down to local interventions.

While there is still a lot of variation between countries, the rate of childhood mortality (measured as the number of deaths of children under ﬁve years old per 1000 live births) has decreased over the decades.

The following plot shows the Under-ﬁve mortality rate (U5MR, aka the number of children under 5 who die per 1000 live births) estimates published by the UN Inter-agency Group for Child Mortality Estimation (IGME)2.

As under-5 mortality has decreased, the composition of child mortality has changed over time. In particular a larger proportion of deaths are classed as neonatal mortality, which is a death that occurs within one month of birth.

Our task is to produce a linear regression model to estimate the average neonatal mortality rate (NMR).

Actions we did:

Explain the choice of variables in our model (we didn’t use country_name, but used U5MR on the log-scale!). In particular we consider whether an interaction eﬀect should be used.

(An interaction effect between a (possibly continuous) covariate x and a discrete covariate u can be fit using lm(y ~ x * u). It will model the effect of x, u, and then a separate slope of x for each level of u. See this chapter (hyperlink) in Statisical Thinking fro the 21st Century)

Assess our linear model and comment on its ﬁt.

This should be done a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that should be chosen to highlight diﬀerent aspects of the ﬁt diagnostics.

Estimate the root mean square error and the mean absolute error on a test set. The test set should be produced using the argument strata = region.

Produce a prediction, with prediction intervals, of the NMR on its natural scale (aka not on the log-scale) and plot these a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that show diﬀerent aspects of the ﬁt.

In order to improve our model ﬁt, we also consider the non-linear eﬀect of one variable. We will use B-Splines, which are a better method than polynomial regression for this problem. A B-Spline is a smooth function that is only non-zero for a small part of the domain that are designed to add together to make smooth functions.

The following graph plots b-spline basis functions:

B-splines can be added to a regression using the spline::bs() function. This function can be used in an lm formula like an ordinary variable in our tibble. An illustrative example, which fits an interaction model and uses 15 basis functions (df = 15), is below. — B-splines can be added to a regression using the `spline::bs()` function. This function can be used in an lm formula like an ordinary variable in our tibble. An illustrative example, which fits an interaction model and uses 15 basis functions (`df = 15`), is below.


library(splines)
library(broom)
library(tidyverse)
dat <- tibble(x = seq(-5,5, length.out = 500),
z = sample(c(-2,2), size = 500, replace = TRUE),
y = (1 + z) * tanh(3 * x) + rnorm(500, sd = 0.3)) %>%
mutate(z = factor(z))
fit <- lm(y ~ bs(x, df = 15) * z, data = dat)
fit %>% augment(data = dat) %>% ggplot(aes(x,y)) +
geom_point() +
geom_line(aes(x, .fitted), colour = "blue") +
facet_wrap(~z)

Modify the linear model we produced to incorporate an appropriate non-linear eﬀect. We did the following:

Explain our choice of model, using appropriate visualisations to support our choice.

Use cross-validation to select an appropriate number of basis functions for bs()

For our ﬁnal model, assess our linear model and comment on its ﬁt. This should be done a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that should be chosen to highlight diﬀerent aspects of the ﬁt diagnostics.

Estimate the root mean square error and the mean absolute error using the same test set as before.

Produce a prediction, with prediction intervals, of the NMR on its natural scale (aka not on the log-scale) and plot these a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that show diﬀerent aspects of the ﬁt.

Write a paragraph or two describing the diﬀerences between the two models and explaining which we think is a more appropriate model of the data.

Full version of the report

Note: the following PDF previewer missed some pages. If interested, feel free to download the entire document by clicking the following link.

national_mortality_model.pdf

4015.8KB

national_mortality_model.Rmd

28.2KB