type
Post
status
Published
slug
mortality
summary
The composition of child mortality has changed over time. In particular a larger proportion of deaths are classed as neonatal mortality, which is a death that occurs within one month of birth.
category
Data Analytics
tags
Data Analysis
R
Data Modelling
Statistics
Predictive Analytics
date
Dec 11, 2022 09:45 AM
password
icon
Priority
Low
URL
national_mortality_model
jasonxsiuUpdated Oct 9, 2021

What is this about?

The composition of child mortality has changed over time. In particular a larger proportion of deaths are classed as neonatal mortality, which is a death that occurs within one month of birth.
This regression analysis is to produce a linear regression model to estimate the average neonatal mortality rate (NMR).

Why am I doing this project? (Motivation)

Through this project, I would like to apply the concepts of statistics and machine learning (e.g., B-spline, Regression, Cross-validation, Prediction intervals) along with my interpretation.

Why this topic?

The topic of life expectancy by UN has been widely analysed, but not the Under-five mortality rate (U5MR, aka the number of children under 5 who die per 1000 live births).
One of this century’s global goals defined by United Nations has been the reduction of childhood mortality across all countries. So, we are interested in such a topic.

Description of the dataset and Questions of the analysis

In this analysis, we are going to look the data published by the IGME on neonatal mortality, available as neonatal_mortality.csv.
The data contains the following columns:
  • country_name: The (short) name of the country
  • year: The year the data was measured
  • region: The name of one of seven large global regions
  • nmr: The observed number of neonatal deaths per thousand live births (the neonatal mortality rate). This is measured either using a country’s vital registration system (births and deaths register) or using some sort of high-quality survey4.
  • u5mr: The estimated under-five mortality rate.

Note on Neonatal mortality rate

Neonatal mortality rate is a strictly positive variable, so we model it on a transformed scale (otherwise there is very little hope of Gaussian residuals). Alexander and Alkema (2018) recommend instead modelling the fraction
notion image
which is the (log of the) number of neonatal deaths per 1000 live births divided by the number of non-neonatal deaths per 1000 live births.
 

Task 1: Estimating neonatal mortality

One of this century’s global goals has been the reduction of childhood mortality across all countries. There has been enormous effort put into this goal at all levels from the united nations down to local interventions.
While there is still a lot of variation between countries, the rate of childhood mortality (measured as the number of deaths of children under five years old per 1000 live births) has decreased over the decades.
The following plot shows the Under-five mortality rate (U5MR, aka the number of children under 5 who die per 1000 live births) estimates published by the UN Inter-agency Group for Child Mortality Estimation (IGME)2.
As under-5 mortality has decreased, the composition of child mortality has changed over time. In particular a larger proportion of deaths are classed as neonatal mortality, which is a death that occurs within one month of birth.
Our task is to produce a linear regression model to estimate the average neonatal mortality rate (NMR).

Actions we did:

  • Explain the choice of variables in our model (we didn’t use country_name, but used U5MR on the log-scale!). In particular we consider whether an interaction effect should be used.
    • (An interaction effect between a (possibly continuous) covariate x and a discrete covariate u can be fit using lm(y ~ x * u). It will model the effect of x, u, and then a separate slope of x for each level of u. See this chapter (hyperlink) in Statisical Thinking fro the 21st Century)
  • Assess our linear model and comment on its fit.
    • This should be done a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that should be chosen to highlight different aspects of the fit diagnostics.
  • Estimate the root mean square error and the mean absolute error on a test set. The test set should be produced using the argument strata = region.
  • Produce a prediction, with prediction intervals, of the NMR on its natural scale (aka not on the log-scale) and plot these a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that show different aspects of the fit.
In order to improve our model fit, we also consider the non-linear effect of one variable. We will use B-Splines, which are a better method than polynomial regression for this problem. A B-Spline is a smooth function that is only non-zero for a small part of the domain that are designed to add together to make smooth functions.
The following graph plots b-spline basis functions:
B-splines can be added to a regression using the spline::bs() function. This function can be used in an lm formula like an ordinary variable in our tibble. An illustrative example, which fits an interaction model and uses 15 basis functions (df = 15), is below.
B-splines can be added to a regression using the spline::bs() function. This function can be used in an lm formula like an ordinary variable in our tibble. An illustrative example, which fits an interaction model and uses 15 basis functions (df = 15), is below.
notion image
library(splines) library(broom) library(tidyverse) dat <- tibble(x = seq(-5,5, length.out = 500), z = sample(c(-2,2), size = 500, replace = TRUE), y = (1 + z) * tanh(3 * x) + rnorm(500, sd = 0.3)) %>% mutate(z = factor(z)) fit <- lm(y ~ bs(x, df = 15) * z, data = dat) fit %>% augment(data = dat) %>% ggplot(aes(x,y)) + geom_point() + geom_line(aes(x, .fitted), colour = "blue") + facet_wrap(~z)
Modify the linear model we produced to incorporate an appropriate non-linear effect. We did the following:
  • Explain our choice of model, using appropriate visualisations to support our choice.
  • Use cross-validation to select an appropriate number of basis functions for bs()
  • For our final model, assess our linear model and comment on its fit. This should be done a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that should be chosen to highlight different aspects of the fit diagnostics.
  • Estimate the root mean square error and the mean absolute error using the same test set as before.
  • Produce a prediction, with prediction intervals, of the NMR on its natural scale (aka not on the log-scale) and plot these a) for all data simultaneously; b) for data in each region; and c) for data in a maximum of 3 countries that show different aspects of the fit.
  • Write a paragraph or two describing the differences between the two models and explaining which we think is a more appropriate model of the data.

Full version of the report

Note: the following PDF previewer missed some pages. If interested, feel free to download the entire document by clicking the following link.
 
Power BI Portfolio - MonCity — Uber On CampusSurvival Analysis on Customer Churn time