type
Post
status
Published
slug
Customer-Churn-time
summary
This analysis is to look into aspects of analysing churn data. Customer churn time is the length of time a customer stays with a company before leaving. This is a vital metric in many businesses and there is great interest in estimating it. The churn time changes, so we always want to estimate it with the most recent data available. This is statistically challenging as many (hopefully most) customers will have not churned at the time when the data is collected. This means that customer churn data is highly censored.
category
Data Analytics
tags
Data Analysis
R
Predictive Analytics
Statistics
date
Dec 11, 2022 09:42 AM
password
icon
Priority
URL
survival_analysis_cust_churn
jasonxsiuUpdated Sep 9, 2021

What is this about?

This analysis is to look into aspects of analysing churn data.
Customer churn time is the length of time a customer stays with a company before leaving. This is a vital metric in many businesses and there is great interest in estimating it.
The churn time changes, so we always want to estimate it with the most recent data available. This is statistically challenging as many (hopefully most) customers will have not churned at the time when the data is collected. This means that customer churn data is highly censored.

Why you are doing this?

With the interest in human right and politics, the situation of child detentions in the US during the Trump administration is a niche and interesting topic to explore. This analysis is to explain the implication of journalism and social media in politics.

Description of the dataset and Questions of the analysis

The first dataset -- "detention_clean.csv"

  • is to analyse the children detained more than 72 hours by grouping them on the basis of nationality and age group.
  • We will use the following data. Note: It is important to remove the rows with months_active == 0.
library(tidyverse) ## Remove the line break in the file name! churn_dat <- read_csv("https://raw.githubusercontent.com/square/pysurvival/master/pysurvival/datasets/churn.csv") churn_dat <- churn_dat %>% filter(months_active > 0)

We will be interested in 3 columns:

  • months_active: The churn time;
  • churned: The censoring indicator that is 1 if the customer churned and 0 if the measurement is censored (aka they were still customers when the data was pulled); and
  • company_size: A categorical variable with the size of the client’s company.

Four sections

S1: Compute the Kaplan-Meier estimate of the survival function

  • I wrote a function that takes a survival pair (time, event) and uses it to compute the Kaplan-Meier estimate of the survival function, using only base R and the Tidyverse. It should work for any survival data.
  • There are many possible ways to program up the function for computing the Kaplan-Meier curve. Two functions that could be helpful are the accumulate () and accumulate2 () functions from the purrr package.
  • Use this function and ggplot2 to plot:
    • The Kaplan-Meier curve for the full data; and
    • The Kaplan-Meier curve for each company_size
  • Write a few sentences describing and interpreting the curves.
These are used to apply a function recursively along a list. For example, if f <- function(x, a){...} is a function, then
accumulate(a, f, .init = 1)
is, if a has 3 elements equivalent to
x[1] <- 1 x[2] <- f(x[1], a[1]) x[3] <- f(x[2], a[2]) x[4] <- f(x[3], a[3])
The function accumulate2 works similarly, except it takes two vectors. For example, if f <- function(x, a, b){...} is a function, then
accumulate(a, b, f, .init = 1)
is equivalent to
x[1] <- 1 x[2] <- f(x[1], a[1]) x[3] <- f(x[2], a[2]) x[4] <- f(x[3], a[3])

S2: Estimate the median churn time

For each company size (as measured in the company_size column): • Compute the Kaplan-Meir curve and use this to estimate the median churn time.
  • Use a non-parametric bootstrap to construct 90% confidence intervals for the median of each company size
  • Make a plot that shows that estimate of the median and the corresponding confidence interval on the same axes
  • Write some sentences describing how the median churn time changes across company sizes.
library(survival) fit <- survfit(Surv(time, event) ~ 1) event_times <- fit$time kaplan_meier <- fit$surv
If your data is (time, event), then you can get the estimated survival curve with the following code.

S3: Compute simultaneous coverage for the entire survival function. That is, the probability that the true survival function is entirely contained between the piecewise constant functions by connecting the lower and upper confidence intervals for each event time.

For one of the company sizes in the data, I did the following:
  • Use a nonparametric bootstrap to re-sample the data and construct 90% confidence intervals for the survival curve at each time.
  • Compute simultaneous coverage for the entire survival function. That is, the the probability that the true survival function is entirely contained between the piecewise constant functions by connecting the lower and upper confidence intervals for each event time.
  • Write a few sentences detailing your results and comparing them to the results in the previous question.
While you will ideally use the Kaplan-Meier function you produced in question 1, you will not be marked down for using the survival::survfit function. If your data is (time, event), then you can get the estimated survival curve with the following code.
library(survival) fit <- survfit(Surv(time, event) ~ 1) event_times <- fit$time kaplan_meier <- fit$surv
 
 

Full version of the report

Note: the following PDF previewer missed some pages. If interested, feel free to download the entire document by clicking the following link.

Additional Sources:

 
Regression analysis on under-five child mortalitySystem Analysis for a university PhD Research Meeting System