Survival Analysis on Customer Churn time

type

Post

status

Published

slug

Customer-Churn-time

summary

This analysis is to look into aspects of analysing churn data. Customer churn time is the length of time a customer stays with a company before leaving. This is a vital metric in many businesses and there is great interest in estimating it. The churn time changes, so we always want to estimate it with the most recent data available. This is statistically challenging as many (hopefully most) customers will have not churned at the time when the data is collected. This means that customer churn data is highly censored.

What is this about?

This analysis is to look into aspects of analysing churn data.

Customer churn time is the length of time a customer stays with a company before leaving. This is a vital metric in many businesses and there is great interest in estimating it.

The churn time changes, so we always want to estimate it with the most recent data available. This is statistically challenging as many (hopefully most) customers will have not churned at the time when the data is collected. This means that customer churn data is highly censored.

Why you are doing this?

With the interest in human right and politics, the situation of child detentions in the US during the Trump administration is a niche and interesting topic to explore. This analysis is to explain the implication of journalism and social media in politics.

Description of the dataset and Questions of the analysis

The first dataset -- "detention_clean.csv"

is to analyse the children detained more than 72 hours by grouping them on the basis of nationality and age group.

We will use the following data. Note: It is important to remove the rows with months_active == 0.


library(tidyverse)
## Remove the line break in the file name!
churn_dat <-
read_csv("https://raw.githubusercontent.com/square/pysurvival/master/pysurvival/datasets/churn.csv")
churn_dat <- churn_dat %>% filter(months_active > 0)

We will be interested in 3 columns:

months_active: The churn time;

churned: The censoring indicator that is 1 if the customer churned and 0 if the measurement is censored (aka they were still customers when the data was pulled); and

company_size: A categorical variable with the size of the client’s company.

Four sections

S1: Compute the Kaplan-Meier estimate of the survival function

I wrote a function that takes a survival pair (time, event) and uses it to compute the Kaplan-Meier estimate of the survival function, using only base R and the Tidyverse. It should work for any survival data.

There are many possible ways to program up the function for computing the Kaplan-Meier curve. Two functions that could be helpful are the accumulate () and accumulate2 () functions from the purrr package.

Use this function and ggplot2 to plot:

The Kaplan-Meier curve for the full data; and
The Kaplan-Meier curve for each company_size

Write a few sentences describing and interpreting the curves.

These are used to apply a function recursively along a list. For example, if f <- function(x, a){...} is a function, then


accumulate(a, f, .init = 1)

is, if a has 3 elements equivalent to


x[1] <- 1  
x[2] <- f(x[1], a[1])  
x[3] <- f(x[2], a[2])  
x[4] <- f(x[3], a[3])

The function accumulate2 works similarly, except it takes two vectors. For example, if f <- function(x, a, b){...} is a function, then


accumulate(a, b, f, .init = 1)

is equivalent to


x[1] <- 1  
x[2] <- f(x[1], a[1])  
x[3] <- f(x[2], a[2])  
x[4] <- f(x[3], a[3])

S2: Estimate the median churn time

For each company size (as measured in the company_size column): • Compute the Kaplan-Meir curve and use this to estimate the median churn time.

Use a non-parametric bootstrap to construct 90% conﬁdence intervals for the median of each company size

Make a plot that shows that estimate of the median and the corresponding conﬁdence interval on the same axes

Write some sentences describing how the median churn time changes across company sizes.


library(survival)
fit <- survfit(Surv(time, event) ~ 1)
event_times <- fit$time
kaplan_meier <- fit$surv

If your data is (time, event), then you can get the estimated survival curve with the following code.

S3: Compute simultaneous coverage for the entire survival function. That is, the probability that the true survival function is entirely contained between the piecewise constant functions by connecting the lower and upper confidence intervals for each event time.

For one of the company sizes in the data, I did the following:

Use a nonparametric bootstrap to re-sample the data and construct 90% conﬁdence intervals for the survival curve at each time.

Compute simultaneous coverage for the entire survival function. That is, the the probability that the true survival function is entirely contained between the piecewise constant functions by connecting the lower and upper conﬁdence intervals for each event time.

Write a few sentences detailing your results and comparing them to the results in the previous question.

While you will ideally use the Kaplan-Meier function you produced in question 1, you will not be marked down for using the survival::survfit function. If your data is (time, event), then you can get the estimated survival curve with the following code.


library(survival) 
fit <- survfit(Surv(time, event) ~ 1) event_times <- fit$time 
kaplan_meier <- fit$surv

Full version of the report

Note: the following PDF previewer missed some pages. If interested, feel free to download the entire document by clicking the following link.

Survival Analysis on Customer Churn time.rmd

20.7KB

Survival Analysis on Customer Churn time.pdf

322.0KB

Additional Sources:

How To Do Survival Analysis In R - Analytics India Magazine

Modeling Customer Churn With Survival Analysis - Medium