Week 3: Binomial Logistic Regression

R
Author

Robert W. Walker

Published

September 12, 2022

The slides are here.

Our third class meeting will focus on Chapter 5 of Handbook of Regression Modeling in People Analytics.

The Skinny

Why not just use a linear model for a binary outcome? It turns out that you can but many people know just enough to think you don’t know what you are talking about if you do. In a very, very dense paper, Jamie Robins (1986, iirc) proves that, as long as the predicted probabilities tend to remain in the region of about 0.2 to 0.8, that is, as long as the model is not all that good, then it really doesn’t matter except that the standard errors you would estimate are likely incorrect. If the model is good for one or the other of the levels, then the behavior in the extremes matters; you could end up with predictions that are less than zero or greater than one and those are invalid as probabilities. Let’s just peak at this example. I will load some data on Churn. For details, look on Kaggle.com – a great source of data.

```{r}
Churn <- read.csv("https://github.com/robertwwalker/DADMStuff/raw/master/WA_Fn-UseC_-Telco-Customer-Churn.csv")
names(Churn)
table(Churn$Churn)
```
 [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"         
 [5] "Dependents"       "tenure"           "PhoneService"     "MultipleLines"   
 [9] "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
[13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"        
[17] "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"   "TotalCharges"    
[21] "Churn"           

  No  Yes 
5174 1869 

Now to a regression model. We will need a variable type change to pull this off. Let’s have a look at the necessary transformation.

```{r}
str(Churn$Churn)
str(as.factor(Churn$Churn))
str(as.numeric(as.factor(Churn$Churn)))
Churn$Churn.Numeric <- as.numeric(as.factor(Churn$Churn))-1
str(Churn$Churn.Numeric)
```
 chr [1:7043] "No" "No" "Yes" "No" "Yes" "Yes" "No" "No" "Yes" "No" "No" ...
 Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
 num [1:7043] 1 1 2 1 2 2 1 1 2 1 ...
 num [1:7043] 0 0 1 0 1 1 0 0 1 0 ...

Churn is a character variable. To turn it to a quantity taking values of zero and one, as.factor() turns the character variable Churn into a factor with two levels, No and Yes. as.numeric() turns this into a number, one or two. I then subtract one to get a variable that takes values zero or one and store it as Churn.Numeric. I will make Churn a function of a few chosen variables; there is more in the dataset that I could work with.

```{r, message=FALSE, warning=FALSE}
library(stargazer); library(magrittr); library(tidyverse); library(skimr)
my.lm <- lm(Churn.Numeric~InternetService+tenure+PhoneService+Contract+TotalCharges, data=Churn)
summary(my.lm)
```

Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = Churn)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64458 -0.26256 -0.07269  0.35565  1.12180 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.274e-01  1.651e-02  25.892  < 2e-16 ***
InternetServiceFiber optic  2.601e-01  1.268e-02  20.512  < 2e-16 ***
InternetServiceNo          -1.476e-01  1.588e-02  -9.290  < 2e-16 ***
tenure                     -1.909e-03  4.642e-04  -4.112 3.97e-05 ***
PhoneServiceYes            -3.823e-02  1.769e-02  -2.161   0.0307 *  
ContractOne year           -1.385e-01  1.383e-02 -10.018  < 2e-16 ***
ContractTwo year           -1.184e-01  1.641e-02  -7.218 5.81e-13 ***
TotalCharges               -3.961e-05  5.244e-06  -7.555 4.73e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3812 on 7024 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.2554 
F-statistic: 345.6 on 7 and 7024 DF,  p-value: < 2.2e-16

Scientific notation can render tables hard to read. We can adjust R’s internal options to require more leading zeroes before scientific notation is used with scipen in options, e.g.

```{r}
options(scipen=6)
summary(my.lm)
```

Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = Churn)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64458 -0.26256 -0.07269  0.35565  1.12180 

Coefficients:
                               Estimate   Std. Error t value Pr(>|t|)    
(Intercept)                 0.427421784  0.016507683  25.892  < 2e-16 ***
InternetServiceFiber optic  0.260067246  0.012678756  20.512  < 2e-16 ***
InternetServiceNo          -0.147560579  0.015884435  -9.290  < 2e-16 ***
tenure                     -0.001908716  0.000464227  -4.112 3.97e-05 ***
PhoneServiceYes            -0.038230897  0.017690283  -2.161   0.0307 *  
ContractOne year           -0.138502281  0.013825977 -10.018  < 2e-16 ***
ContractTwo year           -0.118438811  0.016407760  -7.218 5.81e-13 ***
TotalCharges               -0.000039614  0.000005244  -7.555 4.73e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3812 on 7024 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.2554 
F-statistic: 345.6 on 7 and 7024 DF,  p-value: < 2.2e-16

That is much easier to read. The first thing to note is that there are lots of stars. The model seems to explain variance in the outcome with all the caveats that go with that idea for a 0/1 variable. For example, about 25% of the variance in Churn can be accounted for by these predictors. The model F-statistic assessing the joint hypothesis that all predictors have zero slopes yields an absolutely enormous statistic; the observed F-value is 345.6; 99% of F values with 7 numerator and 7024 denominator degrees of freedom are less than 2.64; a statistic this large is quite unlikely by chance. Moreover, each of the individual t-statistics are greater than 2 in absolute value. Those are the encouraging parts. The residual standard error should give us pause; on average, we are 0.38 away from the observed outcome in the probability metric. That is not great though to make it smaller, the model would have to predict in the extremes. Now, let me put this into a nice table.

```{r, results='asis', warning=FALSE, message=FALSE}
stargazer(my.lm, type="html", style="apsr")
```
Churn.Numeric
InternetServiceFiber optic 0.260***
(0.013)
InternetServiceNo -0.148***
(0.016)
tenure -0.002***
(0.0005)
PhoneServiceYes -0.038**
(0.018)
ContractOne year -0.139***
(0.014)
ContractTwo year -0.118***
(0.016)
TotalCharges -0.00004***
(0.00001)
Constant 0.427***
(0.017)
N 7,032
R2 0.256
Adjusted R2 0.255
Residual Std. Error 0.381 (df = 7024)
F Statistic 345.603*** (df = 7; 7024)
p < .1; p < .05; p < .01

Let’s return to my original criticism of using this particular model and assuming that the outcome is a quantity when it only takes two values. Are all of the predictions well-behaved?

```{r}
my.lm$fitted.values %>% skim()
```
Data summary
Name Piped data
Number of rows 7032
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 0.27 0.22 -0.14 0.08 0.24 0.42 0.64 ▆▇▇▇▇

No, there are negative values. To prevent that, we need a different tool; this is the subject of Chapter 5.

We should also examine residuals. Using a variety of tests of linear model assumptions, we find the model lacking in every one but constant variance [homo/heteroscedasticity].

```{r}
library(gvlma)
gvlma(my.lm)
```

Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = Churn)

Coefficients:
               (Intercept)  InternetServiceFiber optic  
                0.42742178                  0.26006725  
         InternetServiceNo                      tenure  
               -0.14756058                 -0.00190872  
           PhoneServiceYes            ContractOne year  
               -0.03823090                 -0.13850228  
          ContractTwo year                TotalCharges  
               -0.11843881                 -0.00003961  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = my.lm) 

                     Value   p-value                   Decision
Global Stat        630.599 0.000e+00 Assumptions NOT satisfied!
Skewness           377.348 0.000e+00 Assumptions NOT satisfied!
Kurtosis            64.207 1.110e-15 Assumptions NOT satisfied!
Link Function      187.439 0.000e+00 Assumptions NOT satisfied!
Heteroscedasticity   1.605 2.051e-01    Assumptions acceptable.

A linear model seems not to work well for these data. Models designed for this task will occupy our attention after a few notes and an overview.

Overview and Comments

What we require is a regression type tool tuned to represent data drawn from a generic binomial distribution. There are actually a few such models that I will introduce you to. There are also some really interesting models that you can fit that build mixtures of the different approaches but we won’t go that far. I should also note that there is a whole class of models on binary classification using trees. If you remember regression trees, you can also build regression trees for binary problems. I will do a bit of this in the end. Before that, some initial observations:

  • I am using stargazer to produce the tables; they are nice and easy to produce. They have raw html output so I can embed that directly using asis in the code chunks and typesetting to html.
  • This whole document makes use of fenced code chunks. You can copy and paste this into a new markdown or quarto to play along with the ticks built in.
  • If one wants to omit a chunk at the top, you would do it with the bracketed part adding option include=FALSE. I always suppress warnings and messages to read (surrounded by curly brackets) r setup, include=FALSE. If you use this option and load libraries, readers will find it hard to figure out how commands may have changed meaning by masking.
```{r setup}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, tidy = TRUE)
set.seed(9119)
```

My Notes on Binary GLMs

Suppose we have some variable that we want to explain, say Churn that has two mutually exclusive and exhaustive alternatives. Customers can either Churn or not. Any given customer is conceptualized as a Bernoulli trial, e.g. \(\pi^{y}(1-\pi)^{1-y}\). With a willingness to believe that every Churn decision is an independently and identically distributed trial in this group of customers, overall churn is a binomial random variable with probability mass function \[P_{y} = {n \choose y} \pi^{y}(1-\pi)^{n-y}\] where

  • \(P_{y}\) is the binomial probability of \(y\)
  • \(y\) is the number of successes in \(n\) trials
  • \(n\) is the number of trials
  • \(\pi\) is the probability of success in any given trial.

That’s just a fancy way of saying that we have a binomial distribution on our hands. This is known as the canonical distribution for binary data because \(\pi\) is a sufficient statistic – a complete characterization of Churn because it only takes two values. The challenge is that we wish to formulate a regression model for \(\pi\) which will first require that we to grips with the existence of \(\pi_{i}\).

Generalized Linear Models

I need some model that is bounded to zero and one, abstractly, because probabilities are subject to a sum to one constraint. This is where there is some diversity in the representations; let me explain. In generalized linear models, there are two keys to the specification: the family and the link. We have already covered the family; it has to be binomial.

In the theory of these models, presented by Peter McCullagh and John Nelder in 19891, that link for the probabilities is what ties regression to the binomial distribution; we posit that \((1-\pi_{i}) = Pr(y_{i}=1|X_{i}) = 1-F(X_{i}\beta)\) so that \(\pi_{i} = Pr(y_{i}=0|X_{i})= F(X_{i}\beta)\). If \(F\) is some well-behaved probability distribution, then the aforementioned is valid. There are a few ways of actually writing that; the \(\pi_{i}\) could be derived from a normal distribution – called probit –, the distribution that the text focuses on is, the logistic distribution – the model is named logit–, and there are a few others that are somewhat common: the Cauchy, the log-log, and the complimentary log-log. The latter two are asymmetric and mirrors one of the other. What we want to do is to find the estimates of \(\beta\) that maximize the likelihood of the sample we observe.2

A Probit Model

First, a little substitution and some notation. Let me label the normal probability up to \(X_{i}\beta\) to be \(\Phi(X\beta)\) and the probability above \(X\beta\) to be \(1-\Phi(X+{i}\beta)\). I could substitute this into the binomial and obtain the product for the entire sample – this is known as the likelihood.

\[\prod^{n}_{i=1} \Phi(X_{i}\beta)^{1-y_{i}}(1-\Phi(X_{i}\beta))^{y_{i}}\]

Taking logs yields:

\[\ln \mathcal{L} = \sum^{n}_{i=1} (1-y_{i})\ln \Phi(X_{i}\beta) + y_{i} \ln (1-\Phi(X_{i}\beta))\]

So the solution becomes

\[\arg \max_{\beta} \ln \mathcal{L} = \arg \max_{\beta} \sum^{n}_{i=1} (1-y_{i})\ln \Phi(X_{i}\beta) + y_{i} \ln (1-\Phi(X_{i}\beta))\]

In English, we want to find the values of \(\beta\) that maximize the log-likelihod of the entire sample.

Estimation of a First GLM

Now to another example with the same measure of Churn. The outcome of interest is Churn. The model specification will call glm, let me examine Churn as a function of InternetService, tenure, PhoneService, Contract and TotalCharges. There is one trick to deploying it, the outcome variable must be a factor type. To make the table nice, let me mutate the type to a factor and then we can model it.

```{r, warning=FALSE, message=FALSE}
Churn %<>%
    mutate(ChurnF = as.factor(Churn))
Churn %>%
    select(Churn, ChurnF) %>%
    mutate(as.numeric(ChurnF)) %>%
    head()
```
  Churn ChurnF as.numeric(ChurnF)
1    No     No                  1
2    No     No                  1
3   Yes    Yes                  2
4    No     No                  1
5   Yes    Yes                  2
6   Yes    Yes                  2

Now I want to estimate the model and have a look at the result. I will put this in a stargazer table.

```{r, results='asis', warning=FALSE, message=FALSE}
my.probit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    family = binomial(link = "probit"), data = Churn)
stargazer(my.probit, type = "html", style = "apsr")
```
ChurnF
InternetServiceFiber optic 0.726***
(0.053)
InternetServiceNo -0.458***
(0.070)
tenure -0.028***
(0.003)
PhoneServiceYes -0.372***
(0.073)
ContractOne year -0.489***
(0.057)
ContractTwo year -0.866***
(0.083)
TotalCharges 0.0001***
(0.00003)
Constant 0.131*
(0.069)
N 7,032
Log Likelihood -3,019.439
AIC 6,054.878
p < .1; p < .05; p < .01

We can do astrology on the tables; read the stars. Fiber optic customers are more likely to Churn and those without internet service are less likely to Churn but both conditions are compared to a third category absorbed into the Constant. What is that category?

```{r}
janitor::tabyl(Churn$InternetService)
```
 Churn$InternetService    n   percent
                   DSL 2421 0.3437456
           Fiber optic 3096 0.4395854
                    No 1526 0.2166690

DSL subscribers. It is the first in alphabetical order. That is the default option. That also means that the constant captures those on Month-to-month contracts and without phone service – the omitted category for each. So what do these coefficients show?

The slopes represent the effect of a one-unit change in \(x\) on the underlying distribution for the probabilities. Unless one has intuition for those distributions, they come across as nonsensical. In the table above, let me take the example of tenure. For each unit of tenure [another month having been a customer], the normal variable \(Z \sim N(0,1)\) decreases by 0.028. But what that means depends on whether we are going from 0 to -0.028 or from -2 to -2.028. Remember the standard normal has about 95% of probability between -2 and 2 and has a modal/most common value at zero.

```{r}
data.frame(Z = rnorm(10000)) %>%
    ggplot(.) + aes(x = Z) + geom_density() + theme_minimal() + geom_vline(aes(xintercept = -2.028),
    color = "red") + geom_vline(aes(xintercept = -2), color = "red") + geom_vline(aes(xintercept = -0.028),
    color = "blue") + geom_vline(aes(xintercept = 0), color = "blue")
```

The associated probabilities in that example would be either small or nearly trivial even over 1000s of customers.

```{r}
pnorm(-2) - pnorm(-2.028)
pnorm(0) - pnorm(-0.028)
```
[1] 0.001470008
[1] 0.01116892

It seems like most scholars I run across don’t actually know this; they tend to stick to stars (That’s why your book plays with odds and logistic regression but I want to start here because y’all have never seen the logistic distribution, except on my poster….) Before that, here’s another way of showing the actual estimated slopes even if their intuition is hard.

```{r}
library(jtools)
plot_summs(my.probit, inner_ci_level = 0.95)
```

I can make that plot better.

A Scaled Coefficient Plot

Remember scaling from last time and linear models? No rules against that. The two metric variables – tenure and TotalCharges – are now the change in Z for a one standard deviation change in the relevant variable. That’s 2267 dollars for TotalCharges and 24.6 months for tenure.

```{r}
Churn %>%
    skim(TotalCharges, tenure)
```
Data summary
Name Piped data
Number of rows 7043
Number of columns 23
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TotalCharges 11 1 2283.30 2266.77 18.8 401.45 1397.47 3794.74 8684.8 ▇▂▂▂▁
tenure 0 1 32.37 24.56 0.0 9.00 29.00 55.00 72.0 ▇▃▃▃▆
```{r}
Churn %>%
    mutate(tenure = scale(tenure), TotalCharges = scale(TotalCharges)) %>%
    glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
        family = binomial(link = "probit"), data = .) %>%
    plot_summs(., inner_ci_level = 0.95)
```

What the plot makes clear is that basically all of the predictors are deemed important in Churn decisions by conventional standards as all have a very low probability of no relationship/zero slope. But that’s as far as we can get with these unless our audience shares this intuition for probability distributions.

The Trouble with Non-linear Models

I should be clear that the model does have lines; they are just lines inside of a nonlinear function – the F. The generalized linear part means that the interpretation of any one factor will depend on the values of the others. We will have to usually want to generate hypothetical data to understand what is really going on. After a presentation of the remaining models, I will return to my preferred method of interpretation.

Logistic Regression

The logistic distribution is the focus of the textbook chapter. To respecify the model using that, the only change in syntax is the link, we need it to be link="logit" which is the default.

The logistic function is given by:

\[\Lambda = \frac{e^{X\beta}}{1+e^{X\beta}}\]

But the rest is the same; it takes the very general representation and provides a specific probability function for \(F\):

\[\arg \max_{\beta} \ln \mathcal{L} = \arg \max_{\beta} \sum^{n}_{i=1} (1-y_{i})\ln \Lambda(X_{i}\beta) + y_{i} \ln (1-\Lambda(X_{i}\beta))\]

One of the advantages of using the logistic distribution is that you can analytically solve it with only categorical variables. The other is the interpretation of the estimates; the slope is an increment in the log-odds, e.g. \(\ln (\frac{\pi_{y=1}}{1-\pi_{y=1}})\).

```{r, results='asis', warning=FALSE, message=FALSE}
my.logit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    family = binomial(link = "logit"), data = Churn)
stargazer(my.logit, my.probit, type = "html", style = "apsr")
```
ChurnF
logistic probit
(1) (2)
InternetServiceFiber optic 1.172*** 0.726***
(0.091) (0.053)
InternetServiceNo -0.765*** -0.458***
(0.127) (0.070)
tenure -0.063*** -0.028***
(0.006) (0.003)
PhoneServiceYes -0.714*** -0.372***
(0.126) (0.073)
ContractOne year -0.874*** -0.489***
(0.103) (0.057)
ContractTwo year -1.781*** -0.866***
(0.172) (0.083)
TotalCharges 0.0004*** 0.0001***
(0.0001) (0.00003)
Constant 0.373*** 0.131*
(0.117) (0.069)
N 7,032 7,032
Log Likelihood -3,004.328 -3,019.439
AIC 6,024.656 6,054.878
p < .1; p < .05; p < .01

If we see these in a side by side comparison, it is obvious that the logistic version is bigger in absolute value across the board. So what do these mean in terms of actual odds of Churn or not?

```{r}
exp(my.logit$coefficients)
```
               (Intercept) InternetServiceFiber optic 
                 1.4521830                  3.2282235 
         InternetServiceNo                     tenure 
                 0.4652907                  0.9393570 
           PhoneServiceYes           ContractOne year 
                 0.4898382                  0.4173338 
          ContractTwo year               TotalCharges 
                 0.1685345                  1.0003559 

All else equal,

  • The odds of Churning with Fiber optics, as opposed to DSL, increase by 223%.
  • The odds of Churning with No internet, as opposed to DSL, decrease by 53.5% .
  • The odds of Churning with No phone service, as opposed to Phone service, are 51% lower.
  • The odds of Churning decrease by 4% per unit tenure [month].
  • The odds of Churning increase by 0.04% per dollar of total charges.
  • The odds of Churning decrease under contracts. Compared to none, about 83% lower odds under a two-year contract and 58% lower odds under a one-year contract.

If you choose to work with odds, then the suggestion to exponentiate the confidence intervals for the odds-ratios is sound.

```{r}
exp(confint(my.logit))
```
                               2.5 %    97.5 %
(Intercept)                1.1535576 1.8288933
InternetServiceFiber optic 2.7035580 3.8611249
InternetServiceNo          0.3619193 0.5951803
tenure                     0.9284409 0.9500587
PhoneServiceYes            0.3822153 0.6276984
ContractOne year           0.3400210 0.5101487
ContractTwo year           0.1190988 0.2338910
TotalCharges               1.0002350 1.0004795

There are diagnostics that can be applied to these models. The various pseudo-\(r^2\) measures. This model fit is neither terrible nor good.

```{r}
library(DescTools)
DescTools::PseudoR2(my.logit, which = c("McFadden", "CoxSnell", "Nagelkerke", "Tjur"))
```
  McFadden   CoxSnell Nagelkerke       Tjur 
 0.2621401  0.2618213  0.3817196  0.2798359 

Taking advantage of the book’s example, I first need to clean up the data, there are a few missing values. Then let me estimate the regression and diagnose it.

```{r}
Churn.CC <- Churn %>%
    select(ChurnF, InternetService, tenure, PhoneService, Contract, TotalCharges) %>%
    filter(!is.na(TotalCharges))
my.logit.CC <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract +
    TotalCharges, family = binomial(link = "logit"), data = Churn.CC)
library(LogisticDx)
# get range of goodness-of-fit diagnostics
model_diagnostics <- LogisticDx::gof(my.logit.CC, plotROC = TRUE)
```

One very common plot for binary logistic regression is the ROC: the Receiver Operating Curve. It plots specificity against sensitivity. Specificity is the ability, in this case, to correctly identify non-Churners[few false positives is highly specific]; sensitivity is the ability of the test to correctly identify Churners [few false negatives is highly sensitive]. A useful mnemonic is that the presence of the letter f in specificity is a reminder that the False test results are False for the condition, while the t in sensitivity is True test results are True for the condition. Now, turning to the actual provided diagnostics, what all is in there? ?gof for example.

```{r}
# returns a list
names(model_diagnostics)
model_diagnostics$gof
```
[1] "ct"    "chiSq" "ctHL"  "gof"   "R2"    "auc"  
         test  stat       val df         pVal
1:         HL chiSq 31.832686  8 9.979186e-05
2:        mHL     F 16.436041  9 4.896337e-27
3:       OsRo     Z  3.282456 NA 1.029070e-03
4: SstPgeq0.5     Z  6.656484 NA 2.804560e-11
5:   SstPl0.5     Z  5.983933 NA 2.178133e-09
6:    SstBoth chiSq 80.116227  2 4.008504e-18
7: SllPgeq0.5 chiSq 45.280618  1 1.707304e-11
8:   SllPl0.5 chiSq 29.794958  1 4.802393e-08
9:    SllBoth chiSq 54.741365  2 1.297369e-12

This is not a very good model. It fails all the tests. We need to add more predictors; I have those but let’s keep it simple for now. Let me look at two more.

Other Binomial GLMs

The others

```{r, results='asis', warning=FALSE, message=FALSE}
my.cauchit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    family = binomial(link = "cauchit"), data = Churn)
my.cloglogit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract +
    TotalCharges, family = binomial(link = "cloglog"), data = Churn)
stargazer(my.cauchit, my.cloglogit, my.logit, my.probit, type = "html", style = "apsr")
```
ChurnF
glm: binomial glm: binomial logistic probit
link = cauchit link = cloglog
(1) (2) (3) (4)
InternetServiceFiber optic 0.993*** 0.877*** 1.172*** 0.726***
(0.094) (0.071) (0.091) (0.053)
InternetServiceNo -0.790*** -0.640*** -0.765*** -0.458***
(0.174) (0.112) (0.127) (0.070)
tenure -0.123*** -0.059*** -0.063*** -0.028***
(0.011) (0.005) (0.006) (0.003)
PhoneServiceYes -0.705*** -0.586*** -0.714*** -0.372***
(0.136) (0.099) (0.126) (0.073)
ContractOne year -1.190*** -0.767*** -0.874*** -0.489***
(0.166) (0.092) (0.103) (0.057)
ContractTwo year -7.065*** -1.696*** -1.781*** -0.866***
(1.254) (0.161) (0.172) (0.083)
TotalCharges 0.001*** 0.0004*** 0.0004*** 0.0001***
(0.0001) (0.0001) (0.0001) (0.00003)
Constant 0.590*** -0.024 0.373*** 0.131*
(0.122) (0.088) (0.117) (0.069)
N 7,032 7,032 7,032 7,032
Log Likelihood -2,997.445 -2,991.982 -3,004.328 -3,019.439
AIC 6,010.891 5,999.964 6,024.656 6,054.878
p < .1; p < .05; p < .01

The best fit seems to be provided by the cloglog distribution which is asymmetric.

```{r}
library(pROC)
predicted <- predict(my.cloglogit, type = "response")
auc(Churn.CC$ChurnF, predicted, plot = TRUE)
```
Area under the curve: 0.8386

Residuals

```{r}
d <- density(residuals(my.logit, "pearson"))
plot(d, main = "")
```

This is rather poor.

Predicted Probability

I find that the most straightforward way to interpret them is with plots in the probability metric. Let me take the example of tenure.

I will need to create data for interpretation. Let’s suppose we have a DSL user with phone service on a two year contract with average TotalCharges. The last thing I need to know is what values of tenure to show.

```{r}
library(skimr)
Churn %>%
    filter(InternetService == "DSL", PhoneService == "Yes", Contract == "Two year") %>%
    skim(tenure, TotalCharges)
```
Data summary
Name Piped data
Number of rows 467
Number of columns 23
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tenure 0 1.00 60.52 14.79 0.0 55.00 66.00 71.00 72.00 ▁▁▁▂▇
TotalCharges 3 0.99 4733.47 1382.51 130.5 3867.24 4913.88 5879.86 6859.05 ▁▂▅▇▇

Now I can create the data and generate predictions in the probability metric of the response.

```{r}
Tenure.Pred <- data.frame(InternetService = "DSL", PhoneService = "Yes", Contract = "Two year",
    TotalCharges = 4733.5, tenure = seq(0, 72, by = 1))
Tenure.Pred$Prob.Churn <- predict(my.logit, newdata = Tenure.Pred, type = "response")
```

Now let me plot it.

```{r}
ggplot(Tenure.Pred) + aes(x = tenure, y = Prob.Churn) + geom_line() + theme_minimal()
```

We could get fancier, too.

```{r}
Tenure.Pred.Three <- rbind(data.frame(InternetService = "DSL", PhoneService = "Yes",
    Contract = "Month-to-month", TotalCharges = 4733.5, tenure = seq(0, 72, by = 1)),
    data.frame(InternetService = "DSL", PhoneService = "Yes", Contract = "One year",
        TotalCharges = 4733.5, tenure = seq(0, 72, by = 1)), data.frame(InternetService = "DSL",
        PhoneService = "Yes", Contract = "Two year", TotalCharges = 4733.5, tenure = seq(0,
            72, by = 1)))
Tenure.Pred.Three$Prob.Churn <- predict(my.logit, newdata = Tenure.Pred.Three, type = "response")
ggplot(Tenure.Pred.Three) + aes(x = tenure, y = Prob.Churn, color = Contract) + geom_line() +
    theme_minimal() + labs(y = "Pr(Churn)")
```

A Better (Businessy) Way of Thinking About all of This

I personally believe that the only real way to assess models for use in predictive analytics is to assess them by that criteria. That doesn’t mean fitting inside the extant sample of data, but rather sampling from it and then using the model to predict what is known as a holdout sample. Let me show you what I mean. In this case, let me use the probit and logit models from before and a 75/25 split. This means that I will analyse 75 percent and predict the other 25 percent. I can use join style commands to pull it off pretty simply. I have 7043 rows. So I want 5283 rows of the original data out of that 7043.

```{r}
train <- Churn[sample(c(1:7043), size = 5283, replace = FALSE), ]
test <- Churn %>%
    anti_join(., train)
```

Now to estimate the model.

```{r}
library(janitor)
mod.train <- train %>%
    glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
        family = binomial(link = "probit"), data = .)
```

Now to predict the result on the test set that I created. I will then turn the probabilities into a best guess by whether Churn or No is more likely.

```{r}
test$Pred.Probs <- predict(mod.train, newdata = test, type = "response")
test %>%
    mutate(Pred.Val = (Pred.Probs > 0.5)) %>%
    janitor::tabyl(Churn, Pred.Val, show_na = FALSE) %>%
    adorn_percentages("row")
```
 Churn     FALSE      TRUE
    No 0.8961749 0.1038251
   Yes 0.5042017 0.4957983

Then all of the totals.

```{r}
test %>%
    mutate(Pred.Val = (Pred.Probs > 0.5)) %>%
    janitor::tabyl(Churn, Pred.Val, show_na = FALSE) %>%
    adorn_totals(c("row", "col"))
```
 Churn FALSE TRUE Total
    No  1148  133  1281
   Yes   240  236   476
 Total  1388  369  1757

Now you might say that the fact we can only get 50 to 55 percent of Churn='Yes' with the model, remember that only 26.5 percent of people Churn overall so we have improved quite a bit over knowing nothing at all but the raw row probability. In this specific case, the probability of Yes in the test set is shown below.

```{r}
test %>%
    tabyl(Churn)
```
 Churn    n   percent
    No 1284 0.7295455
   Yes  476 0.2704545

Quadratic Terms?

What would happen if I assume that the effect of tenure is not a line but instead has some curvature.

```{r, results='asis', warning=FALSE, message=FALSE}
mod.train.SQ <- train %>%
    glm(ChurnF ~ InternetService + tenure + I(tenure^2) + PhoneService + Contract +
        TotalCharges, family = binomial(link = "probit"), data = .)
stargazer(mod.train, mod.train.SQ, type = "html", style = "apsr")
```
ChurnF
(1) (2)
InternetServiceFiber optic 0.703*** 0.761***
(0.061) (0.062)
InternetServiceNo -0.520*** -0.566***
(0.082) (0.083)
tenure -0.030*** -0.043***
(0.004) (0.004)
I(tenure2) 0.0003***
(0.0001)
PhoneServiceYes -0.296*** -0.262***
(0.085) (0.085)
ContractOne year -0.467*** -0.465***
(0.067) (0.067)
ContractTwo year -0.927*** -1.067***
(0.100) (0.105)
TotalCharges 0.0001*** 0.00005
(0.00004) (0.00004)
Constant 0.086 0.171**
(0.081) (0.082)
N 5,275 5,275
Log Likelihood -2,235.094 -2,220.230
AIC 4,486.189 4,458.460
p < .1; p < .05; p < .01

As we can see from the table, the curvature appears to be different from zero though interpreting such a thing ceteris paribus is probably nonsense. Maybe better to see what happens in the metric of the predicted probability. Let me recycle the prediction data I used to draw this in the earlier section. What would the two predictions look like and how do they differ?

```{r}
Tenure.Pred$Prob.Churn.2 <- predict(mod.train, newdata = Tenure.Pred, type = "response")
Tenure.Pred$Prob.Churn.Sq <- predict(mod.train.SQ, newdata = Tenure.Pred, type = "response")
ggplot(Tenure.Pred) + aes(x = tenure, y = Prob.Churn.Sq) + geom_line() + geom_line(aes(y = Prob.Churn.2),
    color = "purple") + theme_minimal() + labs(y = "Pr(Churn)")
```

Now let’s predict for the test set, does it really do better?

```{r}
test$Pred.Probs.Sq <- predict(mod.train, newdata = test, type = "response")
test %>%
    mutate(Pred.Val.Sq = (Pred.Probs.Sq > 0.5)) %>%
    janitor::tabyl(Churn, Pred.Val.Sq, show_na = FALSE) %>%
    adorn_percentages("row")
```
 Churn     FALSE      TRUE
    No 0.8961749 0.1038251
   Yes 0.5042017 0.4957983

Not usually. Such people are really unlikely to Churn no matter what; it only starts at about 0.25.

A final note: a classification tree

First, I will start with a generic classification tree with everything set to the defaults. Then I will look at a report to refine it.

```{r}
library(rpart)
library(rpart.plot)
fit.BT <- rpart(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    data = train, method = "class")
rpart.plot(fit.BT)
```

How well does it fit the test sample?

```{r}
test$Churn.No <- predict(fit.BT, newdata = test)[, 1]
test$Churn.PredRT <- (test$Churn.No < 0.5)
test %>%
    tabyl(Churn, Churn.PredRT)
```
 Churn FALSE TRUE
    No  1210   74
   Yes   319  157

Not very well. We can alter the tolerance for complexity using some diagnostics about the tree.

```{r}
printcp(fit.BT)
```

Classification tree:
rpart(formula = ChurnF ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = train, method = "class")

Variables actually used in tree construction:
[1] Contract        InternetService tenure         

Root node error: 1393/5283 = 0.26368

n= 5283 

        CP nsplit rel error  xerror     xstd
1 0.056712      0   1.00000 1.00000 0.022991
2 0.010000      3   0.78177 0.78823 0.021172

The option cp controls a complexity parameter that keeps the tree from overfitting the tree. I want to show a fairly complex one so I will change from the default of 0.01 to 0.0025.

```{r}
fit.BT.2 <- rpart(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    data = train, method = "class", cp = 0.0025)
rpart.plot(fit.BT.2, extra = 106)
```

```{r}
test$Churn.No <- predict(fit.BT.2, newdata = test)[, 1]
test$Churn.PredRT2 <- (test$Churn.No < 0.5)
test %>%
    tabyl(Churn, Churn.PredRT2)
```
 Churn FALSE TRUE
    No  1154  130
   Yes   251  225

In this case, we have 1379 correct with the big tree and 1367 right with the smaller tree.

Footnotes

  1. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. 2nd Edition, Chapman and Hall, London. http://dx.doi.org/10.1007/978-1-4899-3242-6↩︎

  2. Full disclosure, I am cheating a bit here. I don’t really want to explain the fitting of generalized linear models as it most often involves iteratively reweighted least squares. I prefer to motivate them with the far more intuitive likelihood approach though they are not, strictly speaking, identical.↩︎