Choice and Forecasting: Week 3

Robert W. Walker

Regression for Binary Variables

Why Not Linear Models

Though with an outcome taking two values, say Yes and No, one might think that a direct model linking change in x to probability of y, given that y is either 0 or 1, perhaps 0% or 100% would be ideal. As it happens, combined with lines, this is a source of trouble because lines extend forever.

Churn: The Data

We will work with data on Churn. For details, look on Kaggle.com – a great source of data.

Churn <- read.csv("https://github.com/robertwwalker/DADMStuff/raw/master/WA_Fn-UseC_-Telco-Customer-Churn.csv")
names(Churn)
 [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"         
 [5] "Dependents"       "tenure"           "PhoneService"     "MultipleLines"   
 [9] "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
[13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"        
[17] "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"   "TotalCharges"    
[21] "Churn"           

More on the Data

dim(Churn)
[1] 7043   21
str(Churn)
'data.frame':   7043 obs. of  21 variables:
 $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
 $ gender          : chr  "Female" "Male" "Male" "Male" ...
 $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Partner         : chr  "Yes" "No" "No" "No" ...
 $ Dependents      : chr  "No" "No" "No" "No" ...
 $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
 $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
 $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
 $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
 $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
 $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
 $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
 $ TechSupport     : chr  "No" "No" "No" "Yes" ...
 $ StreamingTV     : chr  "No" "No" "No" "No" ...
 $ StreamingMovies : chr  "No" "No" "No" "No" ...
 $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
 $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
 $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
 $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
 $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...
table(Churn$Churn)

  No  Yes 
5174 1869 

Transformation to Numbers

Now to a regression model. We will need a variable type change to pull this off. Let’s have a look at the necessary transformation.

str(Churn$Churn)
 chr [1:7043] "No" "No" "Yes" "No" "Yes" "Yes" "No" "No" "Yes" "No" "No" ...
str(as.factor(Churn$Churn))
 Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
str(as.numeric(as.factor(Churn$Churn)))
 num [1:7043] 1 1 2 1 2 2 1 1 2 1 ...
Churn$Churn.Numeric <- as.numeric(as.factor(Churn$Churn)) - 1
str(Churn$Churn.Numeric)
 num [1:7043] 0 0 1 0 1 1 0 0 1 0 ...

Linear Regression

library(stargazer)
library(magrittr)
library(tidyverse)
library(skimr)
my.lm <- lm(Churn.Numeric ~ InternetService + tenure + PhoneService + Contract +
    TotalCharges, data = Churn)
summary(my.lm)

Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = Churn)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64458 -0.26256 -0.07269  0.35565  1.12180 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.274e-01  1.651e-02  25.892  < 2e-16 ***
InternetServiceFiber optic  2.601e-01  1.268e-02  20.512  < 2e-16 ***
InternetServiceNo          -1.476e-01  1.588e-02  -9.290  < 2e-16 ***
tenure                     -1.909e-03  4.642e-04  -4.112 3.97e-05 ***
PhoneServiceYes            -3.823e-02  1.769e-02  -2.161   0.0307 *  
ContractOne year           -1.385e-01  1.383e-02 -10.018  < 2e-16 ***
ContractTwo year           -1.184e-01  1.641e-02  -7.218 5.81e-13 ***
TotalCharges               -3.961e-05  5.244e-06  -7.555 4.73e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3812 on 7024 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.2554 
F-statistic: 345.6 on 7 and 7024 DF,  p-value: < 2.2e-16

Scientific Notation

options(scipen = 6)
summary(my.lm)

Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = Churn)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64458 -0.26256 -0.07269  0.35565  1.12180 

Coefficients:
                               Estimate   Std. Error t value Pr(>|t|)    
(Intercept)                 0.427421784  0.016507683  25.892  < 2e-16 ***
InternetServiceFiber optic  0.260067246  0.012678756  20.512  < 2e-16 ***
InternetServiceNo          -0.147560579  0.015884435  -9.290  < 2e-16 ***
tenure                     -0.001908716  0.000464227  -4.112 3.97e-05 ***
PhoneServiceYes            -0.038230897  0.017690283  -2.161   0.0307 *  
ContractOne year           -0.138502281  0.013825977 -10.018  < 2e-16 ***
ContractTwo year           -0.118438811  0.016407760  -7.218 5.81e-13 ***
TotalCharges               -0.000039614  0.000005244  -7.555 4.73e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3812 on 7024 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.2554 
F-statistic: 345.6 on 7 and 7024 DF,  p-value: < 2.2e-16

The Predictions?

my.lm$fitted.values %>%
    skim()
Data summary
Name Piped data
Number of rows 7032
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 0.27 0.22 -0.14 0.08 0.24 0.42 0.64 ▆▇▇▇▇

No, there are negative values. To prevent that, we need a different tool; this is the subject of Chapter 5.

Residuals?

We should also examine residuals. Using a variety of tests of linear model assumptions, we find the model lacking in every one but constant variance [homo/heteroscedasticity].

library(gvlma)
gvlma(my.lm)

Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = Churn)

Coefficients:
               (Intercept)  InternetServiceFiber optic  
                0.42742178                  0.26006725  
         InternetServiceNo                      tenure  
               -0.14756058                 -0.00190872  
           PhoneServiceYes            ContractOne year  
               -0.03823090                 -0.13850228  
          ContractTwo year                TotalCharges  
               -0.11843881                 -0.00003961  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = my.lm) 

                     Value   p-value                   Decision
Global Stat        630.599 0.000e+00 Assumptions NOT satisfied!
Skewness           377.348 0.000e+00 Assumptions NOT satisfied!
Kurtosis            64.207 1.110e-15 Assumptions NOT satisfied!
Link Function      187.439 0.000e+00 Assumptions NOT satisfied!
Heteroscedasticity   1.605 2.051e-01    Assumptions acceptable.

Churn is a Binomial

Any given customer is conceptualized as a Bernoulli trial, e.g. \pi^{y}(1-\pi)^{1-y}. With a willingness to believe that every Churn decision is an independently and identically distributed trial in this group of customers, overall churn is a binomial random variable with probability mass function P_{y} = {n \choose y} \pi^{y}(1-\pi)^{n-y} where

  • P_{y} is the binomial probability of y
  • y is the number of successes in n trials
  • n is the number of trials
  • \pi is the probability of success in any given trial.

GLM Theory

In the theory of these models, presented by Peter McCullagh and John Nelder in 19891, that link for the probabilities is what ties regression to the binomial distribution;

  • We posit that (1-\pi_{i}) = Pr(y_{i}=1|X_{i}) = 1-F(X_{i}\beta) so that \pi_{i} = Pr(y_{i}=0|X_{i})= F(X_{i}\beta). If F is some well-behaved probability distribution, then the aforementioned is valid.
  • There are a few ways of actually writing that; the \pi_{i} could be derived from
    • a normal distribution – called probit
    • the logistic distribution – the model is named logit
    • there are a few others that are somewhat common: the Cauchy, the log-log, and the complimentary log-log. The latter two are asymmetric and mirrors one of the other.

What we want to do is to find the estimates of \beta that maximize the likelihood of the sample we observe.2

A Probit Model

First, a little substitution and some notation. Let me label the normal probability up to X_{i}\beta to be \Phi(X\beta) and the probability above X\beta to be 1-\Phi(X+{i}\beta). I could substitute this into the binomial and obtain the product for the entire sample – this is known as the likelihood.

\prod^{n}_{i=1} \Phi(X_{i}\beta)^{1-y_{i}}(1-\Phi(X_{i}\beta))^{y_{i}}

Taking Logs

Taking logs yields:

\ln \mathcal{L} = \sum^{n}_{i=1} (1-y_{i})\ln \Phi(X_{i}\beta) + y_{i} \ln (1-\Phi(X_{i}\beta))

So the solution becomes

\arg \max_{\beta} \ln \mathcal{L} = \arg \max_{\beta} \sum^{n}_{i=1} (1-y_{i})\ln \Phi(X_{i}\beta) + y_{i} \ln (1-\Phi(X_{i}\beta))

In English, we want to find the values of \beta that maximize the log-likelihod of the entire sample.

Estimating a Probit Model

The outcome of interest is Churn. The model specification will call glm, let me examine Churn as a function of InternetService, tenure, PhoneService, Contract and TotalCharges. There is one trick to deploying it, the outcome variable must be a factor type. To make the table nice, let me mutate the type to a factor and then we can model it.

Churn %<>%
    mutate(ChurnF = as.factor(Churn))
Churn %>%
    select(Churn, ChurnF) %>%
    mutate(as.numeric(ChurnF)) %>%
    head()
  Churn ChurnF as.numeric(ChurnF)
1    No     No                  1
2    No     No                  1
3   Yes    Yes                  2
4    No     No                  1
5   Yes    Yes                  2
6   Yes    Yes                  2

Estimates

my.probit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    family = binomial(link = "probit"), data = Churn)
summary(my.probit)

Call:
glm(formula = ChurnF ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, family = binomial(link = "probit"), 
    data = Churn)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5100  -0.7306  -0.3184   0.8794   3.7241  

Coefficients:
                              Estimate  Std. Error z value Pr(>|z|)    
(Intercept)                 0.13075295  0.06871592   1.903 0.057066 .  
InternetServiceFiber optic  0.72623953  0.05291228  13.725  < 2e-16 ***
InternetServiceNo          -0.45803763  0.06972801  -6.569 5.07e-11 ***
tenure                     -0.02815768  0.00300707  -9.364  < 2e-16 ***
PhoneServiceYes            -0.37177905  0.07275027  -5.110 3.22e-07 ***
ContractOne year           -0.48852366  0.05695570  -8.577  < 2e-16 ***
ContractTwo year           -0.86638328  0.08281282 -10.462  < 2e-16 ***
TotalCharges                0.00011698  0.00003235   3.616 0.000299 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8143.4  on 7031  degrees of freedom
Residual deviance: 6038.9  on 7024  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 6054.9

Number of Fisher Scoring iterations: 6

AIC: Akaike Information Criterion

A measure of goodness of fit.

AIC Definition: stats

Observations

We can do astrology on the tables; read the stars. Fiber optic customers are more likely to Churn and those without internet service are less likely to Churn but both conditions are compared to a third category absorbed into the Constant. What is that category?

janitor::tabyl(Churn$InternetService)
 Churn$InternetService    n   percent
                   DSL 2421 0.3437456
           Fiber optic 3096 0.4395854
                    No 1526 0.2166690

DSL subscribers. It is the first in alphabetical order. That is the default option. That also means that the constant captures those on Month-to-month contracts and without phone service – the omitted category for each.

So what do these coefficients show?

The slopes represent the effect of a one-unit change in x on the underlying distribution for the probabilities. Unless one has intuition for those distributions, they come across as nonsensical. In the table above, let me take the example of tenure. For each unit of tenure [another month having been a customer], the normal variable Z \sim N(0,1) decreases by 0.028. But what that means depends on whether we are going from 0 to -0.028 or from -2 to -2.028. Remember the standard normal has about 95% of probability between -2 and 2 and has a modal/most common value at zero.

A Picture of the Normal

Plotting Slopes

library(jtools)
plot_summs(my.probit, inner_ci_level = 0.95)

A Better Plot

Churn %>%
    mutate(tenure = scale(tenure), TotalCharges = scale(TotalCharges)) %>%
    glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
        family = binomial(link = "probit"), data = .) %>%
    plot_summs(., inner_ci_level = 0.95)

The Trouble with Non-linear Models

I should be clear that the model does have lines; they are just lines inside of a nonlinear function – the F. The generalized linear part means that the interpretation of any one factor will depend on the values of the others. We will have to usually want to generate hypothetical data to understand what is really going on. After a presentation of the remaining models, I will return to my preferred method of interpretation.

Logistic Regression

The logistic distribution is the focus of the textbook chapter. To respecify the model using that, the only change in syntax is the link, we need it to be link="logit" which is the default.

The logistic function is given by:

\Lambda = \frac{e^{X\beta}}{1+e^{X\beta}}

The Analytics

\arg \max_{\beta} \ln \mathcal{L} = \arg \max_{\beta} \sum^{n}_{i=1} (1-y_{i})\ln \Lambda(X_{i}\beta) + y_{i} \ln (1-\Lambda(X_{i}\beta))

One of the advantages of using the logistic distribution is that you can analytically solve it with only categorical variables. The other is the interpretation of the estimates; the slope is an increment in the log-odds, e.g. \ln (\frac{\pi_{y=1}}{1-\pi_{y=1}}).

The Estimates: Logistic Regression

my.logit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    family = binomial(link = "logit"), data = Churn)
summary(my.logit)

Call:
glm(formula = ChurnF ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, family = binomial(link = "logit"), 
    data = Churn)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5317  -0.7205  -0.3044   0.8632   3.5180  

Coefficients:
                              Estimate  Std. Error z value     Pr(>|z|)    
(Intercept)                 0.37306796  0.11748513   3.175       0.0015 ** 
InternetServiceFiber optic  1.17193200  0.09089725  12.893      < 2e-16 ***
InternetServiceNo          -0.76509292  0.12680615  -6.034 0.0000000016 ***
tenure                     -0.06255973  0.00587067 -10.656      < 2e-16 ***
PhoneServiceYes            -0.71368010  0.12649081  -5.642 0.0000000168 ***
ContractOne year           -0.87386890  0.10344255  -8.448      < 2e-16 ***
ContractTwo year           -1.78061500  0.17185288 -10.361      < 2e-16 ***
TotalCharges                0.00035586  0.00006235   5.707 0.0000000115 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8143.4  on 7031  degrees of freedom
Residual deviance: 6008.7  on 7024  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 6024.7

Number of Fisher Scoring iterations: 6

Odds Ratios

exp(my.logit$coefficients)
               (Intercept) InternetServiceFiber optic 
                 1.4521830                  3.2282235 
         InternetServiceNo                     tenure 
                 0.4652907                  0.9393570 
           PhoneServiceYes           ContractOne year 
                 0.4898382                  0.4173338 
          ContractTwo year               TotalCharges 
                 0.1685345                  1.0003559 

All else equal,

  • The odds of Churning with Fiber optics, as opposed to DSL, increase by 223%.
  • The odds of Churning with No internet, as opposed to DSL, decrease by 53.5% .
  • The odds of Churning with No phone service, as opposed to Phone service, are 51% lower.
  • The odds of Churning decrease by 4% per unit tenure [month].
  • The odds of Churning increase by 0.04% per dollar of total charges.
  • The odds of Churning decrease under contracts. Compared to none, about 83% lower odds under a two-year contract and 58% lower odds under a one-year contract.

Confidence Intervals for Odds Ratios

If you choose to work with odds, then the suggestion to exponentiate the confidence intervals for the odds-ratios is sound.

exp(confint(my.logit))
                               2.5 %    97.5 %
(Intercept)                1.1535576 1.8288933
InternetServiceFiber optic 2.7035580 3.8611249
InternetServiceNo          0.3619193 0.5951803
tenure                     0.9284409 0.9500587
PhoneServiceYes            0.3822153 0.6276984
ContractOne year           0.3400210 0.5101487
ContractTwo year           0.1190988 0.2338910
TotalCharges               1.0002350 1.0004795

Diagnostics

There are diagnostics that can be applied to these models. The various pseudo-r^2 measures. This model fit is neither terrible nor good.

library(DescTools)
DescTools::PseudoR2(my.logit, which = c("McFadden", "CoxSnell", "Nagelkerke", "Tjur"))
  McFadden   CoxSnell Nagelkerke       Tjur 
 0.2621401  0.2618213  0.3817196  0.2798359 

ROC

Churn.CC <- Churn %>%
    select(ChurnF, InternetService, tenure, PhoneService, Contract, TotalCharges) %>%
    filter(!is.na(TotalCharges))
my.logit.CC <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract +
    TotalCharges, family = binomial(link = "logit"), data = Churn.CC)
library(LogisticDx)
# get range of goodness-of-fit diagnostics
model_diagnostics <- LogisticDx::gof(my.logit.CC, plotROC = TRUE)

The ROC

The Receiver Operating Curve

It plots specificity against sensitivity. Specificity is the ability, in this case, to correctly identify non-Churners[few false positives is highly specific]; sensitivity is the ability of the test to correctly identify Churners [few false negatives is highly sensitive]. A useful mnemonic is that the presence of the letter f in specificity is a reminder that the False test results are False for the condition, while the t in sensitivity is True test results are True for the condition.

Other Diagnostics

What all is in gof?

# returns a list
names(model_diagnostics)
[1] "ct"    "chiSq" "ctHL"  "gof"   "R2"    "auc"  
model_diagnostics$gof
         test  stat       val df         pVal
1:         HL chiSq 31.832686  8 9.979186e-05
2:        mHL     F 16.436041  9 4.896337e-27
3:       OsRo     Z  3.282456 NA 1.029070e-03
4: SstPgeq0.5     Z  6.656484 NA 2.804560e-11
5:   SstPl0.5     Z  5.983933 NA 2.178133e-09
6:    SstBoth chiSq 80.116227  2 4.008504e-18
7: SllPgeq0.5 chiSq 45.280618  1 1.707304e-11
8:   SllPl0.5 chiSq 29.794958  1 4.802393e-08
9:    SllBoth chiSq 54.741365  2 1.297369e-12

Other Binomial GLMs

my.cauchit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    family = binomial(link = "cauchit"), data = Churn)
my.cloglogit <- glm(ChurnF ~ InternetService + tenure + PhoneService + Contract +
    TotalCharges, family = binomial(link = "cloglog"), data = Churn)
stargazer(my.cauchit, my.cloglogit, my.logit, my.probit, type = "html", style = "apsr")
ChurnF
glm: binomial glm: binomial logistic probit
link = cauchit link = cloglog
(1) (2) (3) (4)
InternetServiceFiber optic 0.993*** 0.877*** 1.172*** 0.726***
(0.094) (0.071) (0.091) (0.053)
InternetServiceNo -0.790*** -0.640*** -0.765*** -0.458***
(0.174) (0.112) (0.127) (0.070)
tenure -0.123*** -0.059*** -0.063*** -0.028***
(0.011) (0.005) (0.006) (0.003)
PhoneServiceYes -0.705*** -0.586*** -0.714*** -0.372***
(0.136) (0.099) (0.126) (0.073)
ContractOne year -1.190*** -0.767*** -0.874*** -0.489***
(0.166) (0.092) (0.103) (0.057)
ContractTwo year -7.065*** -1.696*** -1.781*** -0.866***
(1.254) (0.161) (0.172) (0.083)
TotalCharges 0.001*** 0.0004*** 0.0004*** 0.0001***
(0.0001) (0.0001) (0.0001) (0.00003)
Constant 0.590*** -0.024 0.373*** 0.131*
(0.122) (0.088) (0.117) (0.069)
N 7,032 7,032 7,032 7,032
Log Likelihood -2,997.445 -2,991.982 -3,004.328 -3,019.439
AIC 6,010.891 5,999.964 6,024.656 6,054.878
p < .1; p < .05; p < .01

Best by AIC: cloglog

library(pROC)
predicted <- predict(my.cloglogit, type = "response")
auc(Churn.CC$ChurnF, predicted, plot = TRUE)
Area under the curve: 0.8386

Residuals

d <- density(residuals(my.logit, "pearson"))
plot(d, main = "")

This is rather poor.

Predicted Probability (1/3)

I find that the most straightforward way to interpret them is with plots in the probability metric. Let me take the example of tenure.

I will need to create data for interpretation. Let’s suppose we have a DSL user with phone service on a two year contract with average TotalCharges. The last thing I need to know is what values of tenure to show.

library(skimr)
Churn %>%
    filter(InternetService == "DSL", PhoneService == "Yes", Contract == "Two year") %>%
    skim(tenure, TotalCharges)
Data summary
Name Piped data
Number of rows 467
Number of columns 23
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tenure 0 1.00 60.52 14.79 0.0 55.00 66.00 71.00 72.00 ▁▁▁▂▇
TotalCharges 3 0.99 4733.47 1382.51 130.5 3867.24 4913.88 5879.86 6859.05 ▁▂▅▇▇

Data for Prediction (2/3)

Now I can create the data and generate predictions in the probability metric of the response.

Tenure.Pred <- data.frame(InternetService = "DSL", PhoneService = "Yes", Contract = "Two year",
    TotalCharges = 4733.5, tenure = seq(0, 72, by = 1))
Tenure.Pred$Prob.Churn <- predict(my.logit, newdata = Tenure.Pred, type = "response")

Plot of Effect of Tenure

ggplot(Tenure.Pred) + aes(x = tenure, y = Prob.Churn) + geom_line() + theme_minimal()

A Fancier Plot

A Better Way of Thinking About all of This

I personally believe that the only real way to assess models for use in predictive analytics is to assess them by that criteria. That doesn’t mean fitting inside the extant sample of data, but rather sampling from it and then using the model to predict what is known as a holdout sample. Let me show you what I mean. In this case, let me use the probit and logit models from before and a 75/25 split. This means that I will analyse 75 percent and predict the other 25 percent. I can use join style commands to pull it off pretty simply. I have 7043 rows. So I want 5283 rows of the original data out of that 7043.

Train and Test

train <- Churn[sample(c(1:7043), size = 5283, replace = FALSE), ]
test <- Churn %>%
    anti_join(., train)

Estimate the model

library(janitor)
mod.train <- train %>%
    glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
        family = binomial(link = "probit"), data = .)

Predict on test

Predict the result on the test set that I created. I will then turn the probabilities into a best guess by whether Churn or No is more likely.

test$Pred.Probs <- predict(mod.train, newdata = test, type = "response")
test %>%
    mutate(Pred.Val = (Pred.Probs > 0.5)) %>%
    janitor::tabyl(Churn, Pred.Val, show_na = FALSE) %>%
    adorn_percentages("row")
 Churn     FALSE       TRUE
    No 0.9007037 0.09929633
   Yes 0.4810924 0.51890756

More Calibration

all of the totals.

test %>%
    mutate(Pred.Val = (Pred.Probs > 0.5)) %>%
    janitor::tabyl(Churn, Pred.Val, show_na = FALSE) %>%
    adorn_totals(c("row", "col"))
 Churn FALSE TRUE Total
    No  1152  127  1279
   Yes   229  247   476
 Total  1381  374  1755

Now you might say that the fact we can only get 50 to 55 percent of Churn='Yes' with the model, remember that only 26.5 percent of people Churn overall so we have improved quite a bit over knowing nothing at all but the raw row probability. In this specific case, the probability of Yes in the test set is shown below.

test %>%
    tabyl(Churn)
 Churn    n   percent
    No 1284 0.7295455
   Yes  476 0.2704545

Quadratic Terms?

What would happen if I assume that the effect of tenure is not a line but instead has some curvature.

mod.train.SQ <- train %>%
    glm(ChurnF ~ InternetService + tenure + I(tenure^2) + PhoneService + Contract +
        TotalCharges, family = binomial(link = "probit"), data = .)
summary(mod.train.SQ)

Call:
glm(formula = ChurnF ~ InternetService + tenure + I(tenure^2) + 
    PhoneService + Contract + TotalCharges, family = binomial(link = "probit"), 
    data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6125  -0.7182  -0.3253   0.7979   3.4343  

Coefficients:
                              Estimate  Std. Error z value Pr(>|z|)    
(Intercept)                 0.18820202  0.07953535   2.366    0.018 *  
InternetServiceFiber optic  0.81992753  0.06187946  13.250  < 2e-16 ***
InternetServiceNo          -0.44105955  0.08100756  -5.445 5.19e-08 ***
tenure                     -0.03946995  0.00405250  -9.740  < 2e-16 ***
I(tenure^2)                 0.00031610  0.00005755   5.492 3.97e-08 ***
PhoneServiceYes            -0.36582066  0.08331362  -4.391 1.13e-05 ***
ContractOne year           -0.48056197  0.06588209  -7.294 3.00e-13 ***
ContractTwo year           -0.95535805  0.09832546  -9.716  < 2e-16 ***
TotalCharges                0.00002354  0.00003661   0.643    0.520    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6091.5  on 5276  degrees of freedom
Residual deviance: 4546.3  on 5268  degrees of freedom
  (6 observations deleted due to missingness)
AIC: 4564.3

Number of Fisher Scoring iterations: 6

Comparison by Plot

Tenure.Pred$Prob.Churn.2 <- predict(mod.train, newdata = Tenure.Pred, type = "response")
Tenure.Pred$Prob.Churn.Sq <- predict(mod.train.SQ, newdata = Tenure.Pred, type = "response")
ggplot(Tenure.Pred) + aes(x = tenure, y = Prob.Churn.Sq) + geom_line() + geom_line(aes(y = Prob.Churn.2),
    color = "purple") + theme_minimal() + labs(y = "Pr(Churn)")

Predicting the Test Set

does it really do better?

test$Pred.Probs.Sq <- predict(mod.train, newdata = test, type = "response")
test %>%
    mutate(Pred.Val.Sq = (Pred.Probs.Sq > 0.5)) %>%
    janitor::tabyl(Churn, Pred.Val.Sq, show_na = FALSE) %>%
    adorn_percentages("row")
 Churn     FALSE       TRUE
    No 0.9007037 0.09929633
   Yes 0.4810924 0.51890756

Not usually. Such people are really unlikely to Churn no matter what; it only starts at about 0.25.

A final note: a classification tree

First, I will start with a generic classification tree with everything set to the defaults. Then I will look at a report to refine it.

library(rpart)
library(rpart.plot)
fit.BT <- rpart(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    data = train, method = "class")

A Plot of the Tree

rpart.plot(fit.BT)

How well does it fit the test sample?

test$Churn.No <- predict(fit.BT, newdata = test)[, 1]
test$Churn.PredRT <- (test$Churn.No < 0.5)
test %>%
    tabyl(Churn, Churn.PredRT)
 Churn FALSE TRUE
    No  1203   81
   Yes   276  200

Not very well. We can alter the tolerance for complexity using some diagnostics about the tree.

cp: Complexity

printcp(fit.BT)

Classification tree:
rpart(formula = ChurnF ~ InternetService + tenure + PhoneService + 
    Contract + TotalCharges, data = train, method = "class")

Variables actually used in tree construction:
[1] Contract        InternetService tenure         

Root node error: 1393/5283 = 0.26368

n= 5283 

       CP nsplit rel error  xerror     xstd
1 0.06748      0   1.00000 1.00000 0.022991
2 0.01000      3   0.79756 0.81407 0.021423

The option cp controls a complexity parameter that keeps the tree from overfitting the tree. I want to show a fairly complex one so I will change from the default of 0.01 to 0.0025.

Estimation

fit.BT.2 <- rpart(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges,
    data = train, method = "class", cp = 0.0025)

Plot of Big Tree

rpart.plot(fit.BT.2, extra = 106)

Predictions

test$Churn.No <- predict(fit.BT.2, newdata = test)[, 1]
test$Churn.PredRT2 <- (test$Churn.No < 0.5)
test %>%
    tabyl(Churn, Churn.PredRT2)
 Churn FALSE TRUE
    No  1195   89
   Yes   259  217

Conclusions

  • Linear regression is generally inappropriate for binary outcomes.
  • Lines extend infinitely.
  • Generalized linear models come to the rescue but require a family for the data and, in this case, a link for the probably.
  • A probit is a normal; the logit is logistic. There are others.
  • Fewer diagnostis for such models exist.
  • Plots of probability by input help to understand what is going on.
  • Some people like odds ratios, the exponentiated coefficients from a logit
  • Training and test are an important means of maintaining parsimony and maximizing predictive accuracy.
  • Classification trees are another approach to these problems.