Though with an outcome taking two values, say Yes and No, one might think that a direct model linking change in x to probability of y, given that y is either 0 or 1, perhaps 0% or 100% would be ideal. As it happens, combined with lines, this is a source of trouble because lines extend forever.
Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService +
Contract + TotalCharges, data = Churn)
Residuals:
Min 1Q Median 3Q Max
-0.64458 -0.26256 -0.07269 0.35565 1.12180
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.274e-01 1.651e-02 25.892 < 2e-16 ***
InternetServiceFiber optic 2.601e-01 1.268e-02 20.512 < 2e-16 ***
InternetServiceNo -1.476e-01 1.588e-02 -9.290 < 2e-16 ***
tenure -1.909e-03 4.642e-04 -4.112 3.97e-05 ***
PhoneServiceYes -3.823e-02 1.769e-02 -2.161 0.0307 *
ContractOne year -1.385e-01 1.383e-02 -10.018 < 2e-16 ***
ContractTwo year -1.184e-01 1.641e-02 -7.218 5.81e-13 ***
TotalCharges -3.961e-05 5.244e-06 -7.555 4.73e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3812 on 7024 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.2562, Adjusted R-squared: 0.2554
F-statistic: 345.6 on 7 and 7024 DF, p-value: < 2.2e-16
Scientific Notation
options(scipen =6)summary(my.lm)
Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService +
Contract + TotalCharges, data = Churn)
Residuals:
Min 1Q Median 3Q Max
-0.64458 -0.26256 -0.07269 0.35565 1.12180
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.427421784 0.016507683 25.892 < 2e-16 ***
InternetServiceFiber optic 0.260067246 0.012678756 20.512 < 2e-16 ***
InternetServiceNo -0.147560579 0.015884435 -9.290 < 2e-16 ***
tenure -0.001908716 0.000464227 -4.112 3.97e-05 ***
PhoneServiceYes -0.038230897 0.017690283 -2.161 0.0307 *
ContractOne year -0.138502281 0.013825977 -10.018 < 2e-16 ***
ContractTwo year -0.118438811 0.016407760 -7.218 5.81e-13 ***
TotalCharges -0.000039614 0.000005244 -7.555 4.73e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3812 on 7024 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.2562, Adjusted R-squared: 0.2554
F-statistic: 345.6 on 7 and 7024 DF, p-value: < 2.2e-16
The Predictions?
my.lm$fitted.values %>%skim()
Data summary
Name
Piped data
Number of rows
7032
Number of columns
1
_______________________
Column type frequency:
numeric
1
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
data
0
1
0.27
0.22
-0.14
0.08
0.24
0.42
0.64
▆▇▇▇▇
No, there are negative values. To prevent that, we need a different tool; this is the subject of Chapter 5.
Residuals?
We should also examine residuals. Using a variety of tests of linear model assumptions, we find the model lacking in every one but constant variance [homo/heteroscedasticity].
library(gvlma)gvlma(my.lm)
Call:
lm(formula = Churn.Numeric ~ InternetService + tenure + PhoneService +
Contract + TotalCharges, data = Churn)
Coefficients:
(Intercept) InternetServiceFiber optic
0.42742178 0.26006725
InternetServiceNo tenure
-0.14756058 -0.00190872
PhoneServiceYes ContractOne year
-0.03823090 -0.13850228
ContractTwo year TotalCharges
-0.11843881 -0.00003961
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = my.lm)
Value p-value Decision
Global Stat 630.599 0.000e+00 Assumptions NOT satisfied!
Skewness 377.348 0.000e+00 Assumptions NOT satisfied!
Kurtosis 64.207 1.110e-15 Assumptions NOT satisfied!
Link Function 187.439 0.000e+00 Assumptions NOT satisfied!
Heteroscedasticity 1.605 2.051e-01 Assumptions acceptable.
Churn is a Binomial
Any given customer is conceptualized as a Bernoulli trial, e.g. \pi^{y}(1-\pi)^{1-y}. With a willingness to believe that every Churn decision is an independently and identically distributed trial in this group of customers, overall churn is a binomial random variable with probability mass function P_{y} = {n \choose y} \pi^{y}(1-\pi)^{n-y} where
P_{y} is the binomial probability of y
y is the number of successes in n trials
n is the number of trials
\pi is the probability of success in any given trial.
GLM Theory
In the theory of these models, presented by Peter McCullagh and John Nelder in 19891, that link for the probabilities is what ties regression to the binomial distribution;
We posit that (1-\pi_{i}) = Pr(y_{i}=1|X_{i}) = 1-F(X_{i}\beta) so that \pi_{i} = Pr(y_{i}=0|X_{i})= F(X_{i}\beta). If F is some well-behaved probability distribution, then the aforementioned is valid.
There are a few ways of actually writing that; the \pi_{i} could be derived from
a normal distribution – called probit
the logistic distribution – the model is named logit
there are a few others that are somewhat common: the Cauchy, the log-log, and the complimentary log-log. The latter two are asymmetric and mirrors one of the other.
What we want to do is to find the estimates of \beta that maximize the likelihood of the sample we observe.2
A Probit Model
First, a little substitution and some notation. Let me label the normal probability up to X_{i}\beta to be \Phi(X\beta) and the probability above X\beta to be 1-\Phi(X+{i}\beta). I could substitute this into the binomial and obtain the product for the entire sample – this is known as the likelihood.
In English, we want to find the values of \beta that maximize the log-likelihod of the entire sample.
Estimating a Probit Model
The outcome of interest is Churn. The model specification will call glm, let me examine Churn as a function of InternetService, tenure, PhoneService, Contract and TotalCharges. There is one trick to deploying it, the outcome variable must be a factor type. To make the table nice, let me mutate the type to a factor and then we can model it.
Call:
glm(formula = ChurnF ~ InternetService + tenure + PhoneService +
Contract + TotalCharges, family = binomial(link = "probit"),
data = Churn)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5100 -0.7306 -0.3184 0.8794 3.7241
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.13075295 0.06871592 1.903 0.057066 .
InternetServiceFiber optic 0.72623953 0.05291228 13.725 < 2e-16 ***
InternetServiceNo -0.45803763 0.06972801 -6.569 5.07e-11 ***
tenure -0.02815768 0.00300707 -9.364 < 2e-16 ***
PhoneServiceYes -0.37177905 0.07275027 -5.110 3.22e-07 ***
ContractOne year -0.48852366 0.05695570 -8.577 < 2e-16 ***
ContractTwo year -0.86638328 0.08281282 -10.462 < 2e-16 ***
TotalCharges 0.00011698 0.00003235 3.616 0.000299 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8143.4 on 7031 degrees of freedom
Residual deviance: 6038.9 on 7024 degrees of freedom
(11 observations deleted due to missingness)
AIC: 6054.9
Number of Fisher Scoring iterations: 6
AIC: Akaike Information Criterion
A measure of goodness of fit.
AIC Definition: stats
Observations
We can do astrology on the tables; read the stars. Fiber optic customers are more likely to Churn and those without internet service are less likely to Churn but both conditions are compared to a third category absorbed into the Constant. What is that category?
janitor::tabyl(Churn$InternetService)
Churn$InternetService n percent
DSL 2421 0.3437456
Fiber optic 3096 0.4395854
No 1526 0.2166690
DSL subscribers. It is the first in alphabetical order. That is the default option. That also means that the constant captures those on Month-to-month contracts and without phone service – the omitted category for each.
So what do these coefficients show?
The slopes represent the effect of a one-unit change in x on the underlying distribution for the probabilities. Unless one has intuition for those distributions, they come across as nonsensical. In the table above, let me take the example of tenure. For each unit of tenure [another month having been a customer], the normal variable Z \sim N(0,1) decreases by 0.028. But what that means depends on whether we are going from 0 to -0.028 or from -2 to -2.028. Remember the standard normal has about 95% of probability between -2 and 2 and has a modal/most common value at zero.
I should be clear that the model does have lines; they are just lines inside of a nonlinear function – the F. The generalized linear part means that the interpretation of any one factor will depend on the values of the others. We will have to usually want to generate hypothetical data to understand what is really going on. After a presentation of the remaining models, I will return to my preferred method of interpretation.
Logistic Regression
The logistic distribution is the focus of the textbook chapter. To respecify the model using that, the only change in syntax is the link, we need it to be link="logit" which is the default.
One of the advantages of using the logistic distribution is that you can analytically solve it with only categorical variables. The other is the interpretation of the estimates; the slope is an increment in the log-odds, e.g. \ln (\frac{\pi_{y=1}}{1-\pi_{y=1}}).
Call:
glm(formula = ChurnF ~ InternetService + tenure + PhoneService +
Contract + TotalCharges, family = binomial(link = "logit"),
data = Churn)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5317 -0.7205 -0.3044 0.8632 3.5180
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.37306796 0.11748513 3.175 0.0015 **
InternetServiceFiber optic 1.17193200 0.09089725 12.893 < 2e-16 ***
InternetServiceNo -0.76509292 0.12680615 -6.034 0.0000000016 ***
tenure -0.06255973 0.00587067 -10.656 < 2e-16 ***
PhoneServiceYes -0.71368010 0.12649081 -5.642 0.0000000168 ***
ContractOne year -0.87386890 0.10344255 -8.448 < 2e-16 ***
ContractTwo year -1.78061500 0.17185288 -10.361 < 2e-16 ***
TotalCharges 0.00035586 0.00006235 5.707 0.0000000115 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8143.4 on 7031 degrees of freedom
Residual deviance: 6008.7 on 7024 degrees of freedom
(11 observations deleted due to missingness)
AIC: 6024.7
Number of Fisher Scoring iterations: 6
Odds Ratios
exp(my.logit$coefficients)
(Intercept) InternetServiceFiber optic
1.4521830 3.2282235
InternetServiceNo tenure
0.4652907 0.9393570
PhoneServiceYes ContractOne year
0.4898382 0.4173338
ContractTwo year TotalCharges
0.1685345 1.0003559
All else equal,
The odds of Churning with Fiber optics, as opposed to DSL, increase by 223%.
The odds of Churning with No internet, as opposed to DSL, decrease by 53.5% .
The odds of Churning with No phone service, as opposed to Phone service, are 51% lower.
The odds of Churning decrease by 4% per unit tenure [month].
The odds of Churning increase by 0.04% per dollar of total charges.
The odds of Churning decrease under contracts. Compared to none, about 83% lower odds under a two-year contract and 58% lower odds under a one-year contract.
Confidence Intervals for Odds Ratios
If you choose to work with odds, then the suggestion to exponentiate the confidence intervals for the odds-ratios is sound.
Churn.CC <- Churn %>%select(ChurnF, InternetService, tenure, PhoneService, Contract, TotalCharges) %>%filter(!is.na(TotalCharges))my.logit.CC <-glm(ChurnF ~ InternetService + tenure + PhoneService + Contract + TotalCharges, family =binomial(link ="logit"), data = Churn.CC)library(LogisticDx)# get range of goodness-of-fit diagnosticsmodel_diagnostics <- LogisticDx::gof(my.logit.CC, plotROC =TRUE)
The ROC
The Receiver Operating Curve
It plots specificity against sensitivity. Specificity is the ability, in this case, to correctly identify non-Churners[few false positives is highly specific]; sensitivity is the ability of the test to correctly identify Churners [few false negatives is highly sensitive]. A useful mnemonic is that the presence of the letter f in specificity is a reminder that the False test results are False for the condition, while the t in sensitivity is True test results are True for the condition.
library(pROC)predicted <-predict(my.cloglogit, type ="response")auc(Churn.CC$ChurnF, predicted, plot =TRUE)
Area under the curve: 0.8386
Residuals
d <-density(residuals(my.logit, "pearson"))plot(d, main ="")
This is rather poor.
Predicted Probability (1/3)
I find that the most straightforward way to interpret them is with plots in the probability metric. Let me take the example of tenure.
I will need to create data for interpretation. Let’s suppose we have a DSL user with phone service on a two year contract with average TotalCharges. The last thing I need to know is what values of tenure to show.
library(skimr)Churn %>%filter(InternetService =="DSL", PhoneService =="Yes", Contract =="Two year") %>%skim(tenure, TotalCharges)
Data summary
Name
Piped data
Number of rows
467
Number of columns
23
_______________________
Column type frequency:
numeric
2
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
tenure
0
1.00
60.52
14.79
0.0
55.00
66.00
71.00
72.00
▁▁▁▂▇
TotalCharges
3
0.99
4733.47
1382.51
130.5
3867.24
4913.88
5879.86
6859.05
▁▂▅▇▇
Data for Prediction (2/3)
Now I can create the data and generate predictions in the probability metric of the response.
Tenure.Pred <-data.frame(InternetService ="DSL", PhoneService ="Yes", Contract ="Two year",TotalCharges =4733.5, tenure =seq(0, 72, by =1))Tenure.Pred$Prob.Churn <-predict(my.logit, newdata = Tenure.Pred, type ="response")
Plot of Effect of Tenure
ggplot(Tenure.Pred) +aes(x = tenure, y = Prob.Churn) +geom_line() +theme_minimal()
A Fancier Plot
A Better Way of Thinking About all of This
I personally believe that the only real way to assess models for use in predictive analytics is to assess them by that criteria. That doesn’t mean fitting inside the extant sample of data, but rather sampling from it and then using the model to predict what is known as a holdout sample. Let me show you what I mean. In this case, let me use the probit and logit models from before and a 75/25 split. This means that I will analyse 75 percent and predict the other 25 percent. I can use join style commands to pull it off pretty simply. I have 7043 rows. So I want 5283 rows of the original data out of that 7043.
Churn FALSE TRUE Total
No 1152 127 1279
Yes 229 247 476
Total 1381 374 1755
Now you might say that the fact we can only get 50 to 55 percent of Churn='Yes' with the model, remember that only 26.5 percent of people Churn overall so we have improved quite a bit over knowing nothing at all but the raw row probability. In this specific case, the probability of Yes in the test set is shown below.
test %>%tabyl(Churn)
Churn n percent
No 1284 0.7295455
Yes 476 0.2704545
Quadratic Terms?
What would happen if I assume that the effect of tenure is not a line but instead has some curvature.
The option cp controls a complexity parameter that keeps the tree from overfitting the tree. I want to show a fairly complex one so I will change from the default of 0.01 to 0.0025.