Choice and Forecasting: Week 7

Robert W. Walker

The “white paper” I shared

Thinking through Florida’s evidence on vaccines and heart issues.

Survival Models

There are two classes of models commonly used for survival data. The text focuses on the Cox model though there are also AFT [accelerated failure time] models for parametric analysis of survival-time data. First, on the data structure.

Data for Survival models

The most common applications are in statistics for life tables. How long do individuals live beyond some intervention? In people analytics, how long do employees stay with a firm or how long do customers maintain a relationship with a firm or service provider?

Two Classes of Survival Models

  • Time-varying covariates: Panel Data/Grouped Duration Data
  • Time-constant covariates: Spells Data

Some Survival Time Data

There will be two key components: events and time.

url <- "http://peopleanalytics-regression-book.org/data/job_retention.csv"
job_retention <- read.csv(url)
library(survival)
retention <- Surv(event = job_retention$left, time = job_retention$month)

A Time-Varying Covariates Example

library(haven)
BKT.Data <- read_dta("./img/bkt98ajps.dta")
BKT.Data
# A tibble: 20,990 × 53
   dispute   dem  growth allies contig capratio    trade    py  pys1  pys2  pys3
     <dbl> <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
 1       1    -9  -2.5        0      1     2.66 0            1     0     0     0
 2       0    -7   2.35       0      1     1.93 0.00200      1     0     0     0
 3       0    -3  -5.27       0      0    10.7  0            1     0     0     0
 4       0     5   2.29       0      0   160.   0            1     0     0     0
 5       0    -9 -13.6        0      1     1.07 0            1     0     0     0
 6       0    -7   0.320      0      1     1.19 0.00300      1     0     0     0
 7       0    -7   2.10       1      1     1.91 0.00460      1     0     0     0
 8       0    -8   4.14       1      1    26.7  0            1     0     0     0
 9       1    -7  -0.230      0      1     1.14 0.000600     1     0     0     0
10       1    -9   2.47       1      1     1.42 0            1     0     0     0
# … with 20,980 more rows, and 42 more variables: contdisp <dbl>,
#   prefail <dbl>, tzero <dbl>, `_st` <dbl>, `_d` <dbl>, `_t` <dbl>,
#   `_t0` <dbl>, k01 <dbl>, k02 <dbl>, k03 <dbl>, k04 <dbl>, k05 <dbl>,
#   k06 <dbl>, k07 <dbl>, k08 <dbl>, k09 <dbl>, k010 <dbl>, k011 <dbl>,
#   k012 <dbl>, k013 <dbl>, k014 <dbl>, k015 <dbl>, k016 <dbl>, k017 <dbl>,
#   k018 <dbl>, k019 <dbl>, k020 <dbl>, k021 <dbl>, k022 <dbl>, k023 <dbl>,
#   k024 <dbl>, k025 <dbl>, k026 <dbl>, k027 <dbl>, k028 <dbl>, k029 <dbl>, …

On Censoring

What happens when the event has yet to happen? The observation is censored. Those are the + in the table below. What does it mean? Data collection ended before the event.

unique(retention)
 [1]  1  12+  5   2   3   6   8   4   8+  4+ 11  10   9   7+  5+  3+  7   9+ 11+
[20] 12  10+  6+  2+  1+

What’s the logic? We know the unit survived at least that long.

A First Very Simple Model

job_retention$sentiment_category <- ifelse(job_retention$sentiment >= 7, "High",
    "Not High")
kmestimate_sentimentcat <- survival::survfit(formula = Surv(event = left, time = month) ~
    sentiment_category, data = job_retention)

The output

summary(kmestimate_sentimentcat)
Call: survfit(formula = Surv(event = left, time = month) ~ sentiment_category, 
    data = job_retention)

                sentiment_category=High 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1   3225      15    0.995 0.00120        0.993        0.998
    2   3167      62    0.976 0.00272        0.971        0.981
    3   3075     120    0.938 0.00429        0.929        0.946
    4   2932     102    0.905 0.00522        0.895        0.915
    5   2802      65    0.884 0.00571        0.873        0.895
    6   2700     144    0.837 0.00662        0.824        0.850
    7   2532     110    0.801 0.00718        0.787        0.815
    8   2389     140    0.754 0.00778        0.739        0.769
    9   2222     112    0.716 0.00818        0.700        0.732
   10   2077      56    0.696 0.00835        0.680        0.713
   11   1994     134    0.650 0.00871        0.633        0.667
   12   1827      45    0.634 0.00882        0.617        0.651

                sentiment_category=Not High 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    545       9    0.983 0.00546        0.973        0.994
    2    532      28    0.932 0.01084        0.911        0.953
    3    500      25    0.885 0.01373        0.859        0.912
    4    472      29    0.831 0.01618        0.800        0.863
    5    438      21    0.791 0.01758        0.757        0.826
    6    411      27    0.739 0.01906        0.703        0.777
    7    379      18    0.704 0.01987        0.666        0.744
    8    357      21    0.662 0.02065        0.623        0.704
    9    330      24    0.614 0.02136        0.574        0.658
   10    302      15    0.584 0.02171        0.543        0.628
   11    283      24    0.534 0.02209        0.493        0.579
   12    254       8    0.517 0.02218        0.476        0.563

Plotting Survival Data

The canonical plot is the Kaplan-Meier survival curve.

library(survminer)

# show survival curves with p-value estimate and confidence intervals
survminer::ggsurvplot(kmestimate_sentimentcat, pval = TRUE, conf.int = TRUE, palette = c("blue",
    "red"), linetype = c("solid", "dashed"), xlab = "Month", ylab = "Retention Rate")

The Cox Model

The key to such modelling is the definition of a hazard function. Let h(t) be the proportion that have not survived to time t, we can write:

h(t) = h_{0}(t)\exp(\beta_{1}x_{1} + \beta_{2}x_{2} + \ldots + \beta_{k}x_{k})

Proportional Hazards

PH Model

Estimating the Model

The outcome will need to be a double: the event and the time. Let’s model this as a function of gender, field, level, and sentiment.

cox_model <- survival::coxph(formula = Surv(event = left, time = month) ~ gender +
    field + level + sentiment, data = job_retention)

A Summary

summary(cox_model)
Call:
survival::coxph(formula = Surv(event = left, time = month) ~ 
    gender + field + level + sentiment, data = job_retention)

  n= 3770, number of events= 1354 

                           coef exp(coef) se(coef)      z Pr(>|z|)    
genderM                -0.04548   0.95553  0.05886 -0.773 0.439647    
fieldFinance            0.22334   1.25025  0.06681  3.343 0.000829 ***
fieldHealth             0.27830   1.32089  0.12890  2.159 0.030849 *  
fieldLaw                0.10532   1.11107  0.14515  0.726 0.468086    
fieldPublic/Government  0.11499   1.12186  0.08899  1.292 0.196277    
fieldSales/Marketing    0.08776   1.09173  0.10211  0.859 0.390082    
levelLow                0.14813   1.15967  0.09000  1.646 0.099799 .  
levelMedium             0.17666   1.19323  0.10203  1.732 0.083362 .  
sentiment              -0.11756   0.88909  0.01397 -8.415  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                       exp(coef) exp(-coef) lower .95 upper .95
genderM                   0.9555     1.0465    0.8514    1.0724
fieldFinance              1.2502     0.7998    1.0968    1.4252
fieldHealth               1.3209     0.7571    1.0260    1.7005
fieldLaw                  1.1111     0.9000    0.8360    1.4767
fieldPublic/Government    1.1219     0.8914    0.9423    1.3356
fieldSales/Marketing      1.0917     0.9160    0.8937    1.3336
levelLow                  1.1597     0.8623    0.9721    1.3834
levelMedium               1.1932     0.8381    0.9770    1.4574
sentiment                 0.8891     1.1248    0.8651    0.9138

Concordance= 0.578  (se = 0.008 )
Likelihood ratio test= 89.18  on 9 df,   p=2e-15
Wald test            = 94.95  on 9 df,   p=<2e-16
Score (logrank) test = 95.31  on 9 df,   p=<2e-16

Examining PH

As a table:

(ph_check <- survival::cox.zph(cox_model))
           chisq df    p
gender     0.726  1 0.39
field      6.656  5 0.25
level      2.135  2 0.34
sentiment  1.828  1 0.18
GLOBAL    11.156  9 0.27

Or Graphically

survminer::ggcoxzph(ph_check, font.main = 10, font.x = 10, font.y = 10)

Models of Frailty

Frailty models combine the insights of hierarchical models from the last chapter with survival analysis.

A frailty model seeks to augment the baseline hazard with some parametric random-effects. In this example, let’s allow the field variable to have shared frailty.

Estimation

library(frailtypack)

(frailty_model <- frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
    gender + level + sentiment + cluster(field), data = job_retention, n.knots = 12,
    kappa = 10000))

Be patient. The program is computing ... 
The program took 0.47 seconds 
Call:
frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~ 
    gender + level + sentiment + cluster(field), data = job_retention, 
    n.knots = 12, kappa = 10000)


  Shared Gamma Frailty model parameter estimates  
  using a Penalized Likelihood on the hazard function 

                 coef exp(coef) SE coef (H) SE coef (HIH)         z          p
genderM     -0.029530  0.970902   0.0591105     0.0591105 -0.499574 6.1738e-01
levelLow     0.198549  1.219632   0.0914449     0.0914449  2.171245 2.9913e-02
levelMedium  0.223268  1.250155   0.1032712     0.1032712  2.161954 3.0622e-02
sentiment   -0.108262  0.897393   0.0141327     0.0141327 -7.660398 1.8541e-14

        chisq df global p
level 5.32143  2   0.0699

    Frailty parameter, Theta: 48.3212 (SE (H): 25.6198 ) p = 0.029642 
 
      penalized marginal log-likelihood = -5510.36
      Convergence criteria: 
      parameters = 9.61e-06 likelihood = 1.48e-06 gradient = 1.73e-09 

      LCV = the approximate likelihood cross-validation criterion
            in the semi parametrical case     = 1.46587 

      n= 3770
      n events= 1354  n groups= 6
      number of iterations:  17 

      Exact number of knots used:  12 
      Value of the smoothing parameter:  10000, DoF:  6.31

Two comments. First, there is a test of the frailty; is there evidence of variation? This is the test of the Theta parameter. Second, do the parameters change?

Comparison

exp(cbind(frailty_model$coef, coef(cox_model)))
                            [,1]      [,2]
genderM                0.9709017 0.9555343
fieldFinance           1.2196322 1.2502466
fieldHealth            1.2501551 1.3208885
fieldLaw               0.8973926 1.1110687
fieldPublic/Government 0.9709017 1.1218640
fieldSales/Marketing   1.2196322 1.0917301
levelLow               1.2501551 1.1596653
levelMedium            0.8973926 1.1932310
sentiment              0.9709017 0.8890854

Time-Varying Covariates

options(scipen = 7)
cox_model <- survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem + growth +
    allies + contig + capratio + trade, data = BKT.Data)
summary(cox_model)
Call:
survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem + 
    growth + allies + contig + capratio + trade, data = BKT.Data)

  n= 20990, number of events= 947 

                  coef     exp(coef)      se(coef)      z Pr(>|z|)    
dem       -0.049197226   0.951993354   0.007229125 -6.805 1.01e-11 ***
growth    -0.007580611   0.992448050   0.007492787 -1.012    0.312    
allies    -0.422180458   0.655615717   0.076717663 -5.503 3.73e-08 ***
contig     0.544156705   1.723154641   0.076855065  7.080 1.44e-12 ***
capratio  -0.003017259   0.996987288   0.000403396 -7.480 7.45e-14 ***
trade    -12.292159040   0.000004588  10.071693690 -1.220    0.222    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           exp(coef)  exp(-coef) lower .95 upper .95
dem      0.951993354      1.0504 9.386e-01    0.9656
growth   0.992448050      1.0076 9.780e-01    1.0071
allies   0.655615717      1.5253 5.641e-01    0.7620
contig   1.723154641      0.5803 1.482e+00    2.0033
capratio 0.996987288      1.0030 9.962e-01    0.9978
trade    0.000004588 217980.0971 1.226e-14 1716.4232

Concordance= 0.703  (se = 0.01 )
Likelihood ratio test= 381.8  on 6 df,   p=<2e-16
Wald test            = 238.3  on 6 df,   p=<2e-16
Score (logrank) test = 267.9  on 6 df,   p=<2e-16

Binary Choice Model

library(haven)
BKT.Data <- read_dta("https://github.com/robertwwalker/xaringan/raw/master/CMF-Week-6/img/bkt98ajps.dta")
cloglog_model <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
    as.factor(py), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model)

Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio + 
    trade + as.factor(py), family = binomial(link = "cloglog"), 
    data = BKT.Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4183  -0.2282  -0.1420  -0.0828   3.9654  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.113947   0.082348 -13.527  < 2e-16 ***
dem              -0.049384   0.007288  -6.776 1.24e-11 ***
growth           -0.008082   0.007612  -1.062 0.288357    
allies           -0.430045   0.078126  -5.504 3.70e-08 ***
contig            0.554479   0.077111   7.191 6.45e-13 ***
capratio         -0.003023   0.000405  -7.464 8.39e-14 ***
trade           -12.503176   9.960629  -1.255 0.209385    
as.factor(py)2   -1.797014   0.130842 -13.734  < 2e-16 ***
as.factor(py)3   -1.910023   0.145076 -13.166  < 2e-16 ***
as.factor(py)4   -2.130485   0.165704 -12.857  < 2e-16 ***
as.factor(py)5   -2.378740   0.193902 -12.268  < 2e-16 ***
as.factor(py)6   -2.836553   0.246354 -11.514  < 2e-16 ***
as.factor(py)7   -2.831567   0.246584 -11.483  < 2e-16 ***
as.factor(py)8   -2.757412   0.239957 -11.491  < 2e-16 ***
as.factor(py)9   -2.701098   0.239710 -11.268  < 2e-16 ***
as.factor(py)10  -2.990651   0.280582 -10.659  < 2e-16 ***
as.factor(py)11  -3.207771   0.319289 -10.047  < 2e-16 ***
as.factor(py)12  -3.077872   0.305041 -10.090  < 2e-16 ***
as.factor(py)13  -3.670483   0.410654  -8.938  < 2e-16 ***
as.factor(py)14  -4.310370   0.578624  -7.449 9.38e-14 ***
as.factor(py)15  -4.329282   0.578841  -7.479 7.48e-14 ***
as.factor(py)16  -3.209284   0.336350  -9.542  < 2e-16 ***
as.factor(py)17  -3.973393   0.501858  -7.917 2.43e-15 ***
as.factor(py)18  -5.339488   1.000841  -5.335 9.55e-08 ***
as.factor(py)19  -3.367405   0.381031  -8.838  < 2e-16 ***
as.factor(py)20  -3.704984   0.449627  -8.240  < 2e-16 ***
as.factor(py)21  -4.552930   0.708058  -6.430 1.27e-10 ***
as.factor(py)22  -4.069630   0.579535  -7.022 2.18e-12 ***
as.factor(py)23  -3.894665   0.579557  -6.720 1.82e-11 ***
as.factor(py)24  -3.761143   0.579762  -6.487 8.73e-11 ***
as.factor(py)25 -17.294682 317.920692  -0.054 0.956617    
as.factor(py)26  -2.758181   0.411827  -6.697 2.12e-11 ***
as.factor(py)27 -17.282784 369.870471  -0.047 0.962731    
as.factor(py)28 -17.251363 368.516083  -0.047 0.962662    
as.factor(py)29  -3.702860   0.709164  -5.221 1.78e-07 ***
as.factor(py)30  -3.209144   0.579970  -5.533 3.14e-08 ***
as.factor(py)31  -4.201176   1.002675  -4.190 2.79e-05 ***
as.factor(py)32  -3.364920   0.710748  -4.734 2.20e-06 ***
as.factor(py)33  -3.869193   1.001648  -3.863 0.000112 ***
as.factor(py)34  -3.112066   0.709943  -4.384 1.17e-05 ***
as.factor(py)35  -3.651929   1.003808  -3.638 0.000275 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7719.2  on 20989  degrees of freedom
Residual deviance: 5108.1  on 20949  degrees of freedom
AIC: 5190.1

Number of Fisher Scoring iterations: 17

Visualizing the Strata

stratified_base <- frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
    strata(sentiment_category), data = job_retention, n.knots = 12, kappa = rep(5000,
    2))

Be patient. The program is computing ... 
The program took 0.4 seconds 

Visual

plot(stratified_base, type.plot = "Survival", pos.legend = "topright", Xlab = "Month",
    Ylab = "Baseline retention rate", color = 1)

A Result on Baseline Hazards

In a paper by Carter and Signorino, they argue that a cubic function is sufficient for the analysis of most baseline hazards. Because it is cubic, that requires only the insertion of time, time-squared, and time-cubed among the set of predictors in a binary choice model with the cloglog link.

Modified disputes

cloglog_model.s <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
    poly(py, 3), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model.s)

Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio + 
    trade + poly(py, 3), family = binomial(link = "cloglog"), 
    data = BKT.Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3282  -0.1974  -0.1359  -0.0902   3.9660  

Coefficients:
                 Estimate   Std. Error z value Pr(>|z|)    
(Intercept)    -4.1139710    0.0926772 -44.390  < 2e-16 ***
dem            -0.0494551    0.0072331  -6.837 8.07e-12 ***
growth         -0.0118322    0.0075851  -1.560    0.119    
allies         -0.4560842    0.0773249  -5.898 3.67e-09 ***
contig          0.5590242    0.0766158   7.296 2.95e-13 ***
capratio       -0.0030813    0.0004072  -7.567 3.81e-14 ***
trade         -14.2260838   10.2085494  -1.394    0.163    
poly(py, 3)1 -129.1817426   10.5291922 -12.269  < 2e-16 ***
poly(py, 3)2   89.6804700   12.9347930   6.933 4.11e-12 ***
poly(py, 3)3  -80.4785273    9.1927794  -8.755  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7719.2  on 20989  degrees of freedom
Residual deviance: 5295.8  on 20980  degrees of freedom
AIC: 5315.8

Number of Fisher Scoring iterations: 8

A Plot

newdf <- data.frame(dem = 0, allies = 0, growth = 0, contig = 0, capratio = 1, trade = 0,
    py = seq(0, 33))
newdf$pred <- predict(cloglog_model.s, newdata = newdf, type = "response")
ggplot(newdf) + aes(x = py, y = pred) + geom_line() + theme_minimal() + labs(x = "Years of Peace",
    y = "Pr(Dispute)")

Other Models

  • Competing Risks Models for Multinomial Outcomes
  • Semi-Markov Processes: Duration Depends on Both the Prior State and How Long in Prior State

Power and Study Planning

Type I and II Errors

Four possibilities

Type I and II Errors

Statistical Power

Book on Power

Draw a Picture

Cohen’s Effect Size

Cohen

WebPower

WebPower has a nice website with worked examples for the sets of power analyses it supports. There are also others of interest.

Proportions

z = \frac{\overbrace{\hat{p}_{0} - \pi}^{MOE}}{\sqrt{\frac{\pi(1-\pi)}{n}}}

  • z represents the standard normal two-sided interval given an a priori level of probability
  • \hat{p}_{0} represents the estimated proportion
  • \pi represents the true proportion
  • n represents the sample size

Let’s solve for n to get a lower bound on the sample size.

Radiant has much of this

Radiant Design

An Example on Study Planning

  • Calling admits