Thinking through Florida’s evidence on vaccines and heart issues.
There are two classes of models commonly used for survival data. The text focuses on the Cox model though there are also AFT [accelerated failure time] models for parametric analysis of survival-time data. First, on the data structure.
The most common applications are in statistics for life tables
. How long do individuals live beyond some intervention? In people analytics, how long do employees stay with a firm or how long do customers maintain a relationship with a firm or service provider?
There will be two key components: events and time.
# A tibble: 20,990 × 53
dispute dem growth allies contig capratio trade py pys1 pys2 pys3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -9 -2.5 0 1 2.66 0 1 0 0 0
2 0 -7 2.35 0 1 1.93 0.00200 1 0 0 0
3 0 -3 -5.27 0 0 10.7 0 1 0 0 0
4 0 5 2.29 0 0 160. 0 1 0 0 0
5 0 -9 -13.6 0 1 1.07 0 1 0 0 0
6 0 -7 0.320 0 1 1.19 0.00300 1 0 0 0
7 0 -7 2.10 1 1 1.91 0.00460 1 0 0 0
8 0 -8 4.14 1 1 26.7 0 1 0 0 0
9 1 -7 -0.230 0 1 1.14 0.000600 1 0 0 0
10 1 -9 2.47 1 1 1.42 0 1 0 0 0
# … with 20,980 more rows, and 42 more variables: contdisp <dbl>,
# prefail <dbl>, tzero <dbl>, `_st` <dbl>, `_d` <dbl>, `_t` <dbl>,
# `_t0` <dbl>, k01 <dbl>, k02 <dbl>, k03 <dbl>, k04 <dbl>, k05 <dbl>,
# k06 <dbl>, k07 <dbl>, k08 <dbl>, k09 <dbl>, k010 <dbl>, k011 <dbl>,
# k012 <dbl>, k013 <dbl>, k014 <dbl>, k015 <dbl>, k016 <dbl>, k017 <dbl>,
# k018 <dbl>, k019 <dbl>, k020 <dbl>, k021 <dbl>, k022 <dbl>, k023 <dbl>,
# k024 <dbl>, k025 <dbl>, k026 <dbl>, k027 <dbl>, k028 <dbl>, k029 <dbl>, …
What happens when the event has yet to happen? The observation is censored. Those are the +
in the table below. What does it mean? Data collection ended before the event.
What’s the logic? We know the unit survived at least that long.
Call: survfit(formula = Surv(event = left, time = month) ~ sentiment_category,
data = job_retention)
sentiment_category=High
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 3225 15 0.995 0.00120 0.993 0.998
2 3167 62 0.976 0.00272 0.971 0.981
3 3075 120 0.938 0.00429 0.929 0.946
4 2932 102 0.905 0.00522 0.895 0.915
5 2802 65 0.884 0.00571 0.873 0.895
6 2700 144 0.837 0.00662 0.824 0.850
7 2532 110 0.801 0.00718 0.787 0.815
8 2389 140 0.754 0.00778 0.739 0.769
9 2222 112 0.716 0.00818 0.700 0.732
10 2077 56 0.696 0.00835 0.680 0.713
11 1994 134 0.650 0.00871 0.633 0.667
12 1827 45 0.634 0.00882 0.617 0.651
sentiment_category=Not High
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 545 9 0.983 0.00546 0.973 0.994
2 532 28 0.932 0.01084 0.911 0.953
3 500 25 0.885 0.01373 0.859 0.912
4 472 29 0.831 0.01618 0.800 0.863
5 438 21 0.791 0.01758 0.757 0.826
6 411 27 0.739 0.01906 0.703 0.777
7 379 18 0.704 0.01987 0.666 0.744
8 357 21 0.662 0.02065 0.623 0.704
9 330 24 0.614 0.02136 0.574 0.658
10 302 15 0.584 0.02171 0.543 0.628
11 283 24 0.534 0.02209 0.493 0.579
12 254 8 0.517 0.02218 0.476 0.563
The canonical plot is the Kaplan-Meier survival curve.
The key to such modelling is the definition of a hazard function. Let h(t) be the proportion that have not survived to time t, we can write:
h(t) = h_{0}(t)\exp(\beta_{1}x_{1} + \beta_{2}x_{2} + \ldots + \beta_{k}x_{k})
The outcome will need to be a double: the event and the time. Let’s model this as a function of gender, field, level, and sentiment.
Call:
survival::coxph(formula = Surv(event = left, time = month) ~
gender + field + level + sentiment, data = job_retention)
n= 3770, number of events= 1354
coef exp(coef) se(coef) z Pr(>|z|)
genderM -0.04548 0.95553 0.05886 -0.773 0.439647
fieldFinance 0.22334 1.25025 0.06681 3.343 0.000829 ***
fieldHealth 0.27830 1.32089 0.12890 2.159 0.030849 *
fieldLaw 0.10532 1.11107 0.14515 0.726 0.468086
fieldPublic/Government 0.11499 1.12186 0.08899 1.292 0.196277
fieldSales/Marketing 0.08776 1.09173 0.10211 0.859 0.390082
levelLow 0.14813 1.15967 0.09000 1.646 0.099799 .
levelMedium 0.17666 1.19323 0.10203 1.732 0.083362 .
sentiment -0.11756 0.88909 0.01397 -8.415 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genderM 0.9555 1.0465 0.8514 1.0724
fieldFinance 1.2502 0.7998 1.0968 1.4252
fieldHealth 1.3209 0.7571 1.0260 1.7005
fieldLaw 1.1111 0.9000 0.8360 1.4767
fieldPublic/Government 1.1219 0.8914 0.9423 1.3356
fieldSales/Marketing 1.0917 0.9160 0.8937 1.3336
levelLow 1.1597 0.8623 0.9721 1.3834
levelMedium 1.1932 0.8381 0.9770 1.4574
sentiment 0.8891 1.1248 0.8651 0.9138
Concordance= 0.578 (se = 0.008 )
Likelihood ratio test= 89.18 on 9 df, p=2e-15
Wald test = 94.95 on 9 df, p=<2e-16
Score (logrank) test = 95.31 on 9 df, p=<2e-16
As a table:
Frailty models combine the insights of hierarchical models from the last chapter with survival analysis.
A frailty model seeks to augment the baseline hazard with some parametric random-effects. In this example, let’s allow the field
variable to have shared frailty.
library(frailtypack)
(frailty_model <- frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
gender + level + sentiment + cluster(field), data = job_retention, n.knots = 12,
kappa = 10000))
Be patient. The program is computing ...
The program took 0.47 seconds
Call:
frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
gender + level + sentiment + cluster(field), data = job_retention,
n.knots = 12, kappa = 10000)
Shared Gamma Frailty model parameter estimates
using a Penalized Likelihood on the hazard function
coef exp(coef) SE coef (H) SE coef (HIH) z p
genderM -0.029530 0.970902 0.0591105 0.0591105 -0.499574 6.1738e-01
levelLow 0.198549 1.219632 0.0914449 0.0914449 2.171245 2.9913e-02
levelMedium 0.223268 1.250155 0.1032712 0.1032712 2.161954 3.0622e-02
sentiment -0.108262 0.897393 0.0141327 0.0141327 -7.660398 1.8541e-14
chisq df global p
level 5.32143 2 0.0699
Frailty parameter, Theta: 48.3212 (SE (H): 25.6198 ) p = 0.029642
penalized marginal log-likelihood = -5510.36
Convergence criteria:
parameters = 9.61e-06 likelihood = 1.48e-06 gradient = 1.73e-09
LCV = the approximate likelihood cross-validation criterion
in the semi parametrical case = 1.46587
n= 3770
n events= 1354 n groups= 6
number of iterations: 17
Exact number of knots used: 12
Value of the smoothing parameter: 10000, DoF: 6.31
Two comments. First, there is a test of the frailty; is there evidence of variation? This is the test of the Theta parameter. Second, do the parameters change?
[,1] [,2]
genderM 0.9709017 0.9555343
fieldFinance 1.2196322 1.2502466
fieldHealth 1.2501551 1.3208885
fieldLaw 0.8973926 1.1110687
fieldPublic/Government 0.9709017 1.1218640
fieldSales/Marketing 1.2196322 1.0917301
levelLow 1.2501551 1.1596653
levelMedium 0.8973926 1.1932310
sentiment 0.9709017 0.8890854
options(scipen = 7)
cox_model <- survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem + growth +
allies + contig + capratio + trade, data = BKT.Data)
summary(cox_model)
Call:
survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem +
growth + allies + contig + capratio + trade, data = BKT.Data)
n= 20990, number of events= 947
coef exp(coef) se(coef) z Pr(>|z|)
dem -0.049197226 0.951993354 0.007229125 -6.805 1.01e-11 ***
growth -0.007580611 0.992448050 0.007492787 -1.012 0.312
allies -0.422180458 0.655615717 0.076717663 -5.503 3.73e-08 ***
contig 0.544156705 1.723154641 0.076855065 7.080 1.44e-12 ***
capratio -0.003017259 0.996987288 0.000403396 -7.480 7.45e-14 ***
trade -12.292159040 0.000004588 10.071693690 -1.220 0.222
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
dem 0.951993354 1.0504 9.386e-01 0.9656
growth 0.992448050 1.0076 9.780e-01 1.0071
allies 0.655615717 1.5253 5.641e-01 0.7620
contig 1.723154641 0.5803 1.482e+00 2.0033
capratio 0.996987288 1.0030 9.962e-01 0.9978
trade 0.000004588 217980.0971 1.226e-14 1716.4232
Concordance= 0.703 (se = 0.01 )
Likelihood ratio test= 381.8 on 6 df, p=<2e-16
Wald test = 238.3 on 6 df, p=<2e-16
Score (logrank) test = 267.9 on 6 df, p=<2e-16
library(haven)
BKT.Data <- read_dta("https://github.com/robertwwalker/xaringan/raw/master/CMF-Week-6/img/bkt98ajps.dta")
cloglog_model <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
as.factor(py), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model)
Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio +
trade + as.factor(py), family = binomial(link = "cloglog"),
data = BKT.Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4183 -0.2282 -0.1420 -0.0828 3.9654
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.113947 0.082348 -13.527 < 2e-16 ***
dem -0.049384 0.007288 -6.776 1.24e-11 ***
growth -0.008082 0.007612 -1.062 0.288357
allies -0.430045 0.078126 -5.504 3.70e-08 ***
contig 0.554479 0.077111 7.191 6.45e-13 ***
capratio -0.003023 0.000405 -7.464 8.39e-14 ***
trade -12.503176 9.960629 -1.255 0.209385
as.factor(py)2 -1.797014 0.130842 -13.734 < 2e-16 ***
as.factor(py)3 -1.910023 0.145076 -13.166 < 2e-16 ***
as.factor(py)4 -2.130485 0.165704 -12.857 < 2e-16 ***
as.factor(py)5 -2.378740 0.193902 -12.268 < 2e-16 ***
as.factor(py)6 -2.836553 0.246354 -11.514 < 2e-16 ***
as.factor(py)7 -2.831567 0.246584 -11.483 < 2e-16 ***
as.factor(py)8 -2.757412 0.239957 -11.491 < 2e-16 ***
as.factor(py)9 -2.701098 0.239710 -11.268 < 2e-16 ***
as.factor(py)10 -2.990651 0.280582 -10.659 < 2e-16 ***
as.factor(py)11 -3.207771 0.319289 -10.047 < 2e-16 ***
as.factor(py)12 -3.077872 0.305041 -10.090 < 2e-16 ***
as.factor(py)13 -3.670483 0.410654 -8.938 < 2e-16 ***
as.factor(py)14 -4.310370 0.578624 -7.449 9.38e-14 ***
as.factor(py)15 -4.329282 0.578841 -7.479 7.48e-14 ***
as.factor(py)16 -3.209284 0.336350 -9.542 < 2e-16 ***
as.factor(py)17 -3.973393 0.501858 -7.917 2.43e-15 ***
as.factor(py)18 -5.339488 1.000841 -5.335 9.55e-08 ***
as.factor(py)19 -3.367405 0.381031 -8.838 < 2e-16 ***
as.factor(py)20 -3.704984 0.449627 -8.240 < 2e-16 ***
as.factor(py)21 -4.552930 0.708058 -6.430 1.27e-10 ***
as.factor(py)22 -4.069630 0.579535 -7.022 2.18e-12 ***
as.factor(py)23 -3.894665 0.579557 -6.720 1.82e-11 ***
as.factor(py)24 -3.761143 0.579762 -6.487 8.73e-11 ***
as.factor(py)25 -17.294682 317.920692 -0.054 0.956617
as.factor(py)26 -2.758181 0.411827 -6.697 2.12e-11 ***
as.factor(py)27 -17.282784 369.870471 -0.047 0.962731
as.factor(py)28 -17.251363 368.516083 -0.047 0.962662
as.factor(py)29 -3.702860 0.709164 -5.221 1.78e-07 ***
as.factor(py)30 -3.209144 0.579970 -5.533 3.14e-08 ***
as.factor(py)31 -4.201176 1.002675 -4.190 2.79e-05 ***
as.factor(py)32 -3.364920 0.710748 -4.734 2.20e-06 ***
as.factor(py)33 -3.869193 1.001648 -3.863 0.000112 ***
as.factor(py)34 -3.112066 0.709943 -4.384 1.17e-05 ***
as.factor(py)35 -3.651929 1.003808 -3.638 0.000275 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7719.2 on 20989 degrees of freedom
Residual deviance: 5108.1 on 20949 degrees of freedom
AIC: 5190.1
Number of Fisher Scoring iterations: 17
In a paper by Carter and Signorino, they argue that a cubic function is sufficient for the analysis of most baseline hazards. Because it is cubic, that requires only the insertion of time, time-squared, and time-cubed among the set of predictors in a binary choice model with the cloglog
link.
cloglog_model.s <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
poly(py, 3), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model.s)
Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio +
trade + poly(py, 3), family = binomial(link = "cloglog"),
data = BKT.Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3282 -0.1974 -0.1359 -0.0902 3.9660
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.1139710 0.0926772 -44.390 < 2e-16 ***
dem -0.0494551 0.0072331 -6.837 8.07e-12 ***
growth -0.0118322 0.0075851 -1.560 0.119
allies -0.4560842 0.0773249 -5.898 3.67e-09 ***
contig 0.5590242 0.0766158 7.296 2.95e-13 ***
capratio -0.0030813 0.0004072 -7.567 3.81e-14 ***
trade -14.2260838 10.2085494 -1.394 0.163
poly(py, 3)1 -129.1817426 10.5291922 -12.269 < 2e-16 ***
poly(py, 3)2 89.6804700 12.9347930 6.933 4.11e-12 ***
poly(py, 3)3 -80.4785273 9.1927794 -8.755 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7719.2 on 20989 degrees of freedom
Residual deviance: 5295.8 on 20980 degrees of freedom
AIC: 5315.8
Number of Fisher Scoring iterations: 8
Four possibilities
WebPower
WebPower
has a nice website with worked examples for the sets of power analyses it supports. There are also others of interest.
z = \frac{\overbrace{\hat{p}_{0} - \pi}^{MOE}}{\sqrt{\frac{\pi(1-\pi)}{n}}}
Let’s solve for n to get a lower bound on the sample size.
Models of Choice and Forecasting