Choice and Forecasting: Week 6

Robert W. Walker

Regression for Categorical Variables

Structural Equations Models

A few weeks ago, Jack mentioned the use of principal components as a means for combining collinear variables. There is a more general language for describing models of this sort. The following example will play off of work I am currently finishing up with Elliot Maltz and a co-author.

First, the data.

library(lavaan)
load(url("https://github.com/robertwwalker/ChoiceAndForecasting/raw/main/posts/week-5/data/EMData.RData"))

There is a ton of data in here. Let me pay particular attention to specific parts we are interested in.

Agentic

names(EMData)[[76]]
[1] "28. You lack career guidance and support [People in the community expect you to be a leader]"
table(EMData.Anonymous$...76)

 1  2  3  4  5  6  7 
22 18 22 21 12 15  5 
names(EMData)[[77]]
[1] "28. You lack career guidance and support [Your community encourages you to achieve individual success]"
table(EMData.Anonymous$...77)

 1  2  3  4  5  6  7 
17 19 26 23 18 10  2 
names(EMData)[[78]]
[1] "28. You lack career guidance and support [You are expected to be assertive in your interactions with others]"
table(EMData.Anonymous$...78)

 1  2  3  4  5  6  7 
13 27 27 31 11  4  2 
names(EMData)[[79]]
[1] "28. You lack career guidance and support [You are expected to have strong opinions]"
table(EMData.Anonymous$...79)

 1  2  3  4  5  6  7 
11 26 19 31 23  3  2 

Agentic: Model

AB <- cfa("Agentic =~ ...76 + ...77 + ...78 + ...79", data = EMData.Anonymous, ordered = TRUE)
summary(AB, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 15 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        28

  Number of observations                           115

Model Test User Model:
                                              Standard      Robust
  Test Statistic                                 7.422      21.832
  Degrees of freedom                                 2           2
  P-value (Chi-square)                           0.024       0.000
  Scaling correction factor                                  0.341
  Shift parameter                                            0.087
    simple second-order correction                                

Model Test Baseline Model:

  Test statistic                              2486.114    1657.812
  Degrees of freedom                                 6           6
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.501

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.998       0.988
  Tucker-Lewis Index (TLI)                       0.993       0.964
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.154       0.295
  90 Percent confidence interval - lower         0.047       0.191
  90 Percent confidence interval - upper         0.280       0.412
  P-value RMSEA <= 0.05                          0.053       0.000
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                        NA
  90 Percent confidence interval - upper                        NA

Standardized Root Mean Square Residual:

  SRMR                                           0.033       0.033

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Agentic =~                                                            
    ...76             1.000                               0.853    0.853
    ...77             1.056    0.051   20.773    0.000    0.901    0.901
    ...78             1.045    0.042   25.093    0.000    0.891    0.891
    ...79             0.962    0.044   22.041    0.000    0.820    0.820

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....76             0.000                               0.000    0.000
   ....77             0.000                               0.000    0.000
   ....78             0.000                               0.000    0.000
   ....79             0.000                               0.000    0.000
    Agentic           0.000                               0.000    0.000

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...76|t1         -0.873    0.135   -6.459    0.000   -0.873   -0.873
    ...76|t2         -0.391    0.121   -3.241    0.001   -0.391   -0.391
    ...76|t3          0.098    0.118    0.835    0.403    0.098    0.098
    ...76|t4          0.588    0.125    4.702    0.000    0.588    0.588
    ...76|t5          0.939    0.138    6.790    0.000    0.939    0.939
    ...76|t6          1.712    0.207    8.262    0.000    1.712    1.712
    ...77|t1         -1.046    0.144   -7.264    0.000   -1.046   -1.046
    ...77|t2         -0.487    0.123   -3.975    0.000   -0.487   -0.487
    ...77|t3          0.098    0.118    0.835    0.403    0.098    0.098
    ...77|t4          0.641    0.127    5.062    0.000    0.641    0.641
    ...77|t5          1.257    0.158    7.948    0.000    1.257    1.257
    ...77|t6          2.111    0.285    7.411    0.000    2.111    2.111
    ...78|t1         -1.211    0.155   -7.826    0.000   -1.211   -1.211
    ...78|t2         -0.391    0.121   -3.241    0.001   -0.391   -0.391
    ...78|t3          0.209    0.118    1.763    0.078    0.209    0.209
    ...78|t4          1.046    0.144    7.264    0.000    1.046    1.046
    ...78|t5          1.624    0.195    8.320    0.000    1.624    1.624
    ...78|t6          2.111    0.285    7.411    0.000    2.111    2.111
    ...79|t1         -1.307    0.162   -8.058    0.000   -1.307   -1.307
    ...79|t2         -0.463    0.122   -3.792    0.000   -0.463   -0.463
    ...79|t3         -0.033    0.117   -0.279    0.781   -0.033   -0.033
    ...79|t4          0.695    0.128    5.418    0.000    0.695    0.695
    ...79|t5          1.712    0.207    8.262    0.000    1.712    1.712
    ...79|t6          2.111    0.285    7.411    0.000    2.111    2.111

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....76             0.273                               0.273    0.273
   ....77             0.188                               0.188    0.188
   ....78             0.207                               0.207    0.207
   ....79             0.327                               0.327    0.327
    Agentic           0.727    0.051   14.358    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...76             1.000                               1.000    1.000
    ...77             1.000                               1.000    1.000
    ...78             1.000                               1.000    1.000
    ...79             1.000                               1.000    1.000

Picture

library(lavaanPlot)
lavaanPlot(model = AB, node_options = list(shape = "box", fontname = "Helvetica"),
    edge_options = list(color = "grey"), covs = TRUE, coefs = TRUE)

Communal

names(EMData)[[80]]
[1] "28. You lack career guidance and support [You are expected to be unselfish]"
table(EMData.Anonymous$...80)

 1  2  3  4  5  6  7 
 5 13 24 25 25 19  4 
names(EMData)[[81]]
[1] "28. You lack career guidance and support [In your interactions with others you are expected to consider others opinions over your own]"
table(EMData.Anonymous$...81)

 1  2  3  4  5  6  7 
 9 21 32 28 20  4  1 
names(EMData)[[84]]
[1] "28. You lack career guidance and support [Your community expects you to put others interests ahead of your own]"
table(EMData.Anonymous$...84)

 1  2  3  4  5  6  7 
12 21 24 25 20 11  2 
CB <- cfa("Communal =~ ...80 + ...81 + ...84", data = EMData.Anonymous, ordered = TRUE)
summary(CB, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 13 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           115

Model Test User Model:
                                              Standard      Robust
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                               511.919     427.939
  Degrees of freedom                                 3           3
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.198

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.000       0.000
  P-value RMSEA <= 0.05                             NA          NA
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Communal =~                                                           
    ...80             1.000                               0.823    0.823
    ...81             0.908    0.078   11.572    0.000    0.747    0.747
    ...84             0.983    0.083   11.823    0.000    0.808    0.808

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....80             0.000                               0.000    0.000
   ....81             0.000                               0.000    0.000
   ....84             0.000                               0.000    0.000
    Communal          0.000                               0.000    0.000

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...80|t1         -1.712    0.207   -8.262    0.000   -1.712   -1.712
    ...80|t2         -1.009    0.142   -7.110    0.000   -1.009   -1.009
    ...80|t3         -0.345    0.120   -2.872    0.004   -0.345   -0.345
    ...80|t4          0.209    0.118    1.763    0.078    0.209    0.209
    ...80|t5          0.842    0.134    6.289    0.000    0.842    0.842
    ...80|t6          1.815    0.223    8.129    0.000    1.815    1.815
    ...81|t1         -1.417    0.172   -8.235    0.000   -1.417   -1.417
    ...81|t2         -0.641    0.127   -5.062    0.000   -0.641   -0.641
    ...81|t3          0.098    0.118    0.835    0.403    0.098    0.098
    ...81|t4          0.781    0.131    5.945    0.000    0.781    0.781
    ...81|t5          1.712    0.207    8.262    0.000    1.712    1.712
    ...81|t6          2.378    0.369    6.451    0.000    2.378    2.378
    ...84|t1         -1.257    0.158   -7.948    0.000   -1.257   -1.257
    ...84|t2         -0.562    0.124   -4.521    0.000   -0.562   -0.562
    ...84|t3         -0.011    0.117   -0.093    0.926   -0.011   -0.011
    ...84|t4          0.562    0.124    4.521    0.000    0.562    0.562
    ...84|t5          1.211    0.155    7.826    0.000    1.211    1.211
    ...84|t6          2.111    0.285    7.411    0.000    2.111    2.111

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....80             0.323                               0.323    0.323
   ....81             0.443                               0.443    0.443
   ....84             0.347                               0.347    0.347
    Communal          0.677    0.072    9.442    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...80             1.000                               1.000    1.000
    ...81             1.000                               1.000    1.000
    ...84             1.000                               1.000    1.000

Plot Code

lavaanPlot(model = CB, node_options = list(shape = "box", fontname = "Helvetica"),
    edge_options = list(color = "grey"), coefs = TRUE, covs = TRUE)

Plot Code

Mentoring

names(EMData)[[13]]
[1] "5. On a scale of 7 to 1, with 7 being very regularly and 1 being very rarely, how regularly do you:  [Encourage entrepreneurs to start businesses]"
table(EMData.Anonymous$...13)

 1  2  3  4  5  6  7 
21 27 12 14 10 26  5 
names(EMData)[[14]]
[1] "5. On a scale of 7 to 1, with 7 being very regularly and 1 being very rarely, how regularly do you:  [Reassure other entrepreneurs when things are not going well]"
table(EMData.Anonymous$...14)

 1  2  3  4  5  6  7 
22 21 10 19 33  4  6 
names(EMData)[[15]]
[1] "5. On a scale of 7 to 1, with 7 being very regularly and 1 being very rarely, how regularly do you:  [Help other entrepreneurs have confidence they can succeed]"
table(EMData.Anonymous$...15)

 1  2  3  4  5  6  7 
21 29 27 11 21  5  1 
M <- cfa("Mentoring =~ ...13 + ...14 + ...15", data = EMData.Anonymous, ordered = TRUE)
summary(M, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 10 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           115

Model Test User Model:
                                              Standard      Robust
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                              1726.590    1624.559
  Degrees of freedom                                 3           3
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.063

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.000       0.000
  P-value RMSEA <= 0.05                             NA          NA
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Mentoring =~                                                          
    ...13             1.000                               0.981    0.981
    ...14             0.879    0.058   15.193    0.000    0.862    0.862
    ...15             0.844    0.056   15.063    0.000    0.827    0.827

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....13             0.000                               0.000    0.000
   ....14             0.000                               0.000    0.000
   ....15             0.000                               0.000    0.000
    Mentoring         0.000                               0.000    0.000

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...13|t1         -0.905    0.137   -6.626    0.000   -0.905   -0.905
    ...13|t2         -0.209    0.118   -1.763    0.078   -0.209   -0.209
    ...13|t3          0.055    0.117    0.464    0.643    0.055    0.055
    ...13|t4          0.368    0.120    3.057    0.002    0.368    0.368
    ...13|t5          0.614    0.126    4.882    0.000    0.614    0.614
    ...13|t6          1.712    0.207    8.262    0.000    1.712    1.712
    ...14|t1         -0.873    0.135   -6.459    0.000   -0.873   -0.873
    ...14|t2         -0.322    0.120   -2.688    0.007   -0.322   -0.322
    ...14|t3         -0.098    0.118   -0.835    0.403   -0.098   -0.098
    ...14|t4          0.322    0.120    2.688    0.007    0.322    0.322
    ...14|t5          1.360    0.167    8.155    0.000    1.360    1.360
    ...14|t6          1.624    0.195    8.320    0.000    1.624    1.624
    ...15|t1         -0.905    0.137   -6.626    0.000   -0.905   -0.905
    ...15|t2         -0.164    0.118   -1.392    0.164   -0.164   -0.164
    ...15|t3          0.439    0.122    3.608    0.000    0.439    0.439
    ...15|t4          0.723    0.129    5.595    0.000    0.723    0.723
    ...15|t5          1.624    0.195    8.320    0.000    1.624    1.624
    ...15|t6          2.378    0.369    6.451    0.000    2.378    2.378

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....13             0.039                               0.039    0.039
   ....14             0.258                               0.258    0.258
   ....15             0.315                               0.315    0.315
    Mentoring         0.961    0.077   12.508    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...13             1.000                               1.000    1.000
    ...14             1.000                               1.000    1.000
    ...15             1.000                               1.000    1.000

Plot Code

lavaanPlot(model = M, node_options = list(shape = "box", fontname = "Helvetica"),
    edge_options = list(color = "grey"), coefs = TRUE, covs = TRUE)

Plot Code

Social Influence

names(EMData)[[37]]
[1] "11. On a scale of 7 to 1, with 7 being strongly agree and 1 being strongly disagree, please indicate your level of agreement with the following items relating to your community:  [The respect you have in the community helps your business]"
table(EMData.Anonymous$...37)

 1  2  3  4  5  6 
26 38 31  5  8  7 
names(EMData)[[38]]
[1] "11. On a scale of 7 to 1, with 7 being strongly agree and 1 being strongly disagree, please indicate your level of agreement with the following items relating to your community:  [Your understanding of the community helps your business]"
table(EMData.Anonymous$...38)

 1  2  3  4  5  6 
27 40 28  6  7  7 
names(EMData)[[39]]
[1] "11. On a scale of 7 to 1, with 7 being strongly agree and 1 being strongly disagree, please indicate your level of agreement with the following items relating to your community:  [Your influence in the community helps your business]"
table(EMData.Anonymous$...39)

 1  2  3  4  5  6  7 
30 38 30  4  7  5  1 
SI <- cfa("Social.Influence =~ ...37 + ...38 + ...39", data = EMData.Anonymous, ordered = TRUE)
summary(SI, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 12 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        19

  Number of observations                           115

Model Test User Model:
                                              Standard      Robust
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                              3636.699    2922.199
  Degrees of freedom                                 3           3
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.245

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.000       0.000
  P-value RMSEA <= 0.05                             NA          NA
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Social.Influence =~                                                      
    ...37                1.000                               0.928    0.928
    ...38                1.055    0.031   33.595    0.000    0.979    0.979
    ...39                0.942    0.029   32.441    0.000    0.874    0.874

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....37             0.000                               0.000    0.000
   ....38             0.000                               0.000    0.000
   ....39             0.000                               0.000    0.000
    Social.Influnc    0.000                               0.000    0.000

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...37|t1         -0.752    0.130   -5.771    0.000   -0.752   -0.752
    ...37|t2          0.142    0.118    1.207    0.228    0.142    0.142
    ...37|t3          0.939    0.138    6.790    0.000    0.939    0.939
    ...37|t4          1.124    0.149    7.558    0.000    1.124    1.124
    ...37|t5          1.548    0.186    8.325    0.000    1.548    1.548
    ...38|t1         -0.723    0.129   -5.595    0.000   -0.723   -0.723
    ...38|t2          0.209    0.118    1.763    0.078    0.209    0.209
    ...38|t3          0.939    0.138    6.790    0.000    0.939    0.939
    ...38|t4          1.166    0.152    7.696    0.000    1.166    1.166
    ...38|t5          1.548    0.186    8.325    0.000    1.548    1.548
    ...39|t1         -0.641    0.127   -5.062    0.000   -0.641   -0.641
    ...39|t2          0.231    0.119    1.948    0.051    0.231    0.231
    ...39|t3          1.046    0.144    7.264    0.000    1.046    1.046
    ...39|t4          1.211    0.155    7.826    0.000    1.211    1.211
    ...39|t5          1.624    0.195    8.320    0.000    1.624    1.624
    ...39|t6          2.378    0.369    6.451    0.000    2.378    2.378

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   ....37             0.139                               0.139    0.139
   ....38             0.041                               0.041    0.041
   ....39             0.236                               0.236    0.236
    Social.Influnc    0.861    0.034   25.088    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ...37             1.000                               1.000    1.000
    ...38             1.000                               1.000    1.000
    ...39             1.000                               1.000    1.000

Plot Code

lavaanPlot(model = SI, node_options = list(shape = "box", fontname = "Helvetica"),
    edge_options = list(color = "grey"), coefs = TRUE, covs = TRUE)

Plot Code

An SEM

Now let me combine those measurement models to produce a set of two structural equations. I wish to explain income and employment given these factors.

names(EMData)[c(5, 59)]
Struct <- sem("Agentic =~ ...76 + ...77 + ...78 + ...79
          Communal =~ ...80 + ...81 + ...84
          Mentoring =~ ...13 + ...14 + ...15
          Social.Influence =~ ...37 + ...38 + ...39
          ...59 ~ Agentic + Communal + Mentoring + Social.Influence
          ...5 ~ Agentic + Communal + Mentoring + Social.Influence",
    data = EMData.Anonymous, ordered = c("...13", "...14", "...15", "...80", "...81",
        "...84", "...76", "...77", "...78", "...79", "...37", "...38", "...39"))
summary(Struct, fit.measures = TRUE, standardized = TRUE)
[1] "3. What is the current income generated from the business? (USD PM)"
[2] "20. How many people do you manage in your business?"                
lavaan 0.6-12 ended normally after 112 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                       108

  Number of observations                           115

Model Test User Model:
                                              Standard      Robust
  Test Statistic                               176.253     255.843
  Degrees of freedom                                77          77
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  0.804
  Shift parameter                                           36.682
    simple second-order correction                                

Model Test Baseline Model:

  Test statistic                              9489.339    3694.019
  Degrees of freedom                               105         105
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.615

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.989       0.950
  Tucker-Lewis Index (TLI)                       0.986       0.932
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.106       0.143
  90 Percent confidence interval - lower         0.086       0.124
  90 Percent confidence interval - upper         0.127       0.162
  P-value RMSEA <= 0.05                          0.000       0.000
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                        NA
  90 Percent confidence interval - upper                        NA

Standardized Root Mean Square Residual:

  SRMR                                           0.075       0.075

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                      Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  Agentic =~                                                                 
    ...76                 1.000                                0.812    0.812
    ...77                 1.110    0.051   21.573    0.000     0.901    0.901
    ...78                 1.076    0.044   24.545    0.000     0.873    0.873
    ...79                 1.097    0.047   23.283    0.000     0.890    0.890
  Communal =~                                                                
    ...80                 1.000                                0.715    0.715
    ...81                 1.233    0.126    9.803    0.000     0.881    0.881
    ...84                 1.103    0.091   12.190    0.000     0.788    0.788
  Mentoring =~                                                               
    ...13                 1.000                                0.961    0.961
    ...14                 0.908    0.057   15.897    0.000     0.872    0.872
    ...15                 0.876    0.054   16.228    0.000     0.842    0.842
  Social.Influence =~                                                        
    ...37                 1.000                                0.948    0.948
    ...38                 1.004    0.030   33.882    0.000     0.952    0.952
    ...39                 0.941    0.029   32.029    0.000     0.892    0.892

Regressions:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  ...59 ~                                                                 
    Agentic           -0.087    0.063   -1.386    0.166    -0.071   -0.052
    Communal           0.119    0.108    1.100    0.271     0.085    0.062
    Mentoring          0.166    0.075    2.213    0.027     0.159    0.116
    Social.Influnc    -0.422    0.055   -7.620    0.000    -0.400   -0.292
  ...5 ~                                                                  
    Agentic           51.893   19.705    2.633    0.008    42.118    0.205
    Communal          -9.427   20.599   -0.458    0.647    -6.736   -0.033
    Mentoring          1.218   15.046    0.081    0.935     1.170    0.006
    Social.Influnc    35.014   16.427    2.131    0.033    33.195    0.161

Covariances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  Agentic ~~                                                              
    Communal           0.259    0.044    5.821    0.000     0.447    0.447
    Mentoring          0.203    0.062    3.269    0.001     0.261    0.261
    Social.Influnc     0.247    0.056    4.423    0.000     0.321    0.321
  Communal ~~                                                             
    Mentoring          0.134    0.057    2.329    0.020     0.195    0.195
    Social.Influnc     0.308    0.062    4.963    0.000     0.454    0.454
  Mentoring ~~                                                            
    Social.Influnc     0.067    0.074    0.904    0.366     0.074    0.074
 ....59 ~~                                                                
   ....5              78.509   19.599    4.006    0.000    78.509    0.304

Intercepts:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   ....76              0.000                                0.000    0.000
   ....77              0.000                                0.000    0.000
   ....78              0.000                                0.000    0.000
   ....79              0.000                                0.000    0.000
   ....80              0.000                                0.000    0.000
   ....81              0.000                                0.000    0.000
   ....84              0.000                                0.000    0.000
   ....13              0.000                                0.000    0.000
   ....14              0.000                                0.000    0.000
   ....15              0.000                                0.000    0.000
   ....37              0.000                                0.000    0.000
   ....38              0.000                                0.000    0.000
   ....39              0.000                                0.000    0.000
   ....59              0.435    0.305    1.427    0.154     0.435    0.317
   ....5             380.609   23.778   16.007    0.000   380.609    1.849
    Agentic            0.000                                0.000    0.000
    Communal           0.000                                0.000    0.000
    Mentoring          0.000                                0.000    0.000
    Social.Influnc     0.000                                0.000    0.000

Thresholds:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    ...76|t1          -0.873    0.135   -6.459    0.000    -0.873   -0.873
    ...76|t2          -0.391    0.121   -3.241    0.001    -0.391   -0.391
    ...76|t3           0.098    0.118    0.835    0.403     0.098    0.098
    ...76|t4           0.588    0.125    4.702    0.000     0.588    0.588
    ...76|t5           0.939    0.138    6.790    0.000     0.939    0.939
    ...76|t6           1.712    0.207    8.262    0.000     1.712    1.712
    ...77|t1          -1.046    0.144   -7.264    0.000    -1.046   -1.046
    ...77|t2          -0.487    0.123   -3.975    0.000    -0.487   -0.487
    ...77|t3           0.098    0.118    0.835    0.403     0.098    0.098
    ...77|t4           0.641    0.127    5.062    0.000     0.641    0.641
    ...77|t5           1.257    0.158    7.948    0.000     1.257    1.257
    ...77|t6           2.111    0.285    7.411    0.000     2.111    2.111
    ...78|t1          -1.211    0.155   -7.826    0.000    -1.211   -1.211
    ...78|t2          -0.391    0.121   -3.241    0.001    -0.391   -0.391
    ...78|t3           0.209    0.118    1.763    0.078     0.209    0.209
    ...78|t4           1.046    0.144    7.264    0.000     1.046    1.046
    ...78|t5           1.624    0.195    8.320    0.000     1.624    1.624
    ...78|t6           2.111    0.285    7.411    0.000     2.111    2.111
    ...79|t1          -1.307    0.162   -8.058    0.000    -1.307   -1.307
    ...79|t2          -0.463    0.122   -3.792    0.000    -0.463   -0.463
    ...79|t3          -0.033    0.117   -0.279    0.781    -0.033   -0.033
    ...79|t4           0.695    0.128    5.418    0.000     0.695    0.695
    ...79|t5           1.712    0.207    8.262    0.000     1.712    1.712
    ...79|t6           2.111    0.285    7.411    0.000     2.111    2.111
    ...80|t1          -1.712    0.207   -8.262    0.000    -1.712   -1.712
    ...80|t2          -1.009    0.142   -7.110    0.000    -1.009   -1.009
    ...80|t3          -0.345    0.120   -2.872    0.004    -0.345   -0.345
    ...80|t4           0.209    0.118    1.763    0.078     0.209    0.209
    ...80|t5           0.842    0.134    6.289    0.000     0.842    0.842
    ...80|t6           1.815    0.223    8.129    0.000     1.815    1.815
    ...81|t1          -1.417    0.172   -8.235    0.000    -1.417   -1.417
    ...81|t2          -0.641    0.127   -5.062    0.000    -0.641   -0.641
    ...81|t3           0.098    0.118    0.835    0.403     0.098    0.098
    ...81|t4           0.781    0.131    5.945    0.000     0.781    0.781
    ...81|t5           1.712    0.207    8.262    0.000     1.712    1.712
    ...81|t6           2.378    0.369    6.451    0.000     2.378    2.378
    ...84|t1          -1.257    0.158   -7.948    0.000    -1.257   -1.257
    ...84|t2          -0.562    0.124   -4.521    0.000    -0.562   -0.562
    ...84|t3          -0.011    0.117   -0.093    0.926    -0.011   -0.011
    ...84|t4           0.562    0.124    4.521    0.000     0.562    0.562
    ...84|t5           1.211    0.155    7.826    0.000     1.211    1.211
    ...84|t6           2.111    0.285    7.411    0.000     2.111    2.111
    ...13|t1          -0.905    0.137   -6.626    0.000    -0.905   -0.905
    ...13|t2          -0.209    0.118   -1.763    0.078    -0.209   -0.209
    ...13|t3           0.055    0.117    0.464    0.643     0.055    0.055
    ...13|t4           0.368    0.120    3.057    0.002     0.368    0.368
    ...13|t5           0.614    0.126    4.882    0.000     0.614    0.614
    ...13|t6           1.712    0.207    8.262    0.000     1.712    1.712
    ...14|t1          -0.873    0.135   -6.459    0.000    -0.873   -0.873
    ...14|t2          -0.322    0.120   -2.688    0.007    -0.322   -0.322
    ...14|t3          -0.098    0.118   -0.835    0.403    -0.098   -0.098
    ...14|t4           0.322    0.120    2.688    0.007     0.322    0.322
    ...14|t5           1.360    0.167    8.155    0.000     1.360    1.360
    ...14|t6           1.624    0.195    8.320    0.000     1.624    1.624
    ...15|t1          -0.905    0.137   -6.626    0.000    -0.905   -0.905
    ...15|t2          -0.164    0.118   -1.392    0.164    -0.164   -0.164
    ...15|t3           0.439    0.122    3.608    0.000     0.439    0.439
    ...15|t4           0.723    0.129    5.595    0.000     0.723    0.723
    ...15|t5           1.624    0.195    8.320    0.000     1.624    1.624
    ...15|t6           2.378    0.369    6.451    0.000     2.378    2.378
    ...37|t1          -0.752    0.130   -5.771    0.000    -0.752   -0.752
    ...37|t2           0.142    0.118    1.207    0.228     0.142    0.142
    ...37|t3           0.939    0.138    6.790    0.000     0.939    0.939
    ...37|t4           1.124    0.149    7.558    0.000     1.124    1.124
    ...37|t5           1.548    0.186    8.325    0.000     1.548    1.548
    ...38|t1          -0.723    0.129   -5.595    0.000    -0.723   -0.723
    ...38|t2           0.209    0.118    1.763    0.078     0.209    0.209
    ...38|t3           0.939    0.138    6.790    0.000     0.939    0.939
    ...38|t4           1.166    0.152    7.696    0.000     1.166    1.166
    ...38|t5           1.548    0.186    8.325    0.000     1.548    1.548
    ...39|t1          -0.641    0.127   -5.062    0.000    -0.641   -0.641
    ...39|t2           0.231    0.119    1.948    0.051     0.231    0.231
    ...39|t3           1.046    0.144    7.264    0.000     1.046    1.046
    ...39|t4           1.211    0.155    7.826    0.000     1.211    1.211
    ...39|t5           1.624    0.195    8.320    0.000     1.624    1.624
    ...39|t6           2.378    0.369    6.451    0.000     2.378    2.378

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   ....76              0.341                                0.341    0.341
   ....77              0.188                                0.188    0.188
   ....78              0.238                                0.238    0.238
   ....79              0.207                                0.207    0.207
   ....80              0.489                                0.489    0.489
   ....81              0.224                                0.224    0.224
   ....84              0.378                                0.378    0.378
   ....13              0.077                                0.077    0.077
   ....14              0.239                                0.239    0.239
   ....15              0.292                                0.292    0.292
   ....37              0.101                                0.101    0.101
   ....38              0.094                                0.094    0.094
   ....39              0.204                                0.204    0.204
   ....59              1.711    0.147   11.666    0.000     1.711    0.910
   ....5           38984.352 4516.263    8.632    0.000 38984.352    0.920
    Agentic            0.659    0.048   13.761    0.000     1.000    1.000
    Communal           0.511    0.068    7.492    0.000     1.000    1.000
    Mentoring          0.923    0.072   12.824    0.000     1.000    1.000
    Social.Influnc     0.899    0.033   27.070    0.000     1.000    1.000

Scales y*:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    ...76              1.000                                1.000    1.000
    ...77              1.000                                1.000    1.000
    ...78              1.000                                1.000    1.000
    ...79              1.000                                1.000    1.000
    ...80              1.000                                1.000    1.000
    ...81              1.000                                1.000    1.000
    ...84              1.000                                1.000    1.000
    ...13              1.000                                1.000    1.000
    ...14              1.000                                1.000    1.000
    ...15              1.000                                1.000    1.000
    ...37              1.000                                1.000    1.000
    ...38              1.000                                1.000    1.000
    ...39              1.000                                1.000    1.000

Plot Code

lavaanPlot(model = Struct, node_options = list(shape = "box", fontname = "Helvetica"),
    edge_options = list(color = "grey"), coefs = TRUE, covs = TRUE)

Plot Code

Survival Models

There are two classes of models commonly used for survival data. The text focuses on the Cox model though there are also AFT [accelerated failure time] models for parametric analysis of survival-time data. First, on the data structure.

Data for Survival models

The most common applications are in statistics for life tables. How long do individuals live beyond some intervention? In people analytics, how long do employees stay with a firm or how long do customers maintain a relationship with a firm or service provider?

Two Classes of Survival Models

  • Time-varying covariates
  • Time-constant covariates

Some Survival Time Data

There will be two key components: events and time.

url <- "http://peopleanalytics-regression-book.org/data/job_retention.csv"
job_retention <- read.csv(url)
library(survival)
retention <- Surv(event = job_retention$left, time = job_retention$month)

A Time-Varying Covariates Example

library(haven)
BKT.Data <- read_dta("./img/bkt98ajps.dta")
BKT.Data
# A tibble: 20,990 × 53
   dispute   dem  growth allies contig capratio    trade    py  pys1  pys2  pys3
     <dbl> <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
 1       1    -9  -2.5        0      1     2.66 0            1     0     0     0
 2       0    -7   2.35       0      1     1.93 0.00200      1     0     0     0
 3       0    -3  -5.27       0      0    10.7  0            1     0     0     0
 4       0     5   2.29       0      0   160.   0            1     0     0     0
 5       0    -9 -13.6        0      1     1.07 0            1     0     0     0
 6       0    -7   0.320      0      1     1.19 0.00300      1     0     0     0
 7       0    -7   2.10       1      1     1.91 0.00460      1     0     0     0
 8       0    -8   4.14       1      1    26.7  0            1     0     0     0
 9       1    -7  -0.230      0      1     1.14 0.000600     1     0     0     0
10       1    -9   2.47       1      1     1.42 0            1     0     0     0
# … with 20,980 more rows, and 42 more variables: contdisp <dbl>,
#   prefail <dbl>, tzero <dbl>, `_st` <dbl>, `_d` <dbl>, `_t` <dbl>,
#   `_t0` <dbl>, k01 <dbl>, k02 <dbl>, k03 <dbl>, k04 <dbl>, k05 <dbl>,
#   k06 <dbl>, k07 <dbl>, k08 <dbl>, k09 <dbl>, k010 <dbl>, k011 <dbl>,
#   k012 <dbl>, k013 <dbl>, k014 <dbl>, k015 <dbl>, k016 <dbl>, k017 <dbl>,
#   k018 <dbl>, k019 <dbl>, k020 <dbl>, k021 <dbl>, k022 <dbl>, k023 <dbl>,
#   k024 <dbl>, k025 <dbl>, k026 <dbl>, k027 <dbl>, k028 <dbl>, k029 <dbl>, …

On Censoring

What happens when the event has yet to happen? The observation is censored. Those are the + in the table below. What does it mean? Data collection ended before the event.

unique(retention)
 [1]  1  12+  5   2   3   6   8   4   8+  4+ 11  10   9   7+  5+  3+  7   9+ 11+
[20] 12  10+  6+  2+  1+

What’s the logic? We know the unit survived at least that long.

A First Very Simple Model

job_retention$sentiment_category <- ifelse(job_retention$sentiment >= 7, "High",
    "Not High")
kmestimate_sentimentcat <- survival::survfit(formula = Surv(event = left, time = month) ~
    sentiment_category, data = job_retention)

The output

summary(kmestimate_sentimentcat)
Call: survfit(formula = Surv(event = left, time = month) ~ sentiment_category, 
    data = job_retention)

                sentiment_category=High 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1   3225      15    0.995 0.00120        0.993        0.998
    2   3167      62    0.976 0.00272        0.971        0.981
    3   3075     120    0.938 0.00429        0.929        0.946
    4   2932     102    0.905 0.00522        0.895        0.915
    5   2802      65    0.884 0.00571        0.873        0.895
    6   2700     144    0.837 0.00662        0.824        0.850
    7   2532     110    0.801 0.00718        0.787        0.815
    8   2389     140    0.754 0.00778        0.739        0.769
    9   2222     112    0.716 0.00818        0.700        0.732
   10   2077      56    0.696 0.00835        0.680        0.713
   11   1994     134    0.650 0.00871        0.633        0.667
   12   1827      45    0.634 0.00882        0.617        0.651

                sentiment_category=Not High 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    545       9    0.983 0.00546        0.973        0.994
    2    532      28    0.932 0.01084        0.911        0.953
    3    500      25    0.885 0.01373        0.859        0.912
    4    472      29    0.831 0.01618        0.800        0.863
    5    438      21    0.791 0.01758        0.757        0.826
    6    411      27    0.739 0.01906        0.703        0.777
    7    379      18    0.704 0.01987        0.666        0.744
    8    357      21    0.662 0.02065        0.623        0.704
    9    330      24    0.614 0.02136        0.574        0.658
   10    302      15    0.584 0.02171        0.543        0.628
   11    283      24    0.534 0.02209        0.493        0.579
   12    254       8    0.517 0.02218        0.476        0.563

Plotting Survival Data

The canonical plot is the Kaplan-Meier survival curve.

library(survminer)

# show survival curves with p-value estimate and confidence intervals
survminer::ggsurvplot(kmestimate_sentimentcat, pval = TRUE, conf.int = TRUE, palette = c("blue",
    "red"), linetype = c("solid", "dashed"), xlab = "Month", ylab = "Retention Rate")

The Cox Model

The key to such modelling is the definition of a hazard function. Let h(t) be the proportion that have not survived to time t, we can write:

h(t) = h_{0}(t)\exp(\beta_{1}x_{1} + \beta_{2}x_{2} + \ldots + \beta_{k}x_{k})

Proportional Hazards

PH Model

Estimating the Model

The outcome will need to be a double: the event and the time. Let’s model this as a function of gender, field, level, and sentiment.

cox_model <- survival::coxph(formula = Surv(event = left, time = month) ~ gender +
    field + level + sentiment, data = job_retention)

A Summary

summary(cox_model)
Call:
survival::coxph(formula = Surv(event = left, time = month) ~ 
    gender + field + level + sentiment, data = job_retention)

  n= 3770, number of events= 1354 

                           coef exp(coef) se(coef)      z Pr(>|z|)    
genderM                -0.04548   0.95553  0.05886 -0.773 0.439647    
fieldFinance            0.22334   1.25025  0.06681  3.343 0.000829 ***
fieldHealth             0.27830   1.32089  0.12890  2.159 0.030849 *  
fieldLaw                0.10532   1.11107  0.14515  0.726 0.468086    
fieldPublic/Government  0.11499   1.12186  0.08899  1.292 0.196277    
fieldSales/Marketing    0.08776   1.09173  0.10211  0.859 0.390082    
levelLow                0.14813   1.15967  0.09000  1.646 0.099799 .  
levelMedium             0.17666   1.19323  0.10203  1.732 0.083362 .  
sentiment              -0.11756   0.88909  0.01397 -8.415  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                       exp(coef) exp(-coef) lower .95 upper .95
genderM                   0.9555     1.0465    0.8514    1.0724
fieldFinance              1.2502     0.7998    1.0968    1.4252
fieldHealth               1.3209     0.7571    1.0260    1.7005
fieldLaw                  1.1111     0.9000    0.8360    1.4767
fieldPublic/Government    1.1219     0.8914    0.9423    1.3356
fieldSales/Marketing      1.0917     0.9160    0.8937    1.3336
levelLow                  1.1597     0.8623    0.9721    1.3834
levelMedium               1.1932     0.8381    0.9770    1.4574
sentiment                 0.8891     1.1248    0.8651    0.9138

Concordance= 0.578  (se = 0.008 )
Likelihood ratio test= 89.18  on 9 df,   p=2e-15
Wald test            = 94.95  on 9 df,   p=<2e-16
Score (logrank) test = 95.31  on 9 df,   p=<2e-16

Examining PH

As a table:

(ph_check <- survival::cox.zph(cox_model))
           chisq df    p
gender     0.726  1 0.39
field      6.656  5 0.25
level      2.135  2 0.34
sentiment  1.828  1 0.18
GLOBAL    11.156  9 0.27

Or Graphically

survminer::ggcoxzph(ph_check, font.main = 10, font.x = 10, font.y = 10)

Models of Frailty

Frailty models combine the insights of hierarchical models from the last chapter with survival analysis.

A frailty model seeks to augment the baseline hazard with some parametric random-effects. In this example, let’s allow the field variable to have shared frailty.

Estimation

library(frailtypack)

(frailty_model <- frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
    gender + level + sentiment + cluster(field), data = job_retention, n.knots = 12,
    kappa = 10000))

Be patient. The program is computing ... 
The program took 0.48 seconds 
Call:
frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~ 
    gender + level + sentiment + cluster(field), data = job_retention, 
    n.knots = 12, kappa = 10000)


  Shared Gamma Frailty model parameter estimates  
  using a Penalized Likelihood on the hazard function 

                 coef exp(coef) SE coef (H) SE coef (HIH)         z          p
genderM     -0.029530  0.970902   0.0591105     0.0591105 -0.499574 6.1738e-01
levelLow     0.198549  1.219632   0.0914449     0.0914449  2.171245 2.9913e-02
levelMedium  0.223268  1.250155   0.1032712     0.1032712  2.161954 3.0622e-02
sentiment   -0.108262  0.897393   0.0141327     0.0141327 -7.660398 1.8541e-14

        chisq df global p
level 5.32143  2   0.0699

    Frailty parameter, Theta: 48.3212 (SE (H): 25.6198 ) p = 0.029642 
 
      penalized marginal log-likelihood = -5510.36
      Convergence criteria: 
      parameters = 9.61e-06 likelihood = 1.48e-06 gradient = 1.73e-09 

      LCV = the approximate likelihood cross-validation criterion
            in the semi parametrical case     = 1.46587 

      n= 3770
      n events= 1354  n groups= 6
      number of iterations:  17 

      Exact number of knots used:  12 
      Value of the smoothing parameter:  10000, DoF:  6.31

Two comments. First, there is a test of the frailty; is there evidence of variation? This is the test of the Theta parameter. Second, do the parameters change?

Comparison

exp(cbind(frailty_model$coef, coef(cox_model)))
                            [,1]      [,2]
genderM                0.9709017 0.9555343
fieldFinance           1.2196322 1.2502466
fieldHealth            1.2501551 1.3208885
fieldLaw               0.8973926 1.1110687
fieldPublic/Government 0.9709017 1.1218640
fieldSales/Marketing   1.2196322 1.0917301
levelLow               1.2501551 1.1596653
levelMedium            0.8973926 1.1932310
sentiment              0.9709017 0.8890854

Time-Varying Covariates

options(scipen = 7)
cox_model <- survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem + growth +
    allies + contig + capratio + trade, data = BKT.Data)
summary(cox_model)
Call:
survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem + 
    growth + allies + contig + capratio + trade, data = BKT.Data)

  n= 20990, number of events= 947 

                  coef     exp(coef)      se(coef)      z Pr(>|z|)    
dem       -0.049197226   0.951993354   0.007229125 -6.805 1.01e-11 ***
growth    -0.007580611   0.992448050   0.007492787 -1.012    0.312    
allies    -0.422180458   0.655615717   0.076717663 -5.503 3.73e-08 ***
contig     0.544156705   1.723154641   0.076855065  7.080 1.44e-12 ***
capratio  -0.003017259   0.996987288   0.000403396 -7.480 7.45e-14 ***
trade    -12.292159040   0.000004588  10.071693690 -1.220    0.222    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           exp(coef)  exp(-coef) lower .95 upper .95
dem      0.951993354      1.0504 9.386e-01    0.9656
growth   0.992448050      1.0076 9.780e-01    1.0071
allies   0.655615717      1.5253 5.641e-01    0.7620
contig   1.723154641      0.5803 1.482e+00    2.0033
capratio 0.996987288      1.0030 9.962e-01    0.9978
trade    0.000004588 217980.0971 1.226e-14 1716.4232

Concordance= 0.703  (se = 0.01 )
Likelihood ratio test= 381.8  on 6 df,   p=<2e-16
Wald test            = 238.3  on 6 df,   p=<2e-16
Score (logrank) test = 267.9  on 6 df,   p=<2e-16

Binary Choice Model

library(haven)
BKT.Data <- read_dta("https://github.com/robertwwalker/xaringan/raw/master/CMF-Week-6/img/bkt98ajps.dta")
cloglog_model <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
    as.factor(py), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model)

Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio + 
    trade + as.factor(py), family = binomial(link = "cloglog"), 
    data = BKT.Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4183  -0.2282  -0.1420  -0.0828   3.9654  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.113947   0.082348 -13.527  < 2e-16 ***
dem              -0.049384   0.007288  -6.776 1.24e-11 ***
growth           -0.008082   0.007612  -1.062 0.288357    
allies           -0.430045   0.078126  -5.504 3.70e-08 ***
contig            0.554479   0.077111   7.191 6.45e-13 ***
capratio         -0.003023   0.000405  -7.464 8.39e-14 ***
trade           -12.503176   9.960629  -1.255 0.209385    
as.factor(py)2   -1.797014   0.130842 -13.734  < 2e-16 ***
as.factor(py)3   -1.910023   0.145076 -13.166  < 2e-16 ***
as.factor(py)4   -2.130485   0.165704 -12.857  < 2e-16 ***
as.factor(py)5   -2.378740   0.193902 -12.268  < 2e-16 ***
as.factor(py)6   -2.836553   0.246354 -11.514  < 2e-16 ***
as.factor(py)7   -2.831567   0.246584 -11.483  < 2e-16 ***
as.factor(py)8   -2.757412   0.239957 -11.491  < 2e-16 ***
as.factor(py)9   -2.701098   0.239710 -11.268  < 2e-16 ***
as.factor(py)10  -2.990651   0.280582 -10.659  < 2e-16 ***
as.factor(py)11  -3.207771   0.319289 -10.047  < 2e-16 ***
as.factor(py)12  -3.077872   0.305041 -10.090  < 2e-16 ***
as.factor(py)13  -3.670483   0.410654  -8.938  < 2e-16 ***
as.factor(py)14  -4.310370   0.578624  -7.449 9.38e-14 ***
as.factor(py)15  -4.329282   0.578841  -7.479 7.48e-14 ***
as.factor(py)16  -3.209284   0.336350  -9.542  < 2e-16 ***
as.factor(py)17  -3.973393   0.501858  -7.917 2.43e-15 ***
as.factor(py)18  -5.339488   1.000841  -5.335 9.55e-08 ***
as.factor(py)19  -3.367405   0.381031  -8.838  < 2e-16 ***
as.factor(py)20  -3.704984   0.449627  -8.240  < 2e-16 ***
as.factor(py)21  -4.552930   0.708058  -6.430 1.27e-10 ***
as.factor(py)22  -4.069630   0.579535  -7.022 2.18e-12 ***
as.factor(py)23  -3.894665   0.579557  -6.720 1.82e-11 ***
as.factor(py)24  -3.761143   0.579762  -6.487 8.73e-11 ***
as.factor(py)25 -17.294682 317.920692  -0.054 0.956617    
as.factor(py)26  -2.758181   0.411827  -6.697 2.12e-11 ***
as.factor(py)27 -17.282784 369.870471  -0.047 0.962731    
as.factor(py)28 -17.251363 368.516083  -0.047 0.962662    
as.factor(py)29  -3.702860   0.709164  -5.221 1.78e-07 ***
as.factor(py)30  -3.209144   0.579970  -5.533 3.14e-08 ***
as.factor(py)31  -4.201176   1.002675  -4.190 2.79e-05 ***
as.factor(py)32  -3.364920   0.710748  -4.734 2.20e-06 ***
as.factor(py)33  -3.869193   1.001648  -3.863 0.000112 ***
as.factor(py)34  -3.112066   0.709943  -4.384 1.17e-05 ***
as.factor(py)35  -3.651929   1.003808  -3.638 0.000275 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7719.2  on 20989  degrees of freedom
Residual deviance: 5108.1  on 20949  degrees of freedom
AIC: 5190.1

Number of Fisher Scoring iterations: 17

Visualizing the Strata

stratified_base <- frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
    strata(sentiment_category), data = job_retention, n.knots = 12, kappa = rep(5000,
    2))

Be patient. The program is computing ... 
The program took 0.37 seconds 

Visual

plot(stratified_base, type.plot = "Survival", pos.legend = "topright", Xlab = "Month",
    Ylab = "Baseline retention rate", color = 1)

A Result on Baseline Hazards

In a paper by Carter and Signorino, they argue that a cubic function is sufficient for the analysis of most baseline hazards. Because it is cubic, that requires only the insertion of time, time-squared, and time-cubed among the set of predictors in a binary choice model with the cloglog link.

Modified disputes

cloglog_model.s <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
    poly(py, 3), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model.s)

Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio + 
    trade + poly(py, 3), family = binomial(link = "cloglog"), 
    data = BKT.Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3282  -0.1974  -0.1359  -0.0902   3.9660  

Coefficients:
                 Estimate   Std. Error z value Pr(>|z|)    
(Intercept)    -4.1139710    0.0926772 -44.390  < 2e-16 ***
dem            -0.0494551    0.0072331  -6.837 8.07e-12 ***
growth         -0.0118322    0.0075851  -1.560    0.119    
allies         -0.4560842    0.0773249  -5.898 3.67e-09 ***
contig          0.5590242    0.0766158   7.296 2.95e-13 ***
capratio       -0.0030813    0.0004072  -7.567 3.81e-14 ***
trade         -14.2260838   10.2085494  -1.394    0.163    
poly(py, 3)1 -129.1817426   10.5291922 -12.269  < 2e-16 ***
poly(py, 3)2   89.6804700   12.9347930   6.933 4.11e-12 ***
poly(py, 3)3  -80.4785273    9.1927794  -8.755  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7719.2  on 20989  degrees of freedom
Residual deviance: 5295.8  on 20980  degrees of freedom
AIC: 5315.8

Number of Fisher Scoring iterations: 8

A Plot

newdf <- data.frame(dem = 0, allies = 0, growth = 0, contig = 0, capratio = 1, trade = 0,
    py = seq(0, 33))
newdf$pred <- predict(cloglog_model.s, newdata = newdf, type = "response")
ggplot(newdf) + aes(x = py, y = pred) + geom_line() + theme_minimal() + labs(x = "Years of Peace",
    y = "Pr(Dispute)")