A few weeks ago, Jack mentioned the use of principal components as a means for combining collinear variables. There is a more general language for describing models of this sort. The following example will play off of work I am currently finishing up with Elliot Maltz and a co-author.
First, the data.
There is a ton of data in here. Let me pay particular attention to specific parts we are interested in.
[1] "28. You lack career guidance and support [People in the community expect you to be a leader]"
1 2 3 4 5 6 7
22 18 22 21 12 15 5
[1] "28. You lack career guidance and support [Your community encourages you to achieve individual success]"
1 2 3 4 5 6 7
17 19 26 23 18 10 2
[1] "28. You lack career guidance and support [You are expected to be assertive in your interactions with others]"
1 2 3 4 5 6 7
13 27 27 31 11 4 2
[1] "28. You lack career guidance and support [You are expected to have strong opinions]"
1 2 3 4 5 6 7
11 26 19 31 23 3 2
AB <- cfa("Agentic =~ ...76 + ...77 + ...78 + ...79", data = EMData.Anonymous, ordered = TRUE)
summary(AB, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 15 iterations
Estimator DWLS
Optimization method NLMINB
Number of model parameters 28
Number of observations 115
Model Test User Model:
Standard Robust
Test Statistic 7.422 21.832
Degrees of freedom 2 2
P-value (Chi-square) 0.024 0.000
Scaling correction factor 0.341
Shift parameter 0.087
simple second-order correction
Model Test Baseline Model:
Test statistic 2486.114 1657.812
Degrees of freedom 6 6
P-value 0.000 0.000
Scaling correction factor 1.501
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.998 0.988
Tucker-Lewis Index (TLI) 0.993 0.964
Robust Comparative Fit Index (CFI) NA
Robust Tucker-Lewis Index (TLI) NA
Root Mean Square Error of Approximation:
RMSEA 0.154 0.295
90 Percent confidence interval - lower 0.047 0.191
90 Percent confidence interval - upper 0.280 0.412
P-value RMSEA <= 0.05 0.053 0.000
Robust RMSEA NA
90 Percent confidence interval - lower NA
90 Percent confidence interval - upper NA
Standardized Root Mean Square Residual:
SRMR 0.033 0.033
Parameter Estimates:
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Unstructured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Agentic =~
...76 1.000 0.853 0.853
...77 1.056 0.051 20.773 0.000 0.901 0.901
...78 1.045 0.042 25.093 0.000 0.891 0.891
...79 0.962 0.044 22.041 0.000 0.820 0.820
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....76 0.000 0.000 0.000
....77 0.000 0.000 0.000
....78 0.000 0.000 0.000
....79 0.000 0.000 0.000
Agentic 0.000 0.000 0.000
Thresholds:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...76|t1 -0.873 0.135 -6.459 0.000 -0.873 -0.873
...76|t2 -0.391 0.121 -3.241 0.001 -0.391 -0.391
...76|t3 0.098 0.118 0.835 0.403 0.098 0.098
...76|t4 0.588 0.125 4.702 0.000 0.588 0.588
...76|t5 0.939 0.138 6.790 0.000 0.939 0.939
...76|t6 1.712 0.207 8.262 0.000 1.712 1.712
...77|t1 -1.046 0.144 -7.264 0.000 -1.046 -1.046
...77|t2 -0.487 0.123 -3.975 0.000 -0.487 -0.487
...77|t3 0.098 0.118 0.835 0.403 0.098 0.098
...77|t4 0.641 0.127 5.062 0.000 0.641 0.641
...77|t5 1.257 0.158 7.948 0.000 1.257 1.257
...77|t6 2.111 0.285 7.411 0.000 2.111 2.111
...78|t1 -1.211 0.155 -7.826 0.000 -1.211 -1.211
...78|t2 -0.391 0.121 -3.241 0.001 -0.391 -0.391
...78|t3 0.209 0.118 1.763 0.078 0.209 0.209
...78|t4 1.046 0.144 7.264 0.000 1.046 1.046
...78|t5 1.624 0.195 8.320 0.000 1.624 1.624
...78|t6 2.111 0.285 7.411 0.000 2.111 2.111
...79|t1 -1.307 0.162 -8.058 0.000 -1.307 -1.307
...79|t2 -0.463 0.122 -3.792 0.000 -0.463 -0.463
...79|t3 -0.033 0.117 -0.279 0.781 -0.033 -0.033
...79|t4 0.695 0.128 5.418 0.000 0.695 0.695
...79|t5 1.712 0.207 8.262 0.000 1.712 1.712
...79|t6 2.111 0.285 7.411 0.000 2.111 2.111
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....76 0.273 0.273 0.273
....77 0.188 0.188 0.188
....78 0.207 0.207 0.207
....79 0.327 0.327 0.327
Agentic 0.727 0.051 14.358 0.000 1.000 1.000
Scales y*:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...76 1.000 1.000 1.000
...77 1.000 1.000 1.000
...78 1.000 1.000 1.000
...79 1.000 1.000 1.000
[1] "28. You lack career guidance and support [You are expected to be unselfish]"
1 2 3 4 5 6 7
5 13 24 25 25 19 4
[1] "28. You lack career guidance and support [In your interactions with others you are expected to consider others opinions over your own]"
1 2 3 4 5 6 7
9 21 32 28 20 4 1
[1] "28. You lack career guidance and support [Your community expects you to put others interests ahead of your own]"
1 2 3 4 5 6 7
12 21 24 25 20 11 2
CB <- cfa("Communal =~ ...80 + ...81 + ...84", data = EMData.Anonymous, ordered = TRUE)
summary(CB, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 13 iterations
Estimator DWLS
Optimization method NLMINB
Number of model parameters 21
Number of observations 115
Model Test User Model:
Standard Robust
Test Statistic 0.000 0.000
Degrees of freedom 0 0
Model Test Baseline Model:
Test statistic 511.919 427.939
Degrees of freedom 3 3
P-value 0.000 0.000
Scaling correction factor 1.198
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000 1.000
Tucker-Lewis Index (TLI) 1.000 1.000
Robust Comparative Fit Index (CFI) NA
Robust Tucker-Lewis Index (TLI) NA
Root Mean Square Error of Approximation:
RMSEA 0.000 0.000
90 Percent confidence interval - lower 0.000 0.000
90 Percent confidence interval - upper 0.000 0.000
P-value RMSEA <= 0.05 NA NA
Robust RMSEA NA
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
Standardized Root Mean Square Residual:
SRMR 0.000 0.000
Parameter Estimates:
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Unstructured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Communal =~
...80 1.000 0.823 0.823
...81 0.908 0.078 11.572 0.000 0.747 0.747
...84 0.983 0.083 11.823 0.000 0.808 0.808
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....80 0.000 0.000 0.000
....81 0.000 0.000 0.000
....84 0.000 0.000 0.000
Communal 0.000 0.000 0.000
Thresholds:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...80|t1 -1.712 0.207 -8.262 0.000 -1.712 -1.712
...80|t2 -1.009 0.142 -7.110 0.000 -1.009 -1.009
...80|t3 -0.345 0.120 -2.872 0.004 -0.345 -0.345
...80|t4 0.209 0.118 1.763 0.078 0.209 0.209
...80|t5 0.842 0.134 6.289 0.000 0.842 0.842
...80|t6 1.815 0.223 8.129 0.000 1.815 1.815
...81|t1 -1.417 0.172 -8.235 0.000 -1.417 -1.417
...81|t2 -0.641 0.127 -5.062 0.000 -0.641 -0.641
...81|t3 0.098 0.118 0.835 0.403 0.098 0.098
...81|t4 0.781 0.131 5.945 0.000 0.781 0.781
...81|t5 1.712 0.207 8.262 0.000 1.712 1.712
...81|t6 2.378 0.369 6.451 0.000 2.378 2.378
...84|t1 -1.257 0.158 -7.948 0.000 -1.257 -1.257
...84|t2 -0.562 0.124 -4.521 0.000 -0.562 -0.562
...84|t3 -0.011 0.117 -0.093 0.926 -0.011 -0.011
...84|t4 0.562 0.124 4.521 0.000 0.562 0.562
...84|t5 1.211 0.155 7.826 0.000 1.211 1.211
...84|t6 2.111 0.285 7.411 0.000 2.111 2.111
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....80 0.323 0.323 0.323
....81 0.443 0.443 0.443
....84 0.347 0.347 0.347
Communal 0.677 0.072 9.442 0.000 1.000 1.000
Scales y*:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...80 1.000 1.000 1.000
...81 1.000 1.000 1.000
...84 1.000 1.000 1.000
[1] "5. On a scale of 7 to 1, with 7 being very regularly and 1 being very rarely, how regularly do you: [Encourage entrepreneurs to start businesses]"
1 2 3 4 5 6 7
21 27 12 14 10 26 5
[1] "5. On a scale of 7 to 1, with 7 being very regularly and 1 being very rarely, how regularly do you: [Reassure other entrepreneurs when things are not going well]"
1 2 3 4 5 6 7
22 21 10 19 33 4 6
[1] "5. On a scale of 7 to 1, with 7 being very regularly and 1 being very rarely, how regularly do you: [Help other entrepreneurs have confidence they can succeed]"
1 2 3 4 5 6 7
21 29 27 11 21 5 1
M <- cfa("Mentoring =~ ...13 + ...14 + ...15", data = EMData.Anonymous, ordered = TRUE)
summary(M, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-12 ended normally after 10 iterations
Estimator DWLS
Optimization method NLMINB
Number of model parameters 21
Number of observations 115
Model Test User Model:
Standard Robust
Test Statistic 0.000 0.000
Degrees of freedom 0 0
Model Test Baseline Model:
Test statistic 1726.590 1624.559
Degrees of freedom 3 3
P-value 0.000 0.000
Scaling correction factor 1.063
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000 1.000
Tucker-Lewis Index (TLI) 1.000 1.000
Robust Comparative Fit Index (CFI) NA
Robust Tucker-Lewis Index (TLI) NA
Root Mean Square Error of Approximation:
RMSEA 0.000 0.000
90 Percent confidence interval - lower 0.000 0.000
90 Percent confidence interval - upper 0.000 0.000
P-value RMSEA <= 0.05 NA NA
Robust RMSEA NA
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
Standardized Root Mean Square Residual:
SRMR 0.000 0.000
Parameter Estimates:
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Unstructured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Mentoring =~
...13 1.000 0.981 0.981
...14 0.879 0.058 15.193 0.000 0.862 0.862
...15 0.844 0.056 15.063 0.000 0.827 0.827
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....13 0.000 0.000 0.000
....14 0.000 0.000 0.000
....15 0.000 0.000 0.000
Mentoring 0.000 0.000 0.000
Thresholds:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...13|t1 -0.905 0.137 -6.626 0.000 -0.905 -0.905
...13|t2 -0.209 0.118 -1.763 0.078 -0.209 -0.209
...13|t3 0.055 0.117 0.464 0.643 0.055 0.055
...13|t4 0.368 0.120 3.057 0.002 0.368 0.368
...13|t5 0.614 0.126 4.882 0.000 0.614 0.614
...13|t6 1.712 0.207 8.262 0.000 1.712 1.712
...14|t1 -0.873 0.135 -6.459 0.000 -0.873 -0.873
...14|t2 -0.322 0.120 -2.688 0.007 -0.322 -0.322
...14|t3 -0.098 0.118 -0.835 0.403 -0.098 -0.098
...14|t4 0.322 0.120 2.688 0.007 0.322 0.322
...14|t5 1.360 0.167 8.155 0.000 1.360 1.360
...14|t6 1.624 0.195 8.320 0.000 1.624 1.624
...15|t1 -0.905 0.137 -6.626 0.000 -0.905 -0.905
...15|t2 -0.164 0.118 -1.392 0.164 -0.164 -0.164
...15|t3 0.439 0.122 3.608 0.000 0.439 0.439
...15|t4 0.723 0.129 5.595 0.000 0.723 0.723
...15|t5 1.624 0.195 8.320 0.000 1.624 1.624
...15|t6 2.378 0.369 6.451 0.000 2.378 2.378
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....13 0.039 0.039 0.039
....14 0.258 0.258 0.258
....15 0.315 0.315 0.315
Mentoring 0.961 0.077 12.508 0.000 1.000 1.000
Scales y*:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...13 1.000 1.000 1.000
...14 1.000 1.000 1.000
...15 1.000 1.000 1.000
Now let me combine those measurement
models to produce a set of two structural equations. I wish to explain income and employment given these factors.
names(EMData)[c(5, 59)]
Struct <- sem("Agentic =~ ...76 + ...77 + ...78 + ...79
Communal =~ ...80 + ...81 + ...84
Mentoring =~ ...13 + ...14 + ...15
Social.Influence =~ ...37 + ...38 + ...39
...59 ~ Agentic + Communal + Mentoring + Social.Influence
...5 ~ Agentic + Communal + Mentoring + Social.Influence",
data = EMData.Anonymous, ordered = c("...13", "...14", "...15", "...80", "...81",
"...84", "...76", "...77", "...78", "...79", "...37", "...38", "...39"))
summary(Struct, fit.measures = TRUE, standardized = TRUE)
[1] "3. What is the current income generated from the business? (USD PM)"
[2] "20. How many people do you manage in your business?"
lavaan 0.6-12 ended normally after 112 iterations
Estimator DWLS
Optimization method NLMINB
Number of model parameters 108
Number of observations 115
Model Test User Model:
Standard Robust
Test Statistic 176.253 255.843
Degrees of freedom 77 77
P-value (Chi-square) 0.000 0.000
Scaling correction factor 0.804
Shift parameter 36.682
simple second-order correction
Model Test Baseline Model:
Test statistic 9489.339 3694.019
Degrees of freedom 105 105
P-value 0.000 0.000
Scaling correction factor 2.615
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.989 0.950
Tucker-Lewis Index (TLI) 0.986 0.932
Robust Comparative Fit Index (CFI) NA
Robust Tucker-Lewis Index (TLI) NA
Root Mean Square Error of Approximation:
RMSEA 0.106 0.143
90 Percent confidence interval - lower 0.086 0.124
90 Percent confidence interval - upper 0.127 0.162
P-value RMSEA <= 0.05 0.000 0.000
Robust RMSEA NA
90 Percent confidence interval - lower NA
90 Percent confidence interval - upper NA
Standardized Root Mean Square Residual:
SRMR 0.075 0.075
Parameter Estimates:
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Unstructured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Agentic =~
...76 1.000 0.812 0.812
...77 1.110 0.051 21.573 0.000 0.901 0.901
...78 1.076 0.044 24.545 0.000 0.873 0.873
...79 1.097 0.047 23.283 0.000 0.890 0.890
Communal =~
...80 1.000 0.715 0.715
...81 1.233 0.126 9.803 0.000 0.881 0.881
...84 1.103 0.091 12.190 0.000 0.788 0.788
Mentoring =~
...13 1.000 0.961 0.961
...14 0.908 0.057 15.897 0.000 0.872 0.872
...15 0.876 0.054 16.228 0.000 0.842 0.842
Social.Influence =~
...37 1.000 0.948 0.948
...38 1.004 0.030 33.882 0.000 0.952 0.952
...39 0.941 0.029 32.029 0.000 0.892 0.892
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...59 ~
Agentic -0.087 0.063 -1.386 0.166 -0.071 -0.052
Communal 0.119 0.108 1.100 0.271 0.085 0.062
Mentoring 0.166 0.075 2.213 0.027 0.159 0.116
Social.Influnc -0.422 0.055 -7.620 0.000 -0.400 -0.292
...5 ~
Agentic 51.893 19.705 2.633 0.008 42.118 0.205
Communal -9.427 20.599 -0.458 0.647 -6.736 -0.033
Mentoring 1.218 15.046 0.081 0.935 1.170 0.006
Social.Influnc 35.014 16.427 2.131 0.033 33.195 0.161
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Agentic ~~
Communal 0.259 0.044 5.821 0.000 0.447 0.447
Mentoring 0.203 0.062 3.269 0.001 0.261 0.261
Social.Influnc 0.247 0.056 4.423 0.000 0.321 0.321
Communal ~~
Mentoring 0.134 0.057 2.329 0.020 0.195 0.195
Social.Influnc 0.308 0.062 4.963 0.000 0.454 0.454
Mentoring ~~
Social.Influnc 0.067 0.074 0.904 0.366 0.074 0.074
....59 ~~
....5 78.509 19.599 4.006 0.000 78.509 0.304
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....76 0.000 0.000 0.000
....77 0.000 0.000 0.000
....78 0.000 0.000 0.000
....79 0.000 0.000 0.000
....80 0.000 0.000 0.000
....81 0.000 0.000 0.000
....84 0.000 0.000 0.000
....13 0.000 0.000 0.000
....14 0.000 0.000 0.000
....15 0.000 0.000 0.000
....37 0.000 0.000 0.000
....38 0.000 0.000 0.000
....39 0.000 0.000 0.000
....59 0.435 0.305 1.427 0.154 0.435 0.317
....5 380.609 23.778 16.007 0.000 380.609 1.849
Agentic 0.000 0.000 0.000
Communal 0.000 0.000 0.000
Mentoring 0.000 0.000 0.000
Social.Influnc 0.000 0.000 0.000
Thresholds:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...76|t1 -0.873 0.135 -6.459 0.000 -0.873 -0.873
...76|t2 -0.391 0.121 -3.241 0.001 -0.391 -0.391
...76|t3 0.098 0.118 0.835 0.403 0.098 0.098
...76|t4 0.588 0.125 4.702 0.000 0.588 0.588
...76|t5 0.939 0.138 6.790 0.000 0.939 0.939
...76|t6 1.712 0.207 8.262 0.000 1.712 1.712
...77|t1 -1.046 0.144 -7.264 0.000 -1.046 -1.046
...77|t2 -0.487 0.123 -3.975 0.000 -0.487 -0.487
...77|t3 0.098 0.118 0.835 0.403 0.098 0.098
...77|t4 0.641 0.127 5.062 0.000 0.641 0.641
...77|t5 1.257 0.158 7.948 0.000 1.257 1.257
...77|t6 2.111 0.285 7.411 0.000 2.111 2.111
...78|t1 -1.211 0.155 -7.826 0.000 -1.211 -1.211
...78|t2 -0.391 0.121 -3.241 0.001 -0.391 -0.391
...78|t3 0.209 0.118 1.763 0.078 0.209 0.209
...78|t4 1.046 0.144 7.264 0.000 1.046 1.046
...78|t5 1.624 0.195 8.320 0.000 1.624 1.624
...78|t6 2.111 0.285 7.411 0.000 2.111 2.111
...79|t1 -1.307 0.162 -8.058 0.000 -1.307 -1.307
...79|t2 -0.463 0.122 -3.792 0.000 -0.463 -0.463
...79|t3 -0.033 0.117 -0.279 0.781 -0.033 -0.033
...79|t4 0.695 0.128 5.418 0.000 0.695 0.695
...79|t5 1.712 0.207 8.262 0.000 1.712 1.712
...79|t6 2.111 0.285 7.411 0.000 2.111 2.111
...80|t1 -1.712 0.207 -8.262 0.000 -1.712 -1.712
...80|t2 -1.009 0.142 -7.110 0.000 -1.009 -1.009
...80|t3 -0.345 0.120 -2.872 0.004 -0.345 -0.345
...80|t4 0.209 0.118 1.763 0.078 0.209 0.209
...80|t5 0.842 0.134 6.289 0.000 0.842 0.842
...80|t6 1.815 0.223 8.129 0.000 1.815 1.815
...81|t1 -1.417 0.172 -8.235 0.000 -1.417 -1.417
...81|t2 -0.641 0.127 -5.062 0.000 -0.641 -0.641
...81|t3 0.098 0.118 0.835 0.403 0.098 0.098
...81|t4 0.781 0.131 5.945 0.000 0.781 0.781
...81|t5 1.712 0.207 8.262 0.000 1.712 1.712
...81|t6 2.378 0.369 6.451 0.000 2.378 2.378
...84|t1 -1.257 0.158 -7.948 0.000 -1.257 -1.257
...84|t2 -0.562 0.124 -4.521 0.000 -0.562 -0.562
...84|t3 -0.011 0.117 -0.093 0.926 -0.011 -0.011
...84|t4 0.562 0.124 4.521 0.000 0.562 0.562
...84|t5 1.211 0.155 7.826 0.000 1.211 1.211
...84|t6 2.111 0.285 7.411 0.000 2.111 2.111
...13|t1 -0.905 0.137 -6.626 0.000 -0.905 -0.905
...13|t2 -0.209 0.118 -1.763 0.078 -0.209 -0.209
...13|t3 0.055 0.117 0.464 0.643 0.055 0.055
...13|t4 0.368 0.120 3.057 0.002 0.368 0.368
...13|t5 0.614 0.126 4.882 0.000 0.614 0.614
...13|t6 1.712 0.207 8.262 0.000 1.712 1.712
...14|t1 -0.873 0.135 -6.459 0.000 -0.873 -0.873
...14|t2 -0.322 0.120 -2.688 0.007 -0.322 -0.322
...14|t3 -0.098 0.118 -0.835 0.403 -0.098 -0.098
...14|t4 0.322 0.120 2.688 0.007 0.322 0.322
...14|t5 1.360 0.167 8.155 0.000 1.360 1.360
...14|t6 1.624 0.195 8.320 0.000 1.624 1.624
...15|t1 -0.905 0.137 -6.626 0.000 -0.905 -0.905
...15|t2 -0.164 0.118 -1.392 0.164 -0.164 -0.164
...15|t3 0.439 0.122 3.608 0.000 0.439 0.439
...15|t4 0.723 0.129 5.595 0.000 0.723 0.723
...15|t5 1.624 0.195 8.320 0.000 1.624 1.624
...15|t6 2.378 0.369 6.451 0.000 2.378 2.378
...37|t1 -0.752 0.130 -5.771 0.000 -0.752 -0.752
...37|t2 0.142 0.118 1.207 0.228 0.142 0.142
...37|t3 0.939 0.138 6.790 0.000 0.939 0.939
...37|t4 1.124 0.149 7.558 0.000 1.124 1.124
...37|t5 1.548 0.186 8.325 0.000 1.548 1.548
...38|t1 -0.723 0.129 -5.595 0.000 -0.723 -0.723
...38|t2 0.209 0.118 1.763 0.078 0.209 0.209
...38|t3 0.939 0.138 6.790 0.000 0.939 0.939
...38|t4 1.166 0.152 7.696 0.000 1.166 1.166
...38|t5 1.548 0.186 8.325 0.000 1.548 1.548
...39|t1 -0.641 0.127 -5.062 0.000 -0.641 -0.641
...39|t2 0.231 0.119 1.948 0.051 0.231 0.231
...39|t3 1.046 0.144 7.264 0.000 1.046 1.046
...39|t4 1.211 0.155 7.826 0.000 1.211 1.211
...39|t5 1.624 0.195 8.320 0.000 1.624 1.624
...39|t6 2.378 0.369 6.451 0.000 2.378 2.378
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
....76 0.341 0.341 0.341
....77 0.188 0.188 0.188
....78 0.238 0.238 0.238
....79 0.207 0.207 0.207
....80 0.489 0.489 0.489
....81 0.224 0.224 0.224
....84 0.378 0.378 0.378
....13 0.077 0.077 0.077
....14 0.239 0.239 0.239
....15 0.292 0.292 0.292
....37 0.101 0.101 0.101
....38 0.094 0.094 0.094
....39 0.204 0.204 0.204
....59 1.711 0.147 11.666 0.000 1.711 0.910
....5 38984.352 4516.263 8.632 0.000 38984.352 0.920
Agentic 0.659 0.048 13.761 0.000 1.000 1.000
Communal 0.511 0.068 7.492 0.000 1.000 1.000
Mentoring 0.923 0.072 12.824 0.000 1.000 1.000
Social.Influnc 0.899 0.033 27.070 0.000 1.000 1.000
Scales y*:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
...76 1.000 1.000 1.000
...77 1.000 1.000 1.000
...78 1.000 1.000 1.000
...79 1.000 1.000 1.000
...80 1.000 1.000 1.000
...81 1.000 1.000 1.000
...84 1.000 1.000 1.000
...13 1.000 1.000 1.000
...14 1.000 1.000 1.000
...15 1.000 1.000 1.000
...37 1.000 1.000 1.000
...38 1.000 1.000 1.000
...39 1.000 1.000 1.000
There are two classes of models commonly used for survival data. The text focuses on the Cox model though there are also AFT [accelerated failure time] models for parametric analysis of survival-time data. First, on the data structure.
The most common applications are in statistics for life tables
. How long do individuals live beyond some intervention? In people analytics, how long do employees stay with a firm or how long do customers maintain a relationship with a firm or service provider?
There will be two key components: events and time.
# A tibble: 20,990 × 53
dispute dem growth allies contig capratio trade py pys1 pys2 pys3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -9 -2.5 0 1 2.66 0 1 0 0 0
2 0 -7 2.35 0 1 1.93 0.00200 1 0 0 0
3 0 -3 -5.27 0 0 10.7 0 1 0 0 0
4 0 5 2.29 0 0 160. 0 1 0 0 0
5 0 -9 -13.6 0 1 1.07 0 1 0 0 0
6 0 -7 0.320 0 1 1.19 0.00300 1 0 0 0
7 0 -7 2.10 1 1 1.91 0.00460 1 0 0 0
8 0 -8 4.14 1 1 26.7 0 1 0 0 0
9 1 -7 -0.230 0 1 1.14 0.000600 1 0 0 0
10 1 -9 2.47 1 1 1.42 0 1 0 0 0
# … with 20,980 more rows, and 42 more variables: contdisp <dbl>,
# prefail <dbl>, tzero <dbl>, `_st` <dbl>, `_d` <dbl>, `_t` <dbl>,
# `_t0` <dbl>, k01 <dbl>, k02 <dbl>, k03 <dbl>, k04 <dbl>, k05 <dbl>,
# k06 <dbl>, k07 <dbl>, k08 <dbl>, k09 <dbl>, k010 <dbl>, k011 <dbl>,
# k012 <dbl>, k013 <dbl>, k014 <dbl>, k015 <dbl>, k016 <dbl>, k017 <dbl>,
# k018 <dbl>, k019 <dbl>, k020 <dbl>, k021 <dbl>, k022 <dbl>, k023 <dbl>,
# k024 <dbl>, k025 <dbl>, k026 <dbl>, k027 <dbl>, k028 <dbl>, k029 <dbl>, …
What happens when the event has yet to happen? The observation is censored. Those are the +
in the table below. What does it mean? Data collection ended before the event.
What’s the logic? We know the unit survived at least that long.
Call: survfit(formula = Surv(event = left, time = month) ~ sentiment_category,
data = job_retention)
sentiment_category=High
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 3225 15 0.995 0.00120 0.993 0.998
2 3167 62 0.976 0.00272 0.971 0.981
3 3075 120 0.938 0.00429 0.929 0.946
4 2932 102 0.905 0.00522 0.895 0.915
5 2802 65 0.884 0.00571 0.873 0.895
6 2700 144 0.837 0.00662 0.824 0.850
7 2532 110 0.801 0.00718 0.787 0.815
8 2389 140 0.754 0.00778 0.739 0.769
9 2222 112 0.716 0.00818 0.700 0.732
10 2077 56 0.696 0.00835 0.680 0.713
11 1994 134 0.650 0.00871 0.633 0.667
12 1827 45 0.634 0.00882 0.617 0.651
sentiment_category=Not High
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 545 9 0.983 0.00546 0.973 0.994
2 532 28 0.932 0.01084 0.911 0.953
3 500 25 0.885 0.01373 0.859 0.912
4 472 29 0.831 0.01618 0.800 0.863
5 438 21 0.791 0.01758 0.757 0.826
6 411 27 0.739 0.01906 0.703 0.777
7 379 18 0.704 0.01987 0.666 0.744
8 357 21 0.662 0.02065 0.623 0.704
9 330 24 0.614 0.02136 0.574 0.658
10 302 15 0.584 0.02171 0.543 0.628
11 283 24 0.534 0.02209 0.493 0.579
12 254 8 0.517 0.02218 0.476 0.563
The canonical plot is the Kaplan-Meier survival curve.
The key to such modelling is the definition of a hazard function. Let h(t) be the proportion that have not survived to time t, we can write:
h(t) = h_{0}(t)\exp(\beta_{1}x_{1} + \beta_{2}x_{2} + \ldots + \beta_{k}x_{k})
The outcome will need to be a double: the event and the time. Let’s model this as a function of gender, field, level, and sentiment.
Call:
survival::coxph(formula = Surv(event = left, time = month) ~
gender + field + level + sentiment, data = job_retention)
n= 3770, number of events= 1354
coef exp(coef) se(coef) z Pr(>|z|)
genderM -0.04548 0.95553 0.05886 -0.773 0.439647
fieldFinance 0.22334 1.25025 0.06681 3.343 0.000829 ***
fieldHealth 0.27830 1.32089 0.12890 2.159 0.030849 *
fieldLaw 0.10532 1.11107 0.14515 0.726 0.468086
fieldPublic/Government 0.11499 1.12186 0.08899 1.292 0.196277
fieldSales/Marketing 0.08776 1.09173 0.10211 0.859 0.390082
levelLow 0.14813 1.15967 0.09000 1.646 0.099799 .
levelMedium 0.17666 1.19323 0.10203 1.732 0.083362 .
sentiment -0.11756 0.88909 0.01397 -8.415 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genderM 0.9555 1.0465 0.8514 1.0724
fieldFinance 1.2502 0.7998 1.0968 1.4252
fieldHealth 1.3209 0.7571 1.0260 1.7005
fieldLaw 1.1111 0.9000 0.8360 1.4767
fieldPublic/Government 1.1219 0.8914 0.9423 1.3356
fieldSales/Marketing 1.0917 0.9160 0.8937 1.3336
levelLow 1.1597 0.8623 0.9721 1.3834
levelMedium 1.1932 0.8381 0.9770 1.4574
sentiment 0.8891 1.1248 0.8651 0.9138
Concordance= 0.578 (se = 0.008 )
Likelihood ratio test= 89.18 on 9 df, p=2e-15
Wald test = 94.95 on 9 df, p=<2e-16
Score (logrank) test = 95.31 on 9 df, p=<2e-16
As a table:
Frailty models combine the insights of hierarchical models from the last chapter with survival analysis.
A frailty model seeks to augment the baseline hazard with some parametric random-effects. In this example, let’s allow the field
variable to have shared frailty.
library(frailtypack)
(frailty_model <- frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
gender + level + sentiment + cluster(field), data = job_retention, n.knots = 12,
kappa = 10000))
Be patient. The program is computing ...
The program took 0.48 seconds
Call:
frailtypack::frailtyPenal(formula = Surv(event = left, time = month) ~
gender + level + sentiment + cluster(field), data = job_retention,
n.knots = 12, kappa = 10000)
Shared Gamma Frailty model parameter estimates
using a Penalized Likelihood on the hazard function
coef exp(coef) SE coef (H) SE coef (HIH) z p
genderM -0.029530 0.970902 0.0591105 0.0591105 -0.499574 6.1738e-01
levelLow 0.198549 1.219632 0.0914449 0.0914449 2.171245 2.9913e-02
levelMedium 0.223268 1.250155 0.1032712 0.1032712 2.161954 3.0622e-02
sentiment -0.108262 0.897393 0.0141327 0.0141327 -7.660398 1.8541e-14
chisq df global p
level 5.32143 2 0.0699
Frailty parameter, Theta: 48.3212 (SE (H): 25.6198 ) p = 0.029642
penalized marginal log-likelihood = -5510.36
Convergence criteria:
parameters = 9.61e-06 likelihood = 1.48e-06 gradient = 1.73e-09
LCV = the approximate likelihood cross-validation criterion
in the semi parametrical case = 1.46587
n= 3770
n events= 1354 n groups= 6
number of iterations: 17
Exact number of knots used: 12
Value of the smoothing parameter: 10000, DoF: 6.31
Two comments. First, there is a test of the frailty; is there evidence of variation? This is the test of the Theta parameter. Second, do the parameters change?
[,1] [,2]
genderM 0.9709017 0.9555343
fieldFinance 1.2196322 1.2502466
fieldHealth 1.2501551 1.3208885
fieldLaw 0.8973926 1.1110687
fieldPublic/Government 0.9709017 1.1218640
fieldSales/Marketing 1.2196322 1.0917301
levelLow 1.2501551 1.1596653
levelMedium 0.8973926 1.1932310
sentiment 0.9709017 0.8890854
options(scipen = 7)
cox_model <- survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem + growth +
allies + contig + capratio + trade, data = BKT.Data)
summary(cox_model)
Call:
survival::coxph(formula = Surv(`_t0`, `_t`, dispute) ~ dem +
growth + allies + contig + capratio + trade, data = BKT.Data)
n= 20990, number of events= 947
coef exp(coef) se(coef) z Pr(>|z|)
dem -0.049197226 0.951993354 0.007229125 -6.805 1.01e-11 ***
growth -0.007580611 0.992448050 0.007492787 -1.012 0.312
allies -0.422180458 0.655615717 0.076717663 -5.503 3.73e-08 ***
contig 0.544156705 1.723154641 0.076855065 7.080 1.44e-12 ***
capratio -0.003017259 0.996987288 0.000403396 -7.480 7.45e-14 ***
trade -12.292159040 0.000004588 10.071693690 -1.220 0.222
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
dem 0.951993354 1.0504 9.386e-01 0.9656
growth 0.992448050 1.0076 9.780e-01 1.0071
allies 0.655615717 1.5253 5.641e-01 0.7620
contig 1.723154641 0.5803 1.482e+00 2.0033
capratio 0.996987288 1.0030 9.962e-01 0.9978
trade 0.000004588 217980.0971 1.226e-14 1716.4232
Concordance= 0.703 (se = 0.01 )
Likelihood ratio test= 381.8 on 6 df, p=<2e-16
Wald test = 238.3 on 6 df, p=<2e-16
Score (logrank) test = 267.9 on 6 df, p=<2e-16
library(haven)
BKT.Data <- read_dta("https://github.com/robertwwalker/xaringan/raw/master/CMF-Week-6/img/bkt98ajps.dta")
cloglog_model <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
as.factor(py), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model)
Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio +
trade + as.factor(py), family = binomial(link = "cloglog"),
data = BKT.Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4183 -0.2282 -0.1420 -0.0828 3.9654
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.113947 0.082348 -13.527 < 2e-16 ***
dem -0.049384 0.007288 -6.776 1.24e-11 ***
growth -0.008082 0.007612 -1.062 0.288357
allies -0.430045 0.078126 -5.504 3.70e-08 ***
contig 0.554479 0.077111 7.191 6.45e-13 ***
capratio -0.003023 0.000405 -7.464 8.39e-14 ***
trade -12.503176 9.960629 -1.255 0.209385
as.factor(py)2 -1.797014 0.130842 -13.734 < 2e-16 ***
as.factor(py)3 -1.910023 0.145076 -13.166 < 2e-16 ***
as.factor(py)4 -2.130485 0.165704 -12.857 < 2e-16 ***
as.factor(py)5 -2.378740 0.193902 -12.268 < 2e-16 ***
as.factor(py)6 -2.836553 0.246354 -11.514 < 2e-16 ***
as.factor(py)7 -2.831567 0.246584 -11.483 < 2e-16 ***
as.factor(py)8 -2.757412 0.239957 -11.491 < 2e-16 ***
as.factor(py)9 -2.701098 0.239710 -11.268 < 2e-16 ***
as.factor(py)10 -2.990651 0.280582 -10.659 < 2e-16 ***
as.factor(py)11 -3.207771 0.319289 -10.047 < 2e-16 ***
as.factor(py)12 -3.077872 0.305041 -10.090 < 2e-16 ***
as.factor(py)13 -3.670483 0.410654 -8.938 < 2e-16 ***
as.factor(py)14 -4.310370 0.578624 -7.449 9.38e-14 ***
as.factor(py)15 -4.329282 0.578841 -7.479 7.48e-14 ***
as.factor(py)16 -3.209284 0.336350 -9.542 < 2e-16 ***
as.factor(py)17 -3.973393 0.501858 -7.917 2.43e-15 ***
as.factor(py)18 -5.339488 1.000841 -5.335 9.55e-08 ***
as.factor(py)19 -3.367405 0.381031 -8.838 < 2e-16 ***
as.factor(py)20 -3.704984 0.449627 -8.240 < 2e-16 ***
as.factor(py)21 -4.552930 0.708058 -6.430 1.27e-10 ***
as.factor(py)22 -4.069630 0.579535 -7.022 2.18e-12 ***
as.factor(py)23 -3.894665 0.579557 -6.720 1.82e-11 ***
as.factor(py)24 -3.761143 0.579762 -6.487 8.73e-11 ***
as.factor(py)25 -17.294682 317.920692 -0.054 0.956617
as.factor(py)26 -2.758181 0.411827 -6.697 2.12e-11 ***
as.factor(py)27 -17.282784 369.870471 -0.047 0.962731
as.factor(py)28 -17.251363 368.516083 -0.047 0.962662
as.factor(py)29 -3.702860 0.709164 -5.221 1.78e-07 ***
as.factor(py)30 -3.209144 0.579970 -5.533 3.14e-08 ***
as.factor(py)31 -4.201176 1.002675 -4.190 2.79e-05 ***
as.factor(py)32 -3.364920 0.710748 -4.734 2.20e-06 ***
as.factor(py)33 -3.869193 1.001648 -3.863 0.000112 ***
as.factor(py)34 -3.112066 0.709943 -4.384 1.17e-05 ***
as.factor(py)35 -3.651929 1.003808 -3.638 0.000275 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7719.2 on 20989 degrees of freedom
Residual deviance: 5108.1 on 20949 degrees of freedom
AIC: 5190.1
Number of Fisher Scoring iterations: 17
In a paper by Carter and Signorino, they argue that a cubic function is sufficient for the analysis of most baseline hazards. Because it is cubic, that requires only the insertion of time, time-squared, and time-cubed among the set of predictors in a binary choice model with the cloglog
link.
cloglog_model.s <- glm(dispute ~ dem + growth + allies + contig + capratio + trade +
poly(py, 3), data = BKT.Data, family = binomial(link = "cloglog"))
summary(cloglog_model.s)
Call:
glm(formula = dispute ~ dem + growth + allies + contig + capratio +
trade + poly(py, 3), family = binomial(link = "cloglog"),
data = BKT.Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3282 -0.1974 -0.1359 -0.0902 3.9660
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.1139710 0.0926772 -44.390 < 2e-16 ***
dem -0.0494551 0.0072331 -6.837 8.07e-12 ***
growth -0.0118322 0.0075851 -1.560 0.119
allies -0.4560842 0.0773249 -5.898 3.67e-09 ***
contig 0.5590242 0.0766158 7.296 2.95e-13 ***
capratio -0.0030813 0.0004072 -7.567 3.81e-14 ***
trade -14.2260838 10.2085494 -1.394 0.163
poly(py, 3)1 -129.1817426 10.5291922 -12.269 < 2e-16 ***
poly(py, 3)2 89.6804700 12.9347930 6.933 4.11e-12 ***
poly(py, 3)3 -80.4785273 9.1927794 -8.755 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7719.2 on 20989 degrees of freedom
Residual deviance: 5295.8 on 20980 degrees of freedom
AIC: 5315.8
Number of Fisher Scoring iterations: 8
newdf <- data.frame(dem = 0, allies = 0, growth = 0, contig = 0, capratio = 1, trade = 0,
py = seq(0, 33))
newdf$pred <- predict(cloglog_model.s, newdata = newdf, type = "response")
ggplot(newdf) + aes(x = py, y = pred) + geom_line() + theme_minimal() + labs(x = "Years of Peace",
y = "Pr(Dispute)")
Models of Choice and Forecasting
Social Influence