Multiple Regresion with Alumni Giving

Robert W. Walker

A Regression Puzzle

Three of them. To be precise.

A Plot

Three on Two

x_1

x_2

Both

What’s going on?

Both are related to the outcome. Both are even more strongly related to each other. Parsing the difference between them is impossible, there is not enough unique variance. A big F so the model accounts for variation alongside small t’s suggest exactly this. Because it inflates the variance of the slope: hence Variance Inflation Factors. A common solution is extracting the principal component but it comes at a cost….. Levers.

Alumni Giving

The setup

The director of development at a large state university is under pressure to raise giving rates to conform to US News and World Reports ranking that place an important weight on the average alumni giving rate - a metric that is a heavy component of their overall ranking of institutions. The institution’s current rate of 8% is in the low category and raising it could play a key role in raising the overall ranking of the university from the current ranking in the mid 130s. You have the following data available for a sample of schools that fielded football teams in the NCAA Football Bowl Subdivision [this is the sampling criterion because it includes almost all big state universities and big private institutions]. There is no reason to believe that this is unrepresentative for our purposes.

Data

You have the following data available for a sample of schools that fielded football teams in the NCAA Football Bowl Subdivision [this is the sampling criterion because it includes almost all big state universities and big private institutions]. There is no reason to believe that this is unrepresentative for our purposes. Of course, the goal is to explain giving rates in GivingRates.RData. Once we have done this, we can evaluate our development director but we need benchmarking.

Description

Variable	Description
ID	An identifier of the school running from 1 to 125.
School	The name of the school.
SFR	the Student to Faculty ratio [the number of students divided by the number of faculty]
Small.Classes	Percentage of classes with fewer than 20 students.
Big.Classes	Percentage of classes with greater than 50 students.
Graduation.Rate	the six year rate of graduation.
Freshman.Retention	the Freshman retention rate or the rate at which first year students are retained into the second year.
Giving	the average alumni giving rate for the institution.
Special	A Yes or No variable denoting schools that were involved in special giving campaigns tied to the football program.

Load the data

load(url("https://robertwwalker.github.io/xaringan/AlumniGiving/data/AlumniGiving.rdata"))

Automagic Data Analysis

install.packages("SmartEDA")
library(SmartEDA)
ExpReport(AlumniGiving, Target="Giving", op_file="EDA-Giving.html")

Understanding Giving

The data are quite skewed to the right with a few seemingly large and outlying values. The lower half [below the median] spans a much smaller range than the upper half even without considering those values.

AlumniGiving %>%
    ggplot() + aes(x = Giving) + geom_boxplot()

Single mean [what’s average giving with 95% confidence?]

t.test(AlumniGiving$Giving)


    One Sample t-test

data:  AlumniGiving$Giving
t = 19, df = 122, p-value <2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.127 0.156
sample estimates:
mean of x 
    0.142

What’s Up with Special?

Comparing independent samples, 0.1 to 0.34 higher for Special schools with 95% confidence.

t.test(Giving ~ Special, data = AlumniGiving)


    Welch Two Sample t-test

data:  Giving by Special
t = -7, df = 2, p-value = 0.01
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -0.3362 -0.0974
sample estimates:
 mean in group No mean in group Yes 
            0.137             0.353

Big Classes or Small Classes more Common?

With 95% confidence, the average school has 0.24 to 0.30 more small classes than big classes.

t.test(AlumniGiving$Small.Classes, AlumniGiving$Big.Classes, paired = TRUE)


    Paired t-test

data:  AlumniGiving$Small.Classes and AlumniGiving$Big.Classes
t = 17, df = 122, p-value <2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.236 0.299
sample estimates:
mean difference 
          0.267

Look at the data

GGally::ggpairs({
    AlumniGiving %>%
        select(-School)
})

A Regression

Estimate and report a regression using all of the potential predictors [except the ID of course, and Special] to explain Giving. With 95% confidence, what factors seem to explain giving? Only Small.Classes. Interpret the residual standard error to summarize how well the regression with all predictors fits. The residual standard error is 0.056; on average, the regression is off for a particular school by 0.056. Does the model predict variance? Yes, 54% of the variance in alumni giving can be accounted for by these factors. The F-statistic is far too large for this quantity of variance to be explained by chance alone.

# Model.LM <-
# lm(Giving~SFR+Small.Classes+Big.Classes+Graduation.Rate+Freshman.Retention,
# data=AlumniGiving) summary(Model.LM) AlumniGiving$ResidsNoS <-
# residuals(Model.LM)
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
    "Big.Classes", "Graduation.Rate", "Freshman.Retention"))
summary(result, sum_check = "sumsquares")

Linear regression (OLS)
Data     : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention 
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero

                    coefficient std.error t.value p.value  
 (Intercept)             -0.195     0.114  -1.710   0.090 .
 SFR                     -0.001     0.002  -0.617   0.538  
 Small.Classes            0.151     0.064   2.339   0.021 *
 Big.Classes             -0.032     0.119  -0.270   0.787  
 Graduation.Rate          0.130     0.098   1.324   0.188  
 Freshman.Retention       0.257     0.177   1.452   0.149  

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.533,  Adjusted R-squared: 0.513 
F-statistic: 26.7 df(5,117), p.value < .001
Nr obs: 123 

Sum of squares:
            df    SS
Regression   5 0.423
Error      117 0.371
Total      122 0.794

AlumniGiving <- store(AlumniGiving, result, name = "ResidsNoS")

Normal Residuals?

Are the residuals from the previous regression normal? No. There are a few schools that are quite poorly explained with large amounts of excess giving given the model. Note the long right tail on the residuals. Provide convincing evidence that they are. Or Fix it with Special?

It is hard to say definitively, most everything conforms but there is also a disquieting pattern of large positive residuals; more than the quantile-quantile plot would suggest there should be.

May need to install:
install.packages("gvlma")

Use gvlma

library(gvlma)
gvlma(result$model)


Call:
lm(formula = form_upper, data = dataset)

Coefficients:
       (Intercept)                 SFR       Small.Classes  
           -0.1950             -0.0011              0.1507  
       Big.Classes     Graduation.Rate  Freshman.Retention  
           -0.0322              0.1298              0.2569  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = result$model) 

                   Value  p-value                   Decision
Global Stat         40.6 3.27e-08 Assumptions NOT satisfied!
Skewness            14.8 1.17e-04 Assumptions NOT satisfied!
Kurtosis            11.7 6.30e-04 Assumptions NOT satisfied!
Link Function        3.6 5.77e-02    Assumptions acceptable.
Heteroscedasticity  10.5 1.22e-03 Assumptions NOT satisfied!

# gvlma(Model.LM) Null is normal.  Here that's very unlikely to be true;
# they are non-normal.
shapiro.test(AlumniGiving$ResidsNoS)


    Shapiro-Wilk normality test

data:  AlumniGiving$ResidsNoS
W = 1, p-value = 4e-04

Nope.

Another Regression [post-hoc]

Reexamine a regression but now use all of the predictors except for the School and ID to explain Giving. With 95% confidence, what factors seem to explain giving? Special and Small.Classes Interpret the residual standard error to summarize how well the regression with all predictors fits. We can predict Giving Rates in the sample to within 0.048, on average. Does the model predict variance? Yes

# Model.S <-
# lm(Giving~SFR+Small.Classes+Big.Classes+Graduation.Rate+Freshman.Retention,
# data=AlumniGiving) summary(Model.S) AlumniGiving$ResidsS <-
# residuals(Model.S)
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
    "Big.Classes", "Graduation.Rate", "Freshman.Retention", "Special"))
summary(result, sum_check = c("rmse", "sumsquares", "confint"))

Linear regression (OLS)
Data     : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero

                    coefficient std.error t.value p.value    
 (Intercept)             -0.186     0.098  -1.893   0.061 .  
 SFR                     -0.001     0.002  -0.705   0.482    
 Small.Classes            0.165     0.055   2.969   0.004 ** 
 Big.Classes             -0.022     0.103  -0.212   0.833    
 Graduation.Rate          0.111     0.084   1.314   0.192    
 Freshman.Retention       0.246     0.152   1.617   0.109    
 Special|Yes              0.185     0.029   6.478  < .001 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.657,  Adjusted R-squared: 0.639 
F-statistic: 37 df(6,116), p.value < .001
Nr obs: 123 

Prediction error (RMSE):  0.047 
Residual st.dev   (RSD):  0.048 

Sum of squares:
            df    SS
Regression   6 0.521
Error      116 0.273
Total      122 0.794

                   coefficient   2.5% 97.5%   +/-
(Intercept)             -0.186 -0.380 0.009 0.194
SFR                     -0.001 -0.004 0.002 0.003
Small.Classes            0.165  0.055 0.275 0.110
Big.Classes             -0.022 -0.225 0.182 0.203
Graduation.Rate          0.111 -0.056 0.278 0.167
Freshman.Retention       0.246 -0.055 0.548 0.302
Special|Yes              0.185  0.128 0.241 0.057

plot(result, plots = "dashboard", lines = "line", nrobs = -1, custom = FALSE)

pred <- predict(result, pred_data = AlumniGiving)
AlumniGiving <- store(AlumniGiving, result, name = "ResidsSpec")
print(pred, n = 10)

Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : confidence 
Prediction dataset   : AlumniGiving 
Rows shown           : 10 of 123 

    SFR Small.Classes Big.Classes Graduation.Rate Freshman.Retention
 24.000         0.420       0.160           0.590              0.810
 19.000         0.490       0.040           0.370              0.690
 18.000         0.240       0.170           0.660              0.870
  8.000         0.740       0.001           0.810              0.880
  8.000         0.950       0.000           0.860              0.920
 20.000         0.390       0.060           0.350              0.690
 20.000         0.350       0.170           0.600              0.790
 18.000         0.280       0.180           0.580              0.830
 18.000         0.340       0.120           0.570              0.780
 14.000         0.490       0.090           0.710              0.850
 Special Prediction  2.5% 97.5%   +/-
      No      0.119 0.099 0.139 0.020
      No      0.085 0.062 0.107 0.023
     Yes      0.303 0.246 0.360 0.057
      No      0.234 0.208 0.260 0.026
      No      0.284 0.246 0.322 0.038
      No      0.064 0.044 0.085 0.021
      No      0.108 0.094 0.122 0.014
      No      0.106 0.091 0.121 0.015
      No      0.104 0.089 0.118 0.015
      No      0.166 0.153 0.179 0.013

AlumniGiving <- store(AlumniGiving, pred, name = "pred_reg")

The model, as a whole and as shown by F, accounts for far more variance than would be expected by chance alone. The p-value is less than 0.001. Indeed, the value of F such that 99.9 percent of F-ratios are smaller is 4.05 while the F we obtain is 36.987. The model almost surely explains far more than random varation. Yet, of six plausible predictors, only Special and Small.Classes have non-zero slopes.

Giving = -0.186 - 0.001*SFR + 0.165*Small.Classes - 0.022*Big.Classes + 0.111*Graduation.Rate + 0.246*Freshman.Retention + 0.185(If Special is Yes) + error

equatiomatic::extract_eq(result$model, use_coefs = TRUE, coef_digits = 4)

\operatorname{\widehat{Giving}} = -0.1858 - 0.0011(\operatorname{SFR}) + 0.1647(\operatorname{Small.Classes}) - 0.0217(\operatorname{Big.Classes}) + 0.111(\operatorname{Graduation.Rate}) + 0.2462(\operatorname{Freshman.Retention}) + 0.1849(\operatorname{Special}_{\operatorname{Yes}})

Residuals with Special

Are the residuals from the previous regression normal?

They are certainly better as Special has pulled those extreme values of residual giving to the line.

library(gvlma)
gvlma(result$model)


Call:
lm(formula = form_upper, data = dataset)

Coefficients:
       (Intercept)                 SFR       Small.Classes  
          -0.18576            -0.00108             0.16470  
       Big.Classes     Graduation.Rate  Freshman.Retention  
          -0.02173             0.11096             0.24625  
        SpecialYes  
           0.18491  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = result$model) 

                   Value p-value                Decision
Global Stat        7.579  0.1083 Assumptions acceptable.
Skewness           2.726  0.0988 Assumptions acceptable.
Kurtosis           0.734  0.3915 Assumptions acceptable.
Link Function      1.705  0.1916 Assumptions acceptable.
Heteroscedasticity 2.414  0.1203 Assumptions acceptable.

# gvlma(Model.S)

Analysis of Residuals

Given the information in the regression with all predictors, which schools have the most and least unexplained giving given the characteristics of the school?

AlumniGiving %>%
    arrange(ResidsSpec) %>%
    select(School, ResidsSpec) %>%
    head

# A tibble: 6 × 2
  School                                                    ResidsSpec
  <chr>                                                          <dbl>
1 United States Air Force Academy                              -0.124 
2 University of California—Berkeley                            -0.115 
3 University of California—Los Angeles                         -0.0865
4 San Diego State University                                   -0.0802
5 Rutgers, the State University of New Jersey—New Brunswick    -0.0707
6 University of Maryland—College Park                          -0.0703

AlumniGiving %>%
    arrange(desc(ResidsSpec)) %>%
    select(School, ResidsSpec) %>%
    head

# A tibble: 6 × 2
  School                            ResidsSpec
  <chr>                                  <dbl>
1 University of Southern California     0.146 
2 University of Arkansas                0.124 
3 Georgia Institute of Technology       0.108 
4 Duke University                       0.0956
5 Clemson University                    0.0935
6 University of Nebraska—Lincoln        0.0900

USC most/highest residual giving and the Air Force Academy least/highest (in absolute value) negative residual giving.

Interpreting the slopes.

Interpreting the intercept here is reassuring. Though it is not clear what a school with an SFR [or any other measure for that matter] of 0 would be, the giving rate could be zero.

confint(result$model)

                      2.5 %  97.5 %
(Intercept)        -0.38013 0.00860
SFR                -0.00412 0.00196
Small.Classes       0.05485 0.27456
Big.Classes        -0.22497 0.18152
Graduation.Rate    -0.05635 0.27827
Freshman.Retention -0.05534 0.54784
SpecialYes          0.12838 0.24144

Otherwise, increase x by one unit and Giving changes by [Lower bound] to [Upper bound] with 95% confidence. Increase small classes by 10% [0.1] and Giving should increase by 0.5% to 2.75% [0.1*0.055] and [0.1*0.275]

Assessing the Development Director

Our school is not Special, has a graduation rate of 0.67, a student to faculty ratio of 17 (SFR), 0.34 of the classes have fewer than 20 students (Small.Classes), 0.23 of the classes have over 50 students (Big.Classes), and a freshman retention rate of 0.77. The school’s giving rate is 8%. Assess their performance on average and for a specific year. What does this say about the performance of the development office?

Two ways to do this.

Type in the values?
Upload a spreadsheet and predict with that.

We are just barely within the 95% interval for average giving given the characteristics of our school.

pred <- predict(result, pred_cmd = c("Graduation.Rate=0.67", "Small.Classes=0.34",
    "Big.Classes=0.23", "Special='No'", "Freshman.Retention=0.77"), conf_lev = 0.9)
plot(pred, xvar = "Special", conf_lev = 0.95)

print(pred, n = 10)

Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : confidence 
Prediction command   : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77 

    SFR Graduation.Rate Small.Classes Big.Classes Special
 17.772           0.670         0.340       0.230      No
 Freshman.Retention Prediction    5%   95%   +/-
              0.770      0.110 0.084 0.136 0.026

At 90%, this would be unacceptable.

pred <- predict(result, pred_cmd = c("Graduation.Rate=0.67", "Small.Classes=0.34",
    "Big.Classes=0.23", "Special='No'", "Freshman.Retention=0.77"), conf_lev = 0.9)
plot(pred, xvar = "Special", conf_lev = 0.9)

print(pred, n = 10)

Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : confidence 
Prediction command   : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77 

    SFR Graduation.Rate Small.Classes Big.Classes Special
 17.772           0.670         0.340       0.230      No
 Freshman.Retention Prediction    5%   95%   +/-
              0.770      0.110 0.084 0.136 0.026

Predicting the Data [An aside]

pred <- predict(result, pred_cmd = c("Graduation.Rate=0.67", "Small.Classes=0.34",
    "Big.Classes=0.23", "Special='No'", "Freshman.Retention=0.77"), conf_lev = 0.9,
    interval = "prediction")
# plot(pred, xvar = 'Special', conf_lev = 0.9)
print(pred, n = 10)

Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : prediction 
Prediction command   : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77 

    SFR Graduation.Rate Small.Classes Big.Classes Special
 17.772           0.670         0.340       0.230      No
 Freshman.Retention Prediction    5%   95%   +/-
              0.770      0.110 0.026 0.194 0.084

Letting the Machine Learn….

What factors matter? We will think about this in two ways. First, by adding predictors that clearly explain variation. Second, by starting with all reasonable predictors and removing them if they do not have non-zero slopes, using a stepwise procedure.

This is stepwise

# step(Model.S, direction='both')
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
    "Big.Classes", "Graduation.Rate", "Freshman.Retention", "Special"), check = "stepwise-backward")

Start:  AIC=-738
Giving ~ SFR + Small.Classes + Big.Classes + Graduation.Rate + 
    Freshman.Retention + Special

                     Df Sum of Sq   RSS  AIC
- Big.Classes         1    0.0001 0.273 -740
- SFR                 1    0.0012 0.274 -739
- Graduation.Rate     1    0.0041 0.277 -738
<none>                            0.273 -738
- Freshman.Retention  1    0.0061 0.279 -737
- Small.Classes       1    0.0207 0.293 -731
- Special             1    0.0986 0.371 -702

Step:  AIC=-740
Giving ~ SFR + Small.Classes + Graduation.Rate + Freshman.Retention + 
    Special

                     Df Sum of Sq   RSS  AIC
- SFR                 1    0.0014 0.274 -741
- Graduation.Rate     1    0.0040 0.277 -740
<none>                            0.273 -740
- Freshman.Retention  1    0.0063 0.279 -739
- Small.Classes       1    0.0315 0.304 -728
- Special             1    0.0987 0.371 -704

Step:  AIC=-741
Giving ~ Small.Classes + Graduation.Rate + Freshman.Retention + 
    Special

                     Df Sum of Sq   RSS  AIC
<none>                            0.274 -741
- Graduation.Rate     1    0.0051 0.279 -741
- Freshman.Retention  1    0.0063 0.280 -740
- Small.Classes       1    0.0571 0.331 -720
- Special             1    0.0989 0.373 -705

summary(result, sum_check = c("rmse", "sumsquares", "confint"), conf_lev = 0.9)

----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data     : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero

                    coefficient std.error t.value p.value    
 (Intercept)             -0.224     0.085  -2.625   0.010 ** 
 Small.Classes            0.191     0.039   4.958  < .001 ***
 Graduation.Rate          0.117     0.079   1.480   0.141    
 Freshman.Retention       0.248     0.151   1.644   0.103    
 Special|Yes              0.185     0.028   6.525  < .001 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.655,  Adjusted R-squared: 0.643 
F-statistic: 56 df(4,118), p.value < .001
Nr obs: 123 

Prediction error (RMSE):  0.047 
Residual st.dev   (RSD):  0.048 

Sum of squares:
            df    SS
Regression   4 0.520
Error      118 0.274
Total      122 0.794

                   coefficient     5%    95%   +/-
(Intercept)             -0.224 -0.366 -0.083 0.142
Small.Classes            0.191  0.127  0.255 0.064
Graduation.Rate          0.117 -0.014  0.248 0.131
Freshman.Retention       0.248 -0.002  0.498 0.250
Special|Yes              0.185  0.138  0.232 0.047

On VIF

It manages to remove two variables for having insufficient associated squares; we lose 0.002 when measured in R-squared. Though we cannot be certain that Graduation.Rate and Freshman.Retention do not have zero slopes, this largely results from the relationship among them.

# car::vif(Model.S)

Holistic Takeaways

If you could recommend one or two operational factors to improve giving, what would they be? Think carefully about which factors can and cannot be controlled by the relevant decision makers.

The overarching evidence provided by Small.Classes, Graduation.Rate, and Freshman.Retention suggests that the ability to deliver more personalized experiences as opposed to greater bureaucratization of the university might be a solution; administrative creep is a well known phenomenon in higher education.

Some Additional Interpretation: Plots

pred <- predict(result, pred_cmd = "Small.Classes=seq(0,0.7, by=0.01)")
plot(pred, xvar = "Small.Classes")