Multiple Regresion with Alumni Giving

Robert W. Walker

Alumni Giving

The setup

The director of development at a large state university is under pressure to raise giving rates to conform to US News and World Reports ranking that place an important weight on the average alumni giving rate - a metric that is a heavy component of their overall ranking of institutions. The institution’s current rate of 8% is in the low category and raising it could play a key role in raising the overall ranking of the university from the current ranking in the mid 130s. You have the following data available for a sample of schools that fielded football teams in the NCAA Football Bowl Subdivision [this is the sampling criterion because it includes almost all big state universities and big private institutions]. There is no reason to believe that this is unrepresentative for our purposes.

Data

You have the following data available for a sample of schools that fielded football teams in the NCAA Football Bowl Subdivision [this is the sampling criterion because it includes almost all big state universities and big private institutions]. There is no reason to believe that this is unrepresentative for our purposes. Of course, the goal is to explain giving rates in GivingRates.RData.

Description

Variable Description
ID An identifier of the school running from 1 to 125.
School The name of the school.
SFR the Student to Faculty ratio [the number of students divided by the number of faculty]
Small.Classes Percentage of classes with fewer than 20 students.
Big.Classes Percentage of classes with greater than 50 students.
Graduation.Rate the six year rate of graduation.
Freshman.Retention the Freshman retention rate or the rate at which first year students are retained into the second year.
Giving the average alumni giving rate for the institution.
Special A Yes or No variable denoting schools that were involved in special giving campaigns tied to the football program.

Load the data

load(url("https://robertwwalker.github.io/xaringan/AlumniGiving/data/AlumniGiving.rdata"))

Shot

Shot 2

Automagic Data Analysis

install.packages("SmartEDA")
library(SmartEDA)
ExpReport(AlumniGiving, Target="Giving", op_file="EDA-Giving.html")

Question 1

Provide an instructive graphic with interpretation for rates of giving. Discuss the symmetry, the center, and the spread. In addition, provide a 95% confidence interval for average giving. The data are quite skewed to the right with a few seemingly large and outlying values. The lower half [below the median] spans a much smaller range than the upper half even without considering those values.

AlumniGiving %>%
    ggplot() + aes(x = Giving) + geom_boxplot()

A Note on Special

With 95% confidence, what is the difference among special and non-special schools in terms of giving? Comparing independent samples, 0.1 to 0.34 higher for Special schools with 95% confidence.

t.test(Giving ~ Special, data = AlumniGiving)

    Welch Two Sample t-test

data:  Giving by Special
t = -7, df = 2, p-value = 0.01
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -0.3362 -0.0974
sample estimates:
 mean in group No mean in group Yes 
            0.137             0.353 

Question 2

Is there a difference or no difference between the percentage of big classes and the percentage of small classes with 95% confidence? Provide a 95% confidence interval for the average difference and interpret the interval. With 95% confidence, the average school has 0.24 to 0.30 more small classes than big classes.

t.test(AlumniGiving$Small.Classes, AlumniGiving$Big.Classes, paired = TRUE)

    Paired t-test

data:  AlumniGiving$Small.Classes and AlumniGiving$Big.Classes
t = 17, df = 122, p-value <2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.236 0.299
sample estimates:
mean difference 
          0.267 

Question 3

GGally::ggpairs({
    AlumniGiving %>%
        select(-School)
})

Question 4

Estimate and report a regression using all of the potential predictors [except the ID of course, and Special] to explain Giving. With 95% confidence, what factors seem to explain giving? Only Small.Classes. Interpret the residual standard error to summarize how well the regression with all predictors fits. The residual standard error is 0.056; on average, the regression is off for a particular school by 0.056. Does the model predict variance? Yes, 54% of the variance in alumni giving can be accounted for by these factors. The F-statistic is far too large for this quantity of variance to be explained by give factors by chance alone.

# Model.LM <-
# lm(Giving~SFR+Small.Classes+Big.Classes+Graduation.Rate+Freshman.Retention,
# data=AlumniGiving) summary(Model.LM) AlumniGiving$ResidsNoS <-
# residuals(Model.LM)
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
    "Big.Classes", "Graduation.Rate", "Freshman.Retention"))
summary(result, sum_check = "sumsquares")
Linear regression (OLS)
Data     : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention 
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero

                    coefficient std.error t.value p.value  
 (Intercept)             -0.195     0.114  -1.710   0.090 .
 SFR                     -0.001     0.002  -0.617   0.538  
 Small.Classes            0.151     0.064   2.339   0.021 *
 Big.Classes             -0.032     0.119  -0.270   0.787  
 Graduation.Rate          0.130     0.098   1.324   0.188  
 Freshman.Retention       0.257     0.177   1.452   0.149  

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.533,  Adjusted R-squared: 0.513 
F-statistic: 26.7 df(5,117), p.value < .001
Nr obs: 123 

Sum of squares:
            df    SS
Regression   5 0.423
Error      117 0.371
Total      122 0.794
AlumniGiving <- store(AlumniGiving, result, name = "ResidsNoS")

Question 5

Are the residuals from the previous regression normal? No. There are a few schools that are quite poorly explained with large amounts of excess giving given the model. Note the long right tail on the residuals. Provide convincing evidence that they are. Or Fix it with Special?

It is hard to say definitively, most everything conforms but there is also a disquieting pattern of large positive residuals; more than the quantile-quantile plot would suggest there should be.

May need to install:
install.packages("gvlma")

Use gvlma

library(gvlma)
gvlma(result$model)

Call:
lm(formula = form_upper, data = dataset)

Coefficients:
       (Intercept)                 SFR       Small.Classes  
           -0.1950             -0.0011              0.1507  
       Big.Classes     Graduation.Rate  Freshman.Retention  
           -0.0322              0.1298              0.2569  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = result$model) 

                   Value  p-value                   Decision
Global Stat         40.6 3.27e-08 Assumptions NOT satisfied!
Skewness            14.8 1.17e-04 Assumptions NOT satisfied!
Kurtosis            11.7 6.30e-04 Assumptions NOT satisfied!
Link Function        3.6 5.77e-02    Assumptions acceptable.
Heteroscedasticity  10.5 1.22e-03 Assumptions NOT satisfied!
# gvlma(Model.LM) Null is normal.  Here that's very unlikely to be true;
# they are non-normal.
shapiro.test(AlumniGiving$ResidsNoS)

    Shapiro-Wilk normality test

data:  AlumniGiving$ResidsNoS
W = 1, p-value = 4e-04

Nope.

Question 6

Reexamine a regression but now use all of the predictors except for the School and ID to explain Giving. With 95% confidence, what factors seem to explain giving? Special and Small.Classes Interpret the residual standard error to summarize how well the regression with all predictors fits. We can predict Giving Rates in the sample to within 0.048, on average. Does the model predict variance? Yes

# Model.S <-
# lm(Giving~SFR+Small.Classes+Big.Classes+Graduation.Rate+Freshman.Retention,
# data=AlumniGiving) summary(Model.S) AlumniGiving$ResidsS <-
# residuals(Model.S)
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
    "Big.Classes", "Graduation.Rate", "Freshman.Retention", "Special"))
summary(result, sum_check = c("rmse", "sumsquares", "confint"))
Linear regression (OLS)
Data     : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero

                    coefficient std.error t.value p.value    
 (Intercept)             -0.186     0.098  -1.893   0.061 .  
 SFR                     -0.001     0.002  -0.705   0.482    
 Small.Classes            0.165     0.055   2.969   0.004 ** 
 Big.Classes             -0.022     0.103  -0.212   0.833    
 Graduation.Rate          0.111     0.084   1.314   0.192    
 Freshman.Retention       0.246     0.152   1.617   0.109    
 Special|Yes              0.185     0.029   6.478  < .001 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.657,  Adjusted R-squared: 0.639 
F-statistic: 37 df(6,116), p.value < .001
Nr obs: 123 

Prediction error (RMSE):  0.047 
Residual st.dev   (RSD):  0.048 

Sum of squares:
            df    SS
Regression   6 0.521
Error      116 0.273
Total      122 0.794

                   coefficient   2.5% 97.5%   +/-
(Intercept)             -0.186 -0.380 0.009 0.194
SFR                     -0.001 -0.004 0.002 0.003
Small.Classes            0.165  0.055 0.275 0.110
Big.Classes             -0.022 -0.225 0.182 0.203
Graduation.Rate          0.111 -0.056 0.278 0.167
Freshman.Retention       0.246 -0.055 0.548 0.302
Special|Yes              0.185  0.128 0.241 0.057
plot(result, plots = "dashboard", lines = "line", nrobs = -1, custom = FALSE)
pred <- predict(result, pred_data = AlumniGiving)
AlumniGiving <- store(AlumniGiving, result, name = "ResidsSpec")
print(pred, n = 10)
Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : confidence 
Prediction dataset   : AlumniGiving 
Rows shown           : 10 of 123 

    SFR Small.Classes Big.Classes Graduation.Rate Freshman.Retention
 24.000         0.420       0.160           0.590              0.810
 19.000         0.490       0.040           0.370              0.690
 18.000         0.240       0.170           0.660              0.870
  8.000         0.740       0.001           0.810              0.880
  8.000         0.950       0.000           0.860              0.920
 20.000         0.390       0.060           0.350              0.690
 20.000         0.350       0.170           0.600              0.790
 18.000         0.280       0.180           0.580              0.830
 18.000         0.340       0.120           0.570              0.780
 14.000         0.490       0.090           0.710              0.850
 Special Prediction  2.5% 97.5%   +/-
      No      0.119 0.099 0.139 0.020
      No      0.085 0.062 0.107 0.023
     Yes      0.303 0.246 0.360 0.057
      No      0.234 0.208 0.260 0.026
      No      0.284 0.246 0.322 0.038
      No      0.064 0.044 0.085 0.021
      No      0.108 0.094 0.122 0.014
      No      0.106 0.091 0.121 0.015
      No      0.104 0.089 0.118 0.015
      No      0.166 0.153 0.179 0.013
AlumniGiving <- store(AlumniGiving, pred, name = "pred_reg")

The model, as a whole and as shown by F, accounts for far more variance than would be expected by chance alone. The p-value is less than 0.001. Indeed, the value of F such that 99.9 percent of F-ratios are smaller is 4.05 while the F we obtain is 36.987. The model almost surely explains far more than random varation. Yet, of six plausible predictors, only Special and Small.Classes have non-zero slopes.

Giving = -0.186 - 0.001*SFR + 0.165*Small.Classes - 0.022*Big.Classes + 0.111*Graduation.Rate + 0.246*Freshman.Retention + 0.185(If Special is Yes) + error

OR

equatiomatic::extract_eq(result$model, use_coefs = TRUE, coef_digits = 4)

\operatorname{\widehat{Giving}} = -0.1858 - 0.0011(\operatorname{SFR}) + 0.1647(\operatorname{Small.Classes}) - 0.0217(\operatorname{Big.Classes}) + 0.111(\operatorname{Graduation.Rate}) + 0.2462(\operatorname{Freshman.Retention}) + 0.1849(\operatorname{Special}_{\operatorname{Yes}})

Question 7

Are the residuals from the previous regression normal? Provide convincing evidence that they are. Compare this to the previous model?

They are certainly better as Special has pulled those extreme values of residual giving to the line.

library(gvlma)
gvlma(result$model)

Call:
lm(formula = form_upper, data = dataset)

Coefficients:
       (Intercept)                 SFR       Small.Classes  
          -0.18576            -0.00108             0.16470  
       Big.Classes     Graduation.Rate  Freshman.Retention  
          -0.02173             0.11096             0.24625  
        SpecialYes  
           0.18491  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = result$model) 

                   Value p-value                Decision
Global Stat        7.579  0.1083 Assumptions acceptable.
Skewness           2.726  0.0988 Assumptions acceptable.
Kurtosis           0.734  0.3915 Assumptions acceptable.
Link Function      1.705  0.1916 Assumptions acceptable.
Heteroscedasticity 2.414  0.1203 Assumptions acceptable.
# gvlma(Model.S)

Question 8

Given the information in the regression with all predictors, which schools have the most and least unexplained giving given the characteristics of the school?

AlumniGiving %>%
    arrange(ResidsSpec) %>%
    select(School, ResidsSpec) %>%
    head
# A tibble: 6 × 2
  School                                                    ResidsSpec
  <chr>                                                          <dbl>
1 United States Air Force Academy                              -0.124 
2 University of California—Berkeley                            -0.115 
3 University of California—Los Angeles                         -0.0865
4 San Diego State University                                   -0.0802
5 Rutgers, the State University of New Jersey—New Brunswick    -0.0707
6 University of Maryland—College Park                          -0.0703
AlumniGiving %>%
    arrange(desc(ResidsSpec)) %>%
    select(School, ResidsSpec) %>%
    head
# A tibble: 6 × 2
  School                            ResidsSpec
  <chr>                                  <dbl>
1 University of Southern California     0.146 
2 University of Arkansas                0.124 
3 Georgia Institute of Technology       0.108 
4 Duke University                       0.0956
5 Clemson University                    0.0935
6 University of Nebraska—Lincoln        0.0900

USC most/highest residual giving and the Air Force Academy least/highest (in absolute value) negative residual giving.

Question 9

confint(result$model)
                      2.5 %  97.5 %
(Intercept)        -0.38013 0.00860
SFR                -0.00412 0.00196
Small.Classes       0.05485 0.27456
Big.Classes        -0.22497 0.18152
Graduation.Rate    -0.05635 0.27827
Freshman.Retention -0.05534 0.54784
SpecialYes          0.12838 0.24144

Question 10

Our school is not Special, has a graduation rate of 0.67, a student to faculty ratio of 17 (SFR), 0.34 of the classes have fewer than 20 students (Small.Classes), 0.23 of the classes have over 50 students (Big.Classes), and a freshman retention rate of 0.77. The school’s giving rate is 8%. Assess their performance on average and for a specific year. What does this say about the performance of the development office?

Two ways to do this.

  • Type in the values?
  • Upload a spreadsheet and predict with that.

We are just barely within the 95% interval for average giving given the characteristics of our school.

pred <- predict(result, pred_cmd = c("Graduation.Rate=0.67", "Small.Classes=0.34",
    "Big.Classes=0.23", "Special='No'", "Freshman.Retention=0.77"), conf_lev = 0.9)
plot(pred, xvar = "Special", conf_lev = 0.9)
print(pred, n = 10)
Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : confidence 
Prediction command   : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77 

    SFR Graduation.Rate Small.Classes Big.Classes Special
 17.772           0.670         0.340       0.230      No
 Freshman.Retention Prediction    5%   95%   +/-
              0.770      0.110 0.084 0.136 0.026

Predicting the Data

pred <- predict(result, pred_cmd = c("Graduation.Rate=0.67", "Small.Classes=0.34",
    "Big.Classes=0.23", "Special='No'", "Freshman.Retention=0.77"), conf_lev = 0.9,
    interval = "prediction")
# plot(pred, xvar = 'Special', conf_lev = 0.9)
print(pred, n = 10)
Linear regression (OLS)
Data                 : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Interval             : prediction 
Prediction command   : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77 

    SFR Graduation.Rate Small.Classes Big.Classes Special
 17.772           0.670         0.340       0.230      No
 Freshman.Retention Prediction    5%   95%   +/-
              0.770      0.110 0.026 0.194 0.084

Question 11

What factors matter? We will think about this in two ways. First, by adding predictors that clearly explain variation. Second, by starting with all reasonable predictors and removing them if they do not have non-zero slopes, using a stepwise procedure.

This is stepwise

# step(Model.S, direction='both')
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
    "Big.Classes", "Graduation.Rate", "Freshman.Retention", "Special"), check = "stepwise-backward")
Start:  AIC=-738
Giving ~ SFR + Small.Classes + Big.Classes + Graduation.Rate + 
    Freshman.Retention + Special

                     Df Sum of Sq   RSS  AIC
- Big.Classes         1    0.0001 0.273 -740
- SFR                 1    0.0012 0.274 -739
- Graduation.Rate     1    0.0041 0.277 -738
<none>                            0.273 -738
- Freshman.Retention  1    0.0061 0.279 -737
- Small.Classes       1    0.0207 0.293 -731
- Special             1    0.0986 0.371 -702

Step:  AIC=-740
Giving ~ SFR + Small.Classes + Graduation.Rate + Freshman.Retention + 
    Special

                     Df Sum of Sq   RSS  AIC
- SFR                 1    0.0014 0.274 -741
- Graduation.Rate     1    0.0040 0.277 -740
<none>                            0.273 -740
- Freshman.Retention  1    0.0063 0.279 -739
- Small.Classes       1    0.0315 0.304 -728
- Special             1    0.0987 0.371 -704

Step:  AIC=-741
Giving ~ Small.Classes + Graduation.Rate + Freshman.Retention + 
    Special

                     Df Sum of Sq   RSS  AIC
<none>                            0.274 -741
- Graduation.Rate     1    0.0051 0.279 -741
- Freshman.Retention  1    0.0063 0.280 -740
- Small.Classes       1    0.0571 0.331 -720
- Special             1    0.0989 0.373 -705
summary(result, sum_check = c("rmse", "sumsquares", "confint"), conf_lev = 0.9)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data     : AlumniGiving 
Response variable    : Giving 
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special 
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero

                    coefficient std.error t.value p.value    
 (Intercept)             -0.224     0.085  -2.625   0.010 ** 
 Small.Classes            0.191     0.039   4.958  < .001 ***
 Graduation.Rate          0.117     0.079   1.480   0.141    
 Freshman.Retention       0.248     0.151   1.644   0.103    
 Special|Yes              0.185     0.028   6.525  < .001 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.655,  Adjusted R-squared: 0.643 
F-statistic: 56 df(4,118), p.value < .001
Nr obs: 123 

Prediction error (RMSE):  0.047 
Residual st.dev   (RSD):  0.048 

Sum of squares:
            df    SS
Regression   4 0.520
Error      118 0.274
Total      122 0.794

                   coefficient     5%    95%   +/-
(Intercept)             -0.224 -0.366 -0.083 0.142
Small.Classes            0.191  0.127  0.255 0.064
Graduation.Rate          0.117 -0.014  0.248 0.131
Freshman.Retention       0.248 -0.002  0.498 0.250
Special|Yes              0.185  0.138  0.232 0.047

On VIF

It manages to remove two variables for having insufficient associated squares; we lose 0.002 when measured in R-squared. Though we cannot be certain that Graduation.Rate and Freshman.Retention do not have zero slopes, this largely results from the relationship among them.

# car::vif(Model.S)

Question 12

If you could recommend one or two operational factors to improve giving, what would they be? Think carefully about which factors can and cannot be controlled by the relevant decision makers.

The overarching evidence provided by Small.Classes, Graduation.Rate, and Freshman.Retention suggests that the ability to deliver more personalized experiences as opposed to greater bureaucratization of the university might be a solution; administrative creep is a well known phenomenon in higher education.

Some Additional Interpretation: Plots

pred <- predict(result, pred_cmd = "Small.Classes=seq(0,0.7, by=0.01)")
plot(pred, xvar = "Small.Classes")