The director of development at a large state university is under pressure to raise giving rates to conform to US News and World Reports ranking that place an important weight on the average alumni giving rate - a metric that is a heavy component of their overall ranking of institutions. The institution’s current rate of 8% is in the low category and raising it could play a key role in raising the overall ranking of the university from the current ranking in the mid 130s. You have the following data available for a sample of schools that fielded football teams in the NCAA Football Bowl Subdivision [this is the sampling criterion because it includes almost all big state universities and big private institutions]. There is no reason to believe that this is unrepresentative for our purposes.
You have the following data available for a sample of schools that fielded football teams in the NCAA Football Bowl Subdivision [this is the sampling criterion because it includes almost all big state universities and big private institutions]. There is no reason to believe that this is unrepresentative for our purposes. Of course, the goal is to explain giving rates in GivingRates.RData.
| Variable | Description |
|---|---|
| ID | An identifier of the school running from 1 to 125. |
| School | The name of the school. |
| SFR | the Student to Faculty ratio [the number of students divided by the number of faculty] |
| Small.Classes | Percentage of classes with fewer than 20 students. |
| Big.Classes | Percentage of classes with greater than 50 students. |
| Graduation.Rate | the six year rate of graduation. |
| Freshman.Retention | the Freshman retention rate or the rate at which first year students are retained into the second year. |
| Giving | the average alumni giving rate for the institution. |
| Special | A Yes or No variable denoting schools that were involved in special giving campaigns tied to the football program. |
install.packages("SmartEDA")
library(SmartEDA)
ExpReport(AlumniGiving, Target="Giving", op_file="EDA-Giving.html")
Provide an instructive graphic with interpretation for rates of giving. Discuss the symmetry, the center, and the spread. In addition, provide a 95% confidence interval for average giving. The data are quite skewed to the right with a few seemingly large and outlying values. The lower half [below the median] spans a much smaller range than the upper half even without considering those values.
With 95% confidence, what is the difference among special and non-special schools in terms of giving? Comparing independent samples, 0.1 to 0.34 higher for Special schools with 95% confidence.
Welch Two Sample t-test
data: Giving by Special
t = -7, df = 2, p-value = 0.01
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-0.3362 -0.0974
sample estimates:
mean in group No mean in group Yes
0.137 0.353
Is there a difference or no difference between the percentage of big classes and the percentage of small classes with 95% confidence? Provide a 95% confidence interval for the average difference and interpret the interval. With 95% confidence, the average school has 0.24 to 0.30 more small classes than big classes.
Paired t-test
data: AlumniGiving$Small.Classes and AlumniGiving$Big.Classes
t = 17, df = 122, p-value <2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.236 0.299
sample estimates:
mean difference
0.267
Estimate and report a regression using all of the potential predictors [except the ID of course, and Special] to explain Giving. With 95% confidence, what factors seem to explain giving? Only Small.Classes. Interpret the residual standard error to summarize how well the regression with all predictors fits. The residual standard error is 0.056; on average, the regression is off for a particular school by 0.056. Does the model predict variance? Yes, 54% of the variance in alumni giving can be accounted for by these factors. The F-statistic is far too large for this quantity of variance to be explained by give factors by chance alone.
# Model.LM <-
# lm(Giving~SFR+Small.Classes+Big.Classes+Graduation.Rate+Freshman.Retention,
# data=AlumniGiving) summary(Model.LM) AlumniGiving$ResidsNoS <-
# residuals(Model.LM)
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
"Big.Classes", "Graduation.Rate", "Freshman.Retention"))
summary(result, sum_check = "sumsquares")Linear regression (OLS)
Data : AlumniGiving
Response variable : Giving
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero
coefficient std.error t.value p.value
(Intercept) -0.195 0.114 -1.710 0.090 .
SFR -0.001 0.002 -0.617 0.538
Small.Classes 0.151 0.064 2.339 0.021 *
Big.Classes -0.032 0.119 -0.270 0.787
Graduation.Rate 0.130 0.098 1.324 0.188
Freshman.Retention 0.257 0.177 1.452 0.149
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.533, Adjusted R-squared: 0.513
F-statistic: 26.7 df(5,117), p.value < .001
Nr obs: 123
Sum of squares:
df SS
Regression 5 0.423
Error 117 0.371
Total 122 0.794
Are the residuals from the previous regression normal? No. There are a few schools that are quite poorly explained with large amounts of excess giving given the model. Note the long right tail on the residuals. Provide convincing evidence that they are. Or Fix it with Special?
It is hard to say definitively, most everything conforms but there is also a disquieting pattern of large positive residuals; more than the quantile-quantile plot would suggest there should be.
May need to install:
install.packages("gvlma")
Use gvlma
Call:
lm(formula = form_upper, data = dataset)
Coefficients:
(Intercept) SFR Small.Classes
-0.1950 -0.0011 0.1507
Big.Classes Graduation.Rate Freshman.Retention
-0.0322 0.1298 0.2569
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = result$model)
Value p-value Decision
Global Stat 40.6 3.27e-08 Assumptions NOT satisfied!
Skewness 14.8 1.17e-04 Assumptions NOT satisfied!
Kurtosis 11.7 6.30e-04 Assumptions NOT satisfied!
Link Function 3.6 5.77e-02 Assumptions acceptable.
Heteroscedasticity 10.5 1.22e-03 Assumptions NOT satisfied!
Shapiro-Wilk normality test
data: AlumniGiving$ResidsNoS
W = 1, p-value = 4e-04
Nope.
Reexamine a regression but now use all of the predictors except for the School and ID to explain Giving. With 95% confidence, what factors seem to explain giving? Special and Small.Classes Interpret the residual standard error to summarize how well the regression with all predictors fits. We can predict Giving Rates in the sample to within 0.048, on average. Does the model predict variance? Yes
# Model.S <-
# lm(Giving~SFR+Small.Classes+Big.Classes+Graduation.Rate+Freshman.Retention,
# data=AlumniGiving) summary(Model.S) AlumniGiving$ResidsS <-
# residuals(Model.S)
result <- regress(AlumniGiving, rvar = "Giving", evar = c("SFR", "Small.Classes",
"Big.Classes", "Graduation.Rate", "Freshman.Retention", "Special"))
summary(result, sum_check = c("rmse", "sumsquares", "confint"))Linear regression (OLS)
Data : AlumniGiving
Response variable : Giving
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero
coefficient std.error t.value p.value
(Intercept) -0.186 0.098 -1.893 0.061 .
SFR -0.001 0.002 -0.705 0.482
Small.Classes 0.165 0.055 2.969 0.004 **
Big.Classes -0.022 0.103 -0.212 0.833
Graduation.Rate 0.111 0.084 1.314 0.192
Freshman.Retention 0.246 0.152 1.617 0.109
Special|Yes 0.185 0.029 6.478 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.657, Adjusted R-squared: 0.639
F-statistic: 37 df(6,116), p.value < .001
Nr obs: 123
Prediction error (RMSE): 0.047
Residual st.dev (RSD): 0.048
Sum of squares:
df SS
Regression 6 0.521
Error 116 0.273
Total 122 0.794
coefficient 2.5% 97.5% +/-
(Intercept) -0.186 -0.380 0.009 0.194
SFR -0.001 -0.004 0.002 0.003
Small.Classes 0.165 0.055 0.275 0.110
Big.Classes -0.022 -0.225 0.182 0.203
Graduation.Rate 0.111 -0.056 0.278 0.167
Freshman.Retention 0.246 -0.055 0.548 0.302
Special|Yes 0.185 0.128 0.241 0.057
Linear regression (OLS)
Data : AlumniGiving
Response variable : Giving
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special
Interval : confidence
Prediction dataset : AlumniGiving
Rows shown : 10 of 123
SFR Small.Classes Big.Classes Graduation.Rate Freshman.Retention
24.000 0.420 0.160 0.590 0.810
19.000 0.490 0.040 0.370 0.690
18.000 0.240 0.170 0.660 0.870
8.000 0.740 0.001 0.810 0.880
8.000 0.950 0.000 0.860 0.920
20.000 0.390 0.060 0.350 0.690
20.000 0.350 0.170 0.600 0.790
18.000 0.280 0.180 0.580 0.830
18.000 0.340 0.120 0.570 0.780
14.000 0.490 0.090 0.710 0.850
Special Prediction 2.5% 97.5% +/-
No 0.119 0.099 0.139 0.020
No 0.085 0.062 0.107 0.023
Yes 0.303 0.246 0.360 0.057
No 0.234 0.208 0.260 0.026
No 0.284 0.246 0.322 0.038
No 0.064 0.044 0.085 0.021
No 0.108 0.094 0.122 0.014
No 0.106 0.091 0.121 0.015
No 0.104 0.089 0.118 0.015
No 0.166 0.153 0.179 0.013
The model, as a whole and as shown by F, accounts for far more variance than would be expected by chance alone. The p-value is less than 0.001. Indeed, the value of F such that 99.9 percent of F-ratios are smaller is 4.05 while the F we obtain is 36.987. The model almost surely explains far more than random varation. Yet, of six plausible predictors, only Special and Small.Classes have non-zero slopes.
Giving = -0.186 - 0.001*SFR + 0.165*Small.Classes - 0.022*Big.Classes + 0.111*Graduation.Rate + 0.246*Freshman.Retention + 0.185(If Special is Yes) + error
OR
\operatorname{\widehat{Giving}} = -0.1858 - 0.0011(\operatorname{SFR}) + 0.1647(\operatorname{Small.Classes}) - 0.0217(\operatorname{Big.Classes}) + 0.111(\operatorname{Graduation.Rate}) + 0.2462(\operatorname{Freshman.Retention}) + 0.1849(\operatorname{Special}_{\operatorname{Yes}})
Are the residuals from the previous regression normal? Provide convincing evidence that they are. Compare this to the previous model?
They are certainly better as Special has pulled those extreme values of residual giving to the line.
Call:
lm(formula = form_upper, data = dataset)
Coefficients:
(Intercept) SFR Small.Classes
-0.18576 -0.00108 0.16470
Big.Classes Graduation.Rate Freshman.Retention
-0.02173 0.11096 0.24625
SpecialYes
0.18491
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = result$model)
Value p-value Decision
Global Stat 7.579 0.1083 Assumptions acceptable.
Skewness 2.726 0.0988 Assumptions acceptable.
Kurtosis 0.734 0.3915 Assumptions acceptable.
Link Function 1.705 0.1916 Assumptions acceptable.
Heteroscedasticity 2.414 0.1203 Assumptions acceptable.
Given the information in the regression with all predictors, which schools have the most and least unexplained giving given the characteristics of the school?
# A tibble: 6 × 2
School ResidsSpec
<chr> <dbl>
1 United States Air Force Academy -0.124
2 University of California—Berkeley -0.115
3 University of California—Los Angeles -0.0865
4 San Diego State University -0.0802
5 Rutgers, the State University of New Jersey—New Brunswick -0.0707
6 University of Maryland—College Park -0.0703
# A tibble: 6 × 2
School ResidsSpec
<chr> <dbl>
1 University of Southern California 0.146
2 University of Arkansas 0.124
3 Georgia Institute of Technology 0.108
4 Duke University 0.0956
5 Clemson University 0.0935
6 University of Nebraska—Lincoln 0.0900
USC most/highest residual giving and the Air Force Academy least/highest (in absolute value) negative residual giving.
Our school is not Special, has a graduation rate of 0.67, a student to faculty ratio of 17 (SFR), 0.34 of the classes have fewer than 20 students (Small.Classes), 0.23 of the classes have over 50 students (Big.Classes), and a freshman retention rate of 0.77. The school’s giving rate is 8%. Assess their performance on average and for a specific year. What does this say about the performance of the development office?
Two ways to do this.
We are just barely within the 95% interval for average giving given the characteristics of our school.
Linear regression (OLS)
Data : AlumniGiving
Response variable : Giving
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special
Interval : confidence
Prediction command : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77
SFR Graduation.Rate Small.Classes Big.Classes Special
17.772 0.670 0.340 0.230 No
Freshman.Retention Prediction 5% 95% +/-
0.770 0.110 0.084 0.136 0.026
Linear regression (OLS)
Data : AlumniGiving
Response variable : Giving
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special
Interval : prediction
Prediction command : Graduation.Rate = 0.67, Small.Classes = 0.34, Big.Classes = 0.23, Special = 'No', Freshman.Retention = 0.77
SFR Graduation.Rate Small.Classes Big.Classes Special
17.772 0.670 0.340 0.230 No
Freshman.Retention Prediction 5% 95% +/-
0.770 0.110 0.026 0.194 0.084
What factors matter? We will think about this in two ways. First, by adding predictors that clearly explain variation. Second, by starting with all reasonable predictors and removing them if they do not have non-zero slopes, using a stepwise procedure.
This is stepwise
Start: AIC=-738
Giving ~ SFR + Small.Classes + Big.Classes + Graduation.Rate +
Freshman.Retention + Special
Df Sum of Sq RSS AIC
- Big.Classes 1 0.0001 0.273 -740
- SFR 1 0.0012 0.274 -739
- Graduation.Rate 1 0.0041 0.277 -738
<none> 0.273 -738
- Freshman.Retention 1 0.0061 0.279 -737
- Small.Classes 1 0.0207 0.293 -731
- Special 1 0.0986 0.371 -702
Step: AIC=-740
Giving ~ SFR + Small.Classes + Graduation.Rate + Freshman.Retention +
Special
Df Sum of Sq RSS AIC
- SFR 1 0.0014 0.274 -741
- Graduation.Rate 1 0.0040 0.277 -740
<none> 0.273 -740
- Freshman.Retention 1 0.0063 0.279 -739
- Small.Classes 1 0.0315 0.304 -728
- Special 1 0.0987 0.371 -704
Step: AIC=-741
Giving ~ Small.Classes + Graduation.Rate + Freshman.Retention +
Special
Df Sum of Sq RSS AIC
<none> 0.274 -741
- Graduation.Rate 1 0.0051 0.279 -741
- Freshman.Retention 1 0.0063 0.280 -740
- Small.Classes 1 0.0571 0.331 -720
- Special 1 0.0989 0.373 -705
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data : AlumniGiving
Response variable : Giving
Explanatory variables: SFR, Small.Classes, Big.Classes, Graduation.Rate, Freshman.Retention, Special
Null hyp.: the effect of x on Giving is zero
Alt. hyp.: the effect of x on Giving is not zero
coefficient std.error t.value p.value
(Intercept) -0.224 0.085 -2.625 0.010 **
Small.Classes 0.191 0.039 4.958 < .001 ***
Graduation.Rate 0.117 0.079 1.480 0.141
Freshman.Retention 0.248 0.151 1.644 0.103
Special|Yes 0.185 0.028 6.525 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.655, Adjusted R-squared: 0.643
F-statistic: 56 df(4,118), p.value < .001
Nr obs: 123
Prediction error (RMSE): 0.047
Residual st.dev (RSD): 0.048
Sum of squares:
df SS
Regression 4 0.520
Error 118 0.274
Total 122 0.794
coefficient 5% 95% +/-
(Intercept) -0.224 -0.366 -0.083 0.142
Small.Classes 0.191 0.127 0.255 0.064
Graduation.Rate 0.117 -0.014 0.248 0.131
Freshman.Retention 0.248 -0.002 0.498 0.250
Special|Yes 0.185 0.138 0.232 0.047
It manages to remove two variables for having insufficient associated squares; we lose 0.002 when measured in R-squared. Though we cannot be certain that Graduation.Rate and Freshman.Retention do not have zero slopes, this largely results from the relationship among them.
If you could recommend one or two operational factors to improve giving, what would they be? Think carefully about which factors can and cannot be controlled by the relevant decision makers.
The overarching evidence provided by Small.Classes, Graduation.Rate, and Freshman.Retention suggests that the ability to deliver more personalized experiences as opposed to greater bureaucratization of the university might be a solution; administrative creep is a well known phenomenon in higher education.
Multiple Regression and Model Selection