Choice and Forecasting: Week 2

Robert W. Walker

Inference and Linear Regression

Chapter 3: Statistics Foundations

Definitions

  • [Arithmetic] Mean mean() and mean(., na.rm=TRUE) \overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_{i}
  • Variances var() and var(., na.rm=TRUE)
    • Population Var_{p}(x) = \frac{1}{N}\sum_{i=1}^{N} (x_{i}-\overline{x})^2
    • Sample Var_{s}(x) = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})^2

Standard Deviation

sd() and sd(., na.rm=TRUE)

We use standard deviation to avoid the squared metric.

  • Population \sigma_{p}(x) = \sqrt{\frac{1}{N}\sum_{i=1}^{N} (x_{i}-\overline{x})^2}
  • Sample \sigma_{s}(x) = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})^2}
  • They are related: \sigma_{s}(x) = \sqrt{\frac{N}{N-1}}\sigma_{p}(x)

Load Some Data

salespeople <- read.csv("http://peopleanalytics-regression-book.org/data/salespeople.csv")

Examples

#Won't work
mean(salespeople$sales)
[1] NA
# Works
mean(salespeople$sales, na.rm=TRUE)
[1] 527.0057
#Won't work
var(salespeople$sales)
[1] NA
# Works
var(salespeople$sales, na.rm=TRUE)
[1] 34308.11
#Won't work
sd(salespeople$sales)
[1] NA
# Works
sd(salespeople$sales, na.rm=TRUE)
[1] 185.2245

Correlation and Covariance

Covariance is the shared variance of x and y.

cov_{s}(x,y) = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})(y_{i}-\overline{y}) There is also population covariance that divides by N.

# Won't work
cov(salespeople$sales, salespeople$customer_rate)
[1] NA
# Works
cov(salespeople$sales, salespeople$customer_rate, use = "complete.obs")
[1] 55.81769

Correlation is a bounded measure between -1 and 1.

r_{s}(x,y) = \frac{cov_{s}(x,y)}{\sigma_{s}(x)\sigma_{s}(y)}

In a two variable regression, r_{x,y}^2 is the variance in y accounted for by x.

# Won't work
cor(salespeople$sales, salespeople$customer_rate)
[1] NA
# Works
cor(salespeople$sales, salespeople$customer_rate, use = "complete.obs")
[1] 0.337805

Sampling and Distributions

What is a random variable? At what level?

  • The text emphasizes independent and identically distributed
  • This can get pretty into the math weeds. Wikipedia definition

Statistics have sampling distributions

  • Remember t? A distribution entirely defined by degrees of freedom. For the mean, because the deviations about the mean must sum to zero, it is always N-1.
  • With statistics we distinguish the variability of a statistic, as opposed to data, with the term standard error.
  • The standard error of the mean is SE(\overline{x}) = \frac{\sigma_{s}(x)}{\sqrt{N}}

A Function for the Standard Error

# Define the function
SE.mean <- function(vectorX) {  # Input a vector
# Insert the formula from the slide
  SE <- sd(vectorX, na.rm=TRUE)/sqrt(length(complete.cases(vectorX)))
# return the value calculated
  SE
}
SE.mean(salespeople$customer_rate)
[1] 0.04761609

Confidence Intervals

A little reading, a fresher, for the case of a proportion is here as a Tufte Handout. You can also see and extended example on the Week 1 entry on the website.

  • A Student proved that the sampling distribution of the sample mean is given by the t-distribution under general conditions satisfying the central limit theorem.
  • t = \frac{\overline{x} - \mu}{SE(x)}
  • so it must be that given percentiles of t, \mu = \overline{x} - t * SE(x)
  • In a sentence, with XXX probability, the true mean lies within \pm t_{n-1, \frac{1-XXX}{2}} standard errors of the sample mean if we want a central interval. Otherwise, all the probability comes from one side.

Hypothesis Tests

Posit some true value and assess the likelihood with some a priori level of probability. The duality of this is the topic of the two linked handouts.
- For cases beyond a single mean, the trick is calculating the appropriate standard error and/or the appropriate degrees of freedom.

Difference of Means

If we wish to know if the mean of two groups is:

  • the same or different
  • one is greater/less than the other

The t distribution is also appropriate for this task, as demonstrated by Welch. Welch’s t

This is implemented as t.test(x, y) or, for tidy data, t-test(var~group, data=data)

An Aside on the Behrens-Fisher Problem

The text makes it seem as though this is a simple problem. It is not.

Behrens-Fisher Wikipedia

Correlations

A modification of the correlation coefficient has also been shown to have a t-distribution with N-2 degrees of freedom. Known often as t^{*}

t^{*} = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}

This is automatically implemented as cor.test(x, y).

Contingency tables

We can also examine the independence among rows and columns of a table. Section 3.3.3 contains this example. The comparison relies on the difference between observed counts and expected counts only knowing the marginal probability of each value along the rows and columns and the total number of items because the expected count is N times row/column probability where the row and column probabilities must sum to one.

Chi-square independence

Two-by-two tables

Are a special case of the above. The Tufte Handout earlier cited goes through the example of a single proportion [of a binary variable] to show that, as long as the number of expected is greater than about 5 or 10, we can use a normal to assess a difference in a two-by-two table.

This is implemented in prop.test(table) after creating the table using table. table will require us to learn a new pipe; %$%.

Illustration

Let me use an internal dataset to illustrate.

library(magrittr) # for the new pipe
data("Titanic")   # Some data
Tidy.Titanic <- DescTools::Untable(Titanic)  # unTable the data
Tidy.Titanic %$% table(Sex, Survived)
        Survived
Sex        No  Yes
  Male   1364  367
  Female  126  344
Tidy.Titanic %$% table(Sex, Survived) %>% prop.test(.)

    2-sample test for equality of proportions with continuity correction

data:  .
X-squared = 454.5, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.4741109 0.5656866
sample estimates:
   prop 1    prop 2 
0.7879838 0.2680851 
Tidy.Titanic %$% chisq.test(Sex, Survived)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Sex and Survived
X-squared = 454.5, df = 1, p-value < 2.2e-16

Chapter 4 – Regression Models

Ordinary Least Squares

Ordinary Least Squares describes the most common algorithm for estimation but the linear regression model is the “workhorse” of applied statistics generalizing the mx + c slope intercept method to multivariate problems. The text cites three examples:

Examples

Bivariate Regression

With y as outcome and x as input, and assuming the relationship is linear, we can write

y = \alpha + \beta x + \epsilon

\alpha and \beta are parameters or coefficients to be estimated

\epsilon is residual or error – the difference between the function of x and the observed value of y.

The method of estimation is, most often and conveniently, finding the values of \alpha and \beta that minimize the sum of squared errors. Because it is a quadratic, a unique solution always exists. Indeed, the solution for \beta often written \hat{\beta} is

\hat{\beta} = \frac{cov(x,y)}{var(x)}

An Example with Fictitious Data, part 1

An Example with Fictitious Data, part 2

An Example with Fictitious Data, part 3

Using rayshader plot_gg

Orthognality

  • Predicted values are a function of x.
  • Actual values are an (identity) function of y
  • Ergo, the estimation space and the error space are independent.

Multiple Regression

The magic of multiple regression is parsing out the partial effects of multiple inputs.

Section 4.3

Interpretation

  • \beta_{0} – the constant – is the expected value of y if all the variables x are zero.
  • \beta_{1} is the change in y for a one unit change in $x_{1}, all else held constant [ceteris paribus]
  • \beta_{2} is the change in y for a one unit change in $x_{2}, all else held constant [ceteris paribus]
  • \beta_{p} is the change in y for a one unit change in $x_{p}, all else held constant [ceteris paribus]

The last three are commonly referred to as partial effects – think partial derivative if you are familiar with calculus.

The point estimate

Is the best guess but these are estimates with accompanying uncertainty. If \epsilon is normal, the those slopes have a t distribution with N-p-1 degrees of freedom. The reason that each slope consumes a degree of freedom is that the line/plane/hyperplane is constrained to pass through the mean of both x and y.

The confidence interval incorporates the uncertainty – the standard error of \beta_{\cdot}. Confidence intervals can be obtained with confint applied to a model object. visreg::visreg(model.object) will plot them.

An Illustration

data("freeny")
my.lm.rev <- lm(y ~ price.index + market.potential + income.level, data=freeny)
summary(my.lm.rev)

Call:
lm(formula = y ~ price.index + market.potential + income.level, 
    data = freeny)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0273061 -0.0090031  0.0007218  0.0111354  0.0270294 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -13.31014    5.04423  -2.639 0.012339 *  
price.index       -0.83488    0.13084  -6.381 2.44e-07 ***
market.potential   1.62735    0.37682   4.319 0.000123 ***
income.level       0.84556    0.09904   8.538 4.47e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01468 on 35 degrees of freedom
Multiple R-squared:  0.998, Adjusted R-squared:  0.9978 
F-statistic:  5846 on 3 and 35 DF,  p-value: < 2.2e-16

confint

confint(my.lm.rev)
                       2.5 %     97.5 %
(Intercept)      -23.5504685 -3.0698191
price.index       -1.1005095 -0.5692559
market.potential   0.8623559  2.3923347
income.level       0.6445029  1.0466143

How does the model fit?

  • r^2 – yes, the correlation between the predictions and the actual values squared is a measure of goodness of fit.
cor(my.lm.rev$model$y,my.lm.rev$fitted.values)^2
[1] 0.9980083
summary(my.lm.rev)$r.squared
[1] 0.9980083
plot(my.lm.rev$fitted.values,my.lm.rev$model$y, xlab="Fitted", ylab="Actual")
abline(a=0, b=1)

Counterfactual Predictions

# Point
predict(my.lm.rev, newdata=data.frame(price.index=5, income.level=5.86, market.potential=13))
       1 
8.625905 
# On Average
predict(my.lm.rev, newdata=data.frame(price.index=5, income.level=5.86, market.potential=13), interval="confidence")
       fit      lwr      upr
1 8.625905 8.523296 8.728514
# All possibilities
predict(my.lm.rev, newdata=data.frame(price.index=5, income.level=5.86, market.potential=13), interval="prediction")
       fit      lwr      upr
1 8.625905 8.519058 8.732752

Managing Inputs

McNulty’s 2

  1. It is jointly determined with the outcome – endogeneity.
  2. Missingness

Categorical Predictors

as.factor() almost always sorts this out. His method is a bit more trouble but also works.

Also, one level/factor will be omitted by default, it is absorbed in the constant to avoid perfect linear combinations. Let me explain.

OLS: Assumptions

  1. Linearity:

y = X\beta + \epsilon

  1. Strict Exogeneity

E[\epsilon | X] = 0

  1. No [perfect] multicollinearity:

Rank of the NxK data matrix X is K with probability 1 (N > K).

  1. X is a nonstochastic matrix.

  2. Homoskedasticity

E[\epsilon\epsilon^{\prime}] = \sigma^2I s.t. \sigma^{2} > 0

Properties of OLS Estimators

  • Unbiasedness is E(\hat{\beta} - \beta) = 0
  • Variance E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^{\prime}]
  • The Gauss-Markov Theorem – Minimum Variance Unbiased Estimator

The first two

Need nothing about the distribution other than the two moment defintions. It is for number three that this starts to matter and, in many ways, this is directly a reflection of Basu’s theorem.

Unbiasedness

With \hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y, \mathbb{E}[\hat{\beta} - \beta] = 0 requires, \mathbb{E}[(X^{\prime}X)^{-1}X^{\prime}y - \beta] = 0 We require an inverse already. Invoking the definition of y, we get

\mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}(\mathbf{X}\beta + \epsilon) - \beta] = 0 \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta] = 0

Taking expectations and rearranging.

\hat{\beta} - \beta = -\mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon]

If the latter multiple is zero, all is well.

Variance

\mathbb{E}[(\hat{\mathbf{\beta}} - \beta)(\hat{\mathbf{\beta}} - \beta)^{\prime}] can be derived as follows.

\mathbb{E}[(\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)(\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)^{\prime}] \mathbb{E}[(\mathbf{I}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)(\mathbf{I}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)^{\prime}]

Recognizing the zero part from before, we are left with the manageable,

\mathbb{E}[(\hat{\mathbf{\beta}} - \beta)(\hat{\mathbf{\beta}} - \beta)^{\prime}] \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon\epsilon^{\prime}\mathbf{X(X^{\prime}X)^{-1}}]

Nifty. With nonstochastic \mathbf{X}, it’s the structure of \epsilon\epsilon^{\prime} and we know what that is. By assumption, we have \sigma^{2}\mathbf{I}. If stochastic, we need more steps to get to the same place.

Gauss-Markov

Proving the Gauss-Markov theorem is not so instructive. From what we already have, we are restricted to linear estimators, we add or subtract something. So after computation, we get the OLS standard errors plus a positive semi-definite matrix. OLS always wins. From here, a natural place to go is corrections for non \mathbf{I}. We will do plenty of that. And we will eventually need Aitken.

Checking Assumptions

  1. Linearity: Examine the added-variable plots accessible in car for patterns
library(car)
avPlots(my.lm.rev)

  1. Exogeneity: No easy means of assessment. Quite unfortunate

  2. No perfect multicollinearity: really a technical requirement because something will have to be dropped to produce estimates because of matrix invertability.

  3. Makes math easy but doesn’t really matter.

  4. Constant error variance

The avPlots from before as a function of x. We also commonly use default plots of the regression model.

plot(my.lm.rev)

Leverage assists in detecting outliers.

Care with Normality

It is not a core assumption for the method; it is a core assumption for inference. There is widespread use of a robust standard error that does not require normality. Normality is what allows t distributions for inference. The q-q plots he cites are nice. Even better is

library(gvlma)
gvlma(my.lm.rev)

Call:
lm(formula = y ~ price.index + market.potential + income.level, 
    data = freeny)

Coefficients:
     (Intercept)       price.index  market.potential      income.level  
        -13.3101           -0.8349            1.6273            0.8456  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = my.lm.rev) 

                      Value p-value                Decision
Global Stat        2.852228  0.5829 Assumptions acceptable.
Skewness           0.056866  0.8115 Assumptions acceptable.
Kurtosis           1.218968  0.2696 Assumptions acceptable.
Link Function      1.574565  0.2095 Assumptions acceptable.
Heteroscedasticity 0.001828  0.9659 Assumptions acceptable.

Collinearity Inflates Standard Errors

Because the variables are hard to separate; that’s the definition of correlated or collinear.

Extending Linear Regression

  1. Interactions: Suppose two variables have an impact where each depends on the value of the other. The partial derivative now includes a base term and a product term.

  2. Higher order polynomials: the magic of Taylor series – just include powers of the relevant variable.