Robert W. Walker
mean()
and mean(., na.rm=TRUE)
\overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_{i}var()
and var(., na.rm=TRUE)
sd()
and sd(., na.rm=TRUE)
We use standard deviation to avoid the squared metric.
Covariance is the shared variance of x and y.
cov_{s}(x,y) = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})(y_{i}-\overline{y}) There is also population covariance that divides by N.
Correlation is a bounded measure between -1 and 1.
r_{s}(x,y) = \frac{cov_{s}(x,y)}{\sigma_{s}(x)\sigma_{s}(y)}
In a two variable regression, r_{x,y}^2 is the variance in y accounted for by x.
What is a random variable? At what level?
A little reading, a fresher, for the case of a proportion is here as a Tufte Handout. You can also see and extended example on the Week 1 entry on the website.
Posit some true value and assess the likelihood with some a priori level of probability. The duality of this is the topic of the two linked handouts.
- For cases beyond a single mean, the trick is calculating the appropriate standard error and/or the appropriate degrees of freedom.
If we wish to know if the mean of two groups is:
The t distribution is also appropriate for this task, as demonstrated by Welch.
This is implemented as t.test(x, y)
or, for tidy data, t-test(var~group, data=data)
The text makes it seem as though this is a simple problem. It is not.
A modification of the correlation coefficient has also been shown to have a t-distribution with N-2 degrees of freedom. Known often as t^{*}
t^{*} = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}
This is automatically implemented as cor.test(x, y)
.
We can also examine the independence among rows and columns of a table. Section 3.3.3 contains this example. The comparison relies on the difference between observed counts and expected counts only knowing the marginal probability of each value along the rows and columns and the total number of items because the expected count is N times row/column probability where the row and column probabilities must sum to one.
Are a special case of the above. The Tufte Handout earlier cited goes through the example of a single proportion [of a binary variable] to show that, as long as the number of expected is greater than about 5 or 10, we can use a normal to assess a difference in a two-by-two table.
This is implemented in prop.test(table)
after creating the table using table
. table
will require us to learn a new pipe; %$%
.
Let me use an internal dataset to illustrate.
library(magrittr) # for the new pipe
data("Titanic") # Some data
Tidy.Titanic <- DescTools::Untable(Titanic) # unTable the data
Tidy.Titanic %$% table(Sex, Survived)
Survived
Sex No Yes
Male 1364 367
Female 126 344
2-sample test for equality of proportions with continuity correction
data: .
X-squared = 454.5, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.4741109 0.5656866
sample estimates:
prop 1 prop 2
0.7879838 0.2680851
Pearson's Chi-squared test with Yates' continuity correction
data: Sex and Survived
X-squared = 454.5, df = 1, p-value < 2.2e-16
Ordinary Least Squares describes the most common algorithm for estimation but the linear regression model is the “workhorse” of applied statistics generalizing the mx + c
slope intercept method to multivariate problems. The text cites three examples:
With y as outcome and x as input, and assuming the relationship is linear, we can write
y = \alpha + \beta x + \epsilon
\alpha and \beta are parameters or coefficients to be estimated
\epsilon is residual or error – the difference between the function of x and the observed value of y.
The method of estimation is, most often and conveniently, finding the values of \alpha and \beta that minimize the sum of squared errors. Because it is a quadratic, a unique solution always exists. Indeed, the solution for \beta often written \hat{\beta} is
\hat{\beta} = \frac{cov(x,y)}{var(x)}
Using rayshader
plot_gg
The magic of multiple regression is parsing out the partial effects of multiple inputs.
The last three are commonly referred to as partial effects – think partial derivative if you are familiar with calculus.
point estimate
Is the best guess but these are estimates with accompanying uncertainty. If \epsilon is normal, the those slopes have a t distribution with N-p-1 degrees of freedom. The reason that each slope consumes a degree of freedom is that the line/plane/hyperplane is constrained to pass through the mean of both x and y.
The confidence interval incorporates the uncertainty – the standard error of \beta_{\cdot}. Confidence intervals can be obtained with confint
applied to a model object. visreg::visreg(model.object)
will plot them.
data("freeny")
my.lm.rev <- lm(y ~ price.index + market.potential + income.level, data=freeny)
summary(my.lm.rev)
Call:
lm(formula = y ~ price.index + market.potential + income.level,
data = freeny)
Residuals:
Min 1Q Median 3Q Max
-0.0273061 -0.0090031 0.0007218 0.0111354 0.0270294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.31014 5.04423 -2.639 0.012339 *
price.index -0.83488 0.13084 -6.381 2.44e-07 ***
market.potential 1.62735 0.37682 4.319 0.000123 ***
income.level 0.84556 0.09904 8.538 4.47e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01468 on 35 degrees of freedom
Multiple R-squared: 0.998, Adjusted R-squared: 0.9978
F-statistic: 5846 on 3 and 35 DF, p-value: < 2.2e-16
confint
[1] 0.9980083
[1] 0.9980083
# Point
predict(my.lm.rev, newdata=data.frame(price.index=5, income.level=5.86, market.potential=13))
1
8.625905
# On Average
predict(my.lm.rev, newdata=data.frame(price.index=5, income.level=5.86, market.potential=13), interval="confidence")
fit lwr upr
1 8.625905 8.523296 8.728514
# All possibilities
predict(my.lm.rev, newdata=data.frame(price.index=5, income.level=5.86, market.potential=13), interval="prediction")
fit lwr upr
1 8.625905 8.519058 8.732752
as.factor()
almost always sorts this out. His method is a bit more trouble but also works.
Also, one level/factor will be omitted by default, it is absorbed in the constant to avoid perfect linear combinations. Let me explain.
y = X\beta + \epsilon
E[\epsilon | X] = 0
Rank of the NxK data matrix X is K with probability 1 (N > K).
X is a nonstochastic matrix.
Homoskedasticity
E[\epsilon\epsilon^{\prime}] = \sigma^2I s.t. \sigma^{2} > 0
Need nothing about the distribution other than the two moment defintions. It is for number three that this starts to matter and, in many ways, this is directly a reflection of Basu’s theorem.
With \hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y, \mathbb{E}[\hat{\beta} - \beta] = 0 requires, \mathbb{E}[(X^{\prime}X)^{-1}X^{\prime}y - \beta] = 0 We require an inverse already. Invoking the definition of y, we get
\mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}(\mathbf{X}\beta + \epsilon) - \beta] = 0 \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta] = 0
Taking expectations and rearranging.
\hat{\beta} - \beta = -\mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon]
If the latter multiple is zero, all is well.
\mathbb{E}[(\hat{\mathbf{\beta}} - \beta)(\hat{\mathbf{\beta}} - \beta)^{\prime}] can be derived as follows.
\mathbb{E}[(\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)(\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)^{\prime}] \mathbb{E}[(\mathbf{I}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)(\mathbf{I}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)^{\prime}]
Recognizing the zero part from before, we are left with the manageable,
\mathbb{E}[(\hat{\mathbf{\beta}} - \beta)(\hat{\mathbf{\beta}} - \beta)^{\prime}] \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon\epsilon^{\prime}\mathbf{X(X^{\prime}X)^{-1}}]
Nifty. With nonstochastic \mathbf{X}, it’s the structure of \epsilon\epsilon^{\prime} and we know what that is. By assumption, we have \sigma^{2}\mathbf{I}. If stochastic, we need more steps to get to the same place.
Proving the Gauss-Markov theorem is not so instructive. From what we already have, we are restricted to linear estimators, we add or subtract something. So after computation, we get the OLS standard errors plus a positive semi-definite matrix. OLS always wins. From here, a natural place to go is corrections for non \mathbf{I}. We will do plenty of that. And we will eventually need Aitken.
car
for patternsExogeneity: No easy means of assessment. Quite unfortunate
No perfect multicollinearity: really a technical requirement because something will have to be dropped to produce estimates because of matrix invertability.
Makes math easy but doesn’t really matter.
Constant error variance
The avPlots
from before as a function of x. We also commonly use default plots of the regression model.
Leverage assists in detecting outliers.
It is not a core assumption for the method; it is a core assumption for inference. There is widespread use of a robust
standard error that does not require normality. Normality is what allows t
distributions for inference. The q-q plots he cites are nice. Even better is
Call:
lm(formula = y ~ price.index + market.potential + income.level,
data = freeny)
Coefficients:
(Intercept) price.index market.potential income.level
-13.3101 -0.8349 1.6273 0.8456
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = my.lm.rev)
Value p-value Decision
Global Stat 2.852228 0.5829 Assumptions acceptable.
Skewness 0.056866 0.8115 Assumptions acceptable.
Kurtosis 1.218968 0.2696 Assumptions acceptable.
Link Function 1.574565 0.2095 Assumptions acceptable.
Heteroscedasticity 0.001828 0.9659 Assumptions acceptable.
Because the variables are hard to separate; that’s the definition of correlated or collinear.
Interactions: Suppose two variables have an impact where each depends on the value of the other. The partial derivative now includes a base term and a product term.
Higher order polynomials: the magic of Taylor series – just include powers of the relevant variable.
Models of Choice and Forecasting