Covariance is defined as: \[\sum_{i} (x_{i} - \overline{x})(y_{i} - \overline{y})\]
Where \(\overline{x}\) and \(\overline{y}\) are the sample means of x and y. The items in parentheses are deviations from the mean; covariance is the product of those deviations from the mean. The metric of covariance is the product of the metric of \(x\) and \(y\); often something for which no baseline exists. For this reason, we use a normalized form of covariance known as correlation, often just \(\rho\).
Correlation is a normed measure; it varies between -1 and 1.
An Oregon Gender Gap
A random sample of salaries in the 1990s for new hires from Oregon DAS. I know nothing about them. It is worth noting that these data represent a protected class by Oregon statute. We have a state file or we can load the data from github.
The mean of all salaries is 41142.433. Represented in equation form, we have:
\[Salary_{i} = \alpha + \epsilon_{i}\]
The \(i^{th}\) person’s salary is some average salary \(\alpha\) [or perhaps \(\mu\) to maintain conceptual continuity] (shown as a solid blue line) plus some idiosyncratic remainder or residual salary for individual \(i\) denoted by \(\epsilon_{i}\) (shown as a blue arrow). Everything here is measured in dollars.
By definition, those vertical distances would/will sum to zero. This sum to zero constraint is what consumes a degree of freedom; it is why the standard deviation has N-1 degrees of freedom. The distance from the point to the line is also shown in blue; that is the observed residual salary. It shows how far that individual’s salary is from the overall average.
The total area of the black box in the original metric (squared dollars) is: 892955385. The length of each side is the square root of that area, e.g. 29882.36 in dollars.
This square, when averaged by degrees of freedom and called mean square, has a known distribution, \(\chi^2\)if we assume that the deviations, call them errors, are normal.
The \(\chi^2\) distribution
The \(\chi^2\) [chi-squared] distribution can be derived as the sum of \(k\) squared \(Z\) or standard normal variables. Because they are squares, they are always positive. The \(k\) is called degrees of freedom. Differences in \(\chi^2\) also have a \(\chi^2\) distribution with degrees of freedom equal to the difference in degrees of freedom.
As a technical matter (for x > 0): \[f(x) = \frac{1}{2^{\frac{k}{2}}\Gamma(k/2)} x^{\frac{k}{2-1}}e^{\frac{-x}{2}}\]
Some language and \(F\)
Sum of squares: \[\sum_{i=1}^{N} e_{i}^{2}\]Mean square: divide the sum of squares by \(k\) degrees of freedom. NB: Take the square root and you get the standard deviation. \[\frac{1}{k}\sum_{i=1}^{N} e_{i}^{2}\]
Fisher’s \(F\) ratio or the \(F\) describes the ratio of two \(\chi^2\) variables (mean square) with numerator and denominator degrees of freedom. It is the ratio of [degrees of freedom] averaged squared errors. As the ratio becomes large, we obtain greater evidence that the two are not the same. Let’s try that here.
The Core Conditional
But all of this stems from the presumption that \(e_{i}\) is normal; that is a claim that cannot really be proven or disproven. It is a core assumption that we should try to verify/falsify.
Because the \(e_{i}\) being normal means implies that their squares are \(\chi^2\)a and further that the ratio of their mean-squares are \(F\).
As an aside, the technical definition of \(t\)b is a normal divided by the square root of a \(\chi^2\) so \(t\) and \(F\) must be equivalent with one degree of freedom in the numerator.
a This is why the various results in prop.test are reported as X-squared.
b I mentioned this fact previously when introducing t (defined by degrees of freedom and with metric standard error.)
The mean of residual salary is zero. The standard deviation is the root-mean-square. Compare observed residual salary [as z] to the hypothetical normal.
We can examine the claim that residual salary is plausibly normal by examining the slope of the sample and theoretical quantiles: the slope of the q-q plot. This is exactly what the following does under the null hypothesis of a normal.
Shapiro-Wilk normality test
data: lm(Salary ~ 1, data = OregonSalaries)$residuals
W = 0.9591, p-value = 0.2595
There are a few others. I personally prefer gvlma.
Linear Models
Will expand on the previous by unpacking the box, reducing residual squares via the inclusion of features, explanatory variables, explanatory factors, exogenous inputs, predictor variables. In this case, let us posit that salary is a function of gender. As an equation, that yields:
\[Salary_{i} = \alpha + \beta_{1}*Gender_{i} + \epsilon_{i}\] We want to measure that \(\beta\); in this case, the difference between Male and Female. By default, R will include Female as the constant.
What does the regression imply? That salary for each individual \(i\) is a function of a constant and gender. Given the way that R works, \(\alpha\) will represent the average for females and \(\beta\) will represent the difference between male and female average salaries. The \(\epsilon\) will capture the difference between the individual’s salary and the average of their group (the mean of males or females).
A New Residual Sum of Squares
The picture will now have a red line and a black line and the residual/leftover/unexplained salary is now the difference between the point and the respective vertical line (red arrows or black arrows). What is the relationship between the datum and the group mean? The answer is shown in black/red.
The Squares
The sum of the remaining squared vertical distances is 238621277 and is obtained by squaring each black/red number and summing them. The amount explained by gender [measured in squared dollars] is the difference between the sums of blue and black/red numbers, squared. It is important to notice that the highest paid females and the lowest paid males may have more residual salary given two averages but the different averages, overall, lead to far less residual salary than a single average for all salaries. Indeed, gender alone accounts for:
Intuitively, Gender should account for 16 times the squared difference between Female and Overall Average (4522) and 16 times the squared difference between Male and Overall Average (4522).
With data on per period (t) costs and units produced, we can partition fixed \(\alpha\) and variable costs \(\beta\) (or cost per unit). Consider the data on Handmade Bags. We want to accomplish two things. First, to measure the two key quantities. Second, be able to predict costs for hypothetical number of units.
A comment on correlation
The Data
How’s that done?
HMB %>%skim() %>%kable(format ="html", digits =3)
skim_type
skim_variable
n_missing
complete_rate
numeric.mean
numeric.sd
numeric.p0
numeric.p25
numeric.p50
numeric.p75
numeric.p100
numeric.hist
numeric
units
0
1
351.967
131.665
141.00
241.5
356.00
457.0
580.00
▇▇▆▇▆
numeric
TotCost
0
1
19946.865
2985.202
14357.98
17593.8
20073.93
22415.2
24887.83
▆▇▇▇▇
A Look at the Data
How’s that done?
p1 <-ggplot(HMB) +aes(x=units, y=TotCost) +geom_point() +labs(y="Total Costs per period", x="Units")plotly::ggplotly(p1)
From normal, the slope and intercept have \(t\) confidence intervals.
How’s that done?
confint(Model.HMB,level =0.95)
2.5 % 97.5 %
(Intercept) 11660.88464 12956.39966
units 19.97605 23.42705
With 95% confidence, fixed costs range from 11660.88 to 12956.40 dollars per period and the variable costs range from 19.976 to 23.427 dollars per unit. If the goal is to attain 20 dollars per unit, we cannot rule that out [though a 95 percent lower bound would].
How’s that done?
confint(Model.HMB,level =0.9)
5 % 95 %
(Intercept) 11767.72623 12849.55807
units 20.26066 23.14245
Analysis of Variance Table
Model 1: TotCost3 ~ units
Model 2: TotCost3 ~ units:PlantFac
Res.Df RSS Df Sum of Sq F Pr(>F)
1 58 585943761
2 56 348638405 2 237305356 19.059 0.0000004859 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How’s that done?
anova(PPlm,PPlm2)
Analysis of Variance Table
Model 1: TotCost3 ~ units:PlantFac
Model 2: TotCost3 ~ PlantFac * units
Res.Df RSS Df Sum of Sq F Pr(>F)
1 56 348638405
2 54 345221354 2 3417052 0.2672 0.7665