Linear Models
Will expand on the previous by unpacking the box, reducing residual squares via the inclusion of features, explanatory variables, explanatory factors, exogenous inputs, predictor variables. In this case, let us posit that salary is a function of gender. As an equation, that yields:
Salary_{i} = \alpha + \beta_{1}*Gender_{i} + \epsilon_{i} We want to measure that \beta; in this case, the difference between Male and Female. By default, R will include Female as the constant.
What does the regression imply? That salary for each individual i is a function of a constant and gender. Given the way that R works, \alpha will represent the average for females and \beta will represent the difference between male and female average salaries. The \epsilon will capture the difference between the individual’s salary and the average of their group (the mean of males or females).
The Squares
The sum of the remaining squared vertical distances is 238621277 and is obtained by squaring each black/red number and summing them. The amount explained by gender [measured in squared dollars] is the difference between the sums of blue and black/red numbers, squared. It is important to notice that the highest paid females and the lowest paid males may have more residual salary given two averages but the different averages, overall, lead to far less residual salary than a single average for all salaries. Indeed, gender alone accounts for:
Intuitively, Gender should account for 16 times the squared difference between Female and Overall Average (4522) and 16 times the squared difference between Male and Overall Average (4522).