class: center, middle, inverse, title-slide .title[ # All About Squares ] .subtitle[ ## Unpacking the Black Box ] .author[ ### Robert W. Walker ] .date[ ### 2024-10-30 ] --- background-image: url("img/SquaresBG.jpg") class: center, middle -- ![Black Box Technology](img/bbox.png) --- .left-column[ ### Summary Statistics are Insufficient ``` r load(url("https://github.com/robertwwalker/DADMStuff/raw/master/RegressionExamples.RData")) datasaurus %>% group_by(dataset) %>% summarise(Mean.x = mean(x), Std.Dev.x = sd(x), Mean.y = mean(y), Std.Dev.y = sd(y)) %>% kable(format="html") ``` ] .right-column[ <table> <thead> <tr> <th style="text-align:left;"> dataset </th> <th style="text-align:right;"> Mean.x </th> <th style="text-align:right;"> Std.Dev.x </th> <th style="text-align:right;"> Mean.y </th> <th style="text-align:right;"> Std.Dev.y </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> away </td> <td style="text-align:right;"> 54.26610 </td> <td style="text-align:right;"> 16.76983 </td> <td style="text-align:right;"> 47.83472 </td> <td style="text-align:right;"> 26.93974 </td> </tr> <tr> <td style="text-align:left;"> bullseye </td> <td style="text-align:right;"> 54.26873 </td> <td style="text-align:right;"> 16.76924 </td> <td style="text-align:right;"> 47.83082 </td> <td style="text-align:right;"> 26.93573 </td> </tr> <tr> <td style="text-align:left;"> circle </td> <td style="text-align:right;"> 54.26732 </td> <td style="text-align:right;"> 16.76001 </td> <td style="text-align:right;"> 47.83772 </td> <td style="text-align:right;"> 26.93004 </td> </tr> <tr> <td style="text-align:left;"> dino </td> <td style="text-align:right;"> 54.26327 </td> <td style="text-align:right;"> 16.76514 </td> <td style="text-align:right;"> 47.83225 </td> <td style="text-align:right;"> 26.93540 </td> </tr> <tr> <td style="text-align:left;"> dots </td> <td style="text-align:right;"> 54.26030 </td> <td style="text-align:right;"> 16.76774 </td> <td style="text-align:right;"> 47.83983 </td> <td style="text-align:right;"> 26.93019 </td> </tr> <tr> <td style="text-align:left;"> h_lines </td> <td style="text-align:right;"> 54.26144 </td> <td style="text-align:right;"> 16.76590 </td> <td style="text-align:right;"> 47.83025 </td> <td style="text-align:right;"> 26.93988 </td> </tr> <tr> <td style="text-align:left;"> high_lines </td> <td style="text-align:right;"> 54.26881 </td> <td style="text-align:right;"> 16.76670 </td> <td style="text-align:right;"> 47.83545 </td> <td style="text-align:right;"> 26.94000 </td> </tr> <tr> <td style="text-align:left;"> slant_down </td> <td style="text-align:right;"> 54.26785 </td> <td style="text-align:right;"> 16.76676 </td> <td style="text-align:right;"> 47.83590 </td> <td style="text-align:right;"> 26.93610 </td> </tr> <tr> <td style="text-align:left;"> slant_up </td> <td style="text-align:right;"> 54.26588 </td> <td style="text-align:right;"> 16.76885 </td> <td style="text-align:right;"> 47.83150 </td> <td style="text-align:right;"> 26.93861 </td> </tr> <tr> <td style="text-align:left;"> star </td> <td style="text-align:right;"> 54.26734 </td> <td style="text-align:right;"> 16.76896 </td> <td style="text-align:right;"> 47.83955 </td> <td style="text-align:right;"> 26.93027 </td> </tr> <tr> <td style="text-align:left;"> v_lines </td> <td style="text-align:right;"> 54.26993 </td> <td style="text-align:right;"> 16.76996 </td> <td style="text-align:right;"> 47.83699 </td> <td style="text-align:right;"> 26.93768 </td> </tr> <tr> <td style="text-align:left;"> wide_lines </td> <td style="text-align:right;"> 54.26692 </td> <td style="text-align:right;"> 16.77000 </td> <td style="text-align:right;"> 47.83160 </td> <td style="text-align:right;"> 26.93790 </td> </tr> <tr> <td style="text-align:left;"> x_shape </td> <td style="text-align:right;"> 54.26015 </td> <td style="text-align:right;"> 16.76996 </td> <td style="text-align:right;"> 47.83972 </td> <td style="text-align:right;"> 26.93000 </td> </tr> </tbody> </table> ] --- class: center, middle, inverse # Relationships Matter --- .left-column[ ``` r datasaurus %>% ggplot() + aes(x=x, y=y, color=dataset) + geom_point() + guides(color=FALSE) + facet_wrap(vars(dataset)) ``` ] .right-column[ <img src="index_files/figure-html/unnamed-chunk-3-1.svg" width="468" /> ] ## Always Visualize the Data --- class: inverse ## Models Use Inference to ... -- ## Unpack Black Boxes -- ## We will literally do it with boxes/squares --- # Data and a Library ``` r library(tidyverse); library(patchwork); library(skimr) load(url("https://github.com/robertwwalker/DADMStuff/raw/master/RegressionExamples.RData")) source('https://github.com/robertwwalker/DADMStuff/raw/master/ResidPlotter.R') source('https://github.com/robertwwalker/DADMStuff/raw/master/PlotlyReg.R') ``` --- # Pivots Gather/Spread and tidy data Radiant can be picky about data structures. `ggplot` is built around the idea of tidy data from Week 2. How do we make data tidy? At the command line, it is `pivot_wider` and `pivot_longer` while radiant uses the [at some point to be obsolete `spread`/`gather`] --- # A Bit on Covariance and Correlation Covariance is defined as: `$$\sum_{i} (x_{i} - \overline{x})(y_{i} - \overline{y})$$` Where `\(\overline{x}\)` and `\(\overline{y}\)` are the sample means of x and y. The items in parentheses are deviations from the mean; covariance is the product of those deviations from the mean. The metric of covariance is the product of the metric of `\(x\)` and `\(y\)`; often something for which no baseline exists. For this reason, we use a normalized form of covariance known as correlation, often just `\(\rho\)`. `$$\rho_{xy} = \sum_{i} \frac{(x_{i} - \overline{x})(y_{i} - \overline{y})}{s_{x}s_{y}}$$` Correlation is a **normed** measure; it varies between -1 and 1. --- class: inverse ## An Oregon Gender Gap -- .pull-left[ A random sample of salaries in the 1990s for new hires from Oregon DAS. I know *nothing* about them. It is worth noting that these data represent a protected class by Oregon statute. We have a state file or we can load the data from github. ``` r OregonSalaries %>% select(-Obs) %>% mutate(Row = rep(c(1:16),2)) %>% pivot_wider(names_from = Gender, values_from = Salary) %>% select(-Row) ``` ] .pull-right[ ``` # A tibble: 16 × 2 Female Male <dbl> <dbl> 1 41514. 47343. 2 40964. 46382. 3 39170. 45813. 4 37937. 46410. 5 33982. 43796. 6 36077. 43124. 7 39174. 49444. 8 39037. 44806. 9 29132. 44440. 10 36200. 46680. 11 38561. 47337. 12 33248. 47299. 13 33609. 41461. 14 33669. 43598. 15 37806. 43431. 16 35846. 49266. ``` ] --- ## Visualising Independent Samples ``` r ggplot(OregonSalaries) + aes(x=Gender, y=Salary) + geom_boxplot() + theme_xaringan() ``` <img src="index_files/figure-html/unnamed-chunk-5-1.svg" width="468" /> --- ## Visualising Independent Samples ``` r ggplot(OregonSalaries) + aes(x=Gender, y=Salary, fill=Gender) + geom_violin() + theme_xaringan() ``` <img src="index_files/figure-html/unnamed-chunk-6-1.svg" width="468" /> --- ## About Those Squares: Deviations The mean of all salaries is 41142.433. Represented in equation form, we have: `$$Salary_{i} = \alpha + \epsilon_{i}$$` The `\(i^{th}\)` person's salary is some average salary `\(\alpha\)` [or perhaps `\(\mu\)` to maintain conceptual continuity] (shown as a solid blue line) plus some idiosyncratic remainder or residual salary for individual `\(i\)` denoted by `\(\epsilon_{i}\)` (shown as a blue arrow). Everything here is measured in dollars. `$$Salary_{i} = 41142.433 + \epsilon_{i}$$` --- class: left background-image: url("img/Dev1.png") background-position: right -- ``` # A tibble: 16 × 2 Female[,1] Male[,1] <dbl> <dbl> 1 372. 6200. 2 -178. 5240. 3 -1972. 4670. 4 -3206. 5267. 5 -7161. 2654. 6 -5065. 1982. 7 -1968. 8301. 8 -2105. 3663. 9 -12011. 3298. 10 -4942. 5537. 11 -2581. 6195. 12 -7895. 6156. 13 -7533. 319. 14 -7473. 2456. 15 -3337. 2289. 16 -5296. 8124. ``` --- By definition, those vertical distances would/will sum to zero. **This sum to zero constraint is what consumes a degree of freedom; it is why the standard deviation has N-1 degrees of freedom.** The distance from the point to the line is also shown in blue; that is the observed residual salary. It shows how far that individual's salary is from the overall average. --- class: left background-image: url("img/Dev2.png") background-position: right # A Big Black Box -- ``` # A tibble: 16 × 2 Female[,1] Male[,1] <dbl> <dbl> 1 138350. 38442803. 2 31813. 27457097. 3 3889736. 21813358. 4 10277545. 27743644. 5 51274988. 7041698. 6 25655866. 3926692. 7 3874503. 68912991. 8 4431282. 13420200. 9 144256540. 10876058. 10 24423237. 30660131. 11 6661738. 38373872. 12 62323288. 37899934. 13 56745270. 101515. 14 55848872. 6031248. 15 11132919. 5238385. 16 28050778. 65999032. ``` ``` sum(Salary.Deviation.Squared) 1 892955384 ``` --- ## The Black Box of Salaries The total area of the `black box` in the original metric (squared dollars) is: 892955385. The length of each side is the square root of that area, e.g. 29882.36 in dollars. ![](img/Dev3.png) --- class: inverse background-image: url("img/Dev3.png") background-position: bottom This square, when averaged by degrees of freedom and called *mean square*, has a known distribution, `\(\chi^2\)` *if we assume that the deviations, call them errors, are normal.* --- # The `\(\chi^2\)` distribution The `\(\chi^2\)` [chi-squared] distribution can be derived as the sum of `\(k\)` squared `\(Z\)` or standard normal variables. Because they are squares, they are always positive. The `\(k\)` is called *degrees of freedom*. Differences in `\(\chi^2\)` also have a `\(\chi^2\)` distribution with degrees of freedom equal to the difference in degrees of freedom. As a technical matter (for x > 0): `$$f(x) = \frac{1}{2^{\frac{k}{2}}\Gamma(k/2)} x^{\frac{k}{2-1}}e^{\frac{-x}{2}}$$` --- # Some language and `\(F\)` *Sum of squares*: `$$\sum_{i=1}^{N} e_{i}^{2}$$` *Mean square*: divide the sum of squares by `\(k\)` degrees of freedom. NB: Take the square root and you get the standard deviation. `$$\frac{1}{k}\sum_{i=1}^{N} e_{i}^{2}$$` Fisher's `\(F\)` ratio or the `\(F\)` describes the ratio of two `\(\chi^2\)` variables (mean square) with numerator and denominator degrees of freedom. It is the ratio of [degrees of freedom] averaged squared errors. As the ratio becomes large, we obtain greater evidence that the two are not the same. Let's try that here. --- # The Core Conditional But all of this stems from the presumption that `\(e_{i}\)` is normal; that is a claim that cannot really be proven or disproven. *It is a core assumption that we should try to verify/falsify.* Because the `\(e_{i}\)` being normal means implies that their squares are `\(\chi^2\)`<sup>a</sup> and further that the ratio of their mean-squares are `\(F\)`. As an aside, the technical definition of `\(t\)`<sup>b</sup> is a normal divided by the square root of a `\(\chi^2\)` so `\(t\)` and `\(F\)` must be equivalent with one degree of freedom in the numerator. .footnote[ <sup>a</sup> This is why the various results in prop.test are reported as X-squared. <sup>b</sup> I mentioned this fact previously when introducing t (defined by degrees of freedom and with metric standard error.) ] --- # Verification of the Normal 1. Plots [shape] 2. the quantile-quantile plot 3. Statistical tests --- # plots ``` r plot(density(lm(OregonSalaries$Salary~1)$residuals), main="Residual Salary") ``` <img src="index_files/figure-html/unnamed-chunk-7-1.svg" width="468" /> --- ### the quantile-quantile plot of a normal .pull-left[ The mean of residual salary is zero. The standard deviation is the root-mean-square. Compare observed residual salary [as z] to the hypothetical normal. ``` r data.frame(x=scale(lm(OregonSalaries$Salary~1)$residuals)) %>% arrange(x) %>% mutate(y=qnorm(seq(1, length(OregonSalaries$Salary))/33)) %>% ggplot() + aes(x=x, y=y) + geom_point() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-9-1.svg" width="468" /> ] --- ## tests We can examine the claim that residual salary is plausibly normal by examining the slope of the sample and theoretical quantiles: the slope of the q-q plot. This is exactly what the following does under the null hypothesis of a normal. ``` r shapiro.test(lm(Salary~1, data=OregonSalaries)$residuals) ``` ``` Shapiro-Wilk normality test data: lm(Salary ~ 1, data = OregonSalaries)$residuals W = 0.9591, p-value = 0.2595 ``` --- ## Linear Models Will expand on the previous by unpacking the box, reducing residual squares via the inclusion of *features, explanatory variables, explanatory factors, exogenous inputs, predictor variables*. In this case, let us posit that salary is a function of gender. As an equation, that yields: `$$Salary_{i} = \alpha + \beta_{1}*Gender_{i} + \epsilon_{i}$$` We want to measure that `\(\beta\)`; in this case, the difference between Male and Female. By default, R will include Female as the constant. What does the regression imply? That salary for each individual `\(i\)` is a function of a constant and gender. Given the way that R works, `\(\alpha\)` will represent the average for females and `\(\beta\)` will represent the difference between male and female average salaries. The `\(\epsilon\)` will capture the difference between the individual's salary and the average of their group (the mean of males or females). --- ## A New Residual Sum of Squares .pull-left[ The picture will now have a red line and a black line and the residual/leftover/unexplained salary is now the difference between the point and the respective vertical line (red arrows or black arrows). What is the relationship between the datum and the group mean? The answer is shown in black/red. ] .pull-right[ <img src="index_files/figure-html/BasePlot-1.svg" width="468" /> ] --- ## The Squares The sum of the remaining squared vertical distances is 238621277 and is obtained by squaring each black/red number and summing them. The amount explained by gender [measured in squared dollars] is the difference between the sums of blue and black/red numbers, squared. It is important to notice that the highest paid females and the lowest paid males may have more residual salary given two averages but the different averages, overall, lead to far less residual salary than a single average for all salaries. Indeed, gender alone accounts for: ``` r sum(scale(OregonSalaries$Salary, scale = FALSE)^2) - sum(lm(Salary~Gender, data=OregonSalaries)$residuals^2) ``` ``` [1] 654334108 ``` Intuitively, *Gender* should account for 16 times the squared difference between Female and Overall Average (4522) and 16 times the squared difference between Male and Overall Average (4522). ``` r 32*(mean(OregonSalaries$Salary[OregonSalaries$Gender=="Female"], na.rm=TRUE)-mean(OregonSalaries$Salary))^2 ``` ``` [1] 654334108 ``` --- ## A Visual: What Proportion is Accounted For? <img src="index_files/figure-html/BBReg-1.svg" width="468" /> --- ## Regression Tables ``` r Model.1 <- lm(Salary~Gender, data=OregonSalaries) summary(Model.1) ``` ``` Call: lm(formula = Salary ~ Gender, data = OregonSalaries) Residuals: Min 1Q Median 3Q Max -7488.7 -2107.9 433.3 1743.9 4893.9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36620.5 705.1 51.94 < 2e-16 *** GenderMale 9043.9 997.1 9.07 0.000000000422 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2820 on 30 degrees of freedom Multiple R-squared: 0.7328, Adjusted R-squared: 0.7239 F-statistic: 82.26 on 1 and 30 DF, p-value: 0.0000000004223 ``` --- class: inverse background-image: url("img/Fbox.png") background-position: bottom background-size: 600px 150px ## Analysis of Variance: The Squares F is the ratio of the two mean squares. It is also `\(t^2\)`. .small[ ``` r anova(Model.1) ``` ``` Analysis of Variance Table Response: Salary Df Sum Sq Mean Sq F value Pr(>F) Gender 1 654334108 654334108 82.264 0.0000000004223 *** Residuals 30 238621277 7954043 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- background-image: url("img/F130.jpg") background-size: contain --- background-image: url("img/radiant-ORSlm.jpg") background-size: contain --- class: center, middle, inverse # The Idea Scales to both Quantitative Variables and Multiple Categories --- ### A Multi-Category Example ``` r Model.2 <- lm(Expenditures~Age.Cohort, data=WH) summary(Model.2) ``` ``` Call: lm(formula = Expenditures ~ Age.Cohort, data = WH) Residuals: Min 1Q Median 3Q Max -20157.4 -1175.2 21.9 1160.9 16348.4 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2224.9 330.0 6.742 3.07e-11 *** Age.Cohort0 - 5 -839.9 584.9 -1.436 0.151368 Age.Cohort13-17 1710.3 443.5 3.856 0.000125 *** Age.Cohort18-21 7816.2 458.7 17.040 < 2e-16 *** Age.Cohort22-50 38142.7 440.1 86.668 < 2e-16 *** Age.Cohort51 + 51042.5 537.3 95.000 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3863 on 771 degrees of freedom Multiple R-squared: 0.9613, Adjusted R-squared: 0.9611 F-statistic: 3833 on 5 and 771 DF, p-value: < 2.2e-16 ``` --- ## `anova` ``` r anova(Model.2) ``` ``` Analysis of Variance Table Response: Expenditures Df Sum Sq Mean Sq F value Pr(>F) Age.Cohort 5 285981965642 57196393128 3833.3 < 2.2e-16 *** Residuals 771 11503984348 14920862 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ``` r confint(Model.2) ``` ``` 2.5 % 97.5 % (Intercept) 1577.0735 2872.7514 Age.Cohort0 - 5 -1988.0158 308.1598 Age.Cohort13-17 839.7131 2580.8856 Age.Cohort18-21 6915.7707 8716.7038 Age.Cohort22-50 37278.7115 39006.5887 Age.Cohort51 + 49987.7479 52097.1984 ``` --- # Not Alot Left.... ``` r black.box.maker(Model.2) ``` <img src="index_files/figure-html/unnamed-chunk-17-1.svg" width="468" /> --- class: center, middle # Bullseye ![bullseye](https://media1.giphy.com/media/Z61gwKI72Gx2g/giphy.gif) --- class: inverse ## Messy ``` r qqnorm(residuals(Model.2)) ``` <img src="index_files/figure-html/unnamed-chunk-18-1.svg" width="468" /> --- ## What about Ethnicity? ``` r Model.3 <- lm(Expenditures ~ Ethnicity + Age.Cohort, data=WH) summary(Model.3) ``` ``` Call: lm(formula = Expenditures ~ Ethnicity + Age.Cohort, data = WH) Residuals: Min 1Q Median 3Q Max -20075.0 -1206.7 0.1 1202.1 16446.7 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2359.9 344.7 6.846 1.55e-11 *** EthnicityWhite not Hispanic -402.1 298.3 -1.348 0.178045 Age.Cohort0 - 5 -849.3 584.6 -1.453 0.146685 Age.Cohort13-17 1733.8 443.6 3.908 0.000101 *** Age.Cohort18-21 7870.0 460.2 17.101 < 2e-16 *** Age.Cohort22-50 38311.5 457.4 83.768 < 2e-16 *** Age.Cohort51 + 51227.2 554.2 92.432 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3861 on 770 degrees of freedom Multiple R-squared: 0.9614, Adjusted R-squared: 0.9611 F-statistic: 3198 on 6 and 770 DF, p-value: < 2.2e-16 ``` --- class: inverse ``` r confint(Model.3) ``` ``` 2.5 % 97.5 % (Intercept) 1683.2345 3036.6106 EthnicityWhite not Hispanic -987.6405 183.4497 Age.Cohort0 - 5 -1996.8464 298.2797 Age.Cohort13-17 862.9646 2604.5597 Age.Cohort18-21 6966.5797 8773.3519 Age.Cohort22-50 37413.6880 39209.3043 Age.Cohort51 + 50139.2506 52315.1525 ``` --- class: inverse ``` r anova(Model.2, Model.3) ``` ``` Analysis of Variance Table Model 1: Expenditures ~ Age.Cohort Model 2: Expenditures ~ Ethnicity + Age.Cohort Res.Df RSS Df Sum of Sq F Pr(>F) 1 771 11503984348 2 770 11476899027 1 27085321 1.8172 0.178 ``` At the margin, *Ethnicity* accounts for little variance. --- # Some Basic Cost Accounting `$$Total.Cost_{t} = \alpha + \beta*units_t + \epsilon_{t}$$` With data on per period (t) costs and units produced, we can partition fixed `\(\alpha\)` and variable costs `\(\beta\)` (or cost per unit). Consider the data on Handmade Bags. We want to accomplish two things. First, to measure the two key quantities. Second, be able to predict costs for hypothetical number of units. **A comment on correlation** --- # The Data ``` r HMB %>% skim() %>% kable(format = "html", digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> skim_type </th> <th style="text-align:left;"> skim_variable </th> <th style="text-align:right;"> n_missing </th> <th style="text-align:right;"> complete_rate </th> <th style="text-align:right;"> numeric.mean </th> <th style="text-align:right;"> numeric.sd </th> <th style="text-align:right;"> numeric.p0 </th> <th style="text-align:right;"> numeric.p25 </th> <th style="text-align:right;"> numeric.p50 </th> <th style="text-align:right;"> numeric.p75 </th> <th style="text-align:right;"> numeric.p100 </th> <th style="text-align:left;"> numeric.hist </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> units </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 351.967 </td> <td style="text-align:right;"> 131.665 </td> <td style="text-align:right;"> 141.00 </td> <td style="text-align:right;"> 241.5 </td> <td style="text-align:right;"> 356.00 </td> <td style="text-align:right;"> 457.0 </td> <td style="text-align:right;"> 580.00 </td> <td style="text-align:left;"> ▇▇▆▇▆ </td> </tr> <tr> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> TotCost </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 19946.865 </td> <td style="text-align:right;"> 2985.202 </td> <td style="text-align:right;"> 14357.98 </td> <td style="text-align:right;"> 17593.8 </td> <td style="text-align:right;"> 20073.93 </td> <td style="text-align:right;"> 22415.2 </td> <td style="text-align:right;"> 24887.83 </td> <td style="text-align:left;"> ▆▇▇▇▇ </td> </tr> </tbody> </table> --- class: inverse ### A Look at the Data ``` r p1 <- ggplot(HMB) + aes(x=units, y=TotCost) + geom_point() + labs(y="Total Costs per period", x="Units") plotly::ggplotly(p1) ```
--- ## The Thing to Measure ``` r p1 <- ggplot(HMB) + aes(x=units, y=TotCost) + geom_point() + geom_smooth(method="lm") + labs(y="Total Costs per period", x="Units") plotly::ggplotly(p1) ```
--- ### A Linear Model ``` r Model.HMB <- lm(TotCost ~ units, data=HMB) summary(Model.HMB) ``` ``` Call: lm(formula = TotCost ~ units, data = HMB) Residuals: Min 1Q Median 3Q Max -1606.7 -610.9 -124.8 509.8 2522.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12308.642 323.601 38.04 <2e-16 *** units 21.702 0.862 25.18 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 871.8 on 58 degrees of freedom Multiple R-squared: 0.9162, Adjusted R-squared: 0.9147 F-statistic: 633.8 on 1 and 58 DF, p-value: < 2.2e-16 ``` `$$Total.Cost_{t} = 12308.64 + 21.70 * units_t + \epsilon_{t}$$` where `\(\epsilon_{t}\)` has a standard deviation of 872 dollars. --- ### `fitted.values()` or predictions The line is the fitted values. ``` r data.frame(units=HMB$units, TotCost = HMB$TotCost, Fitted = fitted.values(Model.HMB)) %>% ggplot() + aes(x=units, y=TotCost) + geom_point() + geom_line(aes(x=units, y=Fitted)) + scale_x_continuous(limits=c(0,600)) + scale_y_continuous(limits=c(0,25000)) ``` <img src="index_files/figure-html/unnamed-chunk-26-1.svg" width="468" /> --- ### Residual Costs? ``` r data.frame(Residual.Cost = residuals(Model.HMB)) %>% ggplot() + aes(sample=Residual.Cost) + geom_qq() + geom_qq_line() ``` <img src="index_files/figure-html/unnamed-chunk-27-1.svg" width="468" /> --- ## Not the Prettiest but the best.... ``` r car::qqPlot(residuals(Model.HMB)) ``` <img src="index_files/figure-html/unnamed-chunk-28-1.svg" width="468" /> --- ## Because normal is sustainable, I can .... - Interpret the table including `\(t\)` and `\(F\)`. - Utilize confidence intervals for the estimates. - Predict from the regression of two forms. - Subject to examinations of other characteristics of the regression. --- ### `anova` [Sums of squares] ``` r anova(Model.HMB) ``` ``` Analysis of Variance Table Response: TotCost Df Sum Sq Mean Sq F value Pr(>F) units 1 481694173 481694173 633.81 < 2.2e-16 *** Residuals 58 44080089 760002 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` Two types of squares: those a function of units and those residual. Why? ![](https://www.math.tamu.edu/~don.allen/history/pythag/pyth_pic.gif) --- ### `confint` [confidence intervals] From normal, the slope and intercept have `\(t\)` confidence intervals. ``` r confint(Model.HMB,level = 0.95) ``` ``` 2.5 % 97.5 % (Intercept) 11660.88464 12956.39966 units 19.97605 23.42705 ``` With 95% confidence, fixed costs range from 11660.88 to 12956.40 dollars per period and the variable costs range from 19.976 to 23.427 dollars per unit. If the goal is to attain 20 dollars per unit, we cannot rule that out [though a 95 percent lower bound would]. ``` r confint(Model.HMB,level = 0.9) ``` ``` 5 % 95 % (Intercept) 11767.72623 12849.55807 units 20.26066 23.14245 ``` --- ## Plots ``` r RegressionPlots(Model.HMB) ```
--- ### Predicting Costs ### Average Costs `interval="confidence"` ``` r predict(Model.HMB, newdata=data.frame(units=c(200,250,300)), interval="confidence") ``` ``` fit lwr upr 1 16648.95 16303.25 16994.66 2 17734.03 17448.18 18019.88 3 18819.11 18576.63 19061.58 ``` ### All Costs `interval="prediction"` ``` r predict(Model.HMB, newdata=data.frame(units=c(200,250,300)), interval="prediction") ``` ``` fit lwr upr 1 16648.95 14869.98 18427.92 2 17734.03 15965.71 19502.35 3 18819.11 17057.28 20580.93 ``` --- ### Diminshing Marginal Cost? ``` r Semiconductors %>% ggplot() + aes(x=units, y=TotCost4) + geom_point() + geom_smooth(method = "lm") + geom_smooth(method = "loess", color="red") + theme_xaringan() + labs(x="Units", y="Total Cost", title="Semiconductors") ``` <img src="index_files/figure-html/unnamed-chunk-35-1.svg" width="468" /> --- ## Residuals ``` r SClm <- lm(TotCost4~units, data=Semiconductors) SClm2 <- lm(TotCost4~poly(units, 2, raw=TRUE), data=Semiconductors) par(mfrow=c(2,2)) plot(SClm) ``` <img src="index_files/figure-html/unnamed-chunk-36-1.svg" width="468" /> --- ## Residuals ``` r resid.plotter(SClm2) ``` <img src="index_files/figure-html/unnamed-chunk-37-1.svg" width="468" /> --- ## Comparing Squares ``` r anova(SClm,SClm2) ``` ``` Analysis of Variance Table Model 1: TotCost4 ~ units Model 2: TotCost4 ~ poly(units, 2, raw = TRUE) Res.Df RSS Df Sum of Sq F Pr(>F) 1 58 304149603 2 57 203027172 1 101122431 28.39 0.000001757 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Line*s*? ``` r PPlm <- lm(TotCost3 ~ units:PlantFac, data=PowerPlant) PowerPlant %>% ggplot() + aes(x=units, y=TotCost3, color=PlantFac) + geom_point() + geom_smooth(method = "lm") + theme_xaringan() + labs(x="Units", y="Total Cost", title="Power Plant") ``` <img src="index_files/figure-html/unnamed-chunk-39-1.svg" width="468" /> --- ## Comparison `test` or `anova` ``` r PPlm1 <- lm(TotCost3~units, data=PowerPlant) PPlm2 <- lm(TotCost3~PlantFac*units, data=PowerPlant) anova(PPlm1,PPlm) ``` ``` Analysis of Variance Table Model 1: TotCost3 ~ units Model 2: TotCost3 ~ units:PlantFac Res.Df RSS Df Sum of Sq F Pr(>F) 1 58 585943761 2 56 348638405 2 237305356 19.059 0.0000004859 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ``` r anova(PPlm,PPlm2) ``` ``` Analysis of Variance Table Model 1: TotCost3 ~ units:PlantFac Model 2: TotCost3 ~ PlantFac * units Res.Df RSS Df Sum of Sq F Pr(>F) 1 56 348638405 2 54 345221354 2 3417052 0.2672 0.7665 ``` --- ## The Premier League ``` r Model.EPL <- lm(Points ~ Wage.Bill.milGBP, data=EPL) summary(Model.EPL) ``` ``` Call: lm(formula = Points ~ Wage.Bill.milGBP, data = EPL) Residuals: Min 1Q Median 3Q Max -12.569 -3.201 0.353 3.712 11.129 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.11589 2.74905 11.683 0.000000000776 *** Wage.Bill.milGBP 0.24023 0.02987 8.042 0.000000227415 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.438 on 18 degrees of freedom Multiple R-squared: 0.7823, Adjusted R-squared: 0.7702 F-statistic: 64.67 on 1 and 18 DF, p-value: 0.0000002274 ``` `$$Points = 32.12 + 0.24*Wage.Bill + \epsilon$$` --- ``` r EPL$FV <- fitted.values(Model.EPL) PP <- ggplot(EPL) + aes(x=Wage.Bill.milGBP, y=Points, text=Team) + geom_point() + geom_smooth(method="lm", color="black") PP ``` <img src="index_files/figure-html/unnamed-chunk-43-1.svg" width="468" /> --- ``` r plotly::ggplotly(PP) ```
--- # Not Unreasonable ``` r qqnorm(residuals(Model.EPL)) ``` <img src="index_files/figure-html/unnamed-chunk-45-1.svg" width="468" /> --- ## Residuals ``` r resid.plotter(Model.EPL) ``` <img src="index_files/figure-html/unnamed-chunk-46-1.svg" width="468" /> --- # Predicting Points ``` r head(predict(Model.EPL, newdata=data.frame(Wage.Bill.milGBP=seq(0,200,10)), interval = "confidence"), 10) ``` ``` fit lwr upr 1 32.11589 26.34036 37.89143 2 34.51820 29.26702 39.76939 3 36.92051 32.16859 41.67244 4 39.32282 35.03629 43.60936 5 41.72513 37.85788 45.59239 6 44.12744 40.61679 47.63809 7 46.52975 43.29225 49.76725 8 48.93206 45.86190 52.00222 9 51.33437 48.30814 54.36060 10 53.73668 50.62574 56.84762 ``` --- ## Some Concluding Remarks 1. It's all about squares. But really, it's all about *normals*. 2. Assessing normality is crucial to inference in linear models. But we cannot really **know** whether or not it holds. But if it does: a. Slopes and intercepts are `t`, as are their confidence intervals. b. Sums of squares have ratios defined by `F`. c. Predictions have `t` averages and `normal` intervals. d. **All founded upon the sum of squared errors being `\(\chi^2\)`.** 3. This core idea of **explaining variance** or unpacking a black box is the core question to virtually all *machine learning* and modelling. 4. [The capital asset pricing model.](https://shinytick-geiwez4tia-uc.a.run.app)