[1] 7
Robert W. Walker
The course is intended to expand your familiarity with common models in the analysis of data of two distinct types that extend basic linear regression tools that you encountered in Data Analysis, Modelling, and Decision Making. We will do it in two sections.
Non-continuous data
[term of art: limited dependent variables]
Time series data
Assignments of two types:
A primary text for this is Keith McNulty’s excellent Handbook of Regression Modelling in People Analytics: With Examples in R, Python, and Julia
A primary text for forecasting is Rob Hyndman and George Athanasopoulos’s excellent Forecasting: Principles and Practice (3rd ed). The book combines forecasting principles with practical examples in R.
Date TBA: We will have to reschedule this because it represents the cancelled class the first week.
An original modelling project is the expectation/deliverable.
Each week, we need to engage with examples. As a result, in addition to weekly reading, the homework is twofold. One part should be easy, the other a bit harder.
Data Exercises
Installation [a similar guide is on discord]
R
is an object. So we need a means of assignment. For example, we can compute,L
]"WHATEVER"
form but always quoted either ‘single’ or “double”.factors
TRUE
or FALSE
typeof
to find the typestr
to see type and contentslength
tells us the number of items.Lists are nominally one-dimensional data structures that can hold data of any type.
A data.frame is a special class of list that combines vectors of the same length that are addressable by name. Databases are like spreadsheets. Two key descriptors: str
for types and dim
for dimensions.
salespeople <- read.csv("http://peopleanalytics-regression-book.org/data/salespeople.csv")
str(salespeople)
'data.frame': 351 obs. of 4 variables:
$ promoted : int 0 0 1 0 1 1 0 0 0 0 ...
$ sales : int 594 446 674 525 657 918 318 364 342 387 ...
$ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...
$ performance : int 2 3 4 2 3 2 3 1 3 3 ...
[1] 351 4
salespeople
promoted sales customer_rate performance
1 0 594 3.94 2
2 0 446 4.06 3
3 1 674 3.83 4
4 0 525 3.62 2
5 1 657 4.40 3
6 1 918 4.54 2
7 0 318 3.09 3
8 0 364 4.89 1
9 0 342 3.74 3
10 0 387 3.00 3
11 0 527 2.43 3
12 1 716 3.16 3
13 0 557 3.51 2
14 0 450 3.21 3
15 0 344 3.02 2
16 0 372 3.87 3
17 0 258 2.49 1
18 0 338 2.66 4
19 0 410 3.14 2
20 1 937 5.00 2
21 1 702 3.53 4
22 0 469 4.24 2
23 0 535 4.47 2
24 0 342 3.60 1
25 1 819 4.45 2
26 1 736 3.94 4
27 0 330 2.54 2
28 0 274 4.06 1
29 0 341 4.47 2
30 1 717 2.98 2
31 0 478 3.48 2
32 0 487 3.74 1
33 0 239 2.47 4
34 1 825 3.32 3
35 0 400 3.53 2
36 1 728 2.66 3
37 1 773 4.89 3
38 0 425 3.62 1
39 1 943 4.40 4
40 0 510 2.56 3
41 0 389 3.34 4
42 0 270 2.56 2
43 1 945 4.31 4
44 0 497 3.02 3
45 0 329 2.86 3
46 0 389 2.98 4
47 0 475 3.39 3
48 0 383 2.36 2
49 1 432 2.33 3
50 1 619 1.94 3
51 1 578 4.17 4
52 0 411 3.07 4
53 0 445 3.00 3
54 0 440 3.62 2
55 0 359 3.92 1
56 0 419 3.85 3
57 1 840 5.00 4
58 0 393 4.49 1
59 1 754 3.74 3
60 0 441 4.75 2
61 1 803 4.89 3
62 0 444 4.15 2
63 1 753 5.00 4
64 1 688 4.29 2
65 0 431 4.29 4
66 0 511 3.74 2
67 0 464 2.22 3
68 0 473 3.57 2
69 0 532 3.74 1
70 0 280 3.41 2
71 0 342 3.71 2
72 0 320 2.15 3
73 0 531 3.41 4
74 0 373 2.01 2
75 0 547 4.40 1
76 1 611 4.03 4
77 1 825 4.66 2
78 0 431 3.62 3
79 0 401 3.69 2
80 0 517 4.20 3
81 1 803 4.15 3
82 0 586 5.00 1
83 0 444 3.21 4
84 1 693 3.80 3
85 1 659 4.20 1
86 0 416 3.87 3
87 0 423 2.75 3
88 1 756 3.55 4
89 0 245 2.52 2
90 0 419 3.76 2
91 1 757 3.11 3
92 1 617 4.33 1
93 1 909 3.21 3
94 0 516 2.47 1
95 0 317 1.51 1
96 0 425 3.53 3
97 0 528 4.63 2
98 0 416 3.37 1
99 1 645 4.08 2
100 0 390 3.16 4
101 0 393 3.76 1
102 0 394 3.07 2
103 0 387 3.87 3
104 0 450 3.62 3
105 0 487 3.46 3
106 1 607 2.49 4
107 0 369 2.22 1
108 0 489 4.98 2
109 0 324 3.05 3
110 0 417 4.47 1
111 1 694 1.90 2
112 1 651 5.00 4
113 0 395 3.46 2
114 0 442 2.29 1
115 0 422 4.54 3
116 0 404 4.06 3
117 0 381 3.37 4
118 0 501 4.77 4
119 1 944 5.00 2
120 1 753 4.43 3
121 0 591 4.93 4
122 1 735 4.03 4
123 1 538 3.05 3
124 0 451 4.49 2
125 0 477 3.87 3
126 0 436 4.13 2
127 1 738 3.05 3
128 1 902 5.00 4
129 0 464 3.90 1
130 1 944 3.92 4
131 0 285 3.53 3
132 0 453 4.68 2
133 0 382 3.51 2
134 0 414 2.03 2
135 0 335 3.71 3
136 1 935 5.00 3
137 0 203 2.72 2
138 0 348 5.00 3
139 1 800 4.24 2
140 0 436 3.51 3
141 0 360 3.23 1
142 1 674 4.47 3
143 0 425 2.43 3
144 1 901 2.70 3
145 0 453 4.98 2
146 0 350 3.00 3
147 0 362 2.89 2
148 0 486 3.41 1
149 0 471 4.38 2
150 0 459 5.00 3
151 0 506 5.00 3
152 0 262 2.70 2
153 1 825 4.95 3
154 0 291 2.54 2
155 1 464 2.70 3
156 1 802 3.78 2
157 1 818 4.24 3
158 1 736 3.78 3
159 0 364 4.01 3
160 0 308 4.82 1
161 1 862 4.17 4
162 0 349 1.67 4
163 0 375 3.05 2
164 0 423 2.54 3
165 1 938 3.69 3
166 0 456 2.91 1
167 0 517 5.00 2
168 0 373 2.93 1
169 1 898 2.26 4
170 1 777 4.86 3
171 0 470 4.84 3
172 0 545 3.94 4
173 1 699 2.66 4
174 1 697 4.06 3
175 0 300 1.94 2
176 1 677 4.63 3
177 0 497 3.14 1
178 1 669 4.56 4
179 1 596 4.98 2
180 0 492 4.24 3
181 0 346 2.20 2
182 1 590 4.17 2
183 0 592 2.20 3
184 1 780 4.15 4
185 0 432 4.15 2
186 0 418 4.01 2
187 1 662 4.56 4
188 1 678 4.49 3
189 1 716 3.44 3
190 0 330 3.05 1
191 0 414 3.83 1
192 0 416 2.79 2
193 0 403 2.75 1
194 0 362 2.03 3
195 0 284 4.20 3
196 0 363 4.72 1
197 1 655 3.39 3
198 0 597 4.08 3
199 1 794 3.83 3
200 1 818 2.70 1
201 0 409 3.44 1
202 1 681 3.97 1
203 1 606 1.83 3
204 0 489 4.47 2
205 0 475 4.56 3
206 0 590 4.43 3
207 0 396 4.86 2
208 0 420 5.00 2
209 1 857 3.85 2
210 0 371 2.77 2
211 0 421 3.39 3
212 1 828 1.37 4
213 0 594 3.05 1
214 0 533 4.86 2
215 0 462 2.98 2
216 0 392 3.85 3
217 0 475 3.83 3
218 1 752 4.89 2
219 1 659 1.97 2
220 1 650 3.14 2
221 0 496 4.31 3
222 0 211 2.52 1
223 1 898 3.51 3
224 0 388 2.54 1
225 0 383 2.47 2
226 0 455 2.36 3
227 0 319 3.21 4
228 1 756 3.09 3
229 0 377 2.08 3
230 1 940 2.82 3
231 1 757 3.55 3
232 0 469 3.85 3
233 0 394 3.57 1
234 0 484 2.86 2
235 0 491 3.44 4
236 0 547 5.00 2
237 0 519 3.34 4
238 1 739 3.99 3
239 0 479 4.06 2
240 1 943 3.21 4
241 1 742 4.17 2
242 0 357 2.72 1
243 0 432 3.80 3
244 0 584 3.78 2
245 1 595 3.74 2
246 0 401 2.86 3
247 0 460 4.45 2
248 1 753 4.89 2
249 0 466 5.00 2
250 0 362 2.26 2
251 0 361 2.66 2
252 0 338 4.03 3
253 1 882 2.63 3
254 0 293 3.51 2
255 1 922 4.15 1
256 1 793 4.08 2
257 1 787 2.56 3
258 0 400 3.34 2
259 0 516 5.00 4
260 0 295 3.87 2
261 0 307 1.00 1
262 0 151 2.31 2
263 0 441 3.34 2
264 0 406 3.25 1
265 0 270 4.10 2
266 1 680 3.09 4
267 1 662 4.77 2
268 0 347 3.62 3
269 0 453 4.86 1
270 0 309 3.00 1
271 0 592 4.79 2
272 0 540 3.41 4
273 1 886 4.68 3
274 0 420 5.00 4
275 1 718 4.03 4
276 0 284 3.69 2
277 0 323 1.85 3
278 0 513 4.20 3
279 1 841 5.00 4
280 0 362 2.38 1
281 1 842 3.99 3
282 0 321 3.25 1
283 0 516 2.89 3
284 0 428 3.28 4
285 0 383 2.98 3
286 1 521 3.23 1
287 0 358 3.09 2
288 0 489 3.41 3
289 0 252 1.69 2
290 1 720 3.76 3
291 1 610 2.75 4
292 1 871 5.00 2
293 0 594 4.75 3
294 0 522 4.59 2
295 0 379 1.83 3
296 0 454 4.29 2
297 0 450 3.69 2
298 0 317 2.66 2
299 1 835 3.90 1
300 0 297 2.61 4
301 0 516 3.90 3
302 0 355 3.41 2
303 1 858 3.67 3
304 0 305 1.99 3
305 0 410 1.37 3
306 1 707 2.38 1
307 1 798 4.72 3
308 0 265 3.48 2
309 1 576 3.60 3
310 0 448 3.18 1
311 0 590 4.77 3
312 0 456 4.03 3
313 1 930 4.22 4
314 0 412 4.10 2
315 0 286 3.64 1
316 0 440 2.29 1
317 0 546 3.55 1
318 0 385 2.66 3
319 0 544 3.48 1
320 0 505 2.89 1
321 1 732 3.57 2
322 0 506 4.36 3
323 0 394 2.79 4
324 1 674 3.60 2
325 0 458 3.39 4
326 0 251 3.32 2
327 0 429 3.41 1
328 0 348 3.69 3
329 1 789 3.71 3
330 1 795 4.31 1
331 0 509 4.61 3
332 1 754 4.33 4
333 0 580 4.70 1
334 0 289 3.57 3
335 0 390 2.01 3
336 1 787 3.14 1
337 0 241 3.05 2
338 0 522 4.72 2
339 0 412 5.00 2
340 0 359 5.00 2
341 0 489 4.86 3
342 1 940 5.00 4
343 0 592 4.38 4
344 1 796 5.00 3
345 1 653 5.00 3
346 0 459 2.82 3
347 0 586 3.41 2
348 0 401 1.60 3
349 0 500 4.17 2
350 0 373 2.54 1
351 0 NA NA NA
head()
gives the first few lines. promoted sales customer_rate performance
1 0 594 3.94 2
2 0 446 4.06 3
3 1 674 3.83 4
4 0 525 3.62 2
5 1 657 4.40 3
tail()
gives the last few lines.summary()
default summary
is often fine.
promoted sales customer_rate performance
Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0
1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0
Median :0.0000 Median :475.0 Median :3.620 Median :3.0
Mean :0.3219 Mean :527.0 Mean :3.608 Mean :2.5
3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0
Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0
NA's :1 NA's :1 NA's :1
skimr::skim()
I prefer skim
from skimr
Name | salespeople |
Number of rows | 351 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
promoted | 0 | 1 | 0.32 | 0.47 | 0 | 0.00 | 0.00 | 1.00 | 1 | ▇▁▁▁▃ |
sales | 1 | 1 | 527.01 | 185.22 | 151 | 389.25 | 475.00 | 667.25 | 945 | ▂▇▅▃▃ |
customer_rate | 1 | 1 | 3.61 | 0.89 | 1 | 3.00 | 3.62 | 4.29 | 5 | ▁▃▆▇▆ |
performance | 1 | 1 | 2.50 | 0.95 | 1 | 2.00 | 3.00 | 3.00 | 4 | ▃▇▁▇▃ |
NA
values are troublesome for calculations because one cannot perform math operations on non-numbers.
Easiest solution: listwise delete
promoted sales customer_rate performance
Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0
1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0
Median :0.0000 Median :475.0 Median :3.620 Median :3.0
Mean :0.3229 Mean :527.0 Mean :3.608 Mean :2.5
3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0
Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0
I will use the tidyverse
extensively because I think the workflow and code are far easier to both read and understand.
promoted sales customer_rate performance
Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0
1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0
Median :0.0000 Median :475.0 Median :3.620 Median :3.0
Mean :0.3229 Mean :527.0 Mean :3.608 Mean :2.5
3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0
Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0
# There was an error in the original slides that owes to disambiguation
# The tidyverse version of filter in library(dplyr) works differently
# Than the default filter in library(stats) that is loaded at startup
# Examine ?stats::filter vs. ?dplyr::filter
# R will use stats::filter unless the tidyverse has `masked` it or the
# version from dplyr is called explicitly
salespeople %>% dplyr::filter(complete.cases(.)) %>% summary
promoted sales customer_rate performance
Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0
1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0
Median :0.0000 Median :475.0 Median :3.620 Median :3.0
Mean :0.3229 Mean :527.0 Mean :3.608 Mean :2.5
3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0
Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0
gives us the data.frame with rows that have “NA” values omitted.
There is an R package called learnr
that has embedded tutorials for practice.
that can be very helpful for the basics of tidy data wrangling. In particular, filter
and mutate
.
Base R plots are ugly. The state of the art is ggplot2
that is part of the tidyverse
. To get started with those, I would suggest esquisse
– an R package – built to use R’s internal shiny to drag and drop plots. For example,
Section 2.8 of HRMPA
contains a bit on RMarkdown. This summer RStudio released a more general version of RMarkdown called Quarto
that is quite flexible and, in some ways, more general purpose. These slides were produced using reveal.js
and quarto
. The beauty is that everything is reproducible and that it seamlessly handles code, output, and fancy text, math, and a host of stuff. If you want to see the code for these slides, it is available here.
mean()
and mean(., na.rm=TRUE)
\overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_{i}var()
and `var(., na.rm=TRUE)
sd()
and sd(., na.rm=TRUE)
We use standard deviation to avoid the squared metric.
Covariance is the shared variance of x and y.
cov_{s}(x,y) = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})(y_{i}-\overline{y}) There is also population covariance that divides by N.
Correlation is a bounded measure between -1 and 1.
r_{s}(x,y) = \frac{cov_{s}(x,y)}{\sigma_{s}(x)\sigma_{s}(y)}
In a two variable regression, r_{x,y}^2 is the variance in y accounted for by x.
What is a random variable? At what level?
A little reading, a fresher, for the case of a proportion is here as a Tufte Handout. You can also see and extended example on the Week 1 entry on the website.
Posit some true value and assess the likelihood with some a priori level of probability. The duality of this is the topic of the two linked handouts.
- For cases beyond a single mean, the trick is calculating the appropriate standard error and/or the appropriate degrees of freedom.
If we wish to know if the mean of two groups is:
The t distribution is also appropriate for this task, as demonstrated by Welch.
This is implemented as t.test(x, y)
or, for tidy data, t-test(var~group, data=data)
A modification of the correlation coefficient has also been shown to have a t-distribution with N-2 degrees of freedom. Known often as t^{*}
t^{*} = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}
This is automatically implemented as cor.test(x, y)
.
We can also examine the independence among rows and columns of a table. Section 3.3.3 contains this example. The comparison relies on the difference between observed counts and expected counts only knowing the marginal probability of each value along the rows and columns and the total number of items because the expected count is N times row/column probability where the row and column probabilities must sum to one.
Are a special case of the above. The Tufte Handout earlier cited goes through the example of a single proportion [of a binary variable] to show that, as long as the number of expected is greater than about 5 or 10, we can use a normal to assess a difference in a two-by-two table.
This is implemented in prop.test(table)
after creating the table using table
. table
will require us to learn a new pipe; %$%
.
Let me use an internal dataset to illustrate.
library(magrittr) # for the new pipe
data("Titanic") # Some data
Tidy.Titanic <- DescTools::Untable(Titanic) # unTable the data
Tidy.Titanic %$% table(Sex, Survived)
Survived
Sex No Yes
Male 1364 367
Female 126 344
2-sample test for equality of proportions with continuity correction
data: .
X-squared = 454.5, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.4741109 0.5656866
sample estimates:
prop 1 prop 2
0.7879838 0.2680851
Pearson's Chi-squared test with Yates' continuity correction
data: Sex and Survived
X-squared = 454.5, df = 1, p-value < 2.2e-16
Models of Choice and Forecasting