Choice and Forecasting: Week 1

Robert W. Walker

An Overview of Forecasting and Choice

Choice and Forecasting

The course is intended to expand your familiarity with common models in the analysis of data of two distinct types that extend basic linear regression tools that you encountered in Data Analysis, Modelling, and Decision Making. We will do it in two sections.

  • Non-continuous data
    [term of art: limited dependent variables]

  • Time series data

Overview

Class Structure

Assignments of two types:

  • Weekly homework implementing the model/topic
  • Two summary unit presentations
    - Choice modelling of some chosen data
    - Time series forecasting with data of your choice

Choice

The Text on Choice

A primary text for this is Keith McNulty’s excellent Handbook of Regression Modelling in People Analytics: With Examples in R, Python, and Julia

Topics

  • Weeks 1 and 2: Review of Linear Models and Inferential Statistics [chapters 1-4]
  • Week 3: Binomial Logistic Regression [chapter 5]
  • Week 4: Ordered and Multinomial Logistic Regression [chapters 6 and 7]
  • Week 5: Hierarchical Data [chapter 8]
  • Week 6: Survival Analysis [chapter 9]
  • Week 7: Power Analysis: How much data do I need? and Review [chapter 10]

Forecasting

The Text on Forecasting

A primary text for forecasting is Rob Hyndman and George Athanasopoulos’s excellent Forecasting: Principles and Practice (3rd ed). The book combines forecasting principles with practical examples in R.

Topics

  • Weeks 8 and 9: The Basics, Time as a Variable, and Decomposition [chapters 1-5]
  • Week 10: Judgemental Forecast and Regression
    [chapters 6 and 7]
  • Week 11: Exponential Smoothing and ARIMA
    [chapters 8 and 9]
  • Week 12: Dynamic Regression [chapter 10]
  • Week 13: Hierarchies, advanced forecasting and related issues
    [chapters 11-13]
  • Week 14: Presentations on a people analytics problem and a time series forecast [See footer]

On the Projects/Presentations

An original modelling project is the expectation/deliverable.

  • Pose an interesting question
  • Find some data that can inform an answer.
  • Present:
    • a motivation,
    • the data,
    • the question,
    • the models, and
    • the answer, concluding with
    • some directions to take it further.

On Homework

Each week, we need to engage with examples. As a result, in addition to weekly reading, the homework is twofold. One part should be easy, the other a bit harder.

  • The Easy Part
    • Replicate the computing in the book chapter.
    • That gives us working examples to start from.
  • The Harder Part
    • Each chapter concludes with Data Exercises
    • These push you to apply the concepts and models in new settings.

Questions?

Introduction, R, Statistics, and Regression

The Importance of People Analytics

  • What are models?
  • Why models?
  • A theory of inference via models
    • Correlation is not causation.
  • A bit on samples, populations, and representation

An inferential process

  • Define the outcome and inputs with a question in mind.
  • Confirm the outcome is reliably measured.
  • Find measures of inputs.
  • Find a sample of outputs and inputs.
  • Explore data to construct models.
  • Render data appropriate for model(s).
  • Estimate models.
  • Interpret and evaluate models.
  • Select an optimal model.
  • Articulate the generalizable inferences owing to sufficient information.

On R and the Basics of R

Installation [a similar guide is on discord]

Chapter 2.2

Key Ideas

  1. Object-orientation: Almost everything in R is an object. So we need a means of assignment. For example, we can compute,
3+4
[1] 7

but if we want that for later use, it must be assigned, with <-, -> or =.

my_sum_la <- 3 + 4
3 + 4 -> my_sum_ra
my_sum_eq = 3 + 4

The Environment Tab

Objects on Objects

my_sum_la + 5
[1] 12
my_new_sum <- my_sum_la + 5
my_new_sum
[1] 12

Data Types

  1. Numeric
    • double
    • integer [followed by L]
  2. Character
    • in "WHATEVER" form but always quoted either ‘single’ or “double”.
    • also factors
  3. Logical takes either TRUE or FALSE
  4. Dates [more on this later]

Homogenous Data Structures

  1. Vectors
    • One dimensional data structures of the same type.
      • typeof to find the type
      • str to see type and contents
      • length tells us the number of items.
  2. Matrices
    • Two dimensional data structures of the same type defined by rows and columns
(m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2))
     [,1] [,2]
[1,]    1    3
[2,]    2    4
m[2,2]
[1] 4
  1. Arrays
    • 3 or more-dimensional data structures of the same type.
( arr <- array(data=c(1:16), dim=c(2,2,2,2)) )
, , 1, 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 1, 2

     [,1] [,2]
[1,]    9   11
[2,]   10   12

, , 2, 2

     [,1] [,2]
[1,]   13   15
[2,]   14   16

Heterogeneous Data Structures:

Lists

Lists are nominally one-dimensional data structures that can hold data of any type.

( new_list <- list(
  scalar = 6, 
  vector = c("Hello", "Goodbye"), 
  matrix = matrix(1:4, nrow = 2, ncol = 2)
) )
$scalar
[1] 6

$vector
[1] "Hello"   "Goodbye"

$matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Heterogeneous Data Structures:

Data Frames

A data.frame is a special class of list that combines vectors of the same length that are addressable by name. Databases are like spreadsheets. Two key descriptors: str for types and dim for dimensions.

salespeople <- read.csv("http://peopleanalytics-regression-book.org/data/salespeople.csv")
str(salespeople)
'data.frame':   351 obs. of  4 variables:
 $ promoted     : int  0 0 1 0 1 1 0 0 0 0 ...
 $ sales        : int  594 446 674 525 657 918 318 364 342 387 ...
 $ customer_rate: num  3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...
 $ performance  : int  2 3 4 2 3 2 3 1 3 3 ...
dim(salespeople)
[1] 351   4

salespeople

salespeople
    promoted sales customer_rate performance
1          0   594          3.94           2
2          0   446          4.06           3
3          1   674          3.83           4
4          0   525          3.62           2
5          1   657          4.40           3
6          1   918          4.54           2
7          0   318          3.09           3
8          0   364          4.89           1
9          0   342          3.74           3
10         0   387          3.00           3
11         0   527          2.43           3
12         1   716          3.16           3
13         0   557          3.51           2
14         0   450          3.21           3
15         0   344          3.02           2
16         0   372          3.87           3
17         0   258          2.49           1
18         0   338          2.66           4
19         0   410          3.14           2
20         1   937          5.00           2
21         1   702          3.53           4
22         0   469          4.24           2
23         0   535          4.47           2
24         0   342          3.60           1
25         1   819          4.45           2
26         1   736          3.94           4
27         0   330          2.54           2
28         0   274          4.06           1
29         0   341          4.47           2
30         1   717          2.98           2
31         0   478          3.48           2
32         0   487          3.74           1
33         0   239          2.47           4
34         1   825          3.32           3
35         0   400          3.53           2
36         1   728          2.66           3
37         1   773          4.89           3
38         0   425          3.62           1
39         1   943          4.40           4
40         0   510          2.56           3
41         0   389          3.34           4
42         0   270          2.56           2
43         1   945          4.31           4
44         0   497          3.02           3
45         0   329          2.86           3
46         0   389          2.98           4
47         0   475          3.39           3
48         0   383          2.36           2
49         1   432          2.33           3
50         1   619          1.94           3
51         1   578          4.17           4
52         0   411          3.07           4
53         0   445          3.00           3
54         0   440          3.62           2
55         0   359          3.92           1
56         0   419          3.85           3
57         1   840          5.00           4
58         0   393          4.49           1
59         1   754          3.74           3
60         0   441          4.75           2
61         1   803          4.89           3
62         0   444          4.15           2
63         1   753          5.00           4
64         1   688          4.29           2
65         0   431          4.29           4
66         0   511          3.74           2
67         0   464          2.22           3
68         0   473          3.57           2
69         0   532          3.74           1
70         0   280          3.41           2
71         0   342          3.71           2
72         0   320          2.15           3
73         0   531          3.41           4
74         0   373          2.01           2
75         0   547          4.40           1
76         1   611          4.03           4
77         1   825          4.66           2
78         0   431          3.62           3
79         0   401          3.69           2
80         0   517          4.20           3
81         1   803          4.15           3
82         0   586          5.00           1
83         0   444          3.21           4
84         1   693          3.80           3
85         1   659          4.20           1
86         0   416          3.87           3
87         0   423          2.75           3
88         1   756          3.55           4
89         0   245          2.52           2
90         0   419          3.76           2
91         1   757          3.11           3
92         1   617          4.33           1
93         1   909          3.21           3
94         0   516          2.47           1
95         0   317          1.51           1
96         0   425          3.53           3
97         0   528          4.63           2
98         0   416          3.37           1
99         1   645          4.08           2
100        0   390          3.16           4
101        0   393          3.76           1
102        0   394          3.07           2
103        0   387          3.87           3
104        0   450          3.62           3
105        0   487          3.46           3
106        1   607          2.49           4
107        0   369          2.22           1
108        0   489          4.98           2
109        0   324          3.05           3
110        0   417          4.47           1
111        1   694          1.90           2
112        1   651          5.00           4
113        0   395          3.46           2
114        0   442          2.29           1
115        0   422          4.54           3
116        0   404          4.06           3
117        0   381          3.37           4
118        0   501          4.77           4
119        1   944          5.00           2
120        1   753          4.43           3
121        0   591          4.93           4
122        1   735          4.03           4
123        1   538          3.05           3
124        0   451          4.49           2
125        0   477          3.87           3
126        0   436          4.13           2
127        1   738          3.05           3
128        1   902          5.00           4
129        0   464          3.90           1
130        1   944          3.92           4
131        0   285          3.53           3
132        0   453          4.68           2
133        0   382          3.51           2
134        0   414          2.03           2
135        0   335          3.71           3
136        1   935          5.00           3
137        0   203          2.72           2
138        0   348          5.00           3
139        1   800          4.24           2
140        0   436          3.51           3
141        0   360          3.23           1
142        1   674          4.47           3
143        0   425          2.43           3
144        1   901          2.70           3
145        0   453          4.98           2
146        0   350          3.00           3
147        0   362          2.89           2
148        0   486          3.41           1
149        0   471          4.38           2
150        0   459          5.00           3
151        0   506          5.00           3
152        0   262          2.70           2
153        1   825          4.95           3
154        0   291          2.54           2
155        1   464          2.70           3
156        1   802          3.78           2
157        1   818          4.24           3
158        1   736          3.78           3
159        0   364          4.01           3
160        0   308          4.82           1
161        1   862          4.17           4
162        0   349          1.67           4
163        0   375          3.05           2
164        0   423          2.54           3
165        1   938          3.69           3
166        0   456          2.91           1
167        0   517          5.00           2
168        0   373          2.93           1
169        1   898          2.26           4
170        1   777          4.86           3
171        0   470          4.84           3
172        0   545          3.94           4
173        1   699          2.66           4
174        1   697          4.06           3
175        0   300          1.94           2
176        1   677          4.63           3
177        0   497          3.14           1
178        1   669          4.56           4
179        1   596          4.98           2
180        0   492          4.24           3
181        0   346          2.20           2
182        1   590          4.17           2
183        0   592          2.20           3
184        1   780          4.15           4
185        0   432          4.15           2
186        0   418          4.01           2
187        1   662          4.56           4
188        1   678          4.49           3
189        1   716          3.44           3
190        0   330          3.05           1
191        0   414          3.83           1
192        0   416          2.79           2
193        0   403          2.75           1
194        0   362          2.03           3
195        0   284          4.20           3
196        0   363          4.72           1
197        1   655          3.39           3
198        0   597          4.08           3
199        1   794          3.83           3
200        1   818          2.70           1
201        0   409          3.44           1
202        1   681          3.97           1
203        1   606          1.83           3
204        0   489          4.47           2
205        0   475          4.56           3
206        0   590          4.43           3
207        0   396          4.86           2
208        0   420          5.00           2
209        1   857          3.85           2
210        0   371          2.77           2
211        0   421          3.39           3
212        1   828          1.37           4
213        0   594          3.05           1
214        0   533          4.86           2
215        0   462          2.98           2
216        0   392          3.85           3
217        0   475          3.83           3
218        1   752          4.89           2
219        1   659          1.97           2
220        1   650          3.14           2
221        0   496          4.31           3
222        0   211          2.52           1
223        1   898          3.51           3
224        0   388          2.54           1
225        0   383          2.47           2
226        0   455          2.36           3
227        0   319          3.21           4
228        1   756          3.09           3
229        0   377          2.08           3
230        1   940          2.82           3
231        1   757          3.55           3
232        0   469          3.85           3
233        0   394          3.57           1
234        0   484          2.86           2
235        0   491          3.44           4
236        0   547          5.00           2
237        0   519          3.34           4
238        1   739          3.99           3
239        0   479          4.06           2
240        1   943          3.21           4
241        1   742          4.17           2
242        0   357          2.72           1
243        0   432          3.80           3
244        0   584          3.78           2
245        1   595          3.74           2
246        0   401          2.86           3
247        0   460          4.45           2
248        1   753          4.89           2
249        0   466          5.00           2
250        0   362          2.26           2
251        0   361          2.66           2
252        0   338          4.03           3
253        1   882          2.63           3
254        0   293          3.51           2
255        1   922          4.15           1
256        1   793          4.08           2
257        1   787          2.56           3
258        0   400          3.34           2
259        0   516          5.00           4
260        0   295          3.87           2
261        0   307          1.00           1
262        0   151          2.31           2
263        0   441          3.34           2
264        0   406          3.25           1
265        0   270          4.10           2
266        1   680          3.09           4
267        1   662          4.77           2
268        0   347          3.62           3
269        0   453          4.86           1
270        0   309          3.00           1
271        0   592          4.79           2
272        0   540          3.41           4
273        1   886          4.68           3
274        0   420          5.00           4
275        1   718          4.03           4
276        0   284          3.69           2
277        0   323          1.85           3
278        0   513          4.20           3
279        1   841          5.00           4
280        0   362          2.38           1
281        1   842          3.99           3
282        0   321          3.25           1
283        0   516          2.89           3
284        0   428          3.28           4
285        0   383          2.98           3
286        1   521          3.23           1
287        0   358          3.09           2
288        0   489          3.41           3
289        0   252          1.69           2
290        1   720          3.76           3
291        1   610          2.75           4
292        1   871          5.00           2
293        0   594          4.75           3
294        0   522          4.59           2
295        0   379          1.83           3
296        0   454          4.29           2
297        0   450          3.69           2
298        0   317          2.66           2
299        1   835          3.90           1
300        0   297          2.61           4
301        0   516          3.90           3
302        0   355          3.41           2
303        1   858          3.67           3
304        0   305          1.99           3
305        0   410          1.37           3
306        1   707          2.38           1
307        1   798          4.72           3
308        0   265          3.48           2
309        1   576          3.60           3
310        0   448          3.18           1
311        0   590          4.77           3
312        0   456          4.03           3
313        1   930          4.22           4
314        0   412          4.10           2
315        0   286          3.64           1
316        0   440          2.29           1
317        0   546          3.55           1
318        0   385          2.66           3
319        0   544          3.48           1
320        0   505          2.89           1
321        1   732          3.57           2
322        0   506          4.36           3
323        0   394          2.79           4
324        1   674          3.60           2
325        0   458          3.39           4
326        0   251          3.32           2
327        0   429          3.41           1
328        0   348          3.69           3
329        1   789          3.71           3
330        1   795          4.31           1
331        0   509          4.61           3
332        1   754          4.33           4
333        0   580          4.70           1
334        0   289          3.57           3
335        0   390          2.01           3
336        1   787          3.14           1
337        0   241          3.05           2
338        0   522          4.72           2
339        0   412          5.00           2
340        0   359          5.00           2
341        0   489          4.86           3
342        1   940          5.00           4
343        0   592          4.38           4
344        1   796          5.00           3
345        1   653          5.00           3
346        0   459          2.82           3
347        0   586          3.41           2
348        0   401          1.60           3
349        0   500          4.17           2
350        0   373          2.54           1
351        0    NA            NA          NA

Working with data.frame

  • head() gives the first few lines.
head(salespeople, 5)
  promoted sales customer_rate performance
1        0   594          3.94           2
2        0   446          4.06           3
3        1   674          3.83           4
4        0   525          3.62           2
5        1   657          4.40           3
  • tail() gives the last few lines.
tail(salespeople, 6)
    promoted sales customer_rate performance
346        0   459          2.82           3
347        0   586          3.41           2
348        0   401          1.60           3
349        0   500          4.17           2
350        0   373          2.54           1
351        0    NA            NA          NA

Single columns and cells

A single column can be found in three ways
1. by $

salespeople$sales
  1. by brackets and quoted name
salespeople[,"sales"]
  1. by [empty] row and column number
salespeople[,2]

While a particular cell is [row,column]

salespeople[38,2]
[1] 425

Summaries summary()

default summary is often fine.

summary(salespeople)
    promoted          sales       customer_rate    performance 
 Min.   :0.0000   Min.   :151.0   Min.   :1.000   Min.   :1.0  
 1st Qu.:0.0000   1st Qu.:389.2   1st Qu.:3.000   1st Qu.:2.0  
 Median :0.0000   Median :475.0   Median :3.620   Median :3.0  
 Mean   :0.3219   Mean   :527.0   Mean   :3.608   Mean   :2.5  
 3rd Qu.:1.0000   3rd Qu.:667.2   3rd Qu.:4.290   3rd Qu.:3.0  
 Max.   :1.0000   Max.   :945.0   Max.   :5.000   Max.   :4.0  
                  NA's   :1       NA's   :1       NA's   :1    

Summaries: skimr::skim()

I prefer skim from skimr

library(skimr)
skim(salespeople)
Data summary
Name salespeople
Number of rows 351
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
promoted 0 1 0.32 0.47 0 0.00 0.00 1.00 1 ▇▁▁▁▃
sales 1 1 527.01 185.22 151 389.25 475.00 667.25 945 ▂▇▅▃▃
customer_rate 1 1 3.61 0.89 1 3.00 3.62 4.29 5 ▁▃▆▇▆
performance 1 1 2.50 0.95 1 2.00 3.00 3.00 4 ▃▇▁▇▃

Dealing with Missing Data

NA values are troublesome for calculations because one cannot perform math operations on non-numbers.

sum(salespeople$sales)
[1] NA

Easiest solution: listwise delete

clean_salespeople <- salespeople[complete.cases(salespeople),]
summary(clean_salespeople)
    promoted          sales       customer_rate    performance 
 Min.   :0.0000   Min.   :151.0   Min.   :1.000   Min.   :1.0  
 1st Qu.:0.0000   1st Qu.:389.2   1st Qu.:3.000   1st Qu.:2.0  
 Median :0.0000   Median :475.0   Median :3.620   Median :3.0  
 Mean   :0.3229   Mean   :527.0   Mean   :3.608   Mean   :2.5  
 3rd Qu.:1.0000   3rd Qu.:667.2   3rd Qu.:4.290   3rd Qu.:3.0  
 Max.   :1.0000   Max.   :945.0   Max.   :5.000   Max.   :4.0  

I like tidy

I will use the tidyverse extensively because I think the workflow and code are far easier to both read and understand.

salespeople %>% na.omit(.) %>% summary()
    promoted          sales       customer_rate    performance 
 Min.   :0.0000   Min.   :151.0   Min.   :1.000   Min.   :1.0  
 1st Qu.:0.0000   1st Qu.:389.2   1st Qu.:3.000   1st Qu.:2.0  
 Median :0.0000   Median :475.0   Median :3.620   Median :3.0  
 Mean   :0.3229   Mean   :527.0   Mean   :3.608   Mean   :2.5  
 3rd Qu.:1.0000   3rd Qu.:667.2   3rd Qu.:4.290   3rd Qu.:3.0  
 Max.   :1.0000   Max.   :945.0   Max.   :5.000   Max.   :4.0  
# There was an error in the original slides that owes to disambiguation
# The tidyverse version of filter in library(dplyr) works differently
# Than the default filter in library(stats) that is loaded at startup
# Examine ?stats::filter vs. ?dplyr::filter
# R will use stats::filter unless the tidyverse has `masked` it or the
# version from dplyr is called explicitly
salespeople %>% dplyr::filter(complete.cases(.)) %>% summary
    promoted          sales       customer_rate    performance 
 Min.   :0.0000   Min.   :151.0   Min.   :1.000   Min.   :1.0  
 1st Qu.:0.0000   1st Qu.:389.2   1st Qu.:3.000   1st Qu.:2.0  
 Median :0.0000   Median :475.0   Median :3.620   Median :3.0  
 Mean   :0.3229   Mean   :527.0   Mean   :3.608   Mean   :2.5  
 3rd Qu.:1.0000   3rd Qu.:667.2   3rd Qu.:4.290   3rd Qu.:3.0  
 Max.   :1.0000   Max.   :945.0   Max.   :5.000   Max.   :4.0  

gives us the data.frame with rows that have “NA” values omitted.

Familiarizing yourself with tidy

There is an R package called learnr that has embedded tutorials for practice.

install.packages("learnr")

that can be very helpful for the basics of tidy data wrangling. In particular, filter and mutate.

learnr tutorial window

Plotting

Base R plots are ugly. The state of the art is ggplot2 that is part of the tidyverse. To get started with those, I would suggest esquisse – an R package – built to use R’s internal shiny to drag and drop plots. For example,

esquisser(salespeople)

Code

Code tab

RMarkdown and Quarto

Section 2.8 of HRMPA contains a bit on RMarkdown. This summer RStudio released a more general version of RMarkdown called Quarto that is quite flexible and, in some ways, more general purpose. These slides were produced using reveal.js and quarto. The beauty is that everything is reproducible and that it seamlessly handles code, output, and fancy text, math, and a host of stuff. If you want to see the code for these slides, it is available here.

Chapter 3: Statistics Foundations

Definitions

  • [Arithmetic] Mean mean() and mean(., na.rm=TRUE) \overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_{i}
  • Variances var() and `var(., na.rm=TRUE)
    • Population Var_{p}(x) = \frac{1}{N}\sum_{i=1}^{N} (x_{i}-\overline{x})^2
    • Sample Var_{s}(x) = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})^2

Standard Deviation

sd() and sd(., na.rm=TRUE)

We use standard deviation to avoid the squared metric.

  • Population \sigma_{p}(x) = \sqrt{\frac{1}{N}\sum_{i=1}^{N} (x_{i}-\overline{x})^2}
  • Sample \sigma_{s}(x) = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})^2}
  • They are related: \sigma_{s}(x) = \sqrt{\frac{N}{N-1}}\sigma_{p}(x)

Correlation and Covariance

Covariance is the shared variance of x and y.

cov_{s}(x,y) = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i}-\overline{x})(y_{i}-\overline{y}) There is also population covariance that divides by N.

Correlation is a bounded measure between -1 and 1.

r_{s}(x,y) = \frac{cov_{s}(x,y)}{\sigma_{s}(x)\sigma_{s}(y)}

In a two variable regression, r_{x,y}^2 is the variance in y accounted for by x.

Sampling and Distributions

What is a random variable? At what level?

  • The text emphasizes independent and identically distributed
  • This can get pretty into the math weeds. Wikipedia definition

Statistics have sampling distributions

  • Remember t? A distribution entirely defined by degrees of freedom. For the mean, because the deviations about the mean must sum to zero, it is always N-1.
  • With statistics we distinguish the variability of a statistic, as opposed to data, with the term standard error.
  • The standard error of the mean is SE(\overline{x}) = \frac{\sigma_{s}(x)}{\sqrt{N}}

A Function for the Standard Error

# Define the function
SE.mean <- function(vectorX) {  # Input a vector
# Insert the formula from the slide
  SE <- sd(vectorX, na.rm=TRUE)/sqrt(length(complete.cases(vectorX)))
# return the value calculated
  SE
}
SE.mean(salespeople$customer_rate)
[1] 0.04761609

Confidence Intervals

A little reading, a fresher, for the case of a proportion is here as a Tufte Handout. You can also see and extended example on the Week 1 entry on the website.

  • A Student proved that the sampling distribution of the sample mean is given by the t-distribution under general conditions satisfying the central limit theorem.
  • t = \frac{\overline{x} - \mu}{SE(x)}
  • so it must be that given percentiles of t, \mu = \overline{x} - t * SE(x)
  • In a sentence, with XXX probability, the true mean lies within \pm t_{n-1, \frac{1-XXX}{2}} standard errors of the sample mean if we want a central interval. Otherwise, all the probability comes from one side.

Hypothesis Tests

Posit some true value and assess the likelihood with some a priori level of probability. The duality of this is the topic of the two linked handouts.
- For cases beyond a single mean, the trick is calculating the appropriate standard error and/or the appropriate degrees of freedom.

Difference of Means

If we wish to know if the mean of two groups is:

  • the same or different
  • one is greater/less than the other

The t distribution is also appropriate for this task, as demonstrated by Welch. Welch’s t

This is implemented as t.test(x, y) or, for tidy data, t-test(var~group, data=data)

Correlations

A modification of the correlation coefficient has also been shown to have a t-distribution with N-2 degrees of freedom. Known often as t^{*}

t^{*} = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}

This is automatically implemented as cor.test(x, y).

Contingency tables

We can also examine the independence among rows and columns of a table. Section 3.3.3 contains this example. The comparison relies on the difference between observed counts and expected counts only knowing the marginal probability of each value along the rows and columns and the total number of items because the expected count is N times row/column probability where the row and column probabilities must sum to one.

Chi-square independence

Two-by-two tables

Are a special case of the above. The Tufte Handout earlier cited goes through the example of a single proportion [of a binary variable] to show that, as long as the number of expected is greater than about 5 or 10, we can use a normal to assess a difference in a two-by-two table.

This is implemented in prop.test(table) after creating the table using table. table will require us to learn a new pipe; %$%.

Illustration

Let me use an internal dataset to illustrate.

library(magrittr) # for the new pipe
data("Titanic")   # Some data
Tidy.Titanic <- DescTools::Untable(Titanic)  # unTable the data
Tidy.Titanic %$% table(Sex, Survived)
        Survived
Sex        No  Yes
  Male   1364  367
  Female  126  344
Tidy.Titanic %$% table(Sex, Survived) %>% prop.test(.)

    2-sample test for equality of proportions with continuity correction

data:  .
X-squared = 454.5, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.4741109 0.5656866
sample estimates:
   prop 1    prop 2 
0.7879838 0.2680851 
Tidy.Titanic %$% chisq.test(Sex, Survived)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Sex and Survived
X-squared = 454.5, df = 1, p-value < 2.2e-16

For Next Time: Chapter 4 – Regression Models