Code
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv') datasaurus
Robert W. Walker
October 19, 2020
datasaurus dozen
The datasaurus dozen is a fantastic teaching resource for examining the importance of data visualization. Let’s have a look. The basic idea is that all thirteen (datasaurus plus 12) contain nearly identical means and standard deviations though they do vary if the five number summaries are deployed. The scatterplots that are derived from data with similar x-y summaries is a useful reminder that data science is about patterns, not just statistics.
Two libraries to make our work easy.
library(tidyverse)
library(skimr)
First, the summary statistics. Summary statistics are great but they are no substitute for basic data familiarity. Notice, we have nearly identical means and standard deviations though the five number summaries do vary.
Name | Piped data |
Number of rows | 1846 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | dataset |
Variable type: numeric
skim_variable | dataset | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
x | away | 0 | 1 | 54.27 | 16.77 | 15.56 | 39.72 | 53.34 | 69.15 | 91.64 | ▁▇▃▇▁ |
x | bullseye | 0 | 1 | 54.27 | 16.77 | 19.29 | 41.63 | 53.84 | 64.80 | 91.74 | ▂▆▇▅▂ |
x | circle | 0 | 1 | 54.27 | 16.76 | 21.86 | 43.38 | 54.02 | 64.97 | 85.66 | ▅▃▇▅▃ |
x | dino | 0 | 1 | 54.26 | 16.77 | 22.31 | 44.10 | 53.33 | 64.74 | 98.21 | ▅▇▇▅▂ |
x | dots | 0 | 1 | 54.26 | 16.77 | 25.44 | 50.36 | 50.98 | 75.20 | 77.95 | ▂▁▇▁▅ |
x | h_lines | 0 | 1 | 54.26 | 16.77 | 22.00 | 42.29 | 53.07 | 66.77 | 98.29 | ▅▇▇▅▁ |
x | high_lines | 0 | 1 | 54.27 | 16.77 | 17.89 | 41.54 | 54.17 | 63.95 | 96.08 | ▂▅▇▃▁ |
x | slant_down | 0 | 1 | 54.27 | 16.77 | 18.11 | 42.89 | 53.14 | 64.47 | 95.59 | ▂▅▇▃▁ |
x | slant_up | 0 | 1 | 54.27 | 16.77 | 20.21 | 42.81 | 54.26 | 64.49 | 95.26 | ▃▆▇▃▂ |
x | star | 0 | 1 | 54.27 | 16.77 | 27.02 | 41.03 | 56.53 | 68.71 | 86.44 | ▅▇▇▃▆ |
x | v_lines | 0 | 1 | 54.27 | 16.77 | 30.45 | 49.96 | 50.36 | 69.50 | 89.50 | ▃▇▁▅▁ |
x | wide_lines | 0 | 1 | 54.27 | 16.77 | 27.44 | 35.52 | 64.55 | 67.45 | 77.92 | ▇▂▁▇▅ |
x | x_shape | 0 | 1 | 54.26 | 16.77 | 31.11 | 40.09 | 47.14 | 71.86 | 85.45 | ▇▆▁▃▅ |
Name | Piped data |
Number of rows | 1846 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | dataset |
Variable type: numeric
skim_variable | dataset | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
y | away | 0 | 1 | 47.83 | 26.94 | 0.02 | 24.63 | 47.54 | 71.80 | 97.48 | ▅▆▃▇▃ |
y | bullseye | 0 | 1 | 47.83 | 26.94 | 9.69 | 26.24 | 47.38 | 72.53 | 85.88 | ▇▆▃▅▇ |
y | circle | 0 | 1 | 47.84 | 26.93 | 16.33 | 18.35 | 51.03 | 77.78 | 85.58 | ▇▁▁▂▆ |
y | dino | 0 | 1 | 47.83 | 26.94 | 2.95 | 25.29 | 46.03 | 68.53 | 99.49 | ▇▇▇▅▆ |
y | dots | 0 | 1 | 47.84 | 26.93 | 15.77 | 17.11 | 51.30 | 82.88 | 94.25 | ▇▁▇▁▆ |
y | h_lines | 0 | 1 | 47.83 | 26.94 | 10.46 | 30.48 | 50.47 | 70.35 | 90.46 | ▆▇▇▅▅ |
y | high_lines | 0 | 1 | 47.84 | 26.94 | 14.91 | 22.92 | 32.50 | 75.94 | 87.15 | ▇▁▁▃▅ |
y | slant_down | 0 | 1 | 47.84 | 26.94 | 0.30 | 27.84 | 46.40 | 68.44 | 99.64 | ▆▇▇▅▆ |
y | slant_up | 0 | 1 | 47.83 | 26.94 | 5.65 | 24.76 | 45.29 | 70.86 | 99.58 | ▇▇▇▅▅ |
y | star | 0 | 1 | 47.84 | 26.93 | 14.37 | 20.37 | 50.11 | 63.55 | 92.21 | ▇▂▂▅▅ |
y | v_lines | 0 | 1 | 47.84 | 26.94 | 2.73 | 22.75 | 47.11 | 65.85 | 99.69 | ▇▆▇▃▅ |
y | wide_lines | 0 | 1 | 47.83 | 26.94 | 0.22 | 24.35 | 46.28 | 67.57 | 99.28 | ▇▇▇▅▆ |
y | x_shape | 0 | 1 | 47.84 | 26.93 | 4.58 | 23.47 | 39.88 | 73.61 | 97.84 | ▇▇▂▆▅ |
Notice that all of the datasets are nearly identical. But have a look at them.