The `datasaurus dozen`

The datasaurus dozen is a fantastic teaching resource for examining the importance of data visualization. Let’s have a look. The basic idea is that all thirteen (datasaurus plus 12) contain nearly identical means and standard deviations though they do vary if the five number summaries are deployed. The scatterplots that are derived from data with similar x-y summaries is a useful reminder that data science is about patterns, not just statistics.

Code

datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')

Two libraries to make our work easy.

library(tidyverse)
library(skimr)

First, the summary statistics. Summary statistics are great but they are no substitute for basic data familiarity. Notice, we have nearly identical means and standard deviations though the five number summaries do vary.

Code

datasaurus %>% 
  group_by(dataset) %>% 
  skim(x)

Data summary
Name	Piped data
Number of rows	1846
Number of columns	3
_______________________
Column type frequency:
numeric	1
________________________
Group variables	dataset

Variable type: numeric

skim_variable	dataset	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
x	away	1	54.27	16.77	15.56	39.72	53.34	69.15	91.64	▁▇▃▇▁
x	bullseye	1	54.27	16.77	19.29	41.63	53.84	64.80	91.74	▂▆▇▅▂
x	circle	1	54.27	16.76	21.86	43.38	54.02	64.97	85.66	▅▃▇▅▃
x	dino	1	54.26	16.77	22.31	44.10	53.33	64.74	98.21	▅▇▇▅▂
x	dots	1	54.26	16.77	25.44	50.36	50.98	75.20	77.95	▂▁▇▁▅
x	h_lines	1	54.26	16.77	22.00	42.29	53.07	66.77	98.29	▅▇▇▅▁
x	high_lines	1	54.27	16.77	17.89	41.54	54.17	63.95	96.08	▂▅▇▃▁
x	slant_down	1	54.27	16.77	18.11	42.89	53.14	64.47	95.59	▂▅▇▃▁
x	slant_up	1	54.27	16.77	20.21	42.81	54.26	64.49	95.26	▃▆▇▃▂
x	star	1	54.27	16.77	27.02	41.03	56.53	68.71	86.44	▅▇▇▃▆
x	v_lines	1	54.27	16.77	30.45	49.96	50.36	69.50	89.50	▃▇▁▅▁
x	wide_lines	1	54.27	16.77	27.44	35.52	64.55	67.45	77.92	▇▂▁▇▅
x	x_shape	1	54.26	16.77	31.11	40.09	47.14	71.86	85.45	▇▆▁▃▅

Code

datasaurus %>% 
  group_by(dataset) %>% 
  skim(y)

Data summary
Name	Piped data
Number of rows	1846
Number of columns	3
_______________________
Column type frequency:
numeric	1
________________________
Group variables	dataset

Variable type: numeric

skim_variable	dataset	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
y	away	1	47.83	26.94	0.02	24.63	47.54	71.80	97.48	▅▆▃▇▃
y	bullseye	1	47.83	26.94	9.69	26.24	47.38	72.53	85.88	▇▆▃▅▇
y	circle	1	47.84	26.93	16.33	18.35	51.03	77.78	85.58	▇▁▁▂▆
y	dino	1	47.83	26.94	2.95	25.29	46.03	68.53	99.49	▇▇▇▅▆
y	dots	1	47.84	26.93	15.77	17.11	51.30	82.88	94.25	▇▁▇▁▆
y	h_lines	1	47.83	26.94	10.46	30.48	50.47	70.35	90.46	▆▇▇▅▅
y	high_lines	1	47.84	26.94	14.91	22.92	32.50	75.94	87.15	▇▁▁▃▅
y	slant_down	1	47.84	26.94	0.30	27.84	46.40	68.44	99.64	▆▇▇▅▆
y	slant_up	1	47.83	26.94	5.65	24.76	45.29	70.86	99.58	▇▇▇▅▅
y	star	1	47.84	26.93	14.37	20.37	50.11	63.55	92.21	▇▂▂▅▅
y	v_lines	1	47.84	26.94	2.73	22.75	47.11	65.85	99.69	▇▆▇▃▅
y	wide_lines	1	47.83	26.94	0.22	24.35	46.28	67.57	99.28	▇▇▇▅▆
y	x_shape	1	47.84	26.93	4.58	23.47	39.88	73.61	97.84	▇▇▂▆▅

Notice that all of the datasets are nearly identical. But have a look at them.

Code

DP <- datasaurus %>% ggplot() + aes(x=x, y=y, color=dataset, group=dataset) + geom_point() + guides(color=FALSE) + facet_wrap(vars(dataset))
DP

References

Code

knitr::write_bib(names(sessionInfo()$otherPkgs), file="bibliography.bib")

References

Müller, Kirill, and Hadley Wickham. 2022. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2022a. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

———. 2022b. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

———. 2023. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Lionel Henry. 2023. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2023. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2023. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.