Notes on Joins

Author

Robert W. Walker

Published

September 16, 2025

Two Data Sets

One on heroes.

How’s that done?
library(tidyverse)
heroes <- "
    name, alignment, gender,         publisher
 Magneto,       bad,   male,            Marvel
   Storm,      good, female,            Marvel
Mystique,       bad, female,            Marvel
  Batman,      good,   male,                DC
   Joker,       bad,   male,                DC
Catwoman,       bad, female,                DC
 Hellboy,      good,   male, Dark Horse Comics
"
heroes <- read_csv(heroes, skip = 1)

brands <- "
  publisher, yr_founded
         DC,       1934
     Marvel,       1939
      Image,       1992
"
brands <- read_csv(brands, skip = 1)
How’s that done?
heroes
# A tibble: 7 × 4
  name     alignment gender publisher        
  <chr>    <chr>     <chr>  <chr>            
1 Magneto  bad       male   Marvel           
2 Storm    good      female Marvel           
3 Mystique bad       female Marvel           
4 Batman   good      male   DC               
5 Joker    bad       male   DC               
6 Catwoman bad       female DC               
7 Hellboy  good      male   Dark Horse Comics
How’s that done?
brands
# A tibble: 3 × 2
  publisher yr_founded
  <chr>          <dbl>
1 DC              1934
2 Marvel          1939
3 Image           1992

Joins

Are the core idea behind SQL. R does all of these things though it does them in an R way. We will figure that out. The assignment? To learn joins. How? Here are the commands; use them and see how they work. It should be simple. inner_join(heroes,brands) produces what? Same for semi? left? right? full? In the following explanations, when I use left and right data, I mean heroes as left and brands as right where the sides are with respect to the ,.

  1. inner_join
How’s that done?
inner_join(heroes,brands)
Joining with `by = join_by(publisher)`
# A tibble: 6 × 5
  name     alignment gender publisher yr_founded
  <chr>    <chr>     <chr>  <chr>          <dbl>
1 Magneto  bad       male   Marvel          1939
2 Storm    good      female Marvel          1939
3 Mystique bad       female Marvel          1939
4 Batman   good      male   DC              1934
5 Joker    bad       male   DC              1934
6 Catwoman bad       female DC              1934

We keep only those rows that exist in both the left and the right datasets and we join all columns.

  1. semi_join
How’s that done?
semi_join(heroes,brands)
Joining with `by = join_by(publisher)`
# A tibble: 6 × 4
  name     alignment gender publisher
  <chr>    <chr>     <chr>  <chr>    
1 Magneto  bad       male   Marvel   
2 Storm    good      female Marvel   
3 Mystique bad       female Marvel   
4 Batman   good      male   DC       
5 Joker    bad       male   DC       
6 Catwoman bad       female DC       

The rows are the same as inner join but it does not keep columns from right data.

  1. left_join
How’s that done?
left_join(heroes,brands)
Joining with `by = join_by(publisher)`
# A tibble: 7 × 5
  name     alignment gender publisher         yr_founded
  <chr>    <chr>     <chr>  <chr>                  <dbl>
1 Magneto  bad       male   Marvel                  1939
2 Storm    good      female Marvel                  1939
3 Mystique bad       female Marvel                  1939
4 Batman   good      male   DC                      1934
5 Joker    bad       male   DC                      1934
6 Catwoman bad       female DC                      1934
7 Hellboy  good      male   Dark Horse Comics         NA

Keep all rows from the left

  1. right_join
How’s that done?
right_join(heroes,brands)
Joining with `by = join_by(publisher)`
# A tibble: 7 × 5
  name     alignment gender publisher yr_founded
  <chr>    <chr>     <chr>  <chr>          <dbl>
1 Magneto  bad       male   Marvel          1939
2 Storm    good      female Marvel          1939
3 Mystique bad       female Marvel          1939
4 Batman   good      male   DC              1934
5 Joker    bad       male   DC              1934
6 Catwoman bad       female DC              1934
7 <NA>     <NA>      <NA>   Image           1992

Keep all rows from the right

  1. full_join
How’s that done?
full_join(heroes,brands)
Joining with `by = join_by(publisher)`
# A tibble: 8 × 5
  name     alignment gender publisher         yr_founded
  <chr>    <chr>     <chr>  <chr>                  <dbl>
1 Magneto  bad       male   Marvel                  1939
2 Storm    good      female Marvel                  1939
3 Mystique bad       female Marvel                  1939
4 Batman   good      male   DC                      1934
5 Joker    bad       male   DC                      1934
6 Catwoman bad       female DC                      1934
7 Hellboy  good      male   Dark Horse Comics         NA
8 <NA>     <NA>      <NA>   Image                   1992

Keep everything, all rows and all columns

  1. anti_join
How’s that done?
anti_join(heroes,brands)
Joining with `by = join_by(publisher)`
# A tibble: 1 × 4
  name    alignment gender publisher        
  <chr>   <chr>     <chr>  <chr>            
1 Hellboy good      male   Dark Horse Comics

Only keep left rows that do not exist in the right data.

The most important option here is by. Here is an example using it explicitly. We need to use c() to combine the two variable names where each name is specified as character name, with quotes surrounding it.

How’s that done?
anti_join(heroes,brands, by=c("publisher" = "publisher"))
# A tibble: 1 × 4
  name    alignment gender publisher        
  <chr>   <chr>     <chr>  <chr>            
1 Hellboy good      male   Dark Horse Comics

If there were two [not applicable in this case], it would be by = c("publisher" = "publisher", "acronym" = "acronym") and can apply to as many variables as one wishes. The other argument of potential consequence is na_matches. By default, NA values are matched [as are NotaNumber – NaN values]. The option na_matches = "never" prevents this. This is also completely inconsequential in this case but will be shown for completeness.

How’s that done?
anti_join(heroes,brands, by=c("publisher" = "publisher"), na_matches="never")
# A tibble: 1 × 4
  name    alignment gender publisher        
  <chr>   <chr>     <chr>  <chr>            
1 Hellboy good      male   Dark Horse Comics

For a graphical depiction of SQL joins, see link