12  Joining data together

Briefly go over the bigger picture (found in the introduction section) and remind everyone the ‘what’ and ‘why’ of what we are doing, and where we are at.

12.1 Learning objectives

  1. Describe some ways to join or “bind” data and identify which join is appropriate for a given situation. Then apply dplyr’s full_join() function to join two datasets together into one single dataset.

12.2 📖 Reading task: The many ways to join data

Time: ~10 minutes.

Walk through and describe these images and the different type of joins after they’ve read it.

The ability to join datasets together is a fundamental component of data processing and transformation. In many cases, it is required to do before you can meaningfully do any work. Like in our case, we have essentially three datasets that aren’t particularly useful on their own and only really benefit by being joined together. We join them to be able to make comparisons, analyse the variables together, and to work on one dataset rather than three.

But joining datasets can be tricky. There are many ways to join datasets and you often have to do several checks of the data and prepare it in the right way for your needs, like we’ve done, to get a useful and correct output in the end. It can be really easy to not realise that some data was dropped or not included or that extra data was added. It can be extremely frustrating to check and confirm things, but ultimately necessary to continue with doing your analyses and work correctly.

There are many ways to join datasets in dplyr, all of which are described in the ?dplyr::join and ?dplyr::semi_join help documentation. While there are several, only a few are most commonly used. These are left_join(), right_join(), and full_join().

left_join(x, y)

Join all rows and columns in y that match rows and columns in x. Columns that exist in y but not x are joined to x. The number of rows in the joined data is equal to the number of rows in x. If there are some rows in y that don’t match with x, those rows are not included, and so are assigned NA values in the output data.

A diagram showing two data frames, one green and one blue, with three columns each. Each data frame has one different column and one different row. The output data frame shows that all columns are output, but only the rows that match with the left data frame (the green one) are included in the output.
Figure 12.1: Left joining in dplyr. Notice how the last row in the blue data (the row with d in column A) is not included in the outputted data on the right. Modified from the Posit dplyr cheat sheet. Image license is CC-BY-SA.
right_join(x, y)

The opposite of left_join(). Join all rows and columns in x that match rows and columns in y. Columns that exist in x but not y are joined to y. Like left_join(), but the opposite, the number of rows in the joined data is equal to the number of rows in y. If there are some rows in x that don’t match with y, those rows are not included and are given an NA value. This function can be useful when used while piping with |> when you are processing data in a sequence of steps and want to continue piping but join a different way then left_join().

A diagram showing two data frames, one green and one blue, with three columns each. Each data frame has one different column and one different row. The output data frame shows that all columns are output, but only the rows that match with the right data frame (the blue one) are included in the output.
Figure 12.2: Right joining in dplyr. Notice how the last row in the green data (the row with c in column A) is not included in the outputted data on the right. Modified from the Posit dplyr cheat sheet. Image license is CC-BY-SA.
full_join(x, y)

Join all rows and columns in y that match rows and columns in x. Columns and rows that exist in y but not x are joined to x. A full join keeps all the data from both x and y. This can be useful if you want to make sure you have all the data from both datasets, even if it results in more missing values in the output data.

A diagram showing two data frames, one green and one blue, with three columns each. Each data frame has one different column and one different row. The output data frame shows that all columns are output, and all rows are included in the output, even if they don't match with the other data frame.
Figure 12.3: Full joining in dplyr. Notice how all rows and columns are included in the outputted data on the right, and that some missingness is introduced because those values don’t exist when the data are combined in this way. Modified from the Posit dplyr cheat sheet. Image license is CC-BY-SA.
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

12.3 Joining with dplyr functions

Remind everyone that there will be some things that look odd when joining, but it’s likely due to us only importing the first 100 rows from all the data files.

Walk through the code, the output, and highlight what remains after the join.

Let’s use some of dplyr’s join functions! We’ll join the participant_details and cgm_data datasets together. We’ll probably want to use full_join(), but let’s test out the other two joins first. In your docs/learning.qmd file, go to the bottom of the file and make a new header called ## Joining datasets. Below that make a new code chunk with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). We’ll start with using left_join() to join the two datasets together.

First, let’s look at the two datasets before joining them.

docs/learning.qmd
participant_details
# A tibble: 823 × 5
      id gender age_years intervention        date      
   <dbl> <chr>      <dbl> <chr>               <date>    
 1   101 woman       45.6 baseline            2021-03-15
 2   101 woman       45.6 baseline            2021-03-16
 3   101 woman       45.6 baseline            2021-03-17
 4   101 woman       45.6 baseline            2021-03-18
 5   101 woman       45.6 baseline            2021-03-19
 6   101 woman       45.6 baseline            2021-03-20
 7   101 woman       45.6 baseline            2021-03-21
 8   101 woman       45.6 baseline            2021-03-22
 9   101 woman       45.6 high bioactive diet 2021-05-11
10   101 woman       45.6 high bioactive diet 2021-05-12
# ℹ 813 more rows
cgm_data
# A tibble: 506 × 5
      id date        hour glucose_mean glucose_sd
   <int> <date>     <int>        <dbl>      <dbl>
 1   101 2021-03-18     8         5.43     0.351 
 2   101 2021-03-18     9         5.05     0.300 
 3   101 2021-03-18    10         5.3      0.392 
 4   101 2021-03-18    11         4.03     0.189 
 5   101 2021-03-18    12         4.02     0.0957
 6   101 2021-03-18    13         4.1      0.141 
 7   101 2021-03-18    14         5.62     0.946 
 8   101 2021-03-18    15         7.3      0.141 
 9   101 2021-03-18    16         6.55     0.7   
10   101 2021-03-18    17         4.8      0.356 
# ℹ 496 more rows

Notice the number of rows in each dataset. The participant_details dataset has 823 rows and the cgm_data dataset has 506 rows. When we left_join() them, there should no more than 1329 rows (cgm_data plus participant_details) in the output. There will likely be less because, remember, we have only imported the first 100 rows of each dataset. Let’s join them with left_join().

docs/learning.qmd
participant_details |>
  left_join(cgm_data)
Joining with `by = join_by(id, date)`
# A tibble: 1,212 × 8
      id gender age_years intervention date        hour glucose_mean
   <dbl> <chr>      <dbl> <chr>        <date>     <int>        <dbl>
 1   101 woman       45.6 baseline     2021-03-15    NA        NA   
 2   101 woman       45.6 baseline     2021-03-16    NA        NA   
 3   101 woman       45.6 baseline     2021-03-17    NA        NA   
 4   101 woman       45.6 baseline     2021-03-18     8         5.43
 5   101 woman       45.6 baseline     2021-03-18     9         5.05
 6   101 woman       45.6 baseline     2021-03-18    10         5.3 
 7   101 woman       45.6 baseline     2021-03-18    11         4.03
 8   101 woman       45.6 baseline     2021-03-18    12         4.02
 9   101 woman       45.6 baseline     2021-03-18    13         4.1 
10   101 woman       45.6 baseline     2021-03-18    14         5.62
# ℹ 1,202 more rows
# ℹ 1 more variable: glucose_sd <dbl>

Here, because cgm_data has the hour column, that means it expands all the data from participant_details for the dates that it joins by. The number of rows now is 1212, which is slightly less than we expected but it also isn’t surprising. Some rows in cgm_data got dropped because we haven’t yet imported all the data.

Because theoretically, the same id and date values should be present in both cgm_data and participant_details, using either left_join() or full_join() should be similar results. It should not give more than 1329 rows though.

But let’s try right_join() first to see what happens. This is simpler, since it will only join what is in the right side, e.g. cgm_data. That means there should be exactly 506 rows in the output.

docs/learning.qmd
participant_details |>
  right_join(cgm_data)
Joining with `by = join_by(id, date)`
# A tibble: 506 × 8
      id gender age_years intervention date        hour glucose_mean
   <dbl> <chr>      <dbl> <chr>        <date>     <int>        <dbl>
 1   101 woman       45.6 baseline     2021-03-18     8         5.43
 2   101 woman       45.6 baseline     2021-03-18     9         5.05
 3   101 woman       45.6 baseline     2021-03-18    10         5.3 
 4   101 woman       45.6 baseline     2021-03-18    11         4.03
 5   101 woman       45.6 baseline     2021-03-18    12         4.02
 6   101 woman       45.6 baseline     2021-03-18    13         4.1 
 7   101 woman       45.6 baseline     2021-03-18    14         5.62
 8   101 woman       45.6 baseline     2021-03-18    15         7.3 
 9   101 woman       45.6 baseline     2021-03-18    16         6.55
10   101 woman       45.6 baseline     2021-03-18    17         4.8 
# ℹ 496 more rows
# ℹ 1 more variable: glucose_sd <dbl>

We see that, yes, it does have the same number of rows. We also see that there are no missing values in the glucose columns. That’s because the participant_details dataset doesn’t have glucose data and we know there is no missing values in cgm_data for it. So there will be no missing values in the output.

Lastly, we’ll fully join. Like left_join(), the number of rows can not be more than the sum of the number of rows in both datasets. But it can be less, since, by joining there will be overlap between the two datasets. At a minimum, there can not be less rows than the largest of the two datasets, which is in this case participant_details at 823 rows. Let’s try full_join() to see what happens.

docs/learning.qmd
participant_details |>
  full_join(cgm_data)
Joining with `by = join_by(id, date)`
# A tibble: 1,292 × 8
      id gender age_years intervention date        hour glucose_mean
   <dbl> <chr>      <dbl> <chr>        <date>     <int>        <dbl>
 1   101 woman       45.6 baseline     2021-03-15    NA        NA   
 2   101 woman       45.6 baseline     2021-03-16    NA        NA   
 3   101 woman       45.6 baseline     2021-03-17    NA        NA   
 4   101 woman       45.6 baseline     2021-03-18     8         5.43
 5   101 woman       45.6 baseline     2021-03-18     9         5.05
 6   101 woman       45.6 baseline     2021-03-18    10         5.3 
 7   101 woman       45.6 baseline     2021-03-18    11         4.03
 8   101 woman       45.6 baseline     2021-03-18    12         4.02
 9   101 woman       45.6 baseline     2021-03-18    13         4.1 
10   101 woman       45.6 baseline     2021-03-18    14         5.62
# ℹ 1,282 more rows
# ℹ 1 more variable: glucose_sd <dbl>

We can see that the number of rows is 1292, which is still less than the sum of the two datasets. But again, it isn’t surprising to us here, since we know there is some overlap between the two datasets.

The full_join() function is most often used because we often want to see all the data we have, even if there are some missing values. Especially considering that some statistical models can use missing values as extra information.

Let’s code assign this output to a new object called dime_data.

docs/learning.qmd
dime_data <- participant_details |>
  full_join(cgm_data)
Joining with `by = join_by(id, date)`
dime_data
# A tibble: 1,292 × 8
      id gender age_years intervention date        hour glucose_mean
   <dbl> <chr>      <dbl> <chr>        <date>     <int>        <dbl>
 1   101 woman       45.6 baseline     2021-03-15    NA        NA   
 2   101 woman       45.6 baseline     2021-03-16    NA        NA   
 3   101 woman       45.6 baseline     2021-03-17    NA        NA   
 4   101 woman       45.6 baseline     2021-03-18     8         5.43
 5   101 woman       45.6 baseline     2021-03-18     9         5.05
 6   101 woman       45.6 baseline     2021-03-18    10         5.3 
 7   101 woman       45.6 baseline     2021-03-18    11         4.03
 8   101 woman       45.6 baseline     2021-03-18    12         4.02
 9   101 woman       45.6 baseline     2021-03-18    13         4.1 
10   101 woman       45.6 baseline     2021-03-18    14         5.62
# ℹ 1,282 more rows
# ℹ 1 more variable: glucose_sd <dbl>

And then we can continue with adding the sleep data to it.

docs/learning.qmd
dime_data <- participant_details |>
  full_join(cgm_data) |>
  full_join(sleep_data)
Joining with `by = join_by(id, date)`
Joining with `by = join_by(id, date, hour)`
dime_data
# A tibble: 1,867 × 15
      id gender age_years intervention date        hour glucose_mean
   <dbl> <chr>      <dbl> <chr>        <date>     <int>        <dbl>
 1   101 woman       45.6 baseline     2021-03-15    NA        NA   
 2   101 woman       45.6 baseline     2021-03-16    NA        NA   
 3   101 woman       45.6 baseline     2021-03-17    NA        NA   
 4   101 woman       45.6 baseline     2021-03-18     8         5.43
 5   101 woman       45.6 baseline     2021-03-18     9         5.05
 6   101 woman       45.6 baseline     2021-03-18    10         5.3 
 7   101 woman       45.6 baseline     2021-03-18    11         4.03
 8   101 woman       45.6 baseline     2021-03-18    12         4.02
 9   101 woman       45.6 baseline     2021-03-18    13         4.1 
10   101 woman       45.6 baseline     2021-03-18    14         5.62
# ℹ 1,857 more rows
# ℹ 8 more variables: glucose_sd <dbl>, seconds_deep <dbl>,
#   seconds_light <dbl>, seconds_wake <dbl>, seconds_rem <dbl>,
#   seconds_asleep <dbl>, seconds_restless <dbl>, seconds_awake <dbl>

Woooo! 🎉 We now are at the point of having only one data frame! Let’s check the data out a bit more. Using the Console do:

Console
View(dime_data)

Let’s do some basic checks. Notice the column names, they seem to be repeats but with different names, like awake and wake. Also, the seconds sleeping per hour is more seconds than exist in an hour. We don’t know why there are these issues as it isn’t explained in the documentation.

Keep in mind, since we only imported the first 100 rows, there will be some things that don’t match up. But even with this, there are odd things that we will need to address.

12.4 Putting it all together: Save a final dataset

We are finally at the point where we can put it all together and make a single dataset! Let’s cut the join code we wrote in the previous section and paste it into the setup code chunk of docs/learning.qmd.

docs/learning.qmd
dime_data <- participant_details |>
  full_join(cgm_data) |>
  full_join(sleep_data)
Joining with `by = join_by(id, date)`
Joining with `by = join_by(id, date, hour)`

Then, open up the R/functions.R file and look at the import_dime() function. Find the line that says n_max = 100 and delete it.

The last step… saving the dataset to the data/ folder! Why save it there? That way you communicate to yourself (and others) that this is the dataset to use, not the data in the data-raw/ folder. So, open up docs/learning.qmd so we can save it as a CSV file with write_csv(). In the last line of the setup code chunk, do:

docs/learning.qmd
write_csv(dime_data, here("data/dime.csv"))

Since we don’t want to track this file in Git, let’s end this with ignoring it by running in the Console:

Console
usethis::use_git_ignore("data/dime.csv")

Now, in the docs/learning.qmd file, render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”). This will reproducibly create the dataset and save it to the data/ folder. Once that is done, we will finish with committing the changes with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and then pushing to GitHub.

12.5 Key takeaways

Quickly cover this and get them to do the survey before moving on to the discussion activity.

  • Knowing how to join datasets together is a fundamental part of data processing and transformation. But it can be also tricky and a source of unexpected issues later on if not done well or not checked.
  • Use left_join(), right_join(), and full_join() to join two datasets together.

12.6 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

participant_details |>
  left_join(cgm_data)
participant_details |>
  right_join(cgm_data)
participant_details |>
  full_join(cgm_data)
dime_data <- participant_details |>
  full_join(cgm_data)
dime_data
dime_data <- participant_details |>
  full_join(cgm_data) |>
  full_join(sleep_data)
dime_data
dime_data <- participant_details |>
  full_join(cgm_data) |>
  full_join(sleep_data)
write_csv(dime_data, here("data/dime.csv"))
usethis::use_git_ignore("data/dime.csv")
save.image(here::here("_temp/joins.RData"))