participant_details |>
left_join(cgm_data)
participant_details |>
right_join(cgm_data)
participant_details |>
full_join(cgm_data)
dime_data <- participant_details |>
full_join(cgm_data)
dime_data
dime_data <- participant_details |>
full_join(cgm_data) |>
full_join(sleep_data)
dime_data
dime_data <- participant_details |>
full_join(cgm_data) |>
full_join(sleep_data)
write_csv(dime_data, here("data/dime.csv"))
usethis::use_git_ignore("data/dime.csv")
save.image(here::here("_temp/joins.RData"))22 Joining data together
22.1 Learning objectives
- Describe some ways to join or “bind” data and identify which join is appropriate for a given situation. Then apply dplyr’s
full_join()function to join two datasets together into one single dataset.
22.2 📖 Reading task: The many ways to join data
Time: ~10 minutes.
The ability to join datasets together is a fundamental component of data processing and transformation. In many cases, it is required to do before you can meaningfully do any work. Like in our case, we have essentially three datasets that aren’t particularly useful on their own and only really benefit by being joined together. We join them to be able to make comparisons, analyse the variables together, and to work on one dataset rather than three.
But joining datasets can be tricky. There are many ways to join datasets and you often have to do several checks of the data and prepare it in the right way for your needs, like we’ve done, to get a useful and correct output in the end. It can be really easy to not realise that some data was dropped or not included or that extra data was added. It can be extremely frustrating to check and confirm things, but ultimately necessary to continue with doing your analyses and work correctly.
There are many ways to join datasets in dplyr, all of which are described in the ?dplyr::join and ?dplyr::semi_join help documentation. While there are several, only a few are most commonly used. These are left_join(), right_join(), and full_join().
left_join(x, y)-
Join all rows and columns in
ythat match rows and columns inx. Columns that exist inybut notxare joined tox. The number of rows in the joined data is equal to the number of rows inx. If there are some rows inythat don’t match withx, those rows are not included, and so are assignedNAvalues in the output data.
d in column A) is not included in the outputted data on the right. Modified from the Posit dplyr cheat sheet. Image license is CC-BY-SA.
right_join(x, y)-
The opposite of
left_join(). Join all rows and columns inxthat match rows and columns iny. Columns that exist inxbut notyare joined toy. Likeleft_join(), but the opposite, the number of rows in the joined data is equal to the number of rows iny. If there are some rows inxthat don’t match withy, those rows are not included and are given anNAvalue. This function can be useful when used while piping with|>when you are processing data in a sequence of steps and want to continue piping but join a different way thenleft_join().
c in column A) is not included in the outputted data on the right. Modified from the Posit dplyr cheat sheet. Image license is CC-BY-SA.
full_join(x, y)-
Join all rows and columns in
ythat match rows and columns inx. Columns and rows that exist inybut notxare joined tox. A full join keeps all the data from bothxandy. This can be useful if you want to make sure you have all the data from both datasets, even if it results in more missing values in the output data.
👷 This part is being updated. Once ready, it will be shown here.
22.3 Key takeaways
- Knowing how to join datasets together is a fundamental part of data processing and transformation. But it can be also tricky and a source of unexpected issues later on if not done well or not checked.
- Use
left_join(),right_join(), andfull_join()to join two datasets together.
22.4 Code used in session
This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.