If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
8 Processing and joining datasets for cleaning
Here we will continue using the “Workflow” block and start moving over to the third block, “Create project data”, in Figure 8.1.
8.1 Learning objectives
The overall learning outcome for this session is to:
- Describe a few ways of joining data and processing character data, and applying them to an existing sequence of data processing.
Specific objectives are to:
- Describe what regular expressions are in very simple terms, and then use a regular expression on character data.
- Describe a few ways that data can be joined and identify which join is appropriate for a given situation. Then apply one join type to two datasets to create a single dataset.
- Demonstrate how to use functionals to repeatedly join more than two datasets together.
- Apply the function
case_when()
in situations that require nested conditionals (if-else). - Use the
usethis::use_data()
function to save the final, fully joined dataset as an.Rda
file indata/
.
8.2 Processing character data
8.3 Exercise: Brainstorm a regex that will match for the user ID
Time: 10 minutes.
In your groups do these tasks. Try not to look ahead, nor in the solution section 😉! When the time is up, we’ll share some ideas and go over what the regex will be.
- Looking at the
file_path_id
column, list what is similar in the user ID between rows and what is different. - Discuss and verbally describe (in English, not regex) what text pattern you might use to extract the user ID.
- Use the list below to think about how you might convert the English description of the text pattern to a regex. This will probably be very hard, but try anyway.
- When characters are written as is, regex will find those characters, e.g.
user
will find onlyuser
. - Use
[]
to find one possible character of the several between the brackets. E.g.[12]
means 1 or 2 or[ab]
means “a” or “b”. To find a range of numbers or letters, use-
in between the start and end ranges, e.g.[1-3]
means 1 to 3 or[a-c]
means “a” to “c”. - Use
?
if the character might be there or not. E.g.ab?
means “a” and maybe “b” follows it or1[1-2]?
means 1 and maybe 1 or 2 will follow it.
- When characters are written as is, regex will find those characters, e.g.
Once you’ve done these tasks, we’ll discuss all together and go over what the regex would be to extract the user ID.
8.4 Using regular expressions to extract text
Now that we’ve identified a possible regex to use to extract the user ID, let’s test it out on the user_info_df
data. Once it works, we will convert it into a function and move (cut and paste) it into the R/functions.R
file.
Since we will create a new column for the user ID, we will use the mutate()
function from the dplyr package. We’ll use the str_extract()
function from the stringr package to “extract a string” by using the regex user_[1-9][0-9]?
that we discussed from the exercise. Since we’re going to use stringr, so let’s add it as a package dependency by typing this in the Console:
We’re also using an argument to mutate()
you might not have seen previously, called .before
. This will insert the new user_id
column before the column we use and we do this entirely for visual reasons, since it is easier to see the newly created column when we run the code. In your doc/learning.qmd
file, create a new header called ## Using regex for user ID
at the bottom of the document, and create a new code chunk below that.
#> # A tibble: 22 × 6
#> user_id file_path_id gender weight height age
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 data-raw/mmash/user_1/user_in… M 65 169 29
#> 2 user_10 data-raw/mmash/user_10/user_i… M 85 180 27
#> 3 user_11 data-raw/mmash/user_11/user_i… M 115 186 27
#> 4 user_12 data-raw/mmash/user_12/user_i… M 67 170 27
#> 5 user_13 data-raw/mmash/user_13/user_i… M 74 180 25
#> 6 user_14 data-raw/mmash/user_14/user_i… M 64 171 27
#> 7 user_15 data-raw/mmash/user_15/user_i… M 80 180 24
#> 8 user_16 data-raw/mmash/user_16/user_i… M 67 176 27
#> 9 user_17 data-raw/mmash/user_17/user_i… M 60 175 24
#> 10 user_18 data-raw/mmash/user_18/user_i… M 80 180 0
#> # ℹ 12 more rows
Since we don’t need the file_path_id
column anymore, let’s drop it using select()
and -
.
doc/learning.qmd
#> # A tibble: 22 × 5
#> user_id gender weight height age
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 M 65 169 29
#> 2 user_10 M 85 180 27
#> 3 user_11 M 115 186 27
#> 4 user_12 M 67 170 27
#> 5 user_13 M 74 180 25
#> 6 user_14 M 64 171 27
#> 7 user_15 M 80 180 24
#> 8 user_16 M 67 176 27
#> 9 user_17 M 60 175 24
#> 10 user_18 M 80 180 0
#> # ℹ 12 more rows
8.5 Exercise: Convert ID extractor code into a function
Time: 15 minutes.
We now have code that takes the data that has the file_path_id
column and extracts the user ID from it. First step: While in the doc/learning.qmd
file, convert this code into a function, using the same process you’ve done previously.
Use this code as a guide to help complete the exercise tasks below:
doc/learning.qmd
- Call the new function
extract_user_id
and add one argument calledimported_data
.- Remember to output the code into an object and
return()
it at the end of the function. - Include Roxygen documentation.
- Remember to output the code into an object and
- After writing it and testing that the function works, move (cut and paste) the function into
R/functions.R
. - Run styler while in the
R/functions.R
file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”). - Replace the code in the
doc/learning.qmd
file with the function name so it looks likeextract_user_id(user_info_df)
, restart the R session, source everything withsource()
with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”), and run the new function in the code chunk inside thedoc/learning.qmd
to test that it works. This should automatically run thesetup
code chunk, otherwise, run that chunk if it doesn’t. - Knit / render the
doc/learning.qmd
file to make sure things remain reproducible with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”). - Add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
#' Extract user ID from data with file path column.
#'
#' @param imported_data Data with `file_path_id` column.
#'
#' @return A data.frame/tibble.
#'
extract_user_id <- function(imported_data) {
extracted_id <- imported_data |>
dplyr::mutate(
user_id = stringr::str_extract(
file_path_id,
"user_[0-9][0-9]?"
),
.before = file_path_id
) |>
dplyr::select(-file_path_id)
return(extracted_id)
}
# This tests that it works:
extract_user_id(user_info_df)
8.6 Modifying existing functions as part of the processing workflow
Now that we’ve created a new function to extract the user ID from the file path variable, we need to actually use it within our processing pipeline. Since we want this function to work on all the datasets that we will import, we need to add it to the import_multiple_files()
function. We’ll go to the import_multiple_files()
function in R/functions.R
and use the |>
to add it after using the list_rbind()
function. The code should look something like:
R/functions.R
import_multiple_files <- function(file_pattern, import_function) {
data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
regexp = file_pattern,
recurse = TRUE
)
combined_data <- purrr::map(data_files, import_function) |>
purrr::list_rbind(names_to = "file_path_id") |>
extract_user_id() # Add the function here
return(combined_data)
}
We’ll re-source the functions with source()
using Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”). Then re-run these pieces of code you wrote during the exercise in Section 7.3 to update them based on the new code in the import_multiple_files()
function. Add this to your doc/learning.qmd
file for now.
As well as adding the summarised_rr_df
and summarised_actigraph_df
to use user_id
instead of file_path_id
:
doc/learning.qmd
summarised_rr_df <- rr_df |>
group_by(user_id, day) |> # change file_path_id to user_id here
summarise(
across(ibi_s, list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)),
.groups = "drop"
)
summarised_actigraph_df <- actigraph_df |>
group_by(user_id, day) |> # change file_path_id to user_id here
# These statistics will probably be different for you
summarise(
across(hr, list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)),
.groups = "drop"
)
Let’s render the doc/learning.qmd
document using Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) to make sure everything still runs fine. Then, add and commit all the changed files into the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
8.7 Join datasets together
Let’s code this together, using reduce()
, full_join()
, and list()
while in the doc/learning.qmd
file.
#> Joining with `by = join_by(user_id)`
#> # A tibble: 43 × 8
#> user_id gender weight height age samples cortisol_norm
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 user_1 M 65 169 29 before sleep 0.0341
#> 2 user_1 M 65 169 29 wake up 0.0779
#> 3 user_10 M 85 180 27 before sleep 0.0370
#> 4 user_10 M 85 180 27 wake up 0.0197
#> 5 user_11 M 115 186 27 before sleep 0.0406
#> 6 user_11 M 115 186 27 wake up 0.0156
#> 7 user_12 M 67 170 27 before sleep 0.156
#> 8 user_12 M 67 170 27 wake up 0.145
#> 9 user_13 M 74 180 25 before sleep 0.0123
#> 10 user_13 M 74 180 25 wake up 0.0342
#> # ℹ 33 more rows
#> # ℹ 1 more variable: melatonin_norm <dbl>
We now have the data in a form that would make sense to join it with the other datasets. So lets try it:
#> Joining with `by = join_by(user_id)`
#> Joining with `by = join_by(user_id)`
#> # A tibble: 86 × 11
#> user_id gender weight height age samples cortisol_norm
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 user_1 M 65 169 29 before sleep 0.0341
#> 2 user_1 M 65 169 29 before sleep 0.0341
#> 3 user_1 M 65 169 29 wake up 0.0779
#> 4 user_1 M 65 169 29 wake up 0.0779
#> 5 user_10 M 85 180 27 before sleep 0.0370
#> 6 user_10 M 85 180 27 before sleep 0.0370
#> 7 user_10 M 85 180 27 wake up 0.0197
#> 8 user_10 M 85 180 27 wake up 0.0197
#> 9 user_11 M 115 186 27 before sleep 0.0406
#> 10 user_11 M 115 186 27 before sleep 0.0406
#> # ℹ 76 more rows
#> # ℹ 4 more variables: melatonin_norm <dbl>, day <dbl>,
#> # ibi_s_mean <dbl>, ibi_s_sd <dbl>
Hmm, but wait, we now have four rows of each user, when we should have only two, one for each day. By looking at each dataset we joined, we can find that the saliva_df
doesn’t have a day
column and instead has a samples
column. We’ll need to add a day column in order to join properly with the RR dataset. For this, we’ll learn about using nested conditionals.
8.8 Cleaning with nested conditionals
While still in the doc/learning.qmd
file, we can use the case_when()
function to set "before sleep"
as day 1 and "wake up"
as day 2 by creating a new column called day
. (We will use NA_real_
because the other day
columns are numeric, not integer.)
doc/learning.qmd
#> # A tibble: 42 × 5
#> user_id samples cortisol_norm melatonin_norm day
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 before sleep 0.0341 0.0000000174 1
#> 2 user_1 wake up 0.0779 0.00000000675 2
#> 3 user_10 before sleep 0.0370 0.00000000867 1
#> 4 user_10 wake up 0.0197 0.00000000257 2
#> 5 user_11 before sleep 0.0406 0.00000000204 1
#> 6 user_11 wake up 0.0156 0.00000000965 2
#> 7 user_12 before sleep 0.156 0.00000000354 1
#> 8 user_12 wake up 0.145 0.00000000864 2
#> 9 user_13 before sleep 0.0123 0.00000000190 1
#> 10 user_13 wake up 0.0342 0.00000000230 2
#> # ℹ 32 more rows
…Now, let’s use the reduce()
with full_join()
again:
doc/learning.qmd
#> Joining with `by = join_by(user_id)`
#> Joining with `by = join_by(user_id, day)`
#> Joining with `by = join_by(user_id, day)`
#> # A tibble: 47 × 13
#> user_id gender weight height age samples cortisol_norm
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 user_1 M 65 169 29 before sleep 0.0341
#> 2 user_1 M 65 169 29 wake up 0.0779
#> 3 user_10 M 85 180 27 before sleep 0.0370
#> 4 user_10 M 85 180 27 wake up 0.0197
#> 5 user_11 M 115 186 27 before sleep 0.0406
#> 6 user_11 M 115 186 27 wake up 0.0156
#> 7 user_12 M 67 170 27 before sleep 0.156
#> 8 user_12 M 67 170 27 wake up 0.145
#> 9 user_13 M 74 180 25 before sleep 0.0123
#> 10 user_13 M 74 180 25 wake up 0.0342
#> # ℹ 37 more rows
#> # ℹ 6 more variables: melatonin_norm <dbl>, day <dbl>,
#> # ibi_s_mean <dbl>, ibi_s_sd <dbl>, hr_mean <dbl>, hr_sd <dbl>
We now have two rows per participant! Let’s add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
8.9 Wrangling data into final form
Now that we’ve got several datasets processed and joined, it’s time to bring it all together and put it into the data-raw/mmash.R
script so we can create a final working dataset.
Open up the data-raw/mmash.R
file and the top of the file move the code library(fs)
to go with the other packages as well. It should look something like this now:
Next, as we have altered import_multiple_files()
to use file_path
instead of file_path_id
, we’ll need to update how we group_by()
when creating summarised_rr_df
and summarised_actigraph_df
.
data-raw/mmash.R
summarised_rr_df <- rr_df |>
group_by(user_id, day) |>
summarise(
across(ibi_s, list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)),
.groups = "drop"
)
summarised_actigraph_df <- actigraph_df |>
group_by(user_id, day) |>
summarise(
across(hr, list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)),
.groups = "drop"
)
Go into the doc/learning.qmd
and cut the code used to create the saliva_with_day_df
as well as the code to full_join()
all the datasets together with reduce()
and paste it at the bottom of the data-raw/mmash.R
script. Assign the output into a new variable called mmash
, like this:
data-raw/mmash.R
#> Joining with `by = join_by(user_id)`
#> Joining with `by = join_by(user_id, day)`
#> Joining with `by = join_by(user_id, day)`
Lastly, we have to save this final dataset into the data/
folder. We’ll use the function usethis::use_data()
to create the folder and save the data as an .rda
file. We’ll add this code to the very bottom of the script:
#> ✔ Setting active project to
#> "/home/runner/work/r-cubed-intermediate/r-cubed-intermediate".
#> ✔ Saving "mmash" to "data/mmash.rda".
#> ☐ Document your data (see <https://r-pkgs.org/data.html>).
We’re adding overwrite = TRUE
so every time we re-run this script, the dataset will be saved. Alternatively, we could (but won’t in this course) save it as a .csv
file with:
And later load it in with read_csv()
(since it is so fast). Alright, we’re finished creating this dataset! Let’s generate it by:
- First running styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).
- Restarting the R session with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”).
- Sourcing the
data-raw/mmash.R
script with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”).
We now have a final dataset to start working on! The main way to load data is with load(here::here("data/mmash.rda"))
. Go into the doc/learning.qmd
file and delete everything again, except for the YAML header and setup
code chunk, so that we are ready for the next session. Lastly, add and commit all the changes, including adding the final mmash.rda
data file, to the Git history by using Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
8.10 Summary
- While very difficult to learn and use, regular expressions (regex or regexp) are incredibly powerfully at processing character data.
- Use
left_join()
,right_join()
, andfull_join()
to join two datasets together. - Use the functional
reduce()
to iteratively apply a function to a set of items in order to end up with one item (e.g. join more than two datasets into one final dataset). - Use
case_when()
instead of nesting multiple “if else” conditions whenever you need to do slightly more complicated conditionals.