If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
On GitLab open an issue or submit a merge request by clicking the "Edit this page " button on the side of this page.
9 Processing and joining datasets for cleaning
Here we will continue using the “Workflow” block and start moving over to the third block, “Create final data”, in Figure 9.1.
Figure 9.1: Section of the overall workflow we will be covering.
And your folder and file structure should look like (use fs::dir_tree(recurse = 2)
if you want to check using R):
LearnR3
├── data/
│ ├── mmash.rda
│ └── README.md
├── data-raw/
│ ├── README.md
│ ├── mmash-data.zip
│ ├── mmash/
│ │ ├── user_1
│ │ ├── ...
│ │ └── user_22
│ └── mmash.R
├── doc/
│ ├── README.md
│ └── lesson.Rmd
├── R/
│ ├── functions.R
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md
9.1 Learning objectives
- Learn what regular expressions are and how to use them on character data.
- Learn about and apply the various ways data can be joined.
- Apply functionals when repeatedly joining multiple datasets.
- Apply the function
case_when()
when you need nested conditionals for cleaning. - Use the
usethis::use_data()
function to save the final, fully joined dataset as an.Rda
file indata/
.
9.2 Processing character data
Take 5 min to read this section before continuing on to the exercise.
Before we go into joining datasets together, we have to do a bit of processing
first. Specifically, we want to get the user ID from the file_path_id
character data. Whenever you are processing and cleaning data, you will very
likely encounter and deal with character data. A wonderful package to use for
working with character data is called stringr, which we’ll use to extract the
user ID from the file_path_id
column.
The main driver behind the functions in stringr are regular expressions (or
regex for short). These expressions are powerful, very concise ways of finding
patterns in text. Because they are so concise, though, they are also very very
difficult to learn, write, and read, even for experienced users. That’s because
certain characters like [
or ?
have special meanings. For instance,
[aeiou]
is regex for “find one character in a string that is either a, e, i, o, or u”.
The []
in this case mean “find the character in between the two brackets”.
We won’t cover regex too much in this course, some great resources for learning
them are the
R for Data Science regex section,
the stringr regex page,
as well as in the help doc ?regex
.
We’ve already used them a bit in the dir_ls()
function with the regexp
argument to find our data files. In the case of the regex in our use of
dir_ls()
, we had wanted to find, for instance, the pattern "user_info.csv"
in all the folders and files.
But in this case, we want to extract the user ID pattern, from user_1
to user_22
.
So how would we go about extracting this pattern?
9.3 Exercise: Brainstorm a regex that will match for the user ID
Time: 10 min
In your groups do these tasks. Try not to look ahead, nor in the solution section 😉! When the time is up, we’ll share some ideas and go over what the regex will be.
- Looking at the
file_path_id
column, list what is similar in the user ID between rows and what is different. - Discuss and verbally describe (in English, not regex) what text pattern you might use to extract the user ID.
- Use the list below to think about how you might convert the English
description of the text pattern to a regex. This will probably be very hard, but
try anyway.
- When characters are written as is, regex will find those characters, e.g.
user
will find onlyuser
. - Use
[]
to find one possible character of the several between the brackets. E.g.[12]
means 1 or 2 or[ab]
means “a” or “b”. To find a range of numbers or letters, use-
in between the start and end ranges, e.g.[1-3]
means 1 to 3 or[a-c]
means “a” to “c”. - Use
?
if the character might be there or not. E.g.ab?
means “a” and maybe “b” follows it or1[1-2]?
means 1 and maybe 1 or 2 will follow it.
- When characters are written as is, regex will find those characters, e.g.
Once you’ve done these tasks, we’ll discuss all together and go over what the regex would be to extract the user ID.
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.
"user_[1-9][0-9]?"
For instructors: Click for details.
Make sure to reinforce that while regex is incredibly complicated, there are some basic things you can do with it that are quite powerful.
More or less, this section and exercise are to introduce the idea and concept of regex, but not to really teach it since that is well beyond the scope of this course and this time frame.
Go over the solution. Explanation is that the pattern will find anything that hasuser_
followed by a number from 1 to 9 and maybe followed by another number from 0 to 9.
9.4 Using regular expressions to extract text
Now that we’ve identified a possible regex to use to extract the user ID,
let’s test it out on the user_info_df
data.
Once it works, we will convert it into a function and move it into the
R/functions.R
file.
Since we will create a new column for the user ID, we will use the mutate()
function from the dplyr package. We’ll use the str_extract()
function from the
stringr package to “extract a string” by using the regex user_[1-9][0-9]?
that
we discussed from the exercise. We’re also using an argument to mutate()
you might
not have seen previously, called .before
. This will insert the new user_id
column
before the column we use and we do this entirely for visual reasons, since it is
easier to see the newly created column when we run the code.
In your doc/lesson.Rmd
file, create a new header called
## Using regex for user ID
at the bottom of the document, and create a new
code chunk below that.
For instructors: Click for details.
Walk through writing this code, briefly explain/remind how to use mutate, and about the stringr function.
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
# Note: your file paths and data may look slightly different.
user_info_df %>%
mutate(user_id = str_extract(file_path_id,
"user_[1-9][0-9]?"),
.before = file_path_id)
#> # A tibble: 22 × 6
#> user_id file_path_id gender weight height age
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 data-raw/mmash/user_1/user_info.csv M 65 169 29
#> 2 user_10 data-raw/mmash/user_10/user_info.csv M 85 180 27
#> 3 user_11 data-raw/mmash/user_11/user_info.csv M 115 186 27
#> 4 user_12 data-raw/mmash/user_12/user_info.csv M 67 170 27
#> 5 user_13 data-raw/mmash/user_13/user_info.csv M 74 180 25
#> 6 user_14 data-raw/mmash/user_14/user_info.csv M 64 171 27
#> 7 user_15 data-raw/mmash/user_15/user_info.csv M 80 180 24
#> 8 user_16 data-raw/mmash/user_16/user_info.csv M 67 176 27
#> 9 user_17 data-raw/mmash/user_17/user_info.csv M 60 175 24
#> 10 user_18 data-raw/mmash/user_18/user_info.csv M 80 180 0
#> # … with 12 more rows
Since we don’t need the file_path_id
column anymore, let’s drop it using select()
and -
.
user_info_df %>%
mutate(user_id = str_extract(file_path_id,
"user_[1-9][0-9]?"),
.before = file_path_id) %>%
select(-file_path_id)
#> # A tibble: 22 × 5
#> user_id gender weight height age
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 M 65 169 29
#> 2 user_10 M 85 180 27
#> 3 user_11 M 115 186 27
#> 4 user_12 M 67 170 27
#> 5 user_13 M 74 180 25
#> 6 user_14 M 64 171 27
#> 7 user_15 M 80 180 24
#> 8 user_16 M 67 176 27
#> 9 user_17 M 60 175 24
#> 10 user_18 M 80 180 0
#> # … with 12 more rows
9.5 Exercise: Convert ID extractor code into a function
Time: 15 min
We now have code that takes the data that has the file_path_id
column
and extracts the user ID from it. First step: While in the doc/lesson.Rmd
file, convert this code into a function, using the same process you’ve done
previously.
- Call the new function
extract_user_id
and add one argument calledimported_data
.- Remember to output the code into an object and
return()
it at the end of the function. - Include Roxygen documentation.
- Remember to output the code into an object and
- After writing it and testing that the function works,
move the function into
R/functions.R
. - Replace the code in the
doc/lesson.Rmd
file with the function name so it looks likeextract_user_id(user_info_df)
, restart the R session, source everything withsource()
(Ctrl-Shift-S
), and run the new function to test that it works. - Knit the
doc/lesson.Rmd
file to make sure things remain reproducible.
Tip: If you don’t know what package a function comes from when you
need to append the package when using ::
, you can find out what the package is
by using the help documentation ?functionname
(can also be done by pressing F1
when the cursor is over the function). The package name is at the very top left
corner, surrounded by { }
.
Use this code as a guide to help complete this exercise:
<- ___(___) {
extract_user_id <- ___ %>%
___ ___mutate(
user_id = ___str_extract(file_path_id,
"user_[0-9][0-9]?"),
.before = file_path_id
%>%
) ___select(-file_path_id)
return(___)
}
# This tests that it works:
# extract_user_id(user_info_df)
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.
#' Extract user ID from data with file path column.
#'
#' @param imported_data Data with `file_path_id` column.
#'
#' @return A data.frame/tibble.
#'
extract_user_id <- function(imported_data) {
extracted_id <- imported_data %>%
dplyr::mutate(user_id = stringr::str_extract(file_path_id,
"user_[0-9][0-9]?"),
.before = file_path_id) %>%
dplyr::select(-file_path_id)
return(extracted_id)
}
# This tests that it works:
extract_user_id(user_info_df)
9.6 Modifying existing functions as part of the processing workflow
Now that we’ve created a new function to extract the user ID from the file path
variable, we need to actually use it within our processing pipeline.
Since we want this function to work on all the datasets that we will import,
we need to add it to the import_multiple_files()
function.
We’ll go to the import_multiple_files()
function in R/functions.R
and use the %>%
to
add it after using the map_dfr()
function. The code should look something like:
import_multiple_files <- function(file_pattern, import_function) {
data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
regexp = file_pattern,
recurse = TRUE)
combined_data <- purrr::map_dfr(data_files, import_function,
.id = "file_path_id") %>%
extract_user_id() # Add the function here.
return(combined_data)
}
We’ll re-source the functions with source()
(Ctrl-Shift-S
). Then re-run these
pieces of code you wrote during Exercise 8.4 to
update them based on the new code in the import_multiple_files()
function.
Add this to your doc/lesson.Rmd
file for now.
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
As well as adding the summarised_rr_df
and summarised_actigraph_df
to use
user_id
instead of file_path_id
:
summarised_rr_df <- rr_df %>%
group_by(user_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE))
summarised_actigraph_df <- actigraph_df %>%
group_by(user_id, day) %>%
# These statistics will probably be different for you
summarise(across(hr, list(mean = mean, sd = sd), na.rm = TRUE))
Let’s knit the doc/lesson.Rmd
document to make sure everything still
runs fine. Then, add and commit all the changed files into the Git history.
9.7 Join datasets together
For instructors: Click for details.
Walk through and describe these images and the different type of joins after they’ve read it.
Take 10 min to read over this section, until it says to stop. The ability to join datasets together is a fundamental component of data processing and transformation. In our case, we want to add the datasets together so we eventually have preferably one main dataset to work with.
There are many ways to join datasets, the more common ones that are implemented in the dplyr package are:
-
left_join(x, y)
: Join all rows and columns iny
that match rows and columns inx
. Columns that exist iny
but notx
are joined tox
.Figure 9.2: Left joining in dplyr. Modified from the RStudio dplyr cheatsheet.
-
right_join(x, y)
: The opposite ofleft_join()
. Join all rows and columns inx
that match rows and columns iny
. Columns that exist inx
but noty
are joined toy
.Figure 9.3: Right joining in dplyr. Modified from the RStudio dplyr cheatsheet.
-
full_join(x, y)
: Join all rows and columns iny
that match rows and columns inx
. Columns and rows that exist iny
but notx
are joined tox
.Figure 9.4: Full joining in dplyr. Modified from the RStudio dplyr cheatsheet.
In our case, we want to use full_join()
, since we want all the data from both
datasets. This function takes two datasets and lets you indicate which column
to join by using the by
argument. Here, both datasets have the column user_id
so we will join by them.
full_join(user_info_df, saliva_df, by = "user_id")
#> # A tibble: 43 × 8
#> user_id gender weight height age samples cortisol_norm melatonin_norm
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 user_1 M 65 169 29 before sleep 0.0341 0.0000000174
#> 2 user_1 M 65 169 29 wake up 0.0779 0.00000000675
#> 3 user_10 M 85 180 27 before sleep 0.0370 0.00000000867
#> 4 user_10 M 85 180 27 wake up 0.0197 0.00000000257
#> 5 user_11 M 115 186 27 before sleep 0.0406 0.00000000204
#> 6 user_11 M 115 186 27 wake up 0.0156 0.00000000965
#> 7 user_12 M 67 170 27 before sleep 0.156 0.00000000354
#> 8 user_12 M 67 170 27 wake up 0.145 0.00000000864
#> 9 user_13 M 74 180 25 before sleep 0.0123 0.00000000190
#> 10 user_13 M 74 180 25 wake up 0.0342 0.00000000230
#> # … with 33 more rows
full_join()
is useful if we want to include all values from both datasets,
as long as each participant (“user”) had data collected from that dataset.
When the two datasets have rows that don’t match, we will get missingness in that
row, but that’s ok in this case.
We also eventually have other datasets to join together later on. Since
full_join()
can only take two datasets at a time, do we then just keep
using full_join()
until all the other datasets are combined?
What if we get more data later on? Well, that’s where more functional programming
comes in. Again, we have a simple goal: For a set of data frames, join them
all together. Here we use another functional programming concept called reduce()
.
Like map()
, which “maps” a function onto a set of items, reduce()
applies a function to each item of a vector or list, each time reducing the set
of items down until only one remains: the output. Let’s use an example with our simple function
add_numbers()
we had created before (but later deleted) and add up 1 to 5.
Since add_numbers()
only takes
two numbers, we have to give it two numbers at a time and repeat until we reach
5.
# Add from 1 to 5
first <- add_numbers(1, 2)
second <- add_numbers(first, 3)
third <- add_numbers(second, 4)
add_numbers(third, 5)
#> [1] 15
Instead, we can use reduce to do the same thing:
reduce(1:5, add_numbers)
#> [1] 15
Figure 9.5 visually shows what is happening within reduce()
.
![A functional that iteratively uses a function on a set of items until only one output remains. Modified from the [RStudio purrr cheatsheet][purrr-cheatsheet].](images/reduce.png)
Figure 9.5: A functional that iteratively uses a function on a set of items until only one output remains. Modified from the RStudio purrr cheatsheet.
If we look at ?reduce
, we see that reduce()
, like map()
, takes either a
vector or a list as an input. Since data frames can only be put together as a list
and not as a vector (a data frame has vectors for columns and so can’t be a vector itself),
we need to combine the datasets together in a list()
and reduce them with full_join()
.
Stop reading here and we will go over it again before continuing to code.
combined_data <- reduce(list(user_info_df, saliva_df), full_join)
#> Joining, by = "user_id"
combined_data
#> # A tibble: 43 × 8
#> user_id gender weight height age samples cortisol_norm melatonin_norm
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 user_1 M 65 169 29 before sleep 0.0341 0.0000000174
#> 2 user_1 M 65 169 29 wake up 0.0779 0.00000000675
#> 3 user_10 M 85 180 27 before sleep 0.0370 0.00000000867
#> 4 user_10 M 85 180 27 wake up 0.0197 0.00000000257
#> 5 user_11 M 115 186 27 before sleep 0.0406 0.00000000204
#> 6 user_11 M 115 186 27 wake up 0.0156 0.00000000965
#> 7 user_12 M 67 170 27 before sleep 0.156 0.00000000354
#> 8 user_12 M 67 170 27 wake up 0.145 0.00000000864
#> 9 user_13 M 74 180 25 before sleep 0.0123 0.00000000190
#> 10 user_13 M 74 180 25 wake up 0.0342 0.00000000230
#> # … with 33 more rows
We now have the data in a form that would make sense to join it with the other datasets. So lets try it:
reduce(list(user_info_df, saliva_df, summarised_rr_df), full_join)
#> Joining, by = "user_id"
#> Joining, by = "user_id"
#> # A tibble: 86 × 11
#> user_id gender weight height age samples cortisol_norm melatonin_norm day
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 M 65 169 29 before… 0.0341 0.0000000174 1
#> 2 user_1 M 65 169 29 before… 0.0341 0.0000000174 2
#> 3 user_1 M 65 169 29 wake up 0.0779 0.00000000675 1
#> 4 user_1 M 65 169 29 wake up 0.0779 0.00000000675 2
#> 5 user_10 M 85 180 27 before… 0.0370 0.00000000867 1
#> 6 user_10 M 85 180 27 before… 0.0370 0.00000000867 2
#> 7 user_10 M 85 180 27 wake up 0.0197 0.00000000257 1
#> 8 user_10 M 85 180 27 wake up 0.0197 0.00000000257 2
#> 9 user_11 M 115 186 27 before… 0.0406 0.00000000204 1
#> 10 user_11 M 115 186 27 before… 0.0406 0.00000000204 2
#> # … with 76 more rows, and 2 more variables: ibi_s_mean <dbl>, ibi_s_sd <dbl>
Hmm, but wait, we now have four rows of each user, when we should have
only two, one for each day. By looking at each dataset we joined, we can find that
the saliva_df
doesn’t have a day
column and instead has a samples
column.
We’ll need to add a day column in order to join properly with the RR dataset.
For this, we’ll learn about using nested conditionals.
9.8 Cleaning with nested conditionals
Take 6 min to read through this section, after which we will go over it again.
There are many ways to clean up this particular problem, but probably the
easiest, most explicit, and programmatically accurate way of doing it would be
with the function case_when()
.
This function works by providing it with a series of logical conditions and an
associated output if the condition is true. Each condition is processed
sequentially, meaning that if a condition is TRUE, the output won’t be overridden
for later conditions. The general form of case_when()
looks like:
case_when(
variable1 == condition1 ~ output,
variable2 == condition2 ~ output,
# (Optional) Otherwise
TRUE ~ final_output
)
The optional ending is only necessary if you want a certain output if none of your
conditions are met. Because conditions are processed sequentially and because
it is the last condition, by setting it as TRUE
the final output will used.
If this last TRUE
condition is not used then by default, the final output
would be a missing value. A (silly) example using age might be:
case_when(
age > 20 ~ "old",
age <= 20 ~ "young",
# For final condition
TRUE ~ "unborn!"
)
If instead you want one of the conditions to be NA
, you need to set the appropriate
NA
value:
case_when(
age > 20 ~ "old",
age <= 20 ~ NA_character_,
# For final condition
TRUE ~ "unborn!"
)
Alternatively, if we want missing age values to output NA
values at the end
(instead of "unborn!"
), we would exclude the final condition:
case_when(
age > 20 ~ "old",
age <= 20 ~ "young"
)
With dplyr functions like case_when()
,
it requires you be explicit about the type of output each condition has since
all the outputs must match (e.g. all character or all numeric).
This prevents you from accidentally mixing e.g. numeric output with character
output. Missing values also have data types:
-
NA_character_
(character) -
NA_real_
(numeric) -
NA_integer_
(integer)
Assuming the final output is NA
, in a pipeline this would look like how you
normally would use mutate()
:
user_info_df %>%
mutate(age_category = case_when(
age > 20 ~ "old",
age <= 20 ~ "young"
))
#> # A tibble: 22 × 6
#> user_id gender weight height age age_category
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 user_1 M 65 169 29 old
#> 2 user_10 M 85 180 27 old
#> 3 user_11 M 115 186 27 old
#> 4 user_12 M 67 170 27 old
#> 5 user_13 M 74 180 25 old
#> 6 user_14 M 64 171 27 old
#> 7 user_15 M 80 180 24 old
#> 8 user_16 M 67 176 27 old
#> 9 user_17 M 60 175 24 old
#> 10 user_18 M 80 180 0 young
#> # … with 12 more rows
Ok, please stop reading and we will code together again.
For instructors: Click for details.
Briefly review the content again, to reinforce what they read.
By using the case_when()
function, we can set "before sleep"
as day 1 and
"wake up"
as day 2 by creating a new column called day
. (We will use
NA_real_
because the other day
columns are numeric, not integer.)
saliva_with_day_df <- saliva_df %>%
mutate(day = case_when(
samples == "before sleep" ~ 1,
samples == "wake up" ~ 2
))
saliva_with_day_df
#> # A tibble: 42 × 5
#> user_id samples cortisol_norm melatonin_norm day
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 before sleep 0.0341 0.0000000174 1
#> 2 user_1 wake up 0.0779 0.00000000675 2
#> 3 user_10 before sleep 0.0370 0.00000000867 1
#> 4 user_10 wake up 0.0197 0.00000000257 2
#> 5 user_11 before sleep 0.0406 0.00000000204 1
#> 6 user_11 wake up 0.0156 0.00000000965 2
#> 7 user_12 before sleep 0.156 0.00000000354 1
#> 8 user_12 wake up 0.145 0.00000000864 2
#> 9 user_13 before sleep 0.0123 0.00000000190 1
#> 10 user_13 wake up 0.0342 0.00000000230 2
#> # … with 32 more rows
…Now, let’s use the reduce()
with full_join()
again:
reduce(list(user_info_df, saliva_with_day_df, summarised_rr_df), full_join)
#> Joining, by = "user_id"
#> Joining, by = c("user_id", "day")
#> # A tibble: 47 × 11
#> user_id gender weight height age samples cortisol_norm melatonin_norm day
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 user_1 M 65 169 29 before… 0.0341 0.0000000174 1
#> 2 user_1 M 65 169 29 wake up 0.0779 0.00000000675 2
#> 3 user_10 M 85 180 27 before… 0.0370 0.00000000867 1
#> 4 user_10 M 85 180 27 wake up 0.0197 0.00000000257 2
#> 5 user_11 M 115 186 27 before… 0.0406 0.00000000204 1
#> 6 user_11 M 115 186 27 wake up 0.0156 0.00000000965 2
#> 7 user_12 M 67 170 27 before… 0.156 0.00000000354 1
#> 8 user_12 M 67 170 27 wake up 0.145 0.00000000864 2
#> 9 user_13 M 74 180 25 before… 0.0123 0.00000000190 1
#> 10 user_13 M 74 180 25 wake up 0.0342 0.00000000230 2
#> # … with 37 more rows, and 2 more variables: ibi_s_mean <dbl>, ibi_s_sd <dbl>
We now have two rows per participant!
9.9 Wrangling data into final form
Now that we’ve got several datasets processed and joined, its now time to
bring it all together and put it into the data-raw/mmash.R
script so we
can create a final working dataset.
Open up the data-raw/mmash.R
file and the top of the file, add the vroom
package to the end of the list of other packages. Move the code library(fs)
to go with the other packages as well. It should look something like this now:
Then, we will comment out the download.file()
, unzip()
, file_delete()
,
and file_move()
code, since we don’t want to download and unzip the data every
single time we run this script. It should look like this:
# Download
mmash_link <- "https://physionet.org/static/published-projects/mmash/multilevel-monitoring-of-activity-and-sleep-in-healthy-people-1.0.0.zip"
# download.file(mmash_link, destfile = here("data-raw/mmash-data.zip"))
# Unzip
# unzip(here("data-raw/mmash-data.zip"),
# exdir = here("data-raw"),
# junkpaths = TRUE)
# unzip(here("data-raw/MMASH.zip"),
# exdir = here("data-raw"))
# Remove/tidy up left over files
# file_delete(here(c("data-raw/MMASH.zip",
# "data-raw/SHA256SUMS.txt",
# "data-raw/LICENSE.txt")))
# file_move(here("data-raw/DataPaper"), here("data-raw/mmash"))
Go into the doc/lesson.Rmd
and cut the code used to create the
saliva_with_day_df
as well as the code to full_join()
all the datasets
together with reduce()
and paste it at the bottom of the data-raw/mmash.R
script. Assign the output into a new variable called mmash
, like this:
saliva_with_day_df <- saliva_df %>%
mutate(day = case_when(
samples == "before sleep" ~ 1,
samples == "wake up" ~ 2,
TRUE ~ NA_real_
))
mmash <- reduce(
list(
user_info_df,
saliva_with_day_df,
summarised_rr_df,
summarised_actigraph_df
),
full_join
)
#> Joining, by = "user_id"
#> Joining, by = c("user_id", "day")
#> Joining, by = c("user_id", "day")
Lastly, we have to save this final dataset into the data/
folder. We’ll use the
function usethis::use_data()
to create the folder and save the data as an .rda
file. We’ll add this code to the very bottom of the script:
usethis::use_data(mmash, overwrite = TRUE)
We’re adding overwrite = TRUE
so every time we re-run this script, the dataset
will be saved. If the final dataset is going to be really large, we could (but won’t
in this course) save it as a .csv
file with:
vroom_write(mmash, here("data/mmash.csv"))
And later load it in with vroom()
(since it is so fast).
Alright, we’re finished creating this dataset! Let’s generate it
by:
- Restarting the R session with
Ctrl-Shift-F10
(“Session -> Restart R”). - Sourcing the
data-raw/mmash.R
script withCtrl-Shift-S
(“Code -> Source” or the “Source” button in the top right corner of the script pane).
We now have a final dataset to start working on! The main way to load data is
with load(here::here("data/mmash.rda"))
. Go into the doc/lesson.Rmd
file
and delete everything again, except for the YAML header and setup
code
chunk, so that we are ready for the next session. Lastly, add and commit all
the changes, including adding the final mmash.rda
data file, to the Git
history.
9.10 Summary
For instructors: Click for details.
Quickly cover this before finishing the session and when starting the next session.
- While very difficult to learn and use, regular expressions (regex or regexp) are incredibly powerfully at processing character data
- Use
left_join()
,right_join()
, andfull_join()
to join two datasets together - Use the functional
reduce()
to iteratively apply a function to a set of items in order to end up with one item (e.g. join more than two datasets into one final dataset) - Use
case_when()
instead of nesting multiple “if else” conditions whenever you need to do slightly more complicated conditionals