print("Analysis has \"nearly\" finished.")[1] "Analysis has \"nearly\" finished."
We’ve read a lot of our data into R so far. But we now have several individual data frames that we want to be able to analyse together. If you recall from the discussion activity we had in the introduction session, we want to have a single data frame that has variables for data like HR and BVP as well as the nurses’ stress levels and other details. That way, we can analyse the relationships between these variables. For example, what is the relationship between HR and stress? Which means we want to join them together, but the question is, how do we do that?
Time: ~5 minutes.
You haven’t yet read in the survey-results.csv.gz file, which is the main file that connects everything else together. So, read it into R and compare the data values and structure to the HR data. Try to identify the issues that prevent you from joining the datasets together right now. Use this code as a scaffold.
docs/learning.qmd file, go to the bottom of the file and create a new header called ## Inspecting the data. Then create a new code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).read() in the data-raw/nurses-stress/survey-results.csv.gz file and assign it to an object called survey_data.When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
The issues stopping us from joining the datasets together are:
id, but does have a file_path_id with the participant ID contained within it.collection_datetime in long format, but the survey data has start_time, end_time, and date, but all in one row in wide format.date is a character, not a date.Time: ~6 minutes.
You’ve inspected the data frames to try and identify why we can’t join them together right now. With a neighbour, discuss what you’ve found and see if you can come to a consensus on what the issues are. Then we will share together.
Time: ~5 minutes.
Before we go into joining datasets together, we have to do a bit of processing first. Specifically, we want to get the participant ID from the file_path_id character data since all the datasets have an ID (in some form). Whenever you are processing and cleaning data, you will very likely encounter and deal with character data. A wonderful package to use for working with character data is called stringr, which we’ll use to get the participant ID from the file_path_id column. The package is called stringr because in programming, a “string” is defined as any collection of character objects. So we will be using “string” from now on.
The main “engine” behind the functions in stringr is called regular expressions (or regex for short). You’ve already used regex when you used dir_ls() with the regexp argument in the previous session. These expressions are powerful, very concise ways of finding patterns in text. But they are also very very difficult to learn, write, and read, even for experienced users. That’s because certain strings like [ or ? have special meanings. For instance, [aeiou] is regex for “find one string in the object that is either a, e, i, o, or u”. The [] in this case means “find the string in between the two brackets”. We won’t cover regex too much in this workshop, but some great resources for learning them are the R for Data Science regex section, the stringr regex page, as well as in the help doc ?regex. There’s also a nice cheat sheet on it. But we will cover some of the more commonly used regex patterns, like in (tab-regex?):
| regex | Meaning |
|---|---|
\\d |
find one digit (0-9) |
\\D |
find one non-digit |
[0-9] |
find one digit between 0 and 9 (the - is a range) |
[a-z] |
find one letter between a and z (lowercase) |
[A-Z] |
find one letter between A and Z (uppercase) |
[a-zA-Z] |
find one letter between a and z or A and Z |
\\. |
find one dot string |
$ |
find the end of the string, e.g. \\.$ means find a dot that is at the end of the string |
^ |
find the start of the string, e.g. ^\\d means find a digit at the start of the string |
? |
maybe find the string before the ?, e.g. \\d? means maybe find one digit |
* |
find the string before it, but maybe many times, e.g. \\d* means find zero or many digits one after the other |
+ |
find the string one or more times before {}, e.g. \\d+ means find one or more digits one after the other |
{n} |
find the same string n times before {}, e.g. \\d{2} means find two digits one after the other |
[:alpha:] |
find one letter (a-z or A-Z) |
[:digit:] |
find one digit (0-9) |
[:alnum:] |
find one alphanumeric (a-z, A-Z, 0-9) |
[:upper:] |
find one uppercase letter (A-Z) |
[:lower:] |
find one lowercase letter (a-z) |
(A|B) |
find either A or B (| means “or” and the () groups them), e.g., (\\D|\\d) means find either a non-digit or a digit. |
The \ string is a special string in regex, but in R it means “escape or use this literally as the string”. For instance, \" in R means “use the " string”, like if you wanted to write a message like:
print("Analysis has \"nearly\" finished.")[1] "Analysis has \"nearly\" finished."
Without the \, R would see this instead:
And think "Analysis has " is the first string, that nearly is an object, and the " finished" is the second string object.
But in regex, the \ string is a special command that can mean different things. For instance, \d means “find a digit” (0-9) and \D means “find a non-digit”. So when using it in R, it would be \\d and \\D so that regex “sees”, e.g., \d
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
Time: ~8 minutes.
Using what you just read from above, try to think of a regex that would get the participant ID from the file path. Try not to look ahead nor at the potential solutions! 😁 😉
When the time is up, we’ll share some ideas and go over what the regex could be (there are a few different ways to do this). With your neighbour, do the following tasks:
file_path_id column, list what is similar in the user ID between rows and what is different.[] to find one possible string of the several between the brackets. E.g. [12] means 1 or 2 or [ab] means “a” or “b”. To find a range of numbers or letters, use - in between the start and end ranges, e.g. [1-3] means 1 to 3 or [a-c] means “a” to “c”.{n} to find exactly n occurrences of the previous pattern. E.g. a{2} means “aa” or \\d{3} means three digits in a row.Once you’ve done these tasks, we’ll discuss all together and go over what the regex could be to extract the user ID.
(\\d|\\D)+(\\d|\\D){2}[0-9A-Z]+[0-9A-Z]{2}[:alnum:]+[:alnum:]{2}([:upper:]|[:digit:]){2}To be very explicit and certain, we could add /stress/ to the start of the regex and / to the end to make certain we get the correct part of the path that has the ID. The simplest one from above is [:alnum:]{2} since it is easier to read compared to the others. With the /stress/ and / added, it would be /stress/[:alnum:]{2}/. Adding those things at the start and end does mean we will need to have a couple other steps to remove those after extracting the ID. But there are nice functions from stringr to help out.
Make sure to reinforce that while regex is incredibly complicated, there are some basic things you can do with it that are quite powerful.
More or less, this section and exercise are to introduce the idea and concept of regex, but not to really teach it since that is well beyond the scope of this workshop and this time frame.
Go over the solution. The explanation is that the pattern will find anything that has two of a letter and/or a number that is preceded by /stress/ and followed by /.
Now that we’ve identified a possible regex to use to extract the participant ID, let’s try it out on the hr_data.
We want to create a new column called id for the participant ID rather than modify the current file_path_id column, so we will use the mutate() function from the dplyr package to create this new column. We’ll use the str_extract() function from the stringr package to “extract a string” by using the regex /stress/[:alnum:]{2}/ that we discussed from the activity above. We’ll also use a few str_remove() functions to remove the /stress/ and / from the extracted string so that we just have the ID.
We’re also using an argument to mutate() you might not have seen previously, called .before. This will insert the new id column before the column we use and we do this entirely for visual reasons, since it is easier to see the newly created column when we run the code. In your docs/learning.qmd file, create a new header called ## Using regex for ID at the bottom of the document, and create a new code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
Walk through writing this code, briefly explain/remind how to use mutate, and about the stringr function.
First we’ll pipe hr_data into mutate() to create a new column called id that will contain the participant ID. We’ll start with using the str_extract() function to extract the participant ID from the file_path_id column using the regex /stress/[:alnum:]{2}/.
docs/learning.qmd
# Note: your file paths and data may look slightly different.
hr_data |>
mutate(
id = str_extract(
file_path_id,
# "/stress/[:alnum:]{2}/"
"(?<=/stress/)[:alnum:]{2}(?=/)"
),
.before = file_path_id
)# A tibble: 57,183 × 4
id file_path_id collection_datetime hr
<chr> <chr> <dttm> <dbl>
1 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:07 83.7
2 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:20 78.2
3 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:41 72.5
4 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:43 73.2
5 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:18 77.8
6 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:35 83.7
7 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:37 84.1
8 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:44 85.2
9 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:52 83.7
10 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:45:02 83.1
# ℹ 57,173 more rows
Cool! But we have the /stress/ and / that we don’t want. We can use the str_remove() function to remove both from the string. So we can pipe our output into that function (twice):
docs/learning.qmd
hr_data |>
mutate(
id = str_extract(
file_path_id,
"/stress/[:alnum:]{2}/"
) |>
str_remove("/stress/") |>
str_remove("/"),
.before = file_path_id
)# A tibble: 57,183 × 4
id file_path_id collection_datetime hr
<chr> <chr> <dttm> <dbl>
1 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:07 83.7
2 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:20 78.2
3 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:41 72.5
4 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:43 73.2
5 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:18 77.8
6 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:35 83.7
7 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:37 84.1
8 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:44 85.2
9 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:52 83.7
10 15 data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:45:02 83.1
# ℹ 57,173 more rows
Nice! Almost there. We don’t actually want to keep the file_path_id column, so we can use the select() function to drop it. We can do this by using the - operator to drop the column. So we can pipe our output into that function:
docs/learning.qmd
hr_data |>
mutate(
id = str_extract(
file_path_id,
"/stress/[:alnum:]{2}/"
) |>
str_remove("/stress/") |>
str_remove("/"),
.before = file_path_id
) |>
select(-file_path_id)# A tibble: 57,183 × 3
id collection_datetime hr
<chr> <dttm> <dbl>
1 15 2020-07-07 16:43:07 83.7
2 15 2020-07-07 16:43:20 78.2
3 15 2020-07-07 16:43:41 72.5
4 15 2020-07-07 16:43:43 73.2
5 15 2020-07-07 16:44:18 77.8
6 15 2020-07-07 16:44:35 83.7
7 15 2020-07-07 16:44:37 84.1
8 15 2020-07-07 16:44:44 85.2
9 15 2020-07-07 16:44:52 83.7
10 15 2020-07-07 16:45:02 83.1
# ℹ 57,173 more rows
We’ve done it! We now have a new column called id that contains the participant ID from the file_path_id column. Now, time to make it into a function!
Time: ~10 minutes.
You now have code that takes the data that has the file_path_id column and gets the participant ID from it. You want to be able to easily use that on any of the HR-type data frames. So, as you’ve done, convert this code in the docs/learning.qmd file into a function.
By the end of this exercise, the code to read in any of the HR-type data should look like this:
hr_data |>
get_participant_id()
# Or with one of the other datasets:
read_all("BVP.csv.gz") |>
get_participant_id()So, convert the code we just wrote into a function by following these steps:
function() to a new named function called get_participant_id and include one argument in function() with the name data.return() the output at the end of the function and to create Roxygen documentation. Replace all the relevant variables with the data argument.stringr and dplyr as a package dependency by using usethis::use_package() in the Console.:: (e.g. dplyr:: and stringr::). Remember that you can find which package a function belongs by using the ? help documentation and looking at the very top to see the package name.get_participant_id(hr_data).R/functions.R.R/functions.R file and docs/learning.qmd with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).docs/learning.qmd file to make sure things remain reproducible with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”).When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
#' Get the participant ID from the file path column.
#'
#' @param data Data with `file_path_id` column.
#'
#' @returns A data frame/tibble.
#'
get_participant_id <- function(data) {
data_with_id <- data |>
dplyr::mutate(
id = stringr::str_extract(
file_path_id,
"/stress/[:alnum:]{2}/"
) |>
stringr::str_remove("/stress/") |>
stringr::str_remove("/"),
.before = file_path_id
) |>
dplyr::select(-file_path_id)
return(data_with_id)
}
# How it will be used.
get_participant_id(hr_data)# A tibble: 57,183 × 3
id collection_datetime hr
<chr> <dttm> <dbl>
1 15 2020-07-07 16:43:07 83.7
2 15 2020-07-07 16:43:20 78.2
3 15 2020-07-07 16:43:41 72.5
4 15 2020-07-07 16:43:43 73.2
5 15 2020-07-07 16:44:18 77.8
6 15 2020-07-07 16:44:35 83.7
7 15 2020-07-07 16:44:37 84.1
8 15 2020-07-07 16:44:44 85.2
9 15 2020-07-07 16:44:52 83.7
10 15 2020-07-07 16:45:02 83.1
# ℹ 57,173 more rows
Time: ~8 minutes.
If you finished the previous exercise early, you can try this more challenging exercise. Try to simplify the regex so that it only extracts the two alphanumeric characters without also getting the /stress/ and /. That way, you won’t need to use the str_remove() functions to remove those parts afterwards. This requires looking in the regex documentation about something called look arounds. Modify the internals of get_participant_id() to use these look arounds so that you only use one str_extract() function and no str_remove() functions. The output should be the same as before, but the code should be simpler.
str_extract() function to use look arounds so that it finds the two alphanumeric characters that are preceded by /stress/ (look behind) and followed by / (look ahead), without including those in the output.str_remove() functions since they should no longer be needed.source()’ing the R/functions.R file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”) and running get_participant_id(hr_data) in the Console.get_participant_id <- function(data) {
data_with_id <- data |>
dplyr::mutate(
id = stringr::str_extract(
file_path_id,
"(?<=/stress/)[:alnum:]{2}(?=/)"
),
.before = file_path_id
) |>
dplyr::select(-file_path_id)
return(data_with_id)
}
# How it will be used.
get_participant_id(hr_data)# A tibble: 57,183 × 3
id collection_datetime hr
<chr> <dttm> <dbl>
1 15 2020-07-07 16:43:07 83.7
2 15 2020-07-07 16:43:20 78.2
3 15 2020-07-07 16:43:41 72.5
4 15 2020-07-07 16:43:43 73.2
5 15 2020-07-07 16:44:18 77.8
6 15 2020-07-07 16:44:35 83.7
7 15 2020-07-07 16:44:37 84.1
8 15 2020-07-07 16:44:44 85.2
9 15 2020-07-07 16:44:52 83.7
10 15 2020-07-07 16:45:02 83.1
# ℹ 57,173 more rows
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
Quickly cover this and get them to do the survey before moving on to the discussion activity.
[], ?, *, and + that can be used to find patterns in string data.str_extract() or str_remove().This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.
# Note: your file paths and data may look slightly different.
hr_data |>
mutate(
id = str_extract(
file_path_id,
# "/stress/[:alnum:]{2}/"
"(?<=/stress/)[:alnum:]{2}(?=/)"
),
.before = file_path_id
)
hr_data |>
mutate(
id = str_extract(
file_path_id,
"/stress/[:alnum:]{2}/"
) |>
str_remove("/stress/") |>
str_remove("/"),
.before = file_path_id
)
hr_data |>
mutate(
id = str_extract(
file_path_id,
"/stress/[:alnum:]{2}/"
) |>
str_remove("/stress/") |>
str_remove("/"),
.before = file_path_id
) |>
select(-file_path_id)
#' Get the participant ID from the file path column.
#'
#' @param data Data with `file_path_id` column.
#'
#' @returns A data frame/tibble.
#'
get_participant_id <- function(data) {
data_with_id <- data |>
dplyr::mutate(
id = stringr::str_extract(
file_path_id,
"/stress/[:alnum:]{2}/"
) |>
stringr::str_remove("/stress/") |>
stringr::str_remove("/"),
.before = file_path_id
) |>
dplyr::select(-file_path_id)
return(data_with_id)
}
# How it will be used.
get_participant_id(hr_data)
get_participant_id <- function(data) {
data_with_id <- data |>
dplyr::mutate(
id = stringr::str_extract(
file_path_id,
"(?<=/stress/)[:alnum:]{2}(?=/)"
),
.before = file_path_id
) |>
dplyr::select(-file_path_id)
return(data_with_id)
}
# How it will be used.
get_participant_id(hr_data)