Use the powerful (but also very difficult) regular expressions to process and clean character data by making use of the stringr package.
Handle dates and times in R using the lubridate package.
Recognise when you trigger a “non-standard evaluation” error when you make functions using tidyverse functions and fix it by using {{ }}.
9.2💬 Discussion activity: Why can’t we join datasets together yet?
🧑🏫 Teacher note
The issues are related to the:
IDs, where the CGM and sleep data have file_path_id while participant_details.csv has id.
Dates, where the CGM data has device_timestamp, the sleep data has date, and the participant_details.csv has start_date and end_date.
Both the CGM and sleep data is measured at different times, but the same date, so a direct join wouldn’t work.
The participant details dates are in two columns rather than one.
Time: ~8 minutes.
We now have a function that imports all our CGM and sleep data, as well as import_dime() that could import the participant_details.csv file. The next direction we want to go to is to join them in some way so that we can be able to analyse them together. But right now, we can’t really do that. There are two main reasons. Look at the data below and try to identify them—don’t peek ahead! 😁 Then discuss with a neighbour to see if you can come to a consensus.
# A tibble: 1,800 × 4
file_path_id date sleep_type seconds
<chr> <dttm> <chr> <dbl>
1 data-raw/dime/sleep/101.csv 2021-05-24 23:03:00 wake 540
2 data-raw/dime/sleep/101.csv 2021-05-24 23:12:00 light 180
3 data-raw/dime/sleep/101.csv 2021-05-24 23:15:00 deep 1440
4 data-raw/dime/sleep/101.csv 2021-05-24 23:39:00 light 240
5 data-raw/dime/sleep/101.csv 2021-05-24 23:43:00 wake 300
6 data-raw/dime/sleep/101.csv 2021-05-24 23:48:00 light 120
# ℹ 1,794 more rows
Participant details
# A tibble: 60 × 6
id gender age_years intervention start_date end_date
<dbl> <chr> <dbl> <chr> <date> <date>
1 101 woman 45.6 baseline 2021-03-15 2021-03-22
2 101 woman 45.6 low bioactive diet 2021-03-23 2021-04-06
3 101 woman 45.6 high bioactive diet 2021-05-11 2021-05-25
4 102 man 43.7 baseline 2021-03-15 2021-03-22
5 102 man 43.7 high bioactive diet 2021-03-23 2021-04-06
6 102 man 43.7 low bioactive diet 2021-05-11 2021-05-25
# ℹ 54 more rows
For 1 minute, look at the above data and try to identify the reasons why we can’t (easily) join the datasets together right now.
For ~4 minutes, discuss what you’ve thought about with your neighbour
Then for about 3 minutes we will all share and discuss together.
9.3📖 Reading task: Cleaning character data with regular expressions
Time: ~5 minutes.
Before we go into joining datasets together, we have to do a bit of processing first. Specifically, we want to get the participant ID from the file_path_id character data since all the datasets have an ID (in some form). Whenever you are processing and cleaning data, you will very likely encounter and deal with character data. A wonderful package to use for working with character data is called stringr, which we’ll use to get the participant ID from the file_path_id column. The package is called stringr because in programming, a “string” is defined as any collection of character objects. So we will be using “string” from now on.
The main “engine” behind the functions in stringr is called regular expressions (or regex for short). These expressions are powerful, very concise ways of finding patterns in text. But they are also very very difficult to learn, write, and read, even for experienced users. That’s because certain strings like [ or ? have special meanings. For instance, [aeiou] is regex for “find one string in the object that is either a, e, i, o, or u”. The [] in this case means “find the string in between the two brackets”. We won’t cover regex too much in this workshop, but some great resources for learning them are the R for Data Science regex section, the stringr regex page, as well as in the help doc ?regex. There’s also a nice cheat sheet on it. But we will cover some of the more commonly used regex patterns, like:
regex
Meaning
\\d
find one digit (0-9)
\\D
find one non-digit
[0-9]
find one digit between 0 and 9 (the - is a range)
[a-z]
find one letter between a and z
[A-Z]
find one letter between A and Z
[a-zA-Z]
find one letter between a and z or A and Z
\\.
find one dot string
$
find the end of the string, e.g. \\.$ means find a dot that is at the end of the string
^
find the start of the string, e.g. ^\\d means find a digit at the start of the string
?
maybe find the string before the ?, e.g. \\d? means maybe find one digit
*
find the string before it, but maybe many times, e.g. \\d* means find zero or many digits one after the other
+
find the string one or more times before {}, e.g. \\d+ means find one or more digits one after the other
{n}
find the same string n times before {}, e.g. \\d{2} means find two digits one after the other
[:alpha:]
find one letter (a-z or A-Z)
[:digit:]
find one digit (0-9)
[:alnum:]
find one alphanumeric (a-z, A-Z, 0-9)
Note
The \ string is a special string in regex, but in R it means “escape or use this as the string”. For instance, \" in R means “use the " string”, like if you wanted to write a message like:
And think "Analysis has " is the first string, that nearly is an object, and the " finished" is the second string object.
But in regex, the \ string is a special command that can mean different things. For instance, \d means “find a digit” (0-9) and \D means “find a non-digit”. So when using it in R, it would be \\d and \\D so that regex “sees”, e.g., \d
Sticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
9.4💬 Discussion activity: Brainstorm a regex that will get the participant ID
Time: ~8 minutes.
Using what you just read from above, try to think of a regex that would get the participant ID from the file path. Try not to look ahead nor at the potential solutions! 😁😉
When the time is up, we’ll share some ideas and go over what the regex could be (there is actually many different ways to do this). With your neighbour, do the following tasks:
Looking at the file_path_id column, list what is similar in the user ID between rows and what is different.
Discuss and verbally describe (in English, not regex) what text pattern you might use to extract the user ID.
Use the list below to think about how you might convert the English description of the text pattern to a regex. This will probably be very hard, but try anyway.
When strings are written as is, regex will find those strings, e.g. user will find only user.
Use [] to find one possible string of the several between the brackets. E.g. [12] means 1 or 2 or [ab] means “a” or “b”. To find a range of numbers or letters, use - in between the start and end ranges, e.g. [1-3] means 1 to 3 or [a-c] means “a” to “c”.
Use ? if the string might be there or not. E.g. ab? means “a” and maybe “b” follows it or 1[1-2]? means 1 and maybe 1 or 2 will follow it.
Once you’ve done these tasks, we’ll discuss all together and go over what the regex would be to extract the user ID.
Potential regex solutions
[0-9][0-9][0-9]\\.csv$
[0-9]+\\.csv$
[0-9]{3}\\.csv$
\\d+\\.csv$
\\d{3}\\.csv$
[:digit:]+\\.csv$
[:digit:]{3}\\.csv$
For the rest of the workshop, we will use the regex [:digit:]+\\.csv$ to get the participant ID from the file path. We use this one because it is more readable than the others and it allows us to get participant IDs that are one to many digits long.
Teacher note
Make sure to reinforce that while regex is incredibly complicated, there are some basic things you can do with it that are quite powerful.
More or less, this section and exercise are to introduce the idea and concept of regex, but not to really teach it since that is well beyond the scope of this workshop and this time frame.
Go over the solution. The explanation is that the pattern will find anything that has \\.csv$ at the end that is preceded by some numbers.
9.5 Using regular expressions to extract text
Now that we’ve identified a possible regex to use to extract the participant ID, let’s try it out on the cgm_data. Before we do though, let’s make it easier on us to use the cgm_data in our Quarto document by importing it in the setup chunk. So, in the docs/learning.qmd file, go to the top of the file, find the setup chunk, and in the bottom of it below the library() functions, add the following code:
Then we can run this code chunk by using Ctrl-EnterCtrl-Enter so that we have the cgm_data object available to us to work with. Then we can continue!
We want to create a new column called id for the participant ID rather than modify the current file_path_id column, so we will use the mutate() function from the dplyr package to create this new column. We’ll use the str_extract() function from the stringr package to “extract a string” by using the regex [:digit:]+\\.csv$ that we discussed from the activity above.
We’re also using an argument to mutate() you might not have seen previously, called .before. This will insert the new id column before the column we use and we do this entirely for visual reasons, since it is easier to see the newly created column when we run the code. In your docs/learning.qmd file, create a new header called ## Using regex for ID at the bottom of the document, and create a new code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
Teacher note
Walk through writing this code, briefly explain/remind how to use mutate, and about the stringr function.
First we’ll pipe cgm_data into mutate() to create a new column called id that will contain the participant ID. We’ll use the str_extract() function to extract the participant ID from the file_path_id column using the regex [:digit:]+\\.csv$.
docs/learning.qmd
# Note: your file paths and data may look slightly different.cgm_data|>mutate( id =str_extract(file_path_id,"[:digit:]+\\.csv$"), .before =file_path_id)
Cool! But we have .csv at the end that we don’t want. We can use the str_remove() function to remove the .csv from the string. So we can pipe our output into that function:
docs/learning.qmd
cgm_data|>mutate( id =str_extract(file_path_id,"[:digit:]+\\.csv$")|>str_remove("\\.csv$"), .before =file_path_id)
Nice! Almost there though. The id column is actually set as a character type, but we will need it as an integer to join with participant_details.csv. So we can use the as.integer() function to convert it to an integer:
Finally, we don’t actually want to keep the file_path_id column, so we can use the select() function to drop it. We can do this by using the - operator to drop the column. So we can pipe our output into that function:
We’ve done it! We now have a new column called id that contains the participant ID from the file_path_id column. Now, time to make it into a function!
9.6🧑💻 Exercise: Convert ‘get ID’ code into a function
Time: ~15 minutes.
You now have code that takes the data that has the file_path_id column and gets the participant ID from it. You want to be able to easily use that on the sleep data too. So, as you’ve done, convert this code in the docs/learning.qmd file into a function.
By the end of this exercise, the code to import both the CGM and sleep data should look like this in the setup chunk of the docs/learning.qmd file:
# A tibble: 1,800 × 4
id date sleep_type seconds
<int> <dttm> <chr> <dbl>
1 101 2021-05-24 23:03:00 wake 540
2 101 2021-05-24 23:12:00 light 180
3 101 2021-05-24 23:15:00 deep 1440
4 101 2021-05-24 23:39:00 light 240
5 101 2021-05-24 23:43:00 wake 300
6 101 2021-05-24 23:48:00 light 120
7 101 2021-05-24 23:50:00 rem 1350
8 101 2021-05-25 00:12:30 wake 870
9 101 2021-05-25 00:27:00 rem 360
10 101 2021-05-25 00:33:00 light 210
# ℹ 1,790 more rows
So, convert the code we just wrote into a function by following these steps:
Assign function() to a new named function called get_participant_id and include one argument in function() with the name data.
Put the code we just wrote into the body of the function. Make sure to return() the output at the end of the function and to create Roxygen documentation. Replace all the relevant variables with the data argument.
Add stringr and dplyr as a package dependency by using usethis::use_package() in the Console.
Explicitly link the functions you are using in this new function to their package by using :: (e.g. dplyr:: and stringr::). Remember that you can find which package a function belongs by using the ? help documentation and looking at the very top to see the package name.
Test that the function works by using get_participant_id(cgm_data) or get_participant_id(sleep_data).
After creating the function and testing that it works, move (cut and paste) the function into R/functions.R.
Run styler in both the R/functions.R file and docs/learning.qmd with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).
In the setup chunk in the docs/learning.qmd file, pipe |> the output of import_csv_files() into the get_participant_id() function for both the CGM and sleep data (as shown above).
Render the docs/learning.qmd file to make sure things remain reproducible with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”).
Restart your R session with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”) to make sure later work is in a clean state.
Add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then push to GitHub.
Sticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
9.7💬 Discussion activity: How to join by dates?
Time: ~10 minutes.
Now that we have the participant ID, we are nearly able to join the CGM data with the sleep data. But before we do that, we need to make sure that the date columns are in the same format. The CGM data has a date column called device_timestamp, which is a date-time column, while the sleep data has a date column called date, which is a date column.
But they also both have different times for when the data was recorded by the devices. We can’t join by exact time, since the CGM data is not recorded at exactly the same time as the sleep data is. So how do we go about joining these two datasets together? There is no right or correct answer as it heavily depends on how you might want to analyse the data later on.
For 1 minute, consider theoretically how you would change the date columns so that they can be joined together.
For 3 minutes, discuss and brainstorm with your neighbour how you might do this and try to come to a consensus.
For 2 minutes, we will all share our ideas all together as a group.
🧑🏫 Teacher note
There is no right answer here, but you can discuss some ideas like:
Create a date column and an hour column, later group by ID, date, and hour, summarise the values from each dataset first, and then join the two datasets together.
Create a date column, group by ID and date, summarise the values, and then join the two datasets together.
We would like to have more data points to work with, so we will go with the first option.
9.8 Finding the right date
Working with dates can be surprisingly incredibly difficult and tricky. There’s timezones, daylight savings time, leap years, and different date formats that all can cause a dozen different problems. In R, there is a wonderful package called lubridate that helps with working with dates and times. It can’t solve all date and time problems, but it can help with a many of them.
Thankfully, lubridate is part of the tidyverse ecosystem, so we don’t need to explicitly load it.
In the docs/learning.qmd file, at the bottom of the document create a new code chunk called ## Working with dates and below that create a new code chunk with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). We’ll use mutate() again to create some new columns, as well as .before to put it before the device_timestamp.
As we discussed in the activity above, we will create a new date and hour column. lubridate has the function as_date() to convert a column into just a date (dropping time if it has it). It also has as_time(), but since we want to get just the hour, it also has a function to do that too: hour(). There are many other helper functions like minute(), year(), month(), and so on. For now, we will keep the device_timestamp column, so we have it as reference, but we will drop it later on.
cgm_data|>mutate( date =as_date(device_timestamp), hour =hour(device_timestamp), .before =device_timestamp)
Like we normally do, we can convert this into a function because we need the same things from the sleep data too. So, let’s use our workflow we’ve done several times before and try it out.
Since we are using lubridate, let’s add it as a package dependency first:
Use the workflow we’ve done rather than directly writing the function below as is. Meaning, start with making the function structure, cut and paste the code into it, add the package:: to the functions, add the Roxygen documentation, and then test it out as shown in the code below.
docs/learning.qmd
#' Prepare the date columns in DIME CGM and sleep data for joining.#'#' @param data The data that has the datetime column.#' @param column The datetime column to convert to date and hour.#'#' @returns A tibble/data.frame#'prepare_dates<-function(data, column){prepared_dates<-data|>dplyr::mutate( date =lubridate::as_date(column), hour =lubridate::hour(column), .before =column)return(prepared_dates)}cgm_data|>prepare_dates(device_timestamp)
Error in `dplyr::mutate()`:
ℹ In argument: `date = lubridate::as_date(column)`.
Caused by error in `h()`:
! error in evaluating the argument 'x' in selecting a method for function 'as_date': object 'device_timestamp' not found
Hmmm, we encountered an error! This error is something that you would very quickly encounter as you make functions using tidyverse functions. Read the next section to learn about what is happening.
9.9📖 Reading task: Non-standard evaluation (NSE)
🧑🏫 Teacher note
After they’ve read it, briefly review it again and try to explain what NSE is in a slightly different way. Try to emphasise that NSE is built-in to so many functions in R and that using {{ }} with tidyverse functions can help with using it in their own functions.
Time: ~5 minutes.
As you have noticed, we encountered an error when trying to write, what we think, is correct code in our function. This error, "object not found" is not easy to figure out. The error occurs because of something called “non-standard evaluation” (or NSE). NSE is a feature of R and is used quite a lot throughout R (e.g. library()). The tidyverse packages in particular use it quite a lot. It is a concept that isn’t commonly found in other programming language.
In R, NSE is what allows you to use formulas (e.g. y ~ x + x2) in statistical models or allows you to type out select(Gender, BMI) or library(purrr). In “standard evaluation”, these would instead be select("Gender", "BMI") or library("purrr"). So NSE gives flexibility and ease of use for the user (we don’t have to type quotes every time) when doing data analysis, but can be quite a headache when trying to program in R, like when making functions. There’s more detail about this on the dplyr website, which will give a few options to deal with NSE while programming with tidyverse packages. The simplest way to use NSE in your own functions when using tidyverse packages is to wrap the variable with {{ }} (called “curly-curly”). For example, using a simplified version of the code we wrote above, instead of this:
By wrapping column with {{ }}, we are telling R that we are giving it an unquoted variable and to give that variable to the as_date() function as the name of the column. It gets very technical very quickly, but the main thing to remember is that when you make functions that use tidyverse functions and you get this type of error, it is likely because of NSE. So you can use {{ }} to wrap the variable name in your function.
Sticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
9.10🧑💻 Exercise: Using NSE in your function
Time: ~10 minutes.
Using what you just read about NSE, fix the prepare_dates() to correctly use the column argument.
Before fixing the function though, you’ll need to fix the sleep_data date column. We made prepare_dates() to create a new column called date and hour, but the sleep data already has a column called date. But, this name is incorrect, since it actually contains datetime data. So you need to rename it to datetime. To do this, you can use the rename() function from the dplyr package. So, in the setup code chunk in docs/learning.qmd, add a pipe after get_participant_id() into rename(datetime = date), like:
Wrap column with {{ }} the three times it is used inside the function body.
Test that the function works by using prepare_dates(cgm_data, device_timestamp) and prepare_dates(sleep_data, datetime).
Cut and paste the function into R/functions.R.
Run styler on both R/functions.R and docs/learning.qmd with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).
Render the docs/learning.qmd file to make sure things remain reproducible with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”).
Restart your R session with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”) so that your environment is cleaned up and empty.
Add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) and then push to GitHub.
Sticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
9.11 Key takeaways
Teacher note
Quickly cover this and get them to do the survey before moving on to the discussion activity.
While very difficult to learn and use, regular expressions (regex or regexp) are incredibly powerful at processing character (also called strings) data. Some strings have special meanings like [], ?, *, and + that can be used to find patterns in string data.
Use the stringr package to work with string data as it has many functions to help use regex to process string data, like str_extract() or str_remove().
Use the lubridate package when working with dates and times, as there are many functions in it to help simplify date processing, like as_date() or hour().
Non-standard evaluation (NSE) is a great feature of R but also can be frustrating. When using tidyverse functions in your own function, use {{}} to wrap the variable that gives the error to fix the problem and use NSE correctly.
9.12💬 Discussion activity: How have you worked with dates and characters?
Time: ~10 minutes.
As we prepare for the next thing (either lunch or the next session), get up (and if it’s nice, go outside), and discuss with your neighbour or group the following questions:
Have you worked with characters and/or dates before in your work? How was that experience? Easy, hard?
Can you see some useful ways you could use regex in your work?
9.13 Code used in session
This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.
cgm_data<-here("data-raw/dime/cgm")|>import_csv_files()# Note: your file paths and data may look slightly different.cgm_data|>mutate( id =str_extract(file_path_id,"[:digit:]+\\.csv$"), .before =file_path_id)cgm_data|>mutate( id =str_extract(file_path_id,"[:digit:]+\\.csv$")|>str_remove("\\.csv$"), .before =file_path_id)cgm_data|>mutate( id =str_extract(file_path_id,"[:digit:]+\\.csv$")|>str_remove("\\.csv$")|>as.integer(), .before =file_path_id)cgm_data|>mutate( id =str_extract(file_path_id,"[:digit:]+\\.csv$")|>str_remove("\\.csv$")|>as.integer(), .before =file_path_id)|>select(-file_path_id)#' Get the participant ID from the file path column.#'#' @param data Data with `file_path_id` column.#'#' @return A data.frame/tibble.#'get_participant_id<-function(data){data_with_id<-data|>dplyr::mutate( id =stringr::str_extract(file_path_id,"[:digit:]+\\.csv$")|>stringr::str_remove("\\.csv$")|>as.integer(), .before =file_path_id)|>dplyr::select(-file_path_id)return(data_with_id)}# CGM datacgm_data<-here("data-raw/dime/cgm")|>import_csv_files()|>get_participant_id()# Sleep datasleep_data<-here("data-raw/dime/sleep")|>import_csv_files()|>get_participant_id()cgm_datasleep_data#' Prepare the date columns in DIME CGM and sleep data for joining.#'#' @param data The data that has the datetime column.#' @param column The datetime column to convert to date and hour.#'#' @returns A tibble/data.frame#'prepare_dates<-function(data, column){prepared_dates<-data|>dplyr::mutate( date =lubridate::as_date(column), hour =lubridate::hour(column), .before =column)return(prepared_dates)}cgm_data|>prepare_dates(device_timestamp)sleep_data<-here("data-raw/dime/sleep")|>import_csv_files()|>get_participant_id()|>rename(datetime =date)#' Prepare the date columns in DIME CGM and sleep data for joining.#'#' @param data The data that has the datetime column.#' @param column The datetime column to convert to date and hour.#'#' @returns A tibble/data.frame#'prepare_dates<-function(data, column){prepared_dates<-data|>dplyr::mutate( date =lubridate::as_date({{column}}), hour =lubridate::hour({{column}}), .before ={{column}})return(prepared_dates)}cgm_datasleep_data