19  Processing strings with regular expressions

We’ve read a lot of our data into R so far. But we now have several individual data frames that we want to be able to analyse together. If you recall from the discussion activity we had in the introduction session, we want to have a single data frame that has variables for data like HR and BVP as well as the nurses’ stress levels and other details. That way, we can analyse the relationships between these variables. For example, what is the relationship between HR and stress? Which means we want to join them together, but the question is, how do we do that?

19.1 Learning objectives

  1. Use the powerful (but also very difficult) regular expressions to process and clean character data by making use of the stringr package.

19.2 🧑‍💻 Exercise: Inspect and compare the data frames

Time: ~5 minutes.

You haven’t yet read in the survey-results.csv.gz file, which is the main file that connects everything else together. So, read it into R and compare the data values and structure to the HR data. Try to identify the issues that prevent you from joining the datasets together right now. Use this code as a scaffold.

hr_data <- read_all("HR.csv.gz")

survey_data <- ___("___") |>
  ___()

hr_data
survey_data
  1. In your docs/learning.qmd file, go to the bottom of the file and create a new header called ## Inspecting the data. Then create a new code chunk below that with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).
  2. Copy and paste the code scaffold above into the code chunk you created.
  3. Fill in the blanks to read() in the data-raw/nurses-stress/survey-results.csv.gz file and assign it to an object called survey_data.
  4. Run the code to see the two data frames in the Console and compare them to identify the issues that prevent you from joining them together right now.
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

19.3 💬 Discussion activity: Why can’t we join datasets together yet?

The issues stopping us from joining the datasets together are:

  • HR-type data doesn’t have an id, but does have a file_path_id with the participant ID contained within it.
  • HR-type data has collection_datetime in long format, but the survey data has start_time, end_time, and date, but all in one row in wide format.
  • The survey data date is a character, not a date.
  • The survey data time columns have the smallest unit of time in minutes but the HR-type data has the smallest unit of time in seconds.
  • The HR-type data is in long format but the survey data is in wide format.

Time: ~6 minutes.

You’ve inspected the data frames to try and identify why we can’t join them together right now. With a neighbour, discuss what you’ve found and see if you can come to a consensus on what the issues are. Then we will share together.

  1. For about 3 minutes, discuss what you’ve found when comparing the datasets with your neighbour
  2. Then for about 3 minutes we will all share and discuss together.

19.4 📖 Reading task: Cleaning character data with regular expressions

Time: ~5 minutes.

Before we go into joining datasets together, we have to do a bit of processing first. Specifically, we want to get the participant ID from the file_path_id character data since all the datasets have an ID (in some form). Whenever you are processing and cleaning data, you will very likely encounter and deal with character data. A wonderful package to use for working with character data is called stringr, which we’ll use to get the participant ID from the file_path_id column. The package is called stringr because in programming, a “string” is defined as any collection of character objects. So we will be using “string” from now on.

The main “engine” behind the functions in stringr is called regular expressions (or regex for short). You’ve already used regex when you used dir_ls() with the regexp argument in the previous session. These expressions are powerful, very concise ways of finding patterns in text. But they are also very very difficult to learn, write, and read, even for experienced users. That’s because certain strings like [ or ? have special meanings. For instance, [aeiou] is regex for “find one string in the object that is either a, e, i, o, or u”. The [] in this case means “find the string in between the two brackets”. We won’t cover regex too much in this workshop, but some great resources for learning them are the R for Data Science regex section, the stringr regex page, as well as in the help doc ?regex. There’s also a nice cheat sheet on it. But we will cover some of the more commonly used regex patterns, like in (tab-regex?):

A list of some common regex patterns and their meanings. {#tab-regex}
regex Meaning
\\d find one digit (0-9)
\\D find one non-digit
[0-9] find one digit between 0 and 9 (the - is a range)
[a-z] find one letter between a and z (lowercase)
[A-Z] find one letter between A and Z (uppercase)
[a-zA-Z] find one letter between a and z or A and Z
\\. find one dot string
$ find the end of the string, e.g. \\.$ means find a dot that is at the end of the string
^ find the start of the string, e.g. ^\\d means find a digit at the start of the string
? maybe find the string before the ?, e.g. \\d? means maybe find one digit
* find the string before it, but maybe many times, e.g. \\d* means find zero or many digits one after the other
+ find the string one or more times before {}, e.g. \\d+ means find one or more digits one after the other
{n} find the same string n times before {}, e.g. \\d{2} means find two digits one after the other
[:alpha:] find one letter (a-z or A-Z)
[:digit:] find one digit (0-9)
[:alnum:] find one alphanumeric (a-z, A-Z, 0-9)
[:upper:] find one uppercase letter (A-Z)
[:lower:] find one lowercase letter (a-z)
(A|B) find either A or B (| means “or” and the () groups them), e.g., (\\D|\\d) means find either a non-digit or a digit.
Note

The \ string is a special string in regex, but in R it means “escape or use this literally as the string”. For instance, \" in R means “use the " string”, like if you wanted to write a message like:

print("Analysis has \"nearly\" finished.")
[1] "Analysis has \"nearly\" finished."

Without the \, R would see this instead:

print("Analysis has "nearly" finished.")

And think "Analysis has " is the first string, that nearly is an object, and the " finished" is the second string object.

But in regex, the \ string is a special command that can mean different things. For instance, \d means “find a digit” (0-9) and \D means “find a non-digit”. So when using it in R, it would be \\d and \\D so that regex “sees”, e.g., \d

CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

19.5 💬 Discussion activity: Brainstorm a regex for getting the participant ID

Time: ~8 minutes.

Using what you just read from above, try to think of a regex that would get the participant ID from the file path. Try not to look ahead nor at the potential solutions! 😁 😉

When the time is up, we’ll share some ideas and go over what the regex could be (there are a few different ways to do this). With your neighbour, do the following tasks:

  1. Looking at the file_path_id column, list what is similar in the user ID between rows and what is different.
  2. Discuss and verbally describe (in English, not regex) what text pattern you might use to extract the user ID.
  3. Using what you just read about regex and the tips below, think about how you might convert the English description of the text pattern to a regex. This will probably be very hard, but try anyway.
    • Use [] to find one possible string of the several between the brackets. E.g. [12] means 1 or 2 or [ab] means “a” or “b”. To find a range of numbers or letters, use - in between the start and end ranges, e.g. [1-3] means 1 to 3 or [a-c] means “a” to “c”.
    • Use {n} to find exactly n occurrences of the previous pattern. E.g. a{2} means “aa” or \\d{3} means three digits in a row.

Once you’ve done these tasks, we’ll discuss all together and go over what the regex could be to extract the user ID.

  • (\\d|\\D)+
  • (\\d|\\D){2}
  • [0-9A-Z]+
  • [0-9A-Z]{2}
  • [:alnum:]+
  • [:alnum:]{2}
  • ([:upper:]|[:digit:]){2}

To be very explicit and certain, we could add /stress/ to the start of the regex and / to the end to make certain we get the correct part of the path that has the ID. The simplest one from above is [:alnum:]{2} since it is easier to read compared to the others. With the /stress/ and / added, it would be /stress/[:alnum:]{2}/. Adding those things at the start and end does mean we will need to have a couple other steps to remove those after extracting the ID. But there are nice functions from stringr to help out.

Make sure to reinforce that while regex is incredibly complicated, there are some basic things you can do with it that are quite powerful.

More or less, this section and exercise are to introduce the idea and concept of regex, but not to really teach it since that is well beyond the scope of this workshop and this time frame.

Go over the solution. The explanation is that the pattern will find anything that has two of a letter and/or a number that is preceded by /stress/ and followed by /.

19.6 Using regular expressions to extract text

Now that we’ve identified a possible regex to use to extract the participant ID, let’s try it out on the hr_data.

We want to create a new column called id for the participant ID rather than modify the current file_path_id column, so we will use the mutate() function from the dplyr package to create this new column. We’ll use the str_extract() function from the stringr package to “extract a string” by using the regex /stress/[:alnum:]{2}/ that we discussed from the activity above. We’ll also use a few str_remove() functions to remove the /stress/ and / from the extracted string so that we just have the ID.

We’re also using an argument to mutate() you might not have seen previously, called .before. This will insert the new id column before the column we use and we do this entirely for visual reasons, since it is easier to see the newly created column when we run the code. In your docs/learning.qmd file, create a new header called ## Using regex for ID at the bottom of the document, and create a new code chunk below that with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).

Walk through writing this code, briefly explain/remind how to use mutate, and about the stringr function.

First we’ll pipe hr_data into mutate() to create a new column called id that will contain the participant ID. We’ll start with using the str_extract() function to extract the participant ID from the file_path_id column using the regex /stress/[:alnum:]{2}/.

docs/learning.qmd
# Note: your file paths and data may look slightly different.
hr_data |>
  mutate(
    id = str_extract(
      file_path_id,
      # "/stress/[:alnum:]{2}/"
      "(?<=/stress/)[:alnum:]{2}(?=/)"
    ),
    .before = file_path_id
  )
# A tibble: 57,183 × 4
   id          file_path_id                    collection_datetime    hr
   <chr>       <chr>                           <dttm>              <dbl>
 1 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:07  83.7
 2 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:20  78.2
 3 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:41  72.5
 4 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:43:43  73.2
 5 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:18  77.8
 6 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:35  83.7
 7 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:37  84.1
 8 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:44  85.2
 9 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:44:52  83.7
10 /stress/15/ data-raw/nurses-stress/stress/… 2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows

Cool! But we have the /stress/ and / that we don’t want. We can use the str_remove() function to remove both from the string. So we can pipe our output into that function (twice):

docs/learning.qmd
hr_data |>
  mutate(
    id = str_extract(
      file_path_id,
      "/stress/[:alnum:]{2}/"
    ) |>
      str_remove("/stress/") |>
      str_remove("/"),
    .before = file_path_id
  )
# A tibble: 57,183 × 4
   id    file_path_id                          collection_datetime    hr
   <chr> <chr>                                 <dttm>              <dbl>
 1 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:07  83.7
 2 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:20  78.2
 3 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:41  72.5
 4 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:43:43  73.2
 5 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:18  77.8
 6 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:35  83.7
 7 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:37  84.1
 8 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:44  85.2
 9 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:44:52  83.7
10 15    data-raw/nurses-stress/stress/15/15_… 2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows

Nice! Almost there. We don’t actually want to keep the file_path_id column, so we can use the select() function to drop it. We can do this by using the - operator to drop the column. So we can pipe our output into that function:

docs/learning.qmd
hr_data |>
  mutate(
    id = str_extract(
      file_path_id,
      "/stress/[:alnum:]{2}/"
    ) |>
      str_remove("/stress/") |>
      str_remove("/"),
    .before = file_path_id
  ) |>
  select(-file_path_id)
# A tibble: 57,183 × 3
   id    collection_datetime    hr
   <chr> <dttm>              <dbl>
 1 15    2020-07-07 16:43:07  83.7
 2 15    2020-07-07 16:43:20  78.2
 3 15    2020-07-07 16:43:41  72.5
 4 15    2020-07-07 16:43:43  73.2
 5 15    2020-07-07 16:44:18  77.8
 6 15    2020-07-07 16:44:35  83.7
 7 15    2020-07-07 16:44:37  84.1
 8 15    2020-07-07 16:44:44  85.2
 9 15    2020-07-07 16:44:52  83.7
10 15    2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows

We’ve done it! We now have a new column called id that contains the participant ID from the file_path_id column. Now, time to make it into a function!

19.7 🧑‍💻 Exercise: Convert ‘get ID’ code into a function

Time: ~10 minutes.

You now have code that takes the data that has the file_path_id column and gets the participant ID from it. You want to be able to easily use that on any of the HR-type data frames. So, as you’ve done, convert this code in the docs/learning.qmd file into a function.

By the end of this exercise, the code to read in any of the HR-type data should look like this:

hr_data |>
  get_participant_id()
# Or with one of the other datasets:
read_all("BVP.csv.gz") |>
  get_participant_id()

So, convert the code we just wrote into a function by following these steps:

  1. Assign function() to a new named function called get_participant_id and include one argument in function() with the name data.
  2. Put the code we just wrote into the body of the function. Make sure to return() the output at the end of the function and to create Roxygen documentation. Replace all the relevant variables with the data argument.
  3. Add stringr and dplyr as a package dependency by using usethis::use_package() in the Console.
  4. Explicitly link the functions you are using in this new function to their package by using :: (e.g. dplyr:: and stringr::). Remember that you can find which package a function belongs by using the ? help documentation and looking at the very top to see the package name.
  5. Test that the function works by using get_participant_id(hr_data).
  6. After creating the function and testing that it works, move (cut and paste) the function into R/functions.R.
  7. Run styler in both the R/functions.R file and docs/learning.qmd with the Palette (Ctrl-Shift-P, then type “style file”).
  8. Render the docs/learning.qmd file to make sure things remain reproducible with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”).
  9. Restart your R session with Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-P, then type “restart”) to make sure later work is in a clean state.
  10. Add and commit the changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”). Then push to GitHub.
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

#' Get the participant ID from the file path column.
#'
#' @param data Data with `file_path_id` column.
#'
#' @returns A data frame/tibble.
#'
get_participant_id <- function(data) {
  data_with_id <- data |>
    dplyr::mutate(
      id = stringr::str_extract(
        file_path_id,
        "/stress/[:alnum:]{2}/"
      ) |>
        stringr::str_remove("/stress/") |>
        stringr::str_remove("/"),
      .before = file_path_id
    ) |>
    dplyr::select(-file_path_id)
  return(data_with_id)
}

# How it will be used.
get_participant_id(hr_data)
# A tibble: 57,183 × 3
   id    collection_datetime    hr
   <chr> <dttm>              <dbl>
 1 15    2020-07-07 16:43:07  83.7
 2 15    2020-07-07 16:43:20  78.2
 3 15    2020-07-07 16:43:41  72.5
 4 15    2020-07-07 16:43:43  73.2
 5 15    2020-07-07 16:44:18  77.8
 6 15    2020-07-07 16:44:35  83.7
 7 15    2020-07-07 16:44:37  84.1
 8 15    2020-07-07 16:44:44  85.2
 9 15    2020-07-07 16:44:52  83.7
10 15    2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows

19.8 🧑‍💻 Extra exercise: Simplify the regex with look arounds

Time: ~8 minutes.

If you finished the previous exercise early, you can try this more challenging exercise. Try to simplify the regex so that it only extracts the two alphanumeric characters without also getting the /stress/ and /. That way, you won’t need to use the str_remove() functions to remove those parts afterwards. This requires looking in the regex documentation about something called look arounds. Modify the internals of get_participant_id() to use these look arounds so that you only use one str_extract() function and no str_remove() functions. The output should be the same as before, but the code should be simpler.

  1. Look up the documentation for regex look arounds and try to understand how they work.
  2. Modify the regex in the str_extract() function to use look arounds so that it finds the two alphanumeric characters that are preceded by /stress/ (look behind) and followed by / (look ahead), without including those in the output.
  3. Remove the two str_remove() functions since they should no longer be needed.
  4. Test out the updates to the function by source()’ing the R/functions.R file with Ctrl-Shift-S or with the Palette (Ctrl-Shift-P, then type “source”) and running get_participant_id(hr_data) in the Console.
  5. Once it works, add and commit the changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”). Then push to GitHub.
get_participant_id <- function(data) {
  data_with_id <- data |>
    dplyr::mutate(
      id = stringr::str_extract(
        file_path_id,
        "(?<=/stress/)[:alnum:]{2}(?=/)"
      ),
      .before = file_path_id
    ) |>
    dplyr::select(-file_path_id)
  return(data_with_id)
}

# How it will be used.
get_participant_id(hr_data)
# A tibble: 57,183 × 3
   id    collection_datetime    hr
   <chr> <dttm>              <dbl>
 1 15    2020-07-07 16:43:07  83.7
 2 15    2020-07-07 16:43:20  78.2
 3 15    2020-07-07 16:43:41  72.5
 4 15    2020-07-07 16:43:43  73.2
 5 15    2020-07-07 16:44:18  77.8
 6 15    2020-07-07 16:44:35  83.7
 7 15    2020-07-07 16:44:37  84.1
 8 15    2020-07-07 16:44:44  85.2
 9 15    2020-07-07 16:44:52  83.7
10 15    2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

19.9 Key takeaways

Quickly cover this and get them to do the survey before moving on to the discussion activity.

  • While very difficult to learn and use, regular expressions (regex or regexp) are incredibly powerful at processing character (also called strings) data. Some strings have special meanings like [], ?, *, and + that can be used to find patterns in string data.
  • Use the stringr package to work with string data as it has many functions to help use regex to process string data, like str_extract() or str_remove().

19.10 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

# Note: your file paths and data may look slightly different.
hr_data |>
  mutate(
    id = str_extract(
      file_path_id,
      # "/stress/[:alnum:]{2}/"
      "(?<=/stress/)[:alnum:]{2}(?=/)"
    ),
    .before = file_path_id
  )
hr_data |>
  mutate(
    id = str_extract(
      file_path_id,
      "/stress/[:alnum:]{2}/"
    ) |>
      str_remove("/stress/") |>
      str_remove("/"),
    .before = file_path_id
  )
hr_data |>
  mutate(
    id = str_extract(
      file_path_id,
      "/stress/[:alnum:]{2}/"
    ) |>
      str_remove("/stress/") |>
      str_remove("/"),
    .before = file_path_id
  ) |>
  select(-file_path_id)
#' Get the participant ID from the file path column.
#'
#' @param data Data with `file_path_id` column.
#'
#' @returns A data frame/tibble.
#'
get_participant_id <- function(data) {
  data_with_id <- data |>
    dplyr::mutate(
      id = stringr::str_extract(
        file_path_id,
        "/stress/[:alnum:]{2}/"
      ) |>
        stringr::str_remove("/stress/") |>
        stringr::str_remove("/"),
      .before = file_path_id
    ) |>
    dplyr::select(-file_path_id)
  return(data_with_id)
}

# How it will be used.
get_participant_id(hr_data)
get_participant_id <- function(data) {
  data_with_id <- data |>
    dplyr::mutate(
      id = stringr::str_extract(
        file_path_id,
        "(?<=/stress/)[:alnum:]{2}(?=/)"
      ),
      .before = file_path_id
    ) |>
    dplyr::select(-file_path_id)
  return(data_with_id)
}

# How it will be used.
get_participant_id(hr_data)