Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

  • Open an issue or submitting a merge request on GitLab.
  • Hypothesis Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

8 Processing datasets for cleaning

Here we will continue making use of the “Workflow” block as we cover the third block, “Create final data” in Figure 8.1.

Figure 8.1: Section of the overall workflow we will be covering.

And your folder and file structure should look like:

LearnR3
├── data/
│   └── README.md
├── data-raw/
│   ├── mmash-data.zip
│   ├── mmash/
│   │  ├── user_1
│   │  ├── ...
│   │  └── user_22
│   └── mmash.R
├── doc/
│   ├── README.md
│   └── lesson.Rmd
├── R/
│   ├── functions.R
│   └── README.md
├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md

8.1 Processing character data

Take 8 min to read this section before we quickly go over it together. When processing data, you likely will encounter and deal with cleaning up character data. A wonderful package to use for working with character data is called stringr. We’ll use that in order to process the file_path_id so that we can get the user ID from it. First, let’s go to the setup code chunk and replace purrr with tidyverse and move library(tidyverse) to the top of the setup code chunk. We also have to add tidyverse as a dependency. Because tidyverse is a large collection of packages, the recommended way to add the dependency is with:

usethis::use_package("tidyverse", type = "Depends")

The main driver behind the functions in stringr are [regular expressions] (or [regex] for short). These expressions are powerful, very concise ways of finding patterns in text. We’ve already used them in the dir_ls() function with the regexp argument to find our data files.

To give an example, the regex ^.*[1-9][0-9]? means “starting from the beginning of the text, find any characters one or more times and stop at a number from 0 to 9, include a possible other number from 0 to 9”. So, using this on a string like "hi there, it's 30 degrees out", the regex would select "hi there, it's 30". If we break it down:

  • ^ means start of the string.
  • . means match any character once.
  • * means match the previous character one or more times.
  • [] means match whatever is in this brack once.
  • 0-9 means match any number from 0 to 9.
  • ? means the previous item may or may not be matched.

Confused? Yea, regex does that to pretty much everyone, you aren’t alone. While regex can be very very powerful, it can also be incredibly difficult to write and work with. We won’t cover this anymore in this course, but two great resources are the R for Data Science regex section, the stringr regex page, as well as in the help doc ?regex. For now, we will only use it for the simplest purposes possible: to extract user_1 to user_22 from the file_path_id. Please stop here and we’ll go over this together.

For instructors: Click for details

Make sure to re-inforce that while regex is incredibly complicated, there are some basic things you can do with it that are quite powerful.

More or less, this section is to introduce the idea and concept of regex, but not to really teach it since that is well beyond the scope of this course and this time frame.

8.2 Exercise: Brainstorm a regex that will match for the user ID

Time: 5 min

Discuss with your neighbour what a potential regex might be to find the ID in the file_path_id. Try not to look ahead 😉 since we will use this regex later on. When the time is up, we’ll share possible ideas.

8.3 Exercise: What is the pipe?

Time: 5 min

For instructors: Click for details

Before starting this exercise, ask how many have used the pipe before. If everyone has, then move on to the next section.

We haven’t used the %>% pipe from the magrittr package yet, but it is used extensively in many R package, is the foundation of tidyverse packages, and will eventually be incorporated into the next version of R (rather than through magrittr). Because of this, we will make heavy use of it. To make sure everyone is on the same page please do either:

  • If one of you or your pair doesn’t know what the pipe is, take some time to talk about and explain it (if you know).
  • If neither of the pair knows, please read the section on it from the beginner course.

8.4 Working with character columns

Now that we’ve talked about regex and pipes, let’s start using them. The first thing we’ll do is work with the user_info_df to write code that works, after which we will convert it into a function and move it into the R/functions.R file.

We want to create a new column for the user ID, so we will use the mutate() function from the dplyr package. We’ll use the regex user_[0-9][0-9]? to match for the user ID and we’ll use the str_extract() function from the stringr package. So, in your doc/lesson.Rmd file, create a new header called ## Using regex for user ID at the bottom of the document, and create a new code chunk below that.

For instructors: Click for details

Walk through writing this code, explain about how to use mutate, and about the stringr function.

# Note: your file paths and data may look slightly different.
user_info_df %>% 
    mutate(user_id = str_extract(file_path_id, "user_[0-9][0-9]?"))
#> # A tibble: 22 x 6
#>    file_path_id                         user_id gender weight height   age
#>    <chr>                                <chr>   <chr>   <dbl>  <dbl> <dbl>
#>  1 data-raw/mmash/user_1/user_info.csv  user_1  M          65    169    29
#>  2 data-raw/mmash/user_10/user_info.csv user_10 M          85    180    27
#>  3 data-raw/mmash/user_11/user_info.csv user_11 M         115    186    27
#>  4 data-raw/mmash/user_12/user_info.csv user_12 M          67    170    27
#>  5 data-raw/mmash/user_13/user_info.csv user_13 M          74    180    25
#>  6 data-raw/mmash/user_14/user_info.csv user_14 M          64    171    27
#>  7 data-raw/mmash/user_15/user_info.csv user_15 M          80    180    24
#>  8 data-raw/mmash/user_16/user_info.csv user_16 M          67    176    27
#>  9 data-raw/mmash/user_17/user_info.csv user_17 M          60    175    24
#> 10 data-raw/mmash/user_18/user_info.csv user_18 M          80    180     0
#> # … with 12 more rows

Since we don’t need to keep the file_path_id, let’s drop it using select() and -.

user_info_df %>% 
    mutate(user_id = str_extract(file_path_id, "user_[0-9][0-9]?")) %>% 
    select(-file_path_id)
#> # A tibble: 22 x 5
#>    gender weight height   age user_id
#>    <chr>   <dbl>  <dbl> <dbl> <chr>  
#>  1 M          65    169    29 user_1 
#>  2 M          85    180    27 user_10
#>  3 M         115    186    27 user_11
#>  4 M          67    170    27 user_12
#>  5 M          74    180    25 user_13
#>  6 M          64    171    27 user_14
#>  7 M          80    180    24 user_15
#>  8 M          67    176    27 user_16
#>  9 M          60    175    24 user_17
#> 10 M          80    180     0 user_18
#> # … with 12 more rows

8.5 Exercise: Convert this code into a function

Time: 10 min

We now have code that takes the data that has the file_path_id column and extracts the user ID from it. While in the doc/lesson.Rmd file, convert this code into a function, using the same process you’ve done previously.

  • Call the new function extract_user_id and add one argument called imported_data.
    • Don’t forget to output the code into an object and adding return() at the end with the object inside it.
    • Also include Roxygen documentation.
  • After writing it and testing it, move the function into R/functions.R.
  • Replace the code in the doc/lesson.Rmd file with the function name (e.g. extract_user_id(user_info_df)), load everything with load_all() (Ctrl-Shift-L), and run the new function.

Tip: If you don’t know what package a function comes from when you need to append the package when using ::, you can find out what the package is by using the help documentation ?functionname (can also be done by pressing F1 when the cursor is over the function). The package name is at the very top left corner, surrounded by { }.

Since we want this function to work on all data that we import, we should add it to import_multiple_files(). After you’ve created the function, go to the import_multiple_files() function in R/functions.R and use the %>% to add it after using the map_dfr() function. The code should look something like:

import_multiple_files <- function(file_pattern, import_function) {
    data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
                             regexp = file_pattern,
                             recurse = TRUE)
    
    combined_data <- purrr::map_dfr(data_files, import_function,
                                    .id = "file_path_id") %>% 
        extract_user_id()
    return(combined_data)
}

Re-load the functions with load_all() (Ctrl-Shift-L). Then re-run these pieces of code you wrote during Exercise 7.8 to update them based on the new code in the import_multiple_files() function. Something like this should already be in somewhere in your doc/lesson.Rmd file.

user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)

Lastly, don’t forget to add and commit the files into the Git history.

8.6 Join datasets together

For instructors: Click for details

Walk through and describe these images and the different type of joins.

The ability to join datasets together is a fundamental component of data processing and transformation. In our case, we want to add the datasets together so we eventually have at least one main dataset to work with.

There are many ways to join datasets, the more common ones that are implemented in the dplyr package are:

  • left_join(x, y): Join all rows and columns in y that match rows and columns in x. Columns that exist in y but not x are joined to x.

    Left joining in dplyr. Modified from the [RStudio dplyr cheatsheet][dplyr-cheatsheet].

    Figure 8.2: Left joining in dplyr. Modified from the RStudio dplyr cheatsheet.

  • right_join(x, y): The opposite of left_join(). Join all rows and columns in x that match rows and columns in y. Columns that exist in x but not y are joined to y.

    Right joining in dplyr. Modified from the [RStudio dplyr cheatsheet][dplyr-cheatsheet].

    Figure 8.3: Right joining in dplyr. Modified from the RStudio dplyr cheatsheet.

  • full_join(x, y): Join all rows and columns in y that match rows and columns in x. Columns and rows that exist in y but not x are joined to x.

    Full joining in dplyr. Modified from the [RStudio dplyr cheatsheet][dplyr-cheatsheet].

    Figure 8.4: Full joining in dplyr. Modified from the RStudio dplyr cheatsheet.

In our case, we want to use full_join(), since we want all the data from both datasets. This function takes two datasets and lets you indicate which column to join by using the by argument. Here, both datasets have the column user_id so we will join by them.

full_join(user_info_df, saliva_df, by = "user_id")
#> # A tibble: 43 x 8
#>    gender weight height   age user_id samples      cortisol_norm melatonin_norm
#>    <chr>   <dbl>  <dbl> <dbl> <chr>   <chr>                <dbl>          <dbl>
#>  1 M          65    169    29 user_1  before sleep        0.0341  0.0000000174 
#>  2 M          65    169    29 user_1  wake up             0.0779  0.00000000675
#>  3 M          85    180    27 user_10 before sleep        0.0370  0.00000000867
#>  4 M          85    180    27 user_10 wake up             0.0197  0.00000000257
#>  5 M         115    186    27 user_11 before sleep        0.0406  0.00000000204
#>  6 M         115    186    27 user_11 wake up             0.0156  0.00000000965
#>  7 M          67    170    27 user_12 before sleep        0.156   0.00000000354
#>  8 M          67    170    27 user_12 wake up             0.145   0.00000000864
#>  9 M          74    180    25 user_13 before sleep        0.0123  0.00000000190
#> 10 M          74    180    25 user_13 wake up             0.0342  0.00000000230
#> # … with 33 more rows

full_join() is useful if we want to include all values from both datasets, as long as each participant (“user”) had data collected from that dataset. When the two datasets have rows that don’t match, we will get missingness in that row, but that’s ok in this case.

We also eventually have other datasets to join together later on. Since full_join() can only take two datasets at a time, do we then just keep using full_join() until all the other datasets are combined? What if we get more data later on? Well, that’s where more functional programming comes in. Again, we have a simple goal: For a set of data frames, join them all together. Here we use another functional programming concept called reduce(). Like map(), which “maps” a function onto a set of items, reduce() applies a function to each item of a vector or list, each time reducing the set of items down until only one remains: the output. Let’s use our simple function add_numbers() from before and add up 1 to 5. Since add_numbers() only takes two numbers, we have to give it two numbers at a time and repeat until we reach 5.

# Add from 1 to 5
first <- add_numbers(1, 2)
second <- add_numbers(first, 3)
third <- add_numbers(second, 4)
add_numbers(third, 5)
#> [1] 15

Instead, we can use reduce to do the same thing:

reduce(1:5, add_numbers)
#> [1] 15

Figure 8.5 visually shows what is happening within reduce().

A functional that iteratively uses a function on a set of items until only one output remains. Modified from the [RStudio purrr cheatsheet][purrr-cheatsheet].

Figure 8.5: A functional that iteratively uses a function on a set of items until only one output remains. Modified from the RStudio purrr cheatsheet.

Since reduce(), like map(), takes either a vector or a list as an input, and since data frames can only be put together as a list (a data frame has vectors for columns and so can’t be a vector itself), we need to combine the datasets together in a list() and reduce them with full_join():

combined_data <- reduce(list(user_info_df, saliva_df), full_join)
#> Joining, by = "user_id"
combined_data
#> # A tibble: 43 x 8
#>    gender weight height   age user_id samples      cortisol_norm melatonin_norm
#>    <chr>   <dbl>  <dbl> <dbl> <chr>   <chr>                <dbl>          <dbl>
#>  1 M          65    169    29 user_1  before sleep        0.0341  0.0000000174 
#>  2 M          65    169    29 user_1  wake up             0.0779  0.00000000675
#>  3 M          85    180    27 user_10 before sleep        0.0370  0.00000000867
#>  4 M          85    180    27 user_10 wake up             0.0197  0.00000000257
#>  5 M         115    186    27 user_11 before sleep        0.0406  0.00000000204
#>  6 M         115    186    27 user_11 wake up             0.0156  0.00000000965
#>  7 M          67    170    27 user_12 before sleep        0.156   0.00000000354
#>  8 M          67    170    27 user_12 wake up             0.145   0.00000000864
#>  9 M          74    180    25 user_13 before sleep        0.0123  0.00000000190
#> 10 M          74    180    25 user_13 wake up             0.0342  0.00000000230
#> # … with 33 more rows

8.7 Summarizing data prior to joining

Functionals and vectorization are an integral component of how R works and they appear throughout many R’s functions and packages. They are particularly used throughout the tidyverse packages like dplyr. Let’s get into some more advanced features of dplyr functions that work as functionals.

There are many “verbs” in dplyr, like select(), rename(), mutate(), summarize(), and group_by() (covered in more detail in the Data Management and Wrangling session of the beginner course). The common usage of these verbs is through acting on and directly using the column names. For instance, to select only the age column, you would type out:

user_info_df %>% 
    select(age)
#> # A tibble: 22 x 1
#>      age
#>    <dbl>
#>  1    29
#>  2    27
#>  3    27
#>  4    27
#>  5    25
#>  6    27
#>  7    24
#>  8    27
#>  9    24
#> 10     0
#> # … with 12 more rows

But many dplyr verbs can also take functions as input. When you combine with the where() function, you can select e.g. columns that are numeric:

user_info_df %>% 
    select(where(is.numeric))
#> # A tibble: 22 x 3
#>    weight height   age
#>     <dbl>  <dbl> <dbl>
#>  1     65    169    29
#>  2     85    180    27
#>  3    115    186    27
#>  4     67    170    27
#>  5     74    180    25
#>  6     64    171    27
#>  7     80    180    24
#>  8     67    176    27
#>  9     60    175    24
#> 10     80    180     0
#> # … with 12 more rows

Or, only character columns:

user_info_df %>% 
    select(where(is.character))
#> # A tibble: 22 x 2
#>    gender user_id
#>    <chr>  <chr>  
#>  1 M      user_1 
#>  2 M      user_10
#>  3 M      user_11
#>  4 M      user_12
#>  5 M      user_13
#>  6 M      user_14
#>  7 M      user_15
#>  8 M      user_16
#>  9 M      user_17
#> 10 M      user_18
#> # … with 12 more rows

Likewise, with functions like summarise(), if you want to e.g. calculate the mean of a column, you would usually type out:

saliva_df %>% 
    summarise(cortisol_mean = mean(cortisol_norm))
#> # A tibble: 1 x 1
#>   cortisol_mean
#>           <dbl>
#> 1        0.0490

If you want to calculate the mean of multiple columns, you might think you’d have to do something like:

saliva_df %>% 
    summarise(cortisol_mean = mean(cortisol_norm),
              melatonin_mean = mean(melatonin_norm))
#> # A tibble: 1 x 2
#>   cortisol_mean melatonin_mean
#>           <dbl>          <dbl>
#> 1        0.0490  0.00000000765

But instead, there is the across() function that works like map() and allows you to calculate the mean across which ever columns you want. In many ways, across() is a duplicate of map(), particularly in the arguments you give it.

Take 2 min and read through this list. When you look in ?across, there are two main arguments and two optional ones:

  1. .cols argument: Columns you want to use.
    • Write column names directly and wrapped in c(): c(age, weight).
    • Write tidyselect helpers: everything(), starts_with(), contains(), ends_with()
    • Use a function wrapped in where(): where(is.numeric), where(is.character)
  2. .fns: The function to use on the .cols.
    • A bare function (mean) applies it to each column and returns the output, with the column name unchanged.
    • A list with bare functions (list(mean, sd)) applies each function to each column and returns the output with the column name appended with a number.
    • A named list with bare functions (list(average = mean, stddev = sd)) does the same as above but instead returns an output with the column names appended with the name given to the function in the list.
    • A function passed with ~ and .x, like in map(). For instance, across(c(age, weight), ~ mean(.x, na.rm = TRUE)) is used to say “put age and weight, one after the other, in place of where .x is located” to calculate the mean for age and the mean for weight.
  3. ... argument: Arguments to give to the functions in .fns. For instance, across(age, mean, na.rm = TRUE) passes the argument to remove missingness na.rm into the mean() function.
  4. .names argument: Customize the output of the column names. We won’t cover this argument.

Ok, stop reading and we’ll cover this together.

For instructors: Click for details

Go over the first two arguments again, reinforcing what they read.

Let’s try out some examples. To calculate the mean of cortisol_norm like we did above, we’d do:

saliva_df %>% 
    summarise(across(cortisol_norm, mean))
#> # A tibble: 1 x 1
#>   cortisol_norm
#>           <dbl>
#> 1        0.0490

To calculate the mean of another column:

saliva_df %>% 
    summarise(across(c(cortisol_norm, melatonin_norm), mean))
#> # A tibble: 1 x 2
#>   cortisol_norm melatonin_norm
#>           <dbl>          <dbl>
#> 1        0.0490  0.00000000765

This is nice, but changing the column names so that the function name is added would make reading what the column contents are clearer. That’s when we would use “named lists”, which are lists that look like:

list(item_one_name = ..., item_two_name = ...)

So, for having a named list with mean inside across(), it would look like:

list(mean = mean)
# or
list(average = mean)
# or
list(ave = mean)

You can confirm that it is a list by using the function names():

names(list(mean = mean))
#> [1] "mean"
names(list(average = mean))
#> [1] "average"
names(list(ave = mean))
#> [1] "ave"

Let’s stick with list(mean = mean):

saliva_df %>% 
    summarise(across(cortisol_norm, list(mean = mean)))
#> # A tibble: 1 x 1
#>   cortisol_norm_mean
#>                <dbl>
#> 1             0.0490

If we wanted to do that for all numeric columns and also calculate sd():

saliva_df %>% 
    summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
#> # A tibble: 1 x 4
#>   cortisol_norm_mean cortisol_norm_sd melatonin_norm_mean melatonin_norm_sd
#>                <dbl>            <dbl>               <dbl>             <dbl>
#> 1             0.0490           0.0478       0.00000000765     0.00000000651

We can use these concepts and code to process the other longer datasets, like RR.csv, in a way that makes it meaningful to join them with the smaller datasets like user_info.csv or saliva.csv. Let’s work with the RR.csv dataset to eventually join it with the others.

With the RR dataset, each participant had almost 100,000 data points recorded over two days of collection. So if we want to join with the other datasets, we need to calculate summary measures by at least user_id and also preferably by day as well. In this case, we need to group_by() these two variables before summarising. Let’s first summarise by taking the mean of ibi_s (which is the Inter-beat interval in seconds):

rr_df %>% 
    group_by(user_id, day) %>% 
    summarise(across(ibi_s, list(mean = mean)))
#> # A tibble: 44 x 3
#> # Groups:   user_id [22]
#>    user_id   day ibi_s_mean
#>    <chr>   <dbl>      <dbl>
#>  1 user_1      1      0.666
#>  2 user_1      2      0.793
#>  3 user_10     1      0.820
#>  4 user_10     2      0.856
#>  5 user_11     1      0.818
#>  6 user_11     2      0.923
#>  7 user_12     1      0.779
#>  8 user_12     2      0.883
#>  9 user_13     1      0.727
#> 10 user_13     2      0.953
#> # … with 34 more rows

While there are no missing values here, let’s add the argument na.rm = TRUE just in case.

rr_df %>% 
    group_by(user_id, day) %>% 
    summarise(across(ibi_s, list(mean = mean), na.rm = TRUE))
#> # A tibble: 44 x 3
#> # Groups:   user_id [22]
#>    user_id   day ibi_s_mean
#>    <chr>   <dbl>      <dbl>
#>  1 user_1      1      0.666
#>  2 user_1      2      0.793
#>  3 user_10     1      0.820
#>  4 user_10     2      0.856
#>  5 user_11     1      0.818
#>  6 user_11     2      0.923
#>  7 user_12     1      0.779
#>  8 user_12     2      0.883
#>  9 user_13     1      0.727
#> 10 user_13     2      0.953
#> # … with 34 more rows

You might notice a message (if you have the last version of dplyr):

`summarise()` regrouping output by 'user_id' (override with `.groups` argument)

Take 5 min to read this section over before we continue. This message talks about regrouping, and overriding based on the .groups argument. If we look into ?summarise to the .groups argument, we see that this argument is currently “experimental”. At the bottom there is a message about:

In addition, a message informs you of that choice, unless the option “dplyr.summarise.inform” is set to FALSE, or when summarise() is called from a function in a package.

So how would be go about removing this message? By putting the “dplyr.summarise.inform” in the options() function. So, go to the setup code chunk and add this code to the top:

options(dplyr.summarise.inform = FALSE)

You will now no longer get the message. Please stop reading and we will continue together.

Let’s also add standard deviation as another measure from the RR datasets:

summarised_rr_df <- rr_df %>% 
    group_by(user_id, day) %>% 
    summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE))
summarised_rr_df
#> # A tibble: 44 x 4
#> # Groups:   user_id [22]
#>    user_id   day ibi_s_mean ibi_s_sd
#>    <chr>   <dbl>      <dbl>    <dbl>
#>  1 user_1      1      0.666   0.164 
#>  2 user_1      2      0.793   0.194 
#>  3 user_10     1      0.820   0.225 
#>  4 user_10     2      0.856   0.397 
#>  5 user_11     1      0.818   0.137 
#>  6 user_11     2      0.923   0.182 
#>  7 user_12     1      0.779   0.0941
#>  8 user_12     2      0.883   0.258 
#>  9 user_13     1      0.727   0.147 
#> 10 user_13     2      0.953   0.151 
#> # … with 34 more rows

We now have the data in a form that would make sense to join it with the other datasets. So lets try it:

reduce(list(user_info_df, saliva_df, summarised_rr_df), full_join)
#> Joining, by = "user_id"
#> Joining, by = "user_id"
#> # A tibble: 86 x 11
#>    gender weight height   age user_id samples cortisol_norm melatonin_norm   day
#>    <chr>   <dbl>  <dbl> <dbl> <chr>   <chr>           <dbl>          <dbl> <dbl>
#>  1 M          65    169    29 user_1  before…        0.0341  0.0000000174      1
#>  2 M          65    169    29 user_1  before…        0.0341  0.0000000174      2
#>  3 M          65    169    29 user_1  wake up        0.0779  0.00000000675     1
#>  4 M          65    169    29 user_1  wake up        0.0779  0.00000000675     2
#>  5 M          85    180    27 user_10 before…        0.0370  0.00000000867     1
#>  6 M          85    180    27 user_10 before…        0.0370  0.00000000867     2
#>  7 M          85    180    27 user_10 wake up        0.0197  0.00000000257     1
#>  8 M          85    180    27 user_10 wake up        0.0197  0.00000000257     2
#>  9 M         115    186    27 user_11 before…        0.0406  0.00000000204     1
#> 10 M         115    186    27 user_11 before…        0.0406  0.00000000204     2
#> # … with 76 more rows, and 2 more variables: ibi_s_mean <dbl>, ibi_s_sd <dbl>

Hmm, but wait, we now have four rows of each user, when really we should have only two, one for each day. That’s because saliva_df doesn’t have a day column, instead there is a samples column. We’ll need to add a day column in order to join properly with the RR dataset.

There are several ways to do this, but probably the easiest, most explicit, and programmatically accurate way of doing it would be with the function case_when(). This function works by providing it with a series of logic conditions and an associated output if the condition is true. The general form looks like:

case_when(
    variable1 == condition1 ~ output,
    variable2 == condition2 ~ output,
    # Otherwise
    TRUE ~ final_output
)

A (silly) example using age might be:

case_when(
    age > 20 ~ "old",
    age <= 20 ~ "young",
    # For final condition
    TRUE ~ NA_character_
)

A quick note about NA_character_. With dplyr functions like case_when(), it requires you be explicit about the type of output each condition has. This prevents you from accidentally mixing e.g. numeric output with character output. This includes for missing values. Other explicit NA include:

  • NA_real_ (numeric)
  • NA_integer_ (integer)

How this would look in a pipeline would be when it is used in mutate():

user_info_df %>% 
    mutate(age_category = case_when(
        age > 20 ~ "old",
        age <= 20 ~ "young",
        TRUE ~ NA_character_
    ))
#> # A tibble: 22 x 6
#>    gender weight height   age user_id age_category
#>    <chr>   <dbl>  <dbl> <dbl> <chr>   <chr>       
#>  1 M          65    169    29 user_1  old         
#>  2 M          85    180    27 user_10 old         
#>  3 M         115    186    27 user_11 old         
#>  4 M          67    170    27 user_12 old         
#>  5 M          74    180    25 user_13 old         
#>  6 M          64    171    27 user_14 old         
#>  7 M          80    180    24 user_15 old         
#>  8 M          67    176    27 user_16 old         
#>  9 M          60    175    24 user_17 old         
#> 10 M          80    180     0 user_18 young       
#> # … with 12 more rows

Ok, please stop reading and we will code together again.

By using this function, we can set "before sleep" as day 1 and "wake up" as day 2 by creating a new column called day that uses the case_when() function. (We will use NA_real_ because the other day columns are numeric, not integer.)

saliva_with_day_df <- saliva_df %>% 
    mutate(day = case_when(
        samples == "before sleep" ~ 1,
        samples == "wake up" ~ 2,
        TRUE ~ NA_real_
    ))
saliva_with_day_df
#> # A tibble: 42 x 5
#>    samples      cortisol_norm melatonin_norm user_id   day
#>    <chr>                <dbl>          <dbl> <chr>   <dbl>
#>  1 before sleep        0.0341  0.0000000174  user_1      1
#>  2 wake up             0.0779  0.00000000675 user_1      2
#>  3 before sleep        0.0370  0.00000000867 user_10     1
#>  4 wake up             0.0197  0.00000000257 user_10     2
#>  5 before sleep        0.0406  0.00000000204 user_11     1
#>  6 wake up             0.0156  0.00000000965 user_11     2
#>  7 before sleep        0.156   0.00000000354 user_12     1
#>  8 wake up             0.145   0.00000000864 user_12     2
#>  9 before sleep        0.0123  0.00000000190 user_13     1
#> 10 wake up             0.0342  0.00000000230 user_13     2
#> # … with 32 more rows

Now, let’s use the reduce() with full_join() again:

reduce(list(user_info_df, saliva_with_day_df, summarised_rr_df), full_join)
#> Joining, by = "user_id"
#> Joining, by = c("user_id", "day")
#> # A tibble: 47 x 11
#>    gender weight height   age user_id samples cortisol_norm melatonin_norm   day
#>    <chr>   <dbl>  <dbl> <dbl> <chr>   <chr>           <dbl>          <dbl> <dbl>
#>  1 M          65    169    29 user_1  before…        0.0341  0.0000000174      1
#>  2 M          65    169    29 user_1  wake up        0.0779  0.00000000675     2
#>  3 M          85    180    27 user_10 before…        0.0370  0.00000000867     1
#>  4 M          85    180    27 user_10 wake up        0.0197  0.00000000257     2
#>  5 M         115    186    27 user_11 before…        0.0406  0.00000000204     1
#>  6 M         115    186    27 user_11 wake up        0.0156  0.00000000965     2
#>  7 M          67    170    27 user_12 before…        0.156   0.00000000354     1
#>  8 M          67    170    27 user_12 wake up        0.145   0.00000000864     2
#>  9 M          74    180    25 user_13 before…        0.0123  0.00000000190     1
#> 10 M          74    180    25 user_13 wake up        0.0342  0.00000000230     2
#> # … with 37 more rows, and 2 more variables: ibi_s_mean <dbl>, ibi_s_sd <dbl>

We now have two rows per participant!

8.8 Exercise: Summarise then join the Actigraph data

Time: 20 min

Like with the RR.csv dataset, let’s process the Actigraph.csv dataset so that you can join it with the other datasets.

  1. Like usual, create a new Markdown header called e.g. ## Exercise: Summarise and join Actigraph and insert a new code chunk below that.
  2. Look into the Data Description to find out what each column is for.
  3. Based on the documentation, which variables would you be most interested in analyzing more?
    • Keep those columns as well user_id and day by using select().
  4. Decide which summary measure(s) you think may be most interesting for you (e.g. median(), sd(), mean(), max(), min(), var()).
  5. Using group_by() of user_id and day, summarize the variables with the summary functions you chose.
  6. Put this new data into an new object called e.g. summarised_actigraph_df and include it with the other datasets in the reduce() function.
  7. Add and commit the changes you’ve made into the Git history.

8.9 Exercise: Import and process the activity data

Time: 25 minutes

We have a few other datasets that we could join together, but would likely require more processing in order to appropriately join with the other datasets. Complete these tasks:

  1. Create a new header called ## Exercise: Importing activitiy data
  2. Create a new code chunk below this new header.
  3. Starting the workflow from the beginning (e.g. with the spec() process), write code that imports the Activity.csv data into R.
  4. Convert this code into a new function using the workflow you’ve used from this course:
    • Call the new function import_activity.
    • Include one argument called file_path
    • Test that it works.
    • Add Roxygen documentation and explicit package links (::) with the functions.
    • Move the newly created function into the R/functions.R.
    • Use the new function in doc/lesson.Rmd and use load_all() (Ctrl-Shift-L) to run it.
  5. Import all the user_ datasets with import_multiple_files() and the import_activity() function.
  6. Pipe the results into mutate() and create a new column called activity_seconds that is based on subtracting end and start.
    • Use ?mutate and check the examples if you don’t recall how to use this function.
  7. Add and commit your changes to the Git history.

Look into the Data Description and find out what each column represents and what the numbers mean in the column activity. Then brainstorm with your neighbour:

  1. The disadvantage of using numbers instead of text to describe categorical data like in the activity column.
  2. Ways you could code include this information into the dataset.
  3. How you could meaningfully join this dataset with the other datasets.

Afterwards, we’ll quickly discuss some of these ideas as a group.

8.10 Wrangling data into final form

Now that we’ve got several datasets processed and joined, its now time to bring it all together and put it into the data-raw/mmash.R script so we can create a final working dataset.

Open up the data-raw/mmash.R file and let’s start cutting and pasting the code from doc/lesson.Rmd related to importing and joining the user_info.csv, saliva.csv, RR.csv, and Actigraph.csv.

The script so far should look like:

library(here)
mmash_link <- "https://physionet.org/static/published-projects/mmash/multilevel-monitoring-of-activity-and-sleep-in-healthy-people-1.0.0.zip"
download.file(mmash_link, destfile = here("data-raw/mmash-data.zip"))
unzip(here("data-raw/mmash-data.zip"), 
      exdir = here("data-raw"),
      junkpaths = TRUE)
unzip(here("data-raw/MMASH.zip"),
      exdir = here("data-raw"))

library(fs)
file_delete(here(c("data-raw/MMASH.zip", 
                   "data-raw/SHA256SUMS.txt",
                   "data-raw/LICENSE.txt")))
file_move(here("data-raw/DataPaper"), here("data-raw/mmash"))

First thing to do is comment out the download.file() code, since we don’t want to download it everytime we run this script. Next we’ll want to do rearrange it so all the packages are at the top, and then add the other packages we’ll need: tidyverse and vroom.

library(tidyverse)
library(vroom)
library(fs)
library(here)

Next, we need to let this script know where to get our functions from. So right below the library() functions, add:

devtools::load_all()

This is the code that runs when you use Ctrl-Shift-L.

Now, find the import_multiple_files(), summarized RR and Actigraph data, saliva data with the day column, and the reduce() code in your doc/lesson.Rmd file and cut and paste it at the bottom of the data-raw/mmash.R script. This is what the bottom half should look like:

## library(tidyverse)
## library(vroom)
## library(fs)
## library(here)
## devtools::load_all()
## 
## mmash_link <- "https://physionet.org/static/published-projects/mmash/multilevel-monitoring-of-activity-and-sleep-in-healthy-people-1.0.0.zip"
## # download.file(mmash_link, destfile = here("data-raw/mmash-data.zip"))
## unzip(here("data-raw/mmash-data.zip"), 
##       exdir = here("data-raw"),
##       junkpaths = TRUE)
## unzip(here("data-raw/MMASH.zip"),
##       exdir = here("data-raw"))
## 
## file_delete(here(c("data-raw/MMASH.zip", 
##                    "data-raw/SHA256SUMS.txt",
##                    "data-raw/LICENSE.txt")))
## file_move(here("data-raw/DataPaper"), here("data-raw/mmash"))
## 
## user_info_df <- import_multiple_files("user_info.csv", import_user_info)
## saliva_df <- import_multiple_files("saliva.csv", import_saliva)
## rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)

summarised_rr_df <- rr_df %>% 
    group_by(user_id, day) %>% 
    summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE))

saliva_with_day_df <- saliva_df %>% 
    mutate(day = case_when(
        samples == "before sleep" ~ 1,
        samples == "wake up" ~ 2,
        TRUE ~ NA_real_
    ))

# Your Actigraph code will be probably be different
summarised_actigraph_df <- actigraph_df %>% 
    group_by(user_id, day) %>% 
    summarise(across(hr, list(mean = mean, sd = sd)))

mmash <- reduce(
    list(
        user_info_df,
        saliva_with_day_df,
        summarised_rr_df,
        summarised_actigraph_df
    ),
    full_join
)
#> Joining, by = "user_id"
#> Joining, by = c("user_id", "day")
#> Joining, by = c("user_id", "day")

Lastly, we have to save this final dataset into the data/ folder. We’ll use this function usethis::use_data() to create the folder and save the data as an .rda file. We’ll add this code to the very bottom of the script:

usethis::use_data(mmash, overwrite = TRUE)

We’re adding overwrite = TRUE so every time we re-run this script, the dataset will be saved. If the final dataset is going to be really large, we could save it as a .csv file with:

vroom_write(mmash, here("data/mmash.csv"))

And later load it in with vroom() (since it is so fast). Alright, we’re finished creating this dataset! Let’s generate it by first:

  • Restart the R session with Ctrl-Shift-F10 (“Session -> Restart R”).
  • Source the data-raw/mmash.R script with Ctrl-Shift-S (“Code -> Source” or the “Source” button in the top right corner of the script pane).

We now have a final dataset to start working on! There are two ways to access the dataset:

  • Through load_all() (Ctrl-Shift-L), which automatically loads in all .rda datasets into your working session.
  • Through load(here::here("data/mmash.rda")).

Try out loading it. Restart the R session again, then load_all() and in the Console type out mmash and hit Enter. You will see that your dataset is available.

8.11 Exercise: What other cleaned data might you create?

Time: 10 min

In order to meaningfully join all the datasets together, we have to calculate summary measures for the RR.csv and Actigraph.csv data. However, we also lose a lot of information and potentially interesting analyses from it. With your neighbour, discuss some potential ways that you might join the datasets to retain that information and what questions you might be interested in with that data.

8.12 Easily add parallel processing

Take 8 minutes to read through this section and then move on to the last exercise. One major reason to get comfortable with and good at using purrr functions like map() is because it is relatively trivial to use parallel processing to speed up your analysis. Packages like furrr (a combination of the future and [purrr] package) have a series of functions starting with future_ that can convert code using map() into parallel processing code by switch to future_map(). This extends to also future_map_chr(), future_map_dfr(), and so on. What this does is creates multiple R sessions that run code simultaneously and then eventually merges the results back into one R session. But note, that parallel processing is good for some things and not for others. If you code requires a specific sequence to run (1 relies on 2 which relies on 3), parallel processing is not the right tool. It also doesn’t always speed up tasks. Creating the multiple R sessions and then merging them back takes some time, so if your tasks is already pretty fast, using parallel processing might actually make things slower.

furrr works by using the function plan() and setting a processing strategy. There are really only two right now: sequential that you already use and multisession that runs the parallel processing. Note, the plan() function should not be put into a function. It should be included in an R script (like data-raw/mmash.R) on its own. Here’s an example of a 7 line R script:

# Add this to the top:
library(furrr)
plan(multisession)

future_map(1:5, paste)

# Add this to end so it returns to normal
plan(sequential)

The ending should generally have plan(sequential) so you tell R to switch back to normal.

8.13 Exercise: Add parallel processing to your raw data processing

Time: 10 min

Let’s compare the difference between not using parallel processing and using it. First:

  1. Open the data-raw/mmash.R file.
  2. Restart the R session Ctrl-Shift-F10.
  3. Re-run the script with source() by using either the button “Source” at the top right corner of the RStudio pane or with Ctrl-Shift-S.

After it finishes, then:

  1. Add library(furrr) with the other library() functions at the top of data-raw/mmash.R.
  2. Add plan(multisession) right below all the other library functions.
  3. Go to the import_multiple_df() function in R/functions.R and replace purrr::map_dfr() with furrr::future_map_dfr.
  4. Go back to the data-raw/mmash.R script and add plan(sequential) to the very end.
  5. Restart the R session with Ctrl-Shift-F10.
  6. Re-run the code with source() using either the button “Source” at the top right corner of the RStudio pane or with Ctrl-Shift-S.

Do you notice a difference in speed compared to before?