values <- 1:10
# Vectorized
sum(values)
#> [1] 55
If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
We will continue covering the “Workflow” block in Figure 7.1.
But what does functionals have to do with what we are doing now? Well, our import_user_info()
function only takes in one data file. But we have 22 files that we could load all at once if we used functionals.
The first thing we have to do is add library(purrr)
to the setup
code chunk in the doc/learning.qmd
document. Then we need to add the package dependency by going to the Console and running:
usethis::use_package("purrr")
Then, the next step for using the map()
functional is to get a vector or list of all the dataset files available to us. We will return to using the fs package, which has a function called dir_ls()
that finds files of a certain pattern. In our case, the pattern is user_info.csv
. So, let’s add library(fs)
to the setup
code chunk. Then, go to the bottom of the doc/learning.qmd
document, create a new header called ## Using map
, and create a code chunk below that with
or with the Palette (, then type “new chunk”)
The dir_ls()
function takes the path that we want to search (data-raw/mmash/
), uses the argument regexp
(short for regular expression or also regex
) to find the pattern, and recurse
to look in all subfolders. We’ll cover regular expressions more in the next session.
Then let’s see what the output looks like. For the website, we are only showing the first 3 files. Your output will look slightly different from this.
user_info_files
#> data-raw/mmash/user_1/user_info.csv
#> data-raw/mmash/user_10/user_info.csv
#> data-raw/mmash/user_11/user_info.csv
Alright, we now have all the files ready to give to map()
. So let’s try it!
user_info_list <- map(user_info_files, import_user_info)
Remember, that map()
always outputs a list, so when we look into this object, it will give us 22 tibbles (data.frames). Here we’ll only show the first one:
user_info_list[[1]]
#> # A tibble: 1 × 4
#> gender weight height age
#> <chr> <dbl> <dbl> <dbl>
#> 1 M 65 169 29
This is great because with one line of code we imported all these datasets! But we’re missing an important bit of information: The user ID. A powerful feature of the purrr package is that it has other functions to make working with functionals easier. We know map()
always outputs a list. What if you want to output a character vector instead? If we check the help:
?map
We see that there are other functions, including a function called map_chr()
that seems to output a character vector. There are several others that give an output based on the ending of map_
, such as:
map_int()
outputs an integer.map_dbl()
outputs a numeric value, called a “double” in programming.map_dfr()
outputs a data frame, combining the list items by row (r
).map_dfc()
outputs a data frame, combining the list items by column (c
).The map_dfr()
looks like the one we want, since we want all these datasets together as one. If we look at the help for it, we see that it has an argument .id
, which we can use to create a new column that sets the user ID, or in this case, the file path to the dataset, which has the user ID information in it. So, let’s use it and create a new column called file_path_id
.
user_info_df <- map_dfr(user_info_files, import_user_info,
.id = "file_path_id")
Your file_path_id
variable will look different. Don’t worry, we’re going to tidy up the file_path_id
variable later.
user_info_df
user_info_df %>%
trim_filepath_for_book()
#> # A tibble: 22 × 5
#> file_path_id gender weight height age
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/user_info.csv M 65 169 29
#> 2 data-raw/mmash/user_10/user_info.csv M 85 180 27
#> 3 data-raw/mmash/user_11/user_info.csv M 115 186 27
#> 4 data-raw/mmash/user_12/user_info.csv M 67 170 27
#> 5 data-raw/mmash/user_13/user_info.csv M 74 180 25
#> 6 data-raw/mmash/user_14/user_info.csv M 64 171 27
#> 7 data-raw/mmash/user_15/user_info.csv M 80 180 24
#> 8 data-raw/mmash/user_16/user_info.csv M 67 176 27
#> 9 data-raw/mmash/user_17/user_info.csv M 60 175 24
#> 10 data-raw/mmash/user_18/user_info.csv M 80 180 0
#> # ℹ 12 more rows
Now that we have this working, let’s add and commit the changes to the Git history, by using or with the Palette (, then type “commit”)
Time: 10 minutes.
As a group, discuss if you’ve ever used for loops or functionals like map()
and your experiences with either. Discuss any advantages to using for loops over functionals and vice versa. Then, brainstorm and discuss as many ways as you can for how you might incorporate functionals like map()
, or replace for loops with them, into your own work. Afterwards, groups will briefly share some of what they thought of before we move on to the next exercise.
Time: 25 minutes.
We need to do basically the exact same thing for the saliva.csv
, RR.csv
, and Actigraph.csv
datasets, following this format:
user_info_files <- dir_ls(here("data-raw/mmash/"),
regexp = "user_info.csv",
recurse = TRUE)
user_info_df <- map_dfr(user_info_files, import_user_info,
.id = "file_path_id")
For importing the other datasets, we have to modify the code in two locations to get this code to import the other datasets: at the regexp =
argument and at import_user_info
. This is the perfect chance to make a function that you can use for other purposes and that is itself a functional (since it takes a function as an input). So inside doc/learning.qmd
, convert this bit of code into a function that works to import the other three datasets.
## Exercise: Map on the other datasets
at the bottom of the document.function() { ... }
import_multiple_files
function()
, set two new arguments called file_pattern
and import_function
."user_info.csv"
with file_pattern
(this is without quotes around it) and import_user_info
with import_function
(also without quotes).user_info_files
and user_info_df
). So, replace and re-write user_info_file
with data_files
and user_info_df
with combined_data
.return(combined_data)
at the end of the function to output the imported data frame.packagename::
to the individual functions (there are three packages used: fs, here, and purrr)saliva.csv
.R/functions.R
file. Then restart the R session with or with the Palette (, then type “restart”), run the line with source(here("R/functions.R"))
or with or with the Palette (, then type “source”), and test the code out in the Console.Use this code as a guide to help complete this exercise:
<- ___(___, ___) {
___ <- ___dir_ls(___here("data-raw/mmash/"),
___ regexp = ___,
recurse = TRUE)
<- ___map_dfr(___, ___,
___ .id = "file_path_id")
___(___)
}
#' Import multiple MMASH data files and merge into one data frame.
#'
#' @param file_pattern Pattern for which data file to import.
#' @param import_function Function to import the data file.
#'
#' @return A single data frame/tibble.
#'
import_multiple_files <- function(file_pattern, import_function) {
data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
regexp = file_pattern,
recurse = TRUE)
combined_data <- purrr::map_dfr(data_files, import_function,
.id = "file_path_id")
return(combined_data)
}
# Test on saliva in the Console
import_multiple_files("saliva.csv", import_saliva)
We’ve now made a function that imports multiple data files based on the type of data file, we can start using this function directly, like we did in the exercise above for the saliva data. We’ve already imported the user_info_df
previously, but now we should do some tidying up of our Quarto / R Markdown file and to start updating the data-raw/mmash.R
script. Why are we doing that? Because the Quarto / R Markdown file is only a sandbox to test code out and in the end we want a script that takes the raw data, processes it, and creates a working dataset we can use for analysis.
First thing we will do is delete everything below the setup
code chunk that contains the library()
and source()
code. Why do we delete everything? Because it keeps things cleaner and makes it easier to look through the file. And because we use Git, nothing is truly gone so you can always go back to the text later. Next, we restart the R session with or with the Palette (, then type “restart”). Then we’ll create a new code chunk below the setup
chunk where we will use the import_multiple_files()
function to import the user info and saliva data.
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
To test that things work, we’ll create an HTML document from our Quarto / R Markdown document by using the “Render” / “Knit” button at the top of the pane or with or with the Palette (, then type “render”). Once it creates the file, it should either pop up or open in the Viewer pane on the side. If it works, then we can move on and open up the data-raw/mmash.R
script. Inside the script, copy and paste these two lines of code to the bottom of the script. Afterwards, go the top of the script and right below the library(here)
code, add these two lines of code, so it looks like this:
Save the files, then add and commit the changes to the Git history with or with the Palette (, then type “commit”).
We’re taking a quick detour to briefly talk about a concept that perfectly illustrates how vectorization and functionals fit into doing data analysis. The concept is called the split-apply-combine technique, which we covered in the beginner R course. The method is:
So when you split data into multiple groups, you make a vector that you can than apply (i.e. the map functional) some statistical technique to each group through vectorization. This technique works really well for a range of tasks, including for our task of summarizing some of the MMASH data so we can merge it all into one dataset.
Time: 5 minutes.
We haven’t used the %>%
pipe from the magrittr package yet, but it is used extensively in many R packages and is the foundation of tidyverse packages. The function fundamentally changed how people write R code so much that in version 4.1 a similar function, |>
, was added to base R. To make sure everyone is aware of what the pipe is, in your groups please do either task:
Functionals and vectorization are an integral component of how R works and they appear throughout many of R’s functions and packages. They are particularly used throughout the tidyverse packages like dplyr. Let’s get into some more advanced features of dplyr functions that work as functionals. Before we continue, re-run the code for getting user_info_df
since you had restarted the R session previously.
There are many “verbs” in dplyr, like select()
, rename()
, mutate()
, summarise()
, and group_by()
(covered in more detail in the Data Management and Wrangling session of the beginner course). The common usage of these verbs is through acting on and directly using the column names (e.g. without "
quotes around the column name). For instance, to select only the age
column, you would type out:
#> # A tibble: 22 × 1
#> age
#> <dbl>
#> 1 29
#> 2 27
#> 3 27
#> 4 27
#> 5 25
#> 6 27
#> 7 24
#> 8 27
#> 9 24
#> 10 0
#> # ℹ 12 more rows
But many dplyr verbs can also take functions as input. When you combine select()
with the where()
function, you can select different variables. The where()
function is a tidyselect
helper, a set of functions that make it easier to select variables. Some additional helper functions are listed in Table 7.1.
What it selects | Example | Function |
---|---|---|
Select variables where a function returns TRUE | Select variables that have data as character (is.character ) |
where() |
Select all variables | Select all variables in user_info_df
|
everything() |
Select variables that contain the matching string | Select variables that contain the string “user_info” | contains() |
Select variables that ends with string | Select all variables that end with “date” | ends_with() |
Let’s select columns that are numeric:
#> # A tibble: 22 × 3
#> weight height age
#> <dbl> <dbl> <dbl>
#> 1 65 169 29
#> 2 85 180 27
#> 3 115 186 27
#> 4 67 170 27
#> 5 74 180 25
#> 6 64 171 27
#> 7 80 180 24
#> 8 67 176 27
#> 9 60 175 24
#> 10 80 180 0
#> # ℹ 12 more rows
Or, only character columns:
#> # A tibble: 22 × 2
#> file_path_id gender
#> <chr> <chr>
#> 1 data-raw/mmash/user_1/user_info.csv M
#> 2 data-raw/mmash/user_10/user_info.csv M
#> 3 data-raw/mmash/user_11/user_info.csv M
#> 4 data-raw/mmash/user_12/user_info.csv M
#> 5 data-raw/mmash/user_13/user_info.csv M
#> 6 data-raw/mmash/user_14/user_info.csv M
#> 7 data-raw/mmash/user_15/user_info.csv M
#> 8 data-raw/mmash/user_16/user_info.csv M
#> 9 data-raw/mmash/user_17/user_info.csv M
#> 10 data-raw/mmash/user_18/user_info.csv M
#> # ℹ 12 more rows
Likewise, with functions like summarise()
, if you want to for example calculate the mean of cortisol in the saliva dataset, you would usually type out:
#> # A tibble: 1 × 1
#> cortisol_mean
#> <dbl>
#> 1 0.0490
If you want to calculate the mean of multiple columns, you might think you’d have to do something like:
#> # A tibble: 1 × 2
#> cortisol_mean melatonin_mean
#> <dbl> <dbl>
#> 1 0.0490 0.00000000765
But instead, there is the across()
function that works like map()
and allows you to calculate the mean across which ever columns you want. In many ways, across()
is a duplicate of map()
, particularly in the arguments you give it.
Let’s try out some examples. To calculate the mean of cortisol_norm
like we did above, we’d do:
#> # A tibble: 1 × 1
#> cortisol_norm
#> <dbl>
#> 1 0.0490
To calculate the mean of another column:
#> # A tibble: 1 × 2
#> cortisol_norm melatonin_norm
#> <dbl> <dbl>
#> 1 0.0490 0.00000000765
This is nice, but changing the column names so that the function name is added would make reading what the column contents are clearer. That’s when we would use “named lists”, which are lists that look like:
list(item_one_name = ..., item_two_name = ...)
So, for having a named list with mean inside across()
, it would look like:
You can confirm that it is a list by using the function names()
:
#> [1] "mean"
#> [1] "average"
#> [1] "ave"
Let’s stick with list(mean = mean)
:
#> # A tibble: 1 × 1
#> cortisol_norm_mean
#> <dbl>
#> 1 0.0490
If we wanted to do that for all numeric columns and also calculate sd()
:
#> # A tibble: 1 × 4
#> cortisol_norm_mean cortisol_norm_sd melatonin_norm_mean
#> <dbl> <dbl> <dbl>
#> 1 0.0490 0.0478 0.00000000765
#> # ℹ 1 more variable: melatonin_norm_sd <dbl>
We can use these concepts and code to process the other longer datasets, like RR.csv
, in a way that makes it more meaningful to eventually merge (also called “join”) them with the smaller datasets like user_info.csv
or saliva.csv
. Let’s work with the RR.csv
dataset to eventually join it with the others.
With the RR dataset, each participant had almost 100,000 data points recorded over two days of collection. So if we want to join with the other datasets, we need to calculate summary measures by at least file_path_id
and also preferably by day
as well. In this case, we need to group_by()
these two variables before summarising that lets us use the split-apply-combine technique. Let’s first summarise by taking the mean of ibi_s
(which is the inter-beat interval in seconds):
rr_df <- import_multiple_files("RR.csv", import_rr)
rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean)))
#> # A tibble: 44 × 3
#> file_path_id day ibi_s_mean
#> <chr> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953
#> # ℹ 34 more rows
While there are no missing values here, let’s add the argument na.rm = TRUE
just in case.
#> # A tibble: 44 × 3
#> file_path_id day ibi_s_mean
#> <chr> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953
#> # ℹ 34 more rows
You might notice a message (depending on the version of dplyr you have):
`summarise()` regrouping output by 'file_path_id' (override with `.groups` argument)
Let’s also add standard deviation as another measure from the RR datasets:
summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE))
summarised_rr_df
#> # A tibble: 44 × 4
#> file_path_id day ibi_s_mean ibi_s_sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666 0.164
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793 0.194
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820 0.225
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856 0.397
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818 0.137
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923 0.182
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779 0.0941
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883 0.258
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727 0.147
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953 0.151
#> # ℹ 34 more rows
Whenever you are finished with a grouping effect, it’s good practice to end the group_by()
with ungroup()
. Let’s add it to the end:
summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE)) %>%
ungroup()
summarised_rr_df
#> # A tibble: 44 × 4
#> file_path_id day ibi_s_mean ibi_s_sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666 0.164
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793 0.194
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820 0.225
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856 0.397
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818 0.137
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923 0.182
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779 0.0941
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883 0.258
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727 0.147
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953 0.151
#> # ℹ 34 more rows
Ungrouping the data with ungroup()
does not provide any visual indication of what is happening. However, in the background, it removes certain metadata that the group_by()
function added.
Before continuing, let’s knit the Quarto / R Markdown document with or with the Palette (, then type “render”) to confirm that everything runs as it should. If the knitting works, then switch to the Git interface and add and commit the changes so far with or with the Palette (, then type “commit”).
Time: 15 minutes.
Like with the RR.csv
dataset, let’s process the Actigraph.csv
dataset so that it makes it easier to join with the other datasets later.
## Exercise: Summarise Actigraph
and insert a new code chunk below that with or with the Palette (, then type “new chunk”).import_multiple_files()
function you created previously. Name the new data frame actigraph_df
.median()
, sd()
, mean()
, max()
, min()
, var()
).group_by()
of file_path_id
and day
, then use summarise()
with across()
to summarise the variables you are interested in (from item 4 above) with the summary functions you chose. Assign the newly summarised data frame to a new data frame and call it summarised_actigraph_df
.ungroup()
.doc/learning.qmd
document with or with the Palette (, then type “render”) to make sure everything works.We’ll do this all together. We’ve tested out, imported, and processed two new datasets, the RR and the Actigraph datasets. First, in the R Markdown / Quarto document, cut the code that we used to import and process the rr_df
and actigraph_df
data. Then open up the data-raw/mmash.R
file and paste the cut code into the bottom of the script. It should look something like this:
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE)) %>%
ungroup()
# Code pasted here that was made from the above exercise
Next, go to the R Markdown / Quarto document and again delete everything below the setup
code chunk. After it has been deleted, add and commit the changes to the Git history with or with the Palette (, then type “commit”).
map()
when you want to repeat a function on multiple items at once.group_by()
, summarise()
, and across()
followed by ungroup()
to use the split-apply-combine technique when needing to do an action on groups within the data (e.g. calculate the mean age between education groups).