#> [1] 55
If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
7 Save time, don’t repeat yourself: Using functionals
We will continue covering the “Workflow” block in Figure 7.1.
7.1 Learning objectives
The overall learning outcome for this session is to:
- Describe a basic workflow for creating functions and then apply this workflow to import data.
Specific objectives are to:
- Explain what functional programming, vectorization, and functionals are within R and identify when code is a functional or uses functional programming. Then apply this knowledge using the purrr package.
- Review the split-apply-combine technique and identify how these concepts connect to functional programming.
- Apply functional programming to summarize data using the split-apply-combine technique.
7.2 Functional programming
But what does functionals have to do with what we are doing now? Well, our import_user_info()
function only takes in one data file. But we have 22 files that we could load all at once if we used functionals.
Before we continue, let’s clean up the doc/learning.qmd
file by deleting everything below the setup
code chunk that contains the library()
and source()
code. Why do we delete everything? Because it keeps things cleaner and makes it easier to look through the file (both for you and for us as instructors). And because we use Git, nothing is truly gone so you can always go back to the text later. Next, we restart the R session with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”).
Before we’ll use the map()
functional, we need to get a vector or list of all the dataset files available to us. We will return to using the fs package, which has a function called dir_ls()
that finds files of a certain pattern. So, let’s add library(fs)
to the setup
code chunk. Then, go to the bottom of the doc/learning.qmd
document, create a new header called ## Using map
, and create a code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”)
The dir_ls()
function takes the path that we want to search (data-raw/mmash/
), uses the argument regexp
(short for regular expression or also regex
) to find the pattern, and recurse
to look in all subfolders. We’ll cover regular expressions more in the next session. In our case, the pattern is user_info.csv
, so the code should look like this:
Then let’s see what the output looks like. For the website, we are only showing the first 3 files. Your output will look slightly different from this.
#> data-raw/mmash/user_1/user_info.csv
#> data-raw/mmash/user_10/user_info.csv
#> data-raw/mmash/user_11/user_info.csv
Alright, we now have all the files ready to give to map()
. But before using it, we’ll need to add purrr, where map()
comes from as a package dependency by going to the Console and running:
Since purrr is part of the tidyverse, we don’t need to load it with library()
. So let’s try it!
Remember, that map()
always outputs a list, so when we look into this object, it will give us 22 tibbles (data.frames). Here we’ll only show the first one:
#> # A tibble: 1 × 4
#> gender weight height age
#> <chr> <dbl> <dbl> <dbl>
#> 1 M 65 169 29
This is great because with one line of code we imported all these datasets! But we’re missing an important bit of information: The user ID. A powerful feature of the purrr package is that it has other functions to make it easier to work with functionals. We know map()
always outputs a list. But what we want is a single data frame at the end that also contains the user ID.
The function that will take a list and convert it into a data frame is called list_rbind()
to bind (“stack”) by rows or list_cbind()
to bind (“stack”) by columns. We want to bind by rows, so will use list_rbind()
and if we look at the help for it, we see that it has an argument names_to
. This argument lets us create a new column that sets the user ID, or in this case, the file path to the dataset, which has the user ID information in it. So, let’s use it and create a new column called file_path_id
.
Your file_path_id
variable will look different. Don’t worry, we’re going to tidy up the file_path_id
variable later.
#> # A tibble: 22 × 5
#> file_path_id gender weight height age
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/user_info.csv M 65 169 29
#> 2 data-raw/mmash/user_10/user_info.csv M 85 180 27
#> 3 data-raw/mmash/user_11/user_info.csv M 115 186 27
#> 4 data-raw/mmash/user_12/user_info.csv M 67 170 27
#> 5 data-raw/mmash/user_13/user_info.csv M 74 180 25
#> 6 data-raw/mmash/user_14/user_info.csv M 64 171 27
#> 7 data-raw/mmash/user_15/user_info.csv M 80 180 24
#> 8 data-raw/mmash/user_16/user_info.csv M 67 176 27
#> 9 data-raw/mmash/user_17/user_info.csv M 60 175 24
#> 10 data-raw/mmash/user_18/user_info.csv M 80 180 0
#> # ℹ 12 more rows
We’re using the base R |>
pipe rather than the magrittr pipe %>%
as more documentation and packages are using or relying on it. In terms of functionality, they are nearly the same, with some small differences. It ultimately doesn’t matter which one you use, but we’re using the base R |>
pipe to be consistent with other documentation and with the general trend to recommend it over the magrittr pipe.
Now that we have this working, let’s add and commit the changes to the Git history, by using Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”)
7.3 Exercise: Make a function for importing other datasets with functionals
Time: ~30 minutes.
We eventually (but not yet) want to do the exact same thing for importing the saliva.csv
, RR.csv
, and Actigraph.csv
datasets, mimicking the code:
Notice that if we wanted to import one of the other datasets, we could copy the code above and make changes at two locations to import the data files: at the regexp =
argument in dir_ls()
and at the import_user_info
location within map()
.
Since we do not want to repeat ourselves, this is a perfect chance to convert this code above into a function, so we can use this new function to import the other datasets without repeating ourselves.
So inside doc/learning.qmd
, convert this bit of code into a function that works to import the other three datasets.
Complete the tasks below and use this below code as a guide to help complete this exercise:
doc/learning.qmd
- Create a new header
## Exercise: Map on the other datasets
at the bottom of the document. - Create a new code chunk below it, using Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
- Use the template code above and the
user_info
code at the start of this exercise section, re-write the code to be able to repeat the steps you’ve taken previously to be able to import the other three datasets using a newly created function.- Name the function
import_multiple_files
- Use the
function() { ... }
code to create a new function - Within
function()
, set two new arguments calledfile_pattern
andimport_function
. - Within the code, use the
user_info
code above as a guide, re-write"user_info.csv"
withfile_pattern
(this is without quotes around it, otherwise R will interpret it as the pattern to look for in theregexp
argument, with the value"file_pattern"
and not as the value fromfile_pattern
argument we created for our function) andimport_user_info
withimport_function
(also without quotes). - Create generic intermediate objects. So, change the code
user_info_files
withdata_files
and the codeuser_info_df
withcombined_data
. - Use
return(combined_data)
at the end of the function to output the imported data frame. - Create and write Roxygen documentation to describe the new function by using Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”).
- Append
packagename::
to the individual functions (there are three packages used: fs, here, and purrr) - Run it and check that it works on
saliva.csv
by usingimport_multiple_files("saliva.csv", import_saliva)
.
- Name the function
- After it works, cut and paste the function into the
R/functions.R
file. Then restart the R session with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”), run the line withsource(here("R/functions.R"))
or with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”), and test the code out in the Console. - Run styler while in the
R/functions.R
file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”). - Once done, add the changes you’ve made and commit them to the Git history, using Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
#' Import multiple MMASH data files and merge into one data frame.
#'
#' @param file_pattern Pattern for which data file to import.
#' @param import_function Function to import the data file.
#'
#' @return A single data frame/tibble.
#'
import_multiple_files <- function(file_pattern, import_function) {
data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
regexp = file_pattern,
recurse = TRUE
)
combined_data <- purrr::map(data_files, import_function) |>
purrr::list_rbind(names_to = "file_path_id")
return(combined_data)
}
# Test on saliva in the Console
import_multiple_files("saliva.csv", import_saliva)
7.4 Adding to the processing script and clean up Quarto document
Now that we’ve made a function that imports multiple data files based on the type of data file, we can start using this function directly, like we did in the exercise above for the saliva data. We’ve already imported the user_info_df
previously, but now we should do some tidying up of our Quarto file and to start updating the data-raw/mmash.R
script. Why are we doing that? Because the Quarto file is only a sandbox to test code out and in the end we want a script that takes the raw data, processes it, and creates a working dataset we can use for analysis.
Like we did before, delete everything below the setup
code chunk that contains the library()
and source()
code. Then, we will restart the R session with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”) and we’ll create a new code chunk below the setup
chunk where we will use the import_multiple_files()
function to import the user info and saliva data.
To test that things work, we’ll create an HTML document from our Quarto document by using the “Render” / “Knit” button at the top of the pane or with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”). Once it creates the file, it should either pop up or open in the Viewer pane on the side. If it works, then we can move on and open up the data-raw/mmash.R
script. If not, it means that there is an issue in your code and that it won’t be reproducible.
Before continuing, we’ll collect our imported packages in the top of the script by adding the library(fs)
line to right below library(here)
. Then, inside data-raw/mmash.R
, copy and paste the two lines of code that creates the user_info_df
and saliva_df
to the bottom of the script (i.e., the two lines in the code chunk above). Afterwards, go the top of the script and right below the library(fs)
code, add these two lines of code, so it looks like this:
Save the files, then add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
7.5 Split-apply-combine technique and functionals
We’re taking a quick detour to briefly talk about a concept that perfectly illustrates how vectorization and functionals fit into doing data analysis. The concept is called the split-apply-combine technique, which we covered in the beginner R course. The method is:
- Split the data into groups (e.g. diabetes status).
- Apply some analysis or statistics to each group (e.g. finding the mean of age).
- Combine the results to present them together (e.g. into a data frame that you can use to make a plot or table).
So when you split data into multiple groups, you create a list (or a vector) that you can then use (with the map functional) to apply a statistical technique to each group through vectorization. This technique works really well for a range of tasks, including for our task of summarizing some of the MMASH data so we can merge it all into one dataset.
7.6 Summarising data through functionals
Functionals and vectorization are integral components of how R works and they appear throughout many of R’s functions and packages. They are particularly used throughout the tidyverse packages like dplyr. Let’s get into some more advanced features of dplyr functions that work as functionals.
Before we continue, re-run the code for getting user_info_df
since you had restarted the R session previously.
Since we’re going to use dplyr, we need to add it as a dependency by typing this in the Console:
There are many “verbs” in dplyr, like select()
, rename()
, mutate()
, summarise()
, and group_by()
(covered in more detail in the Data Management and Wrangling session of the beginner course). The common usage of these verbs is through acting on and directly using the column names (e.g. without "
quotes around the column name like with saliva_df |> select(cortisol_norm)
). But many dplyr verbs can also take functions as input, especially when using the column selection helpers from the tidyselect package.
Likewise, with functions like summarise()
, if you want to for example calculate the mean of cortisol in the saliva dataset, you would usually type out:
#> # A tibble: 1 × 1
#> cortisol_mean
#> <dbl>
#> 1 0.0490
Don’t know what the |>
pipe is? Check out the section on it from the beginner course.
If you want to calculate the mean of multiple columns, you might think you’d have to do something like:
doc/learning.qmd
#> # A tibble: 1 × 2
#> cortisol_mean melatonin_mean
#> <dbl> <dbl>
#> 1 0.0490 0.00000000765
But instead, there is the across()
function that works like map()
and allows you to calculate the mean across which ever columns you want. In many ways, across()
is similar to map()
, particularly in the arguments you give it and in the sense that it is a functional. But they are used in different settings: across()
works well with columns within a dataframe and within a mutate()
or summarise()
, while map()
is more generic.
Let’s try out some examples. To calculate the mean of cortisol_norm
like we did above, we’d do:
#> # A tibble: 1 × 1
#> cortisol_norm
#> <dbl>
#> 1 0.0490
To calculate the mean of another column:
#> # A tibble: 1 × 2
#> cortisol_norm melatonin_norm
#> <dbl> <dbl>
#> 1 0.0490 0.00000000765
This is nice, but changing the column names so that the function name is added would make reading what the column contents are clearer. That’s when we would use “named lists”, which are lists that look like:
So, for having a named list with mean inside across()
, it would look like (in the Console):
#> $mean
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x55977cc3f728>
#> <environment: namespace:base>
#> $average
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x55977cc3f728>
#> <environment: namespace:base>
#> $ave
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x55977cc3f728>
#> <environment: namespace:base>
Let’s stick with list(mean = mean)
:
#> # A tibble: 1 × 1
#> cortisol_norm_mean
#> <dbl>
#> 1 0.0490
Now, let’s collect some of the concepts from above to calculate the mean and standard deviation for all numeric columns in the saliva_df
:
#> # A tibble: 1 × 4
#> cortisol_norm_mean cortisol_norm_sd melatonin_norm_mean
#> <dbl> <dbl> <dbl>
#> 1 0.0490 0.0478 0.00000000765
#> # ℹ 1 more variable: melatonin_norm_sd <dbl>
We can use these concepts and code to process the other longer datasets, like RR.csv
, in a way that makes it more meaningful to eventually merge (also called “join”) them with the smaller datasets like user_info.csv
or saliva.csv
. Let’s work with the RR.csv
dataset to eventually join it with the others.
7.7 Summarizing long data like the RR dataset
With the RR dataset, each participant had almost 100,000 data points recorded over two days of collection. So if we want to join with the other datasets, we need to calculate summary measures by at least file_path_id
and also preferably by day
as well. In this case, we need to group_by()
these two variables before summarising. In this way, we use the split-apply-combine technique. Let’s first summarise by taking the mean of ibi_s
(which is the inter-beat interval in seconds).
#> `summarise()` has grouped output by 'file_path_id'. You can
#> override using the `.groups` argument.
#> # A tibble: 44 × 3
#> file_path_id day ibi_s_mean
#> <chr> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953
#> # ℹ 34 more rows
While there are no missing values here, let’s add the argument na.rm = TRUE
just in case. In order add this argument to the mean, we
#> `summarise()` has grouped output by 'file_path_id'. You can
#> override using the `.groups` argument.
#> # A tibble: 44 × 3
#> file_path_id day ibi_s_mean
#> <chr> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953
#> # ℹ 34 more rows
Let’s also add standard deviation as another measure from the RR datasets:
#> `summarise()` has grouped output by 'file_path_id'. You can
#> override using the `.groups` argument.
#> # A tibble: 44 × 4
#> file_path_id day ibi_s_mean ibi_s_sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666 0.164
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793 0.194
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820 0.225
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856 0.397
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818 0.137
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923 0.182
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779 0.0941
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883 0.258
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727 0.147
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953 0.151
#> # ℹ 34 more rows
Whenever you are finished with a grouping effect, it’s good practice to end the group_by()
with .groups = "drop"
. Let’s add it to the end:
#> # A tibble: 44 × 4
#> file_path_id day ibi_s_mean ibi_s_sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666 0.164
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793 0.194
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820 0.225
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856 0.397
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818 0.137
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923 0.182
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779 0.0941
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883 0.258
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727 0.147
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953 0.151
#> # ℹ 34 more rows
Ungrouping the data with the .groups = "drop"
in the summarise()
function does not provide any visual indication of what is happening. However, in the background, it removes certain metadata that the group_by()
function added.
By default, using group_by()
continues the grouping effect of later code, like mutate()
and summarise()
. Normally we would end a group_by()
by using ungroup()
, especially if we want to do multiple wrangling functions on the same grouping. Because sometimes, especially after using summarise()
, we don’t need to keep the grouping. So we can use the .groups = "drop"
argument in summarise()
to end the grouping.
Before continuing, let’s run styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) and knit the Quarto document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) to confirm that everything runs as it should. If the knitting works, then switch to the Git interface and add and commit the changes so far with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
7.8 Exercise: Summarise the Actigraph data
Time: 15 minutes.
Like with the RR.csv
dataset, let’s process the Actigraph.csv
dataset so that it makes it easier to join with the other datasets later. Make sure to read the warning block below.
Since the actigraph_df
dataset is quite large, we strongly recommend not using View()
or selecting the dataframe in the Environments pane to view it. For many computers, your R session will crash! Instead type out glimpse(actigraph_df)
or simply actigraph_df
in the Console.
- Like usual, create a new Markdown header called e.g.
## Exercise: Summarise Actigraph
and insert a new code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). - Import all the Actigraph data files using the
import_multiple_files()
function you created previously. Name the new data frameactigraph_df
. - Look into the Data Description to find out what each column is for.
- Based on the documentation, which variables would you be most interested in analyzing more?
- Decide which summary measure(s) you think may be most interesting for you (e.g.
median()
,sd()
,mean()
,max()
,min()
,var()
). - Use
group_by()
with thefile_path_id
andday
variables only, then usesummarise()
withacross()
to summarise the variables you are interested in (from item 4 above) with the summary functions you chose. Assign the newly summarised data frame to a new data frame and call itsummarised_actigraph_df
. - End the grouping effect with
.groups = "drop"
insummarise()
. - Run styler while in the
doc/learning.qmd
file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”). - Knit the
doc/learning.qmd
document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) to make sure everything works. - Add and commit the changes you’ve made into the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
summarised_actigraph_df <- actigraph_df |>
group_by(file_path_id, day) |>
# These statistics will probably be different for you
summarise(
across(hr, list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)),
.groups = "drop"
)
7.9 Cleaning up and adding to the processing script
We’ll do this all together. We’ve tested out, imported, and processed two new datasets, the RR and the Actigraph datasets. First, in the Quarto document, cut the code that we used to import and process the rr_df
and actigraph_df
data. Then open up the data-raw/mmash.R
file and paste the cut code into the bottom of the script. It should look something like this:
data-raw/mmash.R
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
summarised_rr_df <- rr_df |>
group_by(file_path_id, day) |>
summarise(
across(ibi_s, list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)),
.groups = "drop"
)
# Code pasted here that was made from the above exercise
Next, go to the Quarto document and again delete everything below the setup
code chunk. After it has been deleted, add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
7.10 Summary
- R is a functional programming language:
- It uses functions that take an input, do an action, and give an output.
- It uses vectorisation that apply a function to multiple items (in a vector) all at once rather than using loops.
- It uses functionals that allow functions to use other functions as input.
- Use the purrr package and its function
map()
when you want to repeat a function on multiple items at once. - Use
group_by()
,summarise()
, andacross()
with.groups = "drop"
in thesummarise()
function to use the split-apply-combine technique when needing to do an action on groups within the data (e.g. calculate the mean age between education groups).