Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

# 7  Save time, don’t repeat yourself: Using functionals

We will continue covering the “Workflow” block in Figure 7.1. Figure 7.1: Section of the overall workflow we will be covering.

Briefly go over the bigger picture (found in the introduction section) and remind everyone the ‘what’ and ‘why’ of what we are doing.

## 7.1 Learning objectives

1. Learn about and apply functional programming, vectorization, and functionals within R.
2. Review the split-apply-combine technique and understand the link with functional programming.
3. Apply functional programming to summarizing data and for using the split-apply-combine technique.

## 7.2 Functional programming

Go over this section briefly by reinforcing what they read, especially reinforcing the concepts shown in the image. Make sure they understand the concept of applying something to many things at once and that functionals are better coding patterns to use compared to loops. Doing the code-along should also help reinforce this concept.

Also highlight that the resources appendix has some links for continued learning for this and that the RStudio purrr cheatsheet is an amazing resource to use.

Unlike many other programming languages, R’s primary strength and approach to programming is in functional programming. So what is it? It is programming that:

• Uses functions (like `function()`).
• Applies functions to vectors all at once (called vectorisation), rather than one at a time.
• Vectors are multiple items, like a sequence of numbers from 1 to 5, that are bundled together, for instance a column for body weight in a dataset is a vector of numbers.
• Can use functions as input to other functions to then output a vector (called a functional).

We’ve already covered functions. You’ve definitely already used vectorization since it is one of R’s big strengths. For instance, functions like `mean()`, `sd()`, `sum()` are vectorized in that you give them a vector of numbers and they do something to all the values in the vector at once. In vectorized functions, you can give the function an entire vector (e.g. `c(1, 2, 3, 4)`) and R will know what to do with it. Figure 7.2 shows how a function conceptually uses vectorization. Figure 7.2: A function using vectorization. Notice how a set of items is included all at once in the func() function and outputs a single item on the right. Modified from the RStudio purrr cheatsheet.

For example, in R, there is a vectorized function called `sum()` that takes the entire vector of `values` and outputs the total sum, without needing a for loop.

``````values <- 1:10
# Vectorized
sum(values)``````
``#>  55``

As a comparison, in other programming languages, if you wanted to calculate the sum you would need a loop:

``````total_sum <- 0
# a vector
values <- 1:10
for (value in values) {
total_sum <- value + total_sum
}
total_sum``````
``#>  55``

Emphasize this next paragraph.

Writing effective and proper for loops is actually quite tricky and difficult to easily explain. Because of this and because there are better and easier ways of writing R code to replace for loops, we will not be covering loops in this course.

A functional on the other hand is a function that can also use a function as one of its arguments. Figure 7.3 shows how the functional `map()` from the purrr package works by taking a vector (or list), applying a function to each of those items, and outputting the results from each function. The name `map()` doesn’t mean a geographic map, it is the mathematical meaning of map: To use a function on each item in a set of items. Figure 7.3: A functional that uses a function to apply it to each item in a vector. Notice how each of the green coloured boxes are placed into the func() function and outputs the same number of blue boxes as there are green boxes. Modified from the RStudio purrr cheatsheet.

Here’s a simple toy example to show how it works. We’ll use `paste()` on each item of `1:5`.

``````library(purrr)
map(1:5, paste)``````
``````#> []
#>  "1"
#>
#> []
#>  "2"
#>
#> []
#>  "3"
#>
#> []
#>  "4"
#>
#> []
#>  "5"``````

You’ll notice that `map()` outputs a list, with all the `[]` printed. `map()` will always output a list. Also notice that the `paste()` function is given without the `()` brackets. Without the brackets, the function can be used by the `map()` functional and treated like any other object in R.

Let’s say we wanted to paste together the number with the sentence “seconds have passed”. Normally it would look like:

``paste(1, "seconds have passed")``
``#>  "1 seconds have passed"``
``paste(2, "seconds have passed")``
``#>  "2 seconds have passed"``
``paste(3, "seconds have passed")``
``#>  "3 seconds have passed"``
``paste(4, "seconds have passed")``
``#>  "4 seconds have passed"``
``paste(5, "seconds have passed")``
``#>  "5 seconds have passed"``

Or as a loop:

``````for (num in 1:5) {
sec_passed <- paste(num, "seconds have passed")
print(sec_passed)
}``````
``````#>  "1 seconds have passed"
#>  "2 seconds have passed"
#>  "3 seconds have passed"
#>  "4 seconds have passed"
#>  "5 seconds have passed"``````

With `map()`, we’d do this a bit differently. purrr allows us to create anonymous functions (functions that are used once and disappear after usage) to extend its capabilities. Anonymous functions are created by writing `function(x)` (the short version is `\(x)`), followed by the function definition inside of `map()`. Using `map()` with an anonymous function allows us to do more things to the input vector (e.g. `1:5`). Here is an example:

``map(1:5, function(x) paste(x, "seconds have passed"))``
``````#> []
#>  "1 seconds have passed"
#>
#> []
#>  "2 seconds have passed"
#>
#> []
#>  "3 seconds have passed"
#>
#> []
#>  "4 seconds have passed"
#>
#> []
#>  "5 seconds have passed"``````
``````# Or with the short version
map(1:5, \(x) paste(x, "seconds have passed"))``````
``````#> []
#>  "1 seconds have passed"
#>
#> []
#>  "2 seconds have passed"
#>
#> []
#>  "3 seconds have passed"
#>
#> []
#>  "4 seconds have passed"
#>
#> []
#>  "5 seconds have passed"``````

purrr supports the use of a syntax shortcut to write anonymous functions. This shortcut is using `~` (tilde) to start the function and `.x` as the replacement for the vector item. `.x` is used instead of `x` in order to avoid potential name collisions in functions where `x` is an function argument (for example in `ggplot2::aes()`, where `x` can be used to define the x-axis mapping for a graph). Here is the same example as above, now using the `~` shortcut:

``map(1:5, ~ paste(.x, "seconds have passed"))``
``````#> []
#>  "1 seconds have passed"
#>
#> []
#>  "2 seconds have passed"
#>
#> []
#>  "3 seconds have passed"
#>
#> []
#>  "4 seconds have passed"
#>
#> []
#>  "5 seconds have passed"``````

`map()` will always output a list, but sometimes we might want to output a different data type. If we look into the help documentation with `?map`, it shows several other types of map that all start with `map_`:

• `map_chr()` outputs a character vector.
• `map_int()` outputs an integer.
• `map_dbl()` outputs a numeric value, called a “double” in programming.

This is the basics of using functionals. Functions, vectorization, and functionals provide expressive and powerful approaches to a simple task: Doing an action on each item in a set of items. And while technically using a for loop lets you “not repeat yourself”, they tend to be more error prone and harder to write and read compared to these other tools. For some alternative explanations of this, see Section C.1.

But what does functionals have to do with what we are doing now? Well, our `import_user_info()` function only takes in one data file. But we have 22 files that we could load all at once if we used functionals.

Before we continue, let’s clean up the `doc/learning.qmd` file by deleting everything below the `setup` code chunk that contains the `library()` and `source()` code. Why do we delete everything? Because it keeps things cleaner and makes it easier to look through the file (both for you and for us as instructors). And because we use Git, nothing is truly gone so you can always go back to the text later. Next, we restart the R session with or with the Palette (, then type “restart”).

Next, we’ll need to add purrr as a package dependency by going to the Console and running:

``usethis::use_package("purrr")``

Since purrr is part of the tidyverse, we don’t need to load it with `library()`. The next step for using the `map()` functional is to get a vector or list of all the dataset files available to us. We will return to using the fs package, which has a function called `dir_ls()` that finds files of a certain pattern. In our case, the pattern is `user_info.csv`. So, let’s add `library(fs)` to the `setup` code chunk. Then, go to the bottom of the `doc/learning.qmd` document, create a new header called `## Using map`, and create a code chunk below that with or with the Palette (, then type “new chunk”)

The `dir_ls()` function takes the path that we want to search (`data-raw/mmash/`), uses the argument `regexp` (short for regular expression or also `regex`) to find the pattern, and `recurse` to look in all subfolders. We’ll cover regular expressions more in the next session.

``````user_info_files <- dir_ls(here("data-raw/mmash/"),
regexp = "user_info.csv",
recurse = TRUE
)``````

Then let’s see what the output looks like. For the website, we are only showing the first 3 files. Your output will look slightly different from this.

``user_info_files``
``````#> data-raw/mmash/user_1/user_info.csv
#> data-raw/mmash/user_10/user_info.csv
#> data-raw/mmash/user_11/user_info.csv``````

Alright, we now have all the files ready to give to `map()`. So let’s try it!

``user_info_list <- map(user_info_files, import_user_info)``

Remember, that `map()` always outputs a list, so when we look into this object, it will give us 22 tibbles (data.frames). Here we’ll only show the first one:

``user_info_list[]``
``````#> # A tibble: 1 × 4
#>   gender weight height   age
#>   <chr>   <dbl>  <dbl> <dbl>
#> 1 M          65    169    29``````

This is great because with one line of code we imported all these datasets! But we’re missing an important bit of information: The user ID. A powerful feature of the purrr package is that it has other functions to make working with functionals easier. We know `map()` always outputs a list. But what we want is a single data frame at the end that also contains the user ID information.

The function that will take a list and convert it into a data frame is called `list_rbind()` to bind (“stack”) by rows or `list_cbind()` to bind (“stack”) by columns. We want to bind by rows, so will use `list_rbind()` and if we look at the help for it, we see that it has an argument `names_to`. This argument lets us create a new column that sets the user ID, or in this case, the file path to the dataset, which has the user ID information in it. So, let’s use it and create a new column called `file_path_id`.

``````user_info_df <- map(user_info_files, import_user_info) %>%
list_rbind(names_to = "file_path_id")``````

Your `file_path_id` variable will look different. Don’t worry, we’re going to tidy up the `file_path_id` variable later.

``user_info_df``
``````#> # A tibble: 22 × 5
#>    file_path_id                         gender weight height   age
#>    <chr>                                <chr>   <dbl>  <dbl> <dbl>
#>  1 data-raw/mmash/user_1/user_info.csv  M          65    169    29
#>  2 data-raw/mmash/user_10/user_info.csv M          85    180    27
#>  3 data-raw/mmash/user_11/user_info.csv M         115    186    27
#>  4 data-raw/mmash/user_12/user_info.csv M          67    170    27
#>  5 data-raw/mmash/user_13/user_info.csv M          74    180    25
#>  6 data-raw/mmash/user_14/user_info.csv M          64    171    27
#>  7 data-raw/mmash/user_15/user_info.csv M          80    180    24
#>  8 data-raw/mmash/user_16/user_info.csv M          67    176    27
#>  9 data-raw/mmash/user_17/user_info.csv M          60    175    24
#> 10 data-raw/mmash/user_18/user_info.csv M          80    180     0
#> # ℹ 12 more rows``````

Now that we have this working, let’s add and commit the changes to the Git history, by using or with the Palette (, then type “commit”)

## 7.3 Exercise: Make a function for importing other datasets with functionals

Time: ~30 minutes.

We need to do basically the exact same thing for the `saliva.csv`, `RR.csv`, and `Actigraph.csv` datasets, following this format:

``````user_info_files <- dir_ls(here("data-raw/mmash/"),
regexp = "user_info.csv",
recurse = TRUE
)
user_info_df <- map(user_info_files, import_user_info) %>%
list_rbind(names_to = "file_path_id")``````

For importing the other datasets, we have to modify the code in two locations to get this code to import the other datasets: at the `regexp =` argument and at `import_user_info`. This is the perfect chance to make a function that you can use for other purposes and that is itself a functional (since it takes a function as an input). So inside `doc/learning.qmd`, convert this bit of code into a function that works to import the other three datasets.

1. Create a new header `## Exercise: Map on the other datasets` at the bottom of the document.
2. Create a new code chunk below it, using or with the Palette (, then type “new chunk”).
3. Repeat the steps you’ve taken previously to create a new function:
• Wrap the code with `function() { ... }`
• Name the function `import_multiple_files`
• Within `function()`, set two new arguments called `file_pattern` and `import_function`.
• Within the code, replace and re-write `"user_info.csv"` with `file_pattern` (this is without quotes around it) and `import_user_info` with `import_function` (also without quotes).
• Create generic intermediate objects (instead of `user_info_files` and `user_info_df`). So, replace and re-write `user_info_file` with `data_files` and `user_info_df` with `combined_data`.
• Use `return(combined_data)` at the end of the function to output the imported data frame.
• Create and write Roxygen documentation to describe the new function by using or with the Palette (, then type “roxygen comment”).
• Append `packagename::` to the individual functions (there are three packages used: fs, here, and purrr)
• Run it and check that it works on `saliva.csv`.
4. After it works, cut and paste the function into the `R/functions.R` file. Then restart the R session with or with the Palette (, then type “restart”), run the line with `source(here("R/functions.R"))` or with or with the Palette (, then type “source”), and test the code out in the Console.
5. Run styler while in the `R/functions.R` file with with the Palette (, then type “style file”).
6. Once done, add the changes you’ve made and commit them to the Git history, using or with the Palette (, then type “commit”).

Use this code as a guide to help complete this exercise:

``````___ <- ___(___, ___) {
___ <- ___dir_ls(___here("data-raw/mmash/"),
regexp = ___,
recurse = TRUE
)

___ <- ___map(___, ____) %>%
___list_rbind(names_to = "file_path_id")
___(___)
}``````
Click for the solution. Only click if you are struggling or are out of time.
``````#' Import multiple MMASH data files and merge into one data frame.
#'
#' @param file_pattern Pattern for which data file to import.
#' @param import_function Function to import the data file.
#'
#' @return A single data frame/tibble.
#'
import_multiple_files <- function(file_pattern, import_function) {
data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
regexp = file_pattern,
recurse = TRUE
)

combined_data <- purrr::map(data_files, import_function) %>%
purrr::list_rbind(names_to = "file_path_id")
return(combined_data)
}

# Test on saliva in the Console
import_multiple_files("saliva.csv", import_saliva)``````

## 7.4 Adding to the processing script and clean up Quarto / R Markdown document

We’ve now made a function that imports multiple data files based on the type of data file, we can start using this function directly, like we did in the exercise above for the saliva data. We’ve already imported the `user_info_df` previously, but now we should do some tidying up of our Quarto / R Markdown file and to start updating the `data-raw/mmash.R` script. Why are we doing that? Because the Quarto / R Markdown file is only a sandbox to test code out and in the end we want a script that takes the raw data, processes it, and creates a working dataset we can use for analysis.

Like we did before, delete everything below the `setup` code chunk that contains the `library()` and `source()` code. Then, we will restart the R session with or with the Palette (, then type “restart”) and we’ll create a new code chunk below the `setup` chunk where we will use the `import_multiple_files()` function to import the user info and saliva data.

``````user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)``````

To test that things work, we’ll create an HTML document from our Quarto / R Markdown document by using the “Render” / “Knit” button at the top of the pane or with or with the Palette (, then type “render”). Once it creates the file, it should either pop up or open in the Viewer pane on the side. If it works, then we can move on and open up the `data-raw/mmash.R` script. Inside the script, copy and paste these two lines of code to the bottom of the script. Afterwards, go the top of the script and right below the `library(here)` code, add these two lines of code, so it looks like this:

``````library(here)
library(tidyverse)
source(here("R/functions.R"))``````

Save the files, then add and commit the changes to the Git history with or with the Palette (, then type “commit”).

## 7.5 Split-apply-combine technique and functionals

We’re taking a quick detour to briefly talk about a concept that perfectly illustrates how vectorization and functionals fit into doing data analysis. The concept is called the split-apply-combine technique, which we covered in the beginner R course. The method is:

1. Split the data into groups (e.g. diabetes status).
2. Apply some analysis or statistics to each group (e.g. finding the mean of age).
3. Combine the results to present them together (e.g. into a data frame that you can use to make a plot or table).

So when you split data into multiple groups, you make a vector that you can than apply (i.e. the map functional) some statistical technique to each group through vectorization. This technique works really well for a range of tasks, including for our task of summarizing some of the MMASH data so we can merge it all into one dataset.

## 7.6 Summarising data through functionals

Before starting this section, ask how many have used the pipe before. If everyone has, then move on. If some haven’t, very briefly explain it, but do not use much time on it since we will be using it shortly and they will see how it works then. We covered this in the introduction course, so we should not cover it again here.

Functionals and vectorization are an integral component of how R works and they appear throughout many of R’s functions and packages. They are particularly used throughout the tidyverse packages like dplyr. Let’s get into some more advanced features of dplyr functions that work as functionals. Before we continue, re-run the code for getting `user_info_df` since you had restarted the R session previously.

Since we’re going to use dplyr, we need to add it as a dependency by typing this in the Console:

``usethis::use_package("dplyr")``

There are many “verbs” in dplyr, like `select()`, `rename()`, `mutate()`, `summarise()`, and `group_by()` (covered in more detail in the Data Management and Wrangling session of the beginner course). The common usage of these verbs is through acting on and directly using the column names (e.g. without `"` quotes around the column name). For instance, to select only the `age` column, you would type out:

``````user_info_df %>%
select(age)``````
``````#> # A tibble: 22 × 1
#>      age
#>    <dbl>
#>  1    29
#>  2    27
#>  3    27
#>  4    27
#>  5    25
#>  6    27
#>  7    24
#>  8    27
#>  9    24
#> 10     0
#> # ℹ 12 more rows``````

Don’t know what the `%>%` pipe is? Check out the section on it from the beginner course.

But many dplyr verbs can also take functions as input. When you combine `select()` with the `where()` function, you can select different variables. The `where()` function is a `tidyselect` helper, a set of functions that make it easier to select variables. Some additional helper functions are listed in Table 7.1.

Table 7.1: Common tidyselect helper functions and some of their use cases.
What it selects Example Function
Select variables where a function returns TRUE Select variables that have data as character (`is.character`) `where()`
Select all variables Select all variables in `user_info_df` `everything()`
Select variables that contain the matching string Select variables that contain the string “user_info” `contains()`
Select variables that ends with string Select all variables that end with “date” `ends_with()`

Let’s select columns that are numeric:

``````user_info_df %>%
select(where(is.numeric))``````
``````#> # A tibble: 22 × 3
#>    weight height   age
#>     <dbl>  <dbl> <dbl>
#>  1     65    169    29
#>  2     85    180    27
#>  3    115    186    27
#>  4     67    170    27
#>  5     74    180    25
#>  6     64    171    27
#>  7     80    180    24
#>  8     67    176    27
#>  9     60    175    24
#> 10     80    180     0
#> # ℹ 12 more rows``````

Or, only character columns:

``````user_info_df %>%
select(where(is.character))``````
``````#> # A tibble: 22 × 2
#>    file_path_id                         gender
#>    <chr>                                <chr>
#>  1 data-raw/mmash/user_1/user_info.csv  M
#>  2 data-raw/mmash/user_10/user_info.csv M
#>  3 data-raw/mmash/user_11/user_info.csv M
#>  4 data-raw/mmash/user_12/user_info.csv M
#>  5 data-raw/mmash/user_13/user_info.csv M
#>  6 data-raw/mmash/user_14/user_info.csv M
#>  7 data-raw/mmash/user_15/user_info.csv M
#>  8 data-raw/mmash/user_16/user_info.csv M
#>  9 data-raw/mmash/user_17/user_info.csv M
#> 10 data-raw/mmash/user_18/user_info.csv M
#> # ℹ 12 more rows``````

Likewise, with functions like `summarise()`, if you want to for example calculate the mean of cortisol in the saliva dataset, you would usually type out:

``````saliva_df %>%
summarise(cortisol_mean = mean(cortisol_norm))``````
``````#> # A tibble: 1 × 1
#>   cortisol_mean
#>           <dbl>
#> 1        0.0490``````

If you want to calculate the mean of multiple columns, you might think you’d have to do something like:

``````saliva_df %>%
summarise(
cortisol_mean = mean(cortisol_norm),
melatonin_mean = mean(melatonin_norm)
)``````
``````#> # A tibble: 1 × 2
#>   cortisol_mean melatonin_mean
#>           <dbl>          <dbl>
#> 1        0.0490  0.00000000765``````

But instead, there is the `across()` function that works like `map()` and allows you to calculate the mean across which ever columns you want. In many ways, `across()` is a duplicate of `map()`, particularly in the arguments you give it.

When you look in `?across`, there are two main arguments and one optional one (plus one deprecated one):

1. `.cols` argument: Columns you want to use.
• Write column names directly and wrapped in `c()`: `c(age, weight)`.
• Write `tidyselect` helpers: `everything()`, `starts_with()`, `contains()`, `ends_with()`
• Use a function wrapped in `where()`: `where(is.numeric)`, `where(is.character)`
2. `.fns`: The function to use on the `.cols`.
• A bare function (`mean`) applies it to each column and returns the output, with the column name unchanged.

• A list with bare functions (`list(mean, sd)`) applies each function to each column and returns the output with the column name appended with a number (e.g. `cortisol_norm_1`).

• An anonymous function passed with either `~` (using `.x` to refer to the variable) or `\()` or `function()` like in `map()`. For instance, these three lines of code below mean all the same and are used to say “put age and weight, one after the other, in place of where `.x` is located” to calculate the mean for age and the mean for weight.

``````across(c(age, weight), ~ mean(.x, na.rm = TRUE))
across(c(age, weight), function(x) mean(x, na.rm = TRUE))
across(c(age, weight), \(x) mean(x, na.rm = TRUE))``````
• A named list with bare or anonymous functions (`list(average = mean, stddev = sd)`) does the same as above but instead returns an output with the column names appended with the name given to the function in the list (e.g. `cortisol_norm_average`). You can also use anonymous functions within the list:

``````list(
average = ~ mean(.x, na.rm = TRUE),
stddev = ~ sd(.x, na.rm = TRUE),
)``````
3. `...` argument (deprecated): Arguments to give to the functions in `.fns`. No longer used.
4. `.names` argument: Customize the output of the column names. We won’t cover this argument.

Go over the first two arguments again, reinforcing what they read.

Let’s try out some examples. To calculate the mean of `cortisol_norm` like we did above, we’d do:

``````saliva_df %>%
summarise(across(cortisol_norm, mean))``````
``````#> # A tibble: 1 × 1
#>   cortisol_norm
#>           <dbl>
#> 1        0.0490``````

To calculate the mean of another column:

``````saliva_df %>%
summarise(across(c(cortisol_norm, melatonin_norm), mean))``````
``````#> # A tibble: 1 × 2
#>   cortisol_norm melatonin_norm
#>           <dbl>          <dbl>
#> 1        0.0490  0.00000000765``````

This is nice, but changing the column names so that the function name is added would make reading what the column contents are clearer. That’s when we would use “named lists”, which are lists that look like:

``list(item_one_name = ..., item_two_name = ...)``

So, for having a named list with mean inside `across()`, it would look like:

``````list(mean = mean)
# or
list(average = mean)
# or
list(ave = mean)``````

You can confirm that it is a list by using the function `names()`:

``names(list(mean = mean))``
``#>  "mean"``
``names(list(average = mean))``
``#>  "average"``
``names(list(ave = mean))``
``#>  "ave"``

Let’s stick with `list(mean = mean)`:

``````saliva_df %>%
summarise(across(cortisol_norm, list(mean = mean)))``````
``````#> # A tibble: 1 × 1
#>   cortisol_norm_mean
#>                <dbl>
#> 1             0.0490``````

If we wanted to do that for all numeric columns and also calculate `sd()`:

``````saliva_df %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd)))``````
``````#> # A tibble: 1 × 4
#>   cortisol_norm_mean cortisol_norm_sd melatonin_norm_mean
#>                <dbl>            <dbl>               <dbl>
#> 1             0.0490           0.0478       0.00000000765
#> # ℹ 1 more variable: melatonin_norm_sd <dbl>``````

We can use these concepts and code to process the other longer datasets, like `RR.csv`, in a way that makes it more meaningful to eventually merge (also called “join”) them with the smaller datasets like `user_info.csv` or `saliva.csv`. Let’s work with the `RR.csv` dataset to eventually join it with the others.

## 7.7 Summarizing long data like the RR dataset

With the RR dataset, each participant had almost 100,000 data points recorded over two days of collection. So if we want to join with the other datasets, we need to calculate summary measures by at least `file_path_id` and also preferably by `day` as well. In this case, we need to `group_by()` these two variables before summarising that lets us use the split-apply-combine technique. Let’s first summarise by taking the mean of `ibi_s` (which is the inter-beat interval in seconds):

``````rr_df <- import_multiple_files("RR.csv", import_rr)
rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean)))``````
``````#> `summarise()` has grouped output by 'file_path_id'. You can
#> override using the `.groups` argument.``````
``````#> # A tibble: 44 × 3
#>    file_path_id                    day ibi_s_mean
#>    <chr>                         <dbl>      <dbl>
#>  1 data-raw/mmash/user_1/RR.csv      1      0.666
#>  2 data-raw/mmash/user_1/RR.csv      2      0.793
#>  3 data-raw/mmash/user_10/RR.csv     1      0.820
#>  4 data-raw/mmash/user_10/RR.csv     2      0.856
#>  5 data-raw/mmash/user_11/RR.csv     1      0.818
#>  6 data-raw/mmash/user_11/RR.csv     2      0.923
#>  7 data-raw/mmash/user_12/RR.csv     1      0.779
#>  8 data-raw/mmash/user_12/RR.csv     2      0.883
#>  9 data-raw/mmash/user_13/RR.csv     1      0.727
#> 10 data-raw/mmash/user_13/RR.csv     2      0.953
#> # ℹ 34 more rows``````

While there are no missing values here, let’s add the argument `na.rm = TRUE` just in case. In order add this argument to the mean, we

``````rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = ~ mean(.x, na.rm = TRUE))))``````
``````#> `summarise()` has grouped output by 'file_path_id'. You can
#> override using the `.groups` argument.``````
``````#> # A tibble: 44 × 3
#>    file_path_id                    day ibi_s_mean
#>    <chr>                         <dbl>      <dbl>
#>  1 data-raw/mmash/user_1/RR.csv      1      0.666
#>  2 data-raw/mmash/user_1/RR.csv      2      0.793
#>  3 data-raw/mmash/user_10/RR.csv     1      0.820
#>  4 data-raw/mmash/user_10/RR.csv     2      0.856
#>  5 data-raw/mmash/user_11/RR.csv     1      0.818
#>  6 data-raw/mmash/user_11/RR.csv     2      0.923
#>  7 data-raw/mmash/user_12/RR.csv     1      0.779
#>  8 data-raw/mmash/user_12/RR.csv     2      0.883
#>  9 data-raw/mmash/user_13/RR.csv     1      0.727
#> 10 data-raw/mmash/user_13/RR.csv     2      0.953
#> # ℹ 34 more rows``````

You might notice a message (depending on the version of dplyr you have):

```summarise()` regrouping output by 'file_path_id' (override with `.groups` argument)``

This message talks about regrouping, and overriding based on the `.groups` argument. If we look in the help `?summarise`, at the `.groups` argument, we see that this argument is currently “experimental”. At the bottom there is a message about:

In addition, a message informs you of that choice, unless the option “dplyr.summarise.inform” is set to FALSE, or when summarise() is called from a function in a package.

So how would be go about removing this message? By putting the “dplyr.summarise.inform” in the `options()` function. So, go to the `setup` code chunk at the top of the document and add this code to the top:

``options(dplyr.summarise.inform = FALSE)``

You will now no longer get the message.

Let’s also add standard deviation as another measure from the RR datasets:

``````summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(
mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE)
)))
summarised_rr_df``````
``````#> # A tibble: 44 × 4
#>    file_path_id                    day ibi_s_mean ibi_s_sd
#>    <chr>                         <dbl>      <dbl>    <dbl>
#>  1 data-raw/mmash/user_1/RR.csv      1      0.666   0.164
#>  2 data-raw/mmash/user_1/RR.csv      2      0.793   0.194
#>  3 data-raw/mmash/user_10/RR.csv     1      0.820   0.225
#>  4 data-raw/mmash/user_10/RR.csv     2      0.856   0.397
#>  5 data-raw/mmash/user_11/RR.csv     1      0.818   0.137
#>  6 data-raw/mmash/user_11/RR.csv     2      0.923   0.182
#>  7 data-raw/mmash/user_12/RR.csv     1      0.779   0.0941
#>  8 data-raw/mmash/user_12/RR.csv     2      0.883   0.258
#>  9 data-raw/mmash/user_13/RR.csv     1      0.727   0.147
#> 10 data-raw/mmash/user_13/RR.csv     2      0.953   0.151
#> # ℹ 34 more rows``````

Whenever you are finished with a grouping effect, it’s good practice to end the `group_by()` with `ungroup()`. Let’s add it to the end:

``````summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(
mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE)
))) %>%
ungroup()
summarised_rr_df``````
``````#> # A tibble: 44 × 4
#>    file_path_id                    day ibi_s_mean ibi_s_sd
#>    <chr>                         <dbl>      <dbl>    <dbl>
#>  1 data-raw/mmash/user_1/RR.csv      1      0.666   0.164
#>  2 data-raw/mmash/user_1/RR.csv      2      0.793   0.194
#>  3 data-raw/mmash/user_10/RR.csv     1      0.820   0.225
#>  4 data-raw/mmash/user_10/RR.csv     2      0.856   0.397
#>  5 data-raw/mmash/user_11/RR.csv     1      0.818   0.137
#>  6 data-raw/mmash/user_11/RR.csv     2      0.923   0.182
#>  7 data-raw/mmash/user_12/RR.csv     1      0.779   0.0941
#>  8 data-raw/mmash/user_12/RR.csv     2      0.883   0.258
#>  9 data-raw/mmash/user_13/RR.csv     1      0.727   0.147
#> 10 data-raw/mmash/user_13/RR.csv     2      0.953   0.151
#> # ℹ 34 more rows``````

Ungrouping the data with `ungroup()` does not provide any visual indication of what is happening. However, in the background, it removes certain metadata that the `group_by()` function added.

Before continuing, let’s run styler with with the Palette (, then type “style file”) and knit the Quarto / R Markdown document with or with the Palette (, then type “render”) to confirm that everything runs as it should. If the knitting works, then switch to the Git interface and add and commit the changes so far with or with the Palette (, then type “commit”).

## 7.8 Exercise: Summarise the Actigraph data

Time: 15 minutes.

Like with the `RR.csv` dataset, let’s process the `Actigraph.csv` dataset so that it makes it easier to join with the other datasets later. Make sure to read the warning block below.

1. Like usual, create a new Markdown header called e.g. `## Exercise: Summarise Actigraph` and insert a new code chunk below that with or with the Palette (, then type “new chunk”).
2. Import all the Actigraph data files using the `import_multiple_files()` function you created previously. Name the new data frame `actigraph_df`.
3. Look into the Data Description to find out what each column is for.
4. Based on the documentation, which variables would you be most interested in analyzing more?
5. Decide which summary measure(s) you think may be most interesting for you (e.g. `median()`, `sd()`, `mean()`, `max()`, `min()`, `var()`).
6. Use `group_by()` of `file_path_id` and `day`, then use `summarise()` with `across()` to summarise the variables you are interested in (from item 4 above) with the summary functions you chose. Assign the newly summarised data frame to a new data frame and call it `summarised_actigraph_df`.
7. End the grouping effect with `ungroup()`.
8. Run styler while in the `doc/learning.qmd` file with with the Palette (, then type “style file”).
9. Knit the `doc/learning.qmd` document with or with the Palette (, then type “render”) to make sure everything works.
10. Add and commit the changes you’ve made into the Git history with or with the Palette (, then type “commit”).
Warning

Since the `actigraph_df` dataset is quite large, we strongly recommend not using `View()` or selecting the dataframe in the Environments pane to view it. For many computers, your R session will crash! Instead type out `glimpse(actigraph_df)` or simply `actigraph_df` in the Console.

Click for the solution. Only click if you are struggling or are out of time.
``````actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
summarised_actigraph_df <- actigraph_df %>%
group_by(file_path_id, day) %>%
# These statistics will probably be different for you
summarise(across(hr, list(
mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE)
))) %>%
ungroup()``````

## 7.9 Cleaning up and adding to the processing script

We’ll do this all together. We’ve tested out, imported, and processed two new datasets, the RR and the Actigraph datasets. First, in the R Markdown / Quarto document, cut the code that we used to import and process the `rr_df` and `actigraph_df` data. Then open up the `data-raw/mmash.R` file and paste the cut code into the bottom of the script. It should look something like this:

``````user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)

summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(
mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE)
))) %>%
ungroup()

# Code pasted here that was made from the above exercise``````
Important

Do not yet `source()` or run code in this file!!

Next, go to the R Markdown / Quarto document and again delete everything below the `setup` code chunk. After it has been deleted, add and commit the changes to the Git history with or with the Palette (, then type “commit”).

## 7.10 Exercise: Brainstorm and discuss how you’d use functionals in your work

Time: 10 minutes.

As a group, discuss if you’ve ever used for loops or functionals like `map()` and your experiences with either. Discuss any advantages to using for loops over functionals and vice versa. Then, brainstorm and discuss as many ways as you can for how you might incorporate functionals like `map()`, or replace for loops with them, into your own work. Afterwards, groups will briefly share some of what they thought of before we move on to the next exercise.

## 7.11 Summary

• R is a functional programming language:
• It uses functions that take an input, do an action, and give an output.
• It uses vectorisation that apply a function to multiple items (in a vector) all at once rather than using loops.
• It uses functionals that allow functions to use other functions as input.
• Use the purrr package and its function `map()` when you want to repeat a function on multiple items at once.
• Use `group_by()`, `summarise()`, and `across()` followed by `ungroup()` to use the split-apply-combine technique when needing to do an action on groups within the data (e.g. calculate the mean age between education groups).