If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
On GitLab open an issue or submit a merge request by clicking the "Edit this page " button on the side of this page.
8 Save time, don’t repeat yourself: Using functionals
We will continue covering the “Workflow” block in Figure 8.1.
Figure 8.1: Section of the overall workflow we will be covering.
And your folder and file structure should look like (use fs::dir_tree(recurse = 2)
if you want to check using R):
LearnR3
├── data/
│ └── README.md
├── data-raw/
│ ├── README.md
│ ├── mmash-data.zip
│ ├── mmash/
│ │ ├── user_1
│ │ ├── ...
│ │ └── user_22
│ └── mmash.R
├── doc/
│ ├── README.md
│ └── lesson.Rmd
├── R/
│ ├── functions.R
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md
For instructors: Click for details.
Briefly go over the bigger picture (found in the introduction section) and remind everyone the ‘what’ and ‘why’ of what we are doing.
8.1 Learning objectives
- Learn about and apply functional programming, vectorization, and functionals within R.
- Review the split-apply-combine technique and understand the link with functional programming.
- Apply functional programming to summarizing data and for using the split-apply-combine technique.
8.2 Functional programming
Please take 15 min to read over this section before we continue. Unlike many other programming languages, R’s primary strength and approach to programming is in functional programming. So what is it? It is programming that:
- Uses functions (like
function()
) - Applies functions to vectors all at once (called vectorisation), rather than
one at a time
- Vectors are multiple items, like a sequence of numbers from 1 to 5, that are bundled together, for instance a column for body weight in a dataset is a vector of numbers
- Can use functions as input to other functions to then output a vector (called a functional)
We’ve already covered functions. You’ve definitely already used vectorization
since it is one of R’s big strengths. For instance, functions like mean()
,
sd()
, sum()
are vectorized in that you give them a vector of numbers and they
do something to all the values in the vector at once. In vectorized functions, you can give the function an entire vector (e.g. c(1, 2, 3, 4)
) and R will know what to do with it. Figure 8.2 shows how a function conceptually uses vectorization.
![A function using vectorization. Modified from the [RStudio purrr cheatsheet][purrr-cheatsheet].](images/vectorization.png)
Figure 8.2: A function using vectorization. Modified from the RStudio purrr cheatsheet.
For example, in R, there is a vectorized function called sum()
that takes the entire vector of values
and outputs the total sum, without needing a for loop.
values <- 1:10
# Vectorized
sum(values)
#> [1] 55
As a comparison, in other programming languages, if you wanted to calculate the sum you would need a loop:
total_sum <- 0
# a vector
values <- 1:10
for (value in values) {
total_sum <- value + total_sum
}
total_sum
#> [1] 55
For instructors: Click for details.
Emphasize this next paragraph.
Writing effective and proper for loops is actually quite tricky and difficult to easily explain. Because of this and because there are better and easier ways of writing R code to replace for loops, we will not be covering loops in this course.
A functional on the other hand is a function that can also use a function as
one of its arguments. Figure 8.3 shows how the functional
map()
from the purrr package works by taking a vector (or list), applying a
function to each of those items, and outputting the results from each function.
The name map()
doesn’t mean a geographic map, it is the mathematical meaning
of map: To use a function on each item in a set of items.
![A functional that uses a function to apply it to each item in a vector. Modified from the [RStudio purrr cheatsheet][purrr-cheatsheet].](images/functionals.png)
Figure 8.3: A functional that uses a function to apply it to each item in a vector. Modified from the RStudio purrr cheatsheet.
Here’s a simple toy example to show how it works. We’ll use paste()
on each
item of 1:5
.
library(purrr)
map(1:5, paste)
#> [[1]]
#> [1] "1"
#>
#> [[2]]
#> [1] "2"
#>
#> [[3]]
#> [1] "3"
#>
#> [[4]]
#> [1] "4"
#>
#> [[5]]
#> [1] "5"
You’ll notice that map()
outputs a list, with all the [[1]]
printed. map()
will always output a list. Also notice that the paste()
function is given
without the ()
brackets. Without the brackets, the function can be used
by the map()
functional and treated like any other object in R.
Let’s say we wanted to paste together the number with the sentence “seconds have passed”. Normally it would look like:
paste(1, "seconds have passed")
#> [1] "1 seconds have passed"
paste(2, "seconds have passed")
#> [1] "2 seconds have passed"
paste(3, "seconds have passed")
#> [1] "3 seconds have passed"
paste(4, "seconds have passed")
#> [1] "4 seconds have passed"
paste(5, "seconds have passed")
#> [1] "5 seconds have passed"
Or as a loop:
for (num in 1:5) {
sec_passed <- paste(num, "seconds have passed")
print(sec_passed)
}
#> [1] "1 seconds have passed"
#> [1] "2 seconds have passed"
#> [1] "3 seconds have passed"
#> [1] "4 seconds have passed"
#> [1] "5 seconds have passed"
With map()
, we’d do this a bit differently. purrr allows us to create
anonymous functions (functions that are used once and disappear after usage) to
extend its capabilities. Anonymous functions are created by writing
function(x)
followed by the function definition inside of map()
. Using
map()
with an anonymous function allows us to do more things to the input
vector (e.g. 1:5
). Here is an example:
map(1:5, function(x) paste(x, "seconds have passed"))
#> [[1]]
#> [1] "1 seconds have passed"
#>
#> [[2]]
#> [1] "2 seconds have passed"
#>
#> [[3]]
#> [1] "3 seconds have passed"
#>
#> [[4]]
#> [1] "4 seconds have passed"
#>
#> [[5]]
#> [1] "5 seconds have passed"
As this is quite verbose, purrr supports the use of a syntax shortcut to write
anonymous functions. This shortcut is using ~
(tilde) to start the function
and .x
as the replacement for the vector item. .x
is used instead of x
in
order to avoid potential name collisions in functions where x
is an function
argument (for example in ggplot2::aes()
, where x
can be used to define the
x-axis mapping for a graph). Here is the same example as above, now using the
~
shortcut:
map(1:5, ~paste(.x, "seconds have passed"))
#> [[1]]
#> [1] "1 seconds have passed"
#>
#> [[2]]
#> [1] "2 seconds have passed"
#>
#> [[3]]
#> [1] "3 seconds have passed"
#>
#> [[4]]
#> [1] "4 seconds have passed"
#>
#> [[5]]
#> [1] "5 seconds have passed"
This is the basics of using functionals. Functions, vectorization, and functionals provide expressive and powerful approaches to a simple task: Doing an action on each item in a set of items. And while technically using a for loop lets you “not repeat yourself”, they tend to be more error prone and harder to write and read compared to these other tools. For some alternative explanations of this, see Section D.1. Ok, stop reading and we’ll go over this again before continuing with the coding.
For instructors: Click for details.
Go over this section briefly by reinforcing what they read. Make sure they understand the concept of applying something to many things at once. Doing the code-along should also help reinforce this concept.
Also highlight that the resources appendix has some links for continued learning for this and that the RStudio purrr cheatsheet is an amazing resource to use.
But what does functionals have to do with what we are doing now? Well, our
import_user_info()
function only takes in one data file. But we have
22 files that we could load all at once if we used functionals.
The first thing we have to do is add library(purrr)
to the setup
code chunk
in the doc/lesson.Rmd
document. Then we need to add the package dependency
by going to the Console and running:
usethis::use_package("purrr")
Then, the next step for using the map()
functional is to get a vector
or list of all the dataset files available to us. We will return to using the fs
package, which has a function called dir_ls()
that finds files of a certain
pattern. In our case, the pattern is user_info.csv
. So, let’s add library(fs)
to the setup
code chunk. Then, go to the bottom of the doc/lesson.Rmd
document, create a new header called ## Using map
, and create a code chunk
below that.
The dir_ls()
function takes the path that we want to search (data-raw/mmash/
),
uses the argument regexp
(short for regular expression or also regex
) to find the pattern,
and recurse
to look in all subfolders. We’ll cover regular expressions
more in the next session.
Then let’s see what the output looks like. For the website, we are only showing the first 3 files. Your output will look slightly different from this.
user_info_files
#> data-raw/mmash/user_1/user_info.csv data-raw/mmash/user_10/user_info.csv
#> data-raw/mmash/user_11/user_info.csv
Alright, we now have all the files ready to give to map()
. So let’s try it!
user_info_list <- map(user_info_files, import_user_info)
Remember, that map()
always outputs a list, so when we look into this object,
it will give us 22 tibbles (data.frames). Here we’ll only show the first one:
user_info_list[[1]]
#> # A tibble: 1 × 4
#> gender weight height age
#> <chr> <dbl> <dbl> <dbl>
#> 1 M 65 169 29
This is great because with one line of code we imported all these datasets!
But we’re missing an important bit of information: The user ID.
A powerful feature of the purrr package is that it has other functions to
make working with functionals easier. We know map()
always outputs a list.
What if you want to output a character vector instead? If we check the help:
?map
For instructors: Click for details.
Go through this help documentation and talk a bit about it.
We see that there are other functions, including a function
called map_chr()
that seems to output a character vector.
There are several others that give an output based
on the ending of map_
, such as:
-
map_int()
outputs an integer. -
map_dbl()
outputs a numeric value, called a “double” in programming. -
map_dfr()
outputs a data frame, combining the list items by row (r
). -
map_dfc()
outputs a data frame, combining the list items by column (c
).
The map_dfr()
looks like the one we want, since we want all these datasets
together as one. If we look at the help for it,
we see that it has an argument .id
, which we can use to create a new column
that sets the user ID, or in this case, the file path to the dataset, which
has the user ID information in it. So, let’s use it and create a new column
called file_path_id
.
user_info_df <- map_dfr(user_info_files, import_user_info,
.id = "file_path_id")
Your file_path_id
variable will look different. Don’t worry, we’re going to
tidy up the file_path_id
variable later.
user_info_df
#> # A tibble: 22 × 5
#> file_path_id gender weight height age
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/user_info.csv M 65 169 29
#> 2 data-raw/mmash/user_10/user_info.csv M 85 180 27
#> 3 data-raw/mmash/user_11/user_info.csv M 115 186 27
#> 4 data-raw/mmash/user_12/user_info.csv M 67 170 27
#> 5 data-raw/mmash/user_13/user_info.csv M 74 180 25
#> 6 data-raw/mmash/user_14/user_info.csv M 64 171 27
#> 7 data-raw/mmash/user_15/user_info.csv M 80 180 24
#> 8 data-raw/mmash/user_16/user_info.csv M 67 176 27
#> 9 data-raw/mmash/user_17/user_info.csv M 60 175 24
#> 10 data-raw/mmash/user_18/user_info.csv M 80 180 0
#> # … with 12 more rows
Now that we have this working, let’s add and commit the changes to the Git history.
8.3 Exercise: Brainstorm and discuss how you’d use functionals in your work
Time: 10 min
As a group, discuss if you’ve ever used for loops or functionals like map()
and your experiences with either. Discuss any advantages to using for loops over
functionals and vice versa. Then, brainstorm and discuss as many ways as you can
for how you might incorporate functionals like map()
, or replace for loops
with them, into your own work. Afterwards, groups will briefly share some of
what they thought of before we move on to the next exercise.
8.4 Exercise: Make a function for importing other datasets with functionals
Time: 25 min
We need to do basically the exact same thing for the saliva.csv
, RR.csv
, and
Actigraph.csv
datasets, following this format:
user_info_files <- dir_ls(here("data-raw/mmash/"),
regexp = "user_info.csv",
recurse = TRUE)
user_info_df <- map_dfr(user_info_files, import_user_info,
.id = "file_path_id")
For importing the other datasets, we have to modify the code in two locations
to get this code to import the other datasets: at the regexp =
argument and
at import_user_info
. This is the perfect chance to make a
function that you can use for other purposes and that is itself a functional
(since it takes a function as an input). So inside doc/lesson.Rmd
, convert
this bit of code into a function that works to import the other three datasets.
- Create a new header
## Exercise: Map on the other datasets
at the bottom of the document. - Create a new code chunk below it.
- Repeat the steps you’ve taken previously to create a new function:
- Wrap the code with
function() { ... }
- Name the function
import_multiple_files
- Within
function()
, set two new arguments calledfile_pattern
andimport_function
. - Within the code, replace and re-write
"user_info.csv"
withfile_pattern
(this is without quotes around it) andimport_user_info
withimport_function
(also without quotes). - Create generic intermediate objects (instead of
user_info_files
anduser_info_df
). So, replace and re-writeuser_info_file
withdata_files
anduser_info_df
withcombined_data
. - Use
return(combined_data)
at the end of the function to output the imported data frame. - Create and write Roxygen documentation to describe the new function
- Append
packagename::
to the individual functions (there are three packages used: fs, here, and purrr) - Run it and check that it works on
saliva.csv
- Wrap the code with
- After it works, cut and paste the function into the
R/functions.R
file. Then restart the R session, run the line withsource(here("R/functions.R"))
, and test the code out in the Console. - Once done, add the changes you’ve made and commit them to the Git history.
Use this code as a guide to help complete this exercise:
<- ___(___, ___) {
___ <- ___dir_ls(___here("data-raw/mmash/"),
___ regexp = ___,
recurse = TRUE)
<- ___map_dfr(___, ___,
___ .id = "file_path_id")
___(___)
}
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.
#' Import multiple MMASH data files and merge into one data frame.
#'
#' @param file_pattern Pattern for which data file to import.
#' @param import_function Function to import the data file.
#'
#' @return A single data frame/tibble.
#'
import_multiple_files <- function(file_pattern, import_function) {
data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
regexp = file_pattern,
recurse = TRUE)
combined_data <- purrr::map_dfr(data_files, import_function,
.id = "file_path_id")
return(combined_data)
}
# Test on saliva in the Console
import_multiple_files("saliva.csv", import_saliva)
8.5 Adding to the processing script and clean up R Markdown document
We’ll do this together.
We’ve now made a function that imports multiple data files based on the type of
data file, we can start using this function directly, like we did in the exercise
above for the saliva data. We’ve already imported the user_info_df
previously,
but now we should do some tidying up of our R Markdown file and to start updating
the data-raw/mmash.R
script. Why are we doing that? Because the R Markdown file
is only a sandbox to test code out and in the end we want a script that takes
the raw data, processes it, and creates a working dataset we can use for analysis.
First thing we will do is delete everything below the setup
code chunk
that contains the library()
and source()
code. Why do we delete everything?
Because it keeps things cleaner and makes it easier to look through the file.
And because we use Git, nothing is truly gone so you can always go back to the
text later. Next, we restart the R session (Ctrl-Shift-F10
). Then we’ll create
a new code chunk below the setup
chunk where we will use the
import_multiple_files()
function to import the user info and saliva data.
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
To test that things work, we’ll “Knit” our R Markdown document into HTML by
using the “Knit” button at the top of the pane or with Ctrl-Shift-K
. Once it
creates the file, it should either pop up or open in the Viewer pane on the
side. If it works, then we can move on and open up the data-raw/mmash.R
script.
Inside the script, copy and paste these two lines of code to the bottom of the script.
Afterwards, go the top of the script and right below the library(here)
code,
add these two lines of code, so it looks like this:
Save the files, then add and commit the changes to the Git history.
8.6 Split-apply-combine technique and functionals
For instructors: Click for details.
Verbally cover this section before moving on to the summarizing. Let them know they can read more about this in this section.
We’re taking a quick detour to briefly talk about a concept that perfectly illustrates how vectorization and functionals fit into doing data analysis. The concept is called the split-apply-combine technique, which we covered in the beginner R course. The method is:
- Split the data into groups (e.g. diabetes status).
- Apply some analysis or statistics to each group (e.g. finding the mean of age).
- Combine the results to present them together (e.g. into a data frame that you can use to make a plot or table).
So when you split data into multiple groups, you make a vector that you can than apply (i.e. the map functional) some statistical technique to each group through vectorization. This technique works really well for a range of tasks, including for our task of summarizing some of the MMASH data so we can merge it all into one dataset.
8.7 Exercise: What is the pipe?
Time: 5 min
For instructors: Click for details.
Before starting this exercise, ask how many have used the pipe before. If everyone has, then move on to the next section. If some haven’t, let the others in the group explain, but do not use much time or even demonstrate it. If they don’t know what it is, they can look it up after. We covered this in the introduction course, so we should not cover it again here.
We haven’t used the %>%
pipe from the magrittr package yet, but it is used
extensively in many R packages and is the foundation of tidyverse packages.
The function fundamentally changed how people write R code so much that
in version 4.1 a similar function, |>
, was added to base R. To make sure
everyone is aware of what the pipe is, in your groups please do either task:
- If one or more person in the group doesn’t know what the pipe is, take some time to talk about and explain it (if you know).
- If no one in the group knows, please read the section on it from the beginner course.
8.8 Summarising data through functionals
Functionals and vectorization are an integral component of how R works and
they appear throughout many of R’s functions and packages. They are particularly
used throughout the tidyverse packages like dplyr. Let’s get into some more
advanced features of dplyr functions that work as functionals. Before we continue,
re-run the code for getting user_info_df
since you had restarted the R session
previously.
There are many “verbs” in dplyr, like select()
, rename()
, mutate()
,
summarise()
, and group_by()
(covered in more detail in the
Data Management and Wrangling
session of the beginner course). The common usage of these verbs is through
acting on and directly using the column names (e.g. without "
quotes around
the column name). For instance, to select only the age
column, you would type
out:
user_info_df %>%
select(age)
#> # A tibble: 22 × 1
#> age
#> <dbl>
#> 1 29
#> 2 27
#> 3 27
#> 4 27
#> 5 25
#> 6 27
#> 7 24
#> 8 27
#> 9 24
#> 10 0
#> # … with 12 more rows
But many dplyr verbs can also take functions as input. When you combine select()
with the where()
function, you can select different variables. The where()
function is a tidyselect
helper, a set of functions that make it easier to
select variables. Some additional helper functions include:
What it selects | Example | Function |
---|---|---|
Select variables where a function returns TRUE |
Select variables that have data as character (is.character )
|
where()
|
Select all variables |
Select all variables in user_info_df
|
everything()
|
Select variables that contain the matching string | Select variables that contain the string "user_info" |
contains()
|
Select variables that ends with string | Select all variables that end with "date" |
ends_with()
|
Let’s select columns that are numeric:
user_info_df %>%
select(where(is.numeric))
#> # A tibble: 22 × 3
#> weight height age
#> <dbl> <dbl> <dbl>
#> 1 65 169 29
#> 2 85 180 27
#> 3 115 186 27
#> 4 67 170 27
#> 5 74 180 25
#> 6 64 171 27
#> 7 80 180 24
#> 8 67 176 27
#> 9 60 175 24
#> 10 80 180 0
#> # … with 12 more rows
Or, only character columns:
#> # A tibble: 22 × 2
#> file_path_id gender
#> <chr> <chr>
#> 1 data-raw/mmash/user_1/user_info.csv M
#> 2 data-raw/mmash/user_10/user_info.csv M
#> 3 data-raw/mmash/user_11/user_info.csv M
#> 4 data-raw/mmash/user_12/user_info.csv M
#> 5 data-raw/mmash/user_13/user_info.csv M
#> 6 data-raw/mmash/user_14/user_info.csv M
#> 7 data-raw/mmash/user_15/user_info.csv M
#> 8 data-raw/mmash/user_16/user_info.csv M
#> 9 data-raw/mmash/user_17/user_info.csv M
#> 10 data-raw/mmash/user_18/user_info.csv M
#> # … with 12 more rows
Likewise, with functions like summarise()
,
if you want to for example calculate the mean of cortisol in the saliva dataset,
you would usually type out:
saliva_df %>%
summarise(cortisol_mean = mean(cortisol_norm))
#> # A tibble: 1 × 1
#> cortisol_mean
#> <dbl>
#> 1 0.0490
If you want to calculate the mean of multiple columns, you might think you’d have to do something like:
saliva_df %>%
summarise(cortisol_mean = mean(cortisol_norm),
melatonin_mean = mean(melatonin_norm))
#> # A tibble: 1 × 2
#> cortisol_mean melatonin_mean
#> <dbl> <dbl>
#> 1 0.0490 0.00000000765
But instead, there is the across()
function that works like map()
and allows
you to calculate the mean across which ever columns you want. In many ways,
across()
is a duplicate of map()
, particularly in the arguments you give
it.
Take 2 min and read through this list. When you look in ?across
, there are
two main arguments and two optional ones:
-
.cols
argument: Columns you want to use.- Write column names directly and wrapped in
c()
:c(age, weight)
. - Write
tidyselect
helpers:everything()
,starts_with()
,contains()
,ends_with()
- Use a function wrapped in
where()
:where(is.numeric)
,where(is.character)
- Write column names directly and wrapped in
-
.fns
: The function to use on the.cols
.- A bare function (
mean
) applies it to each column and returns the output, with the column name unchanged. - A list with bare functions (
list(mean, sd)
) applies each function to each column and returns the output with the column name appended with a number (e.g.cortisol_norm_1
). - A named list with bare functions (
list(average = mean, stddev = sd)
) does the same as above but instead returns an output with the column names appended with the name given to the function in the list (e.g.cortisol_norm_average
). - A function passed with
~
and.x
, like inmap()
. For instance,across(c(age, weight), ~ mean(.x, na.rm = TRUE))
is used to say “put age and weight, one after the other, in place of where.x
is located” to calculate the mean for age and the mean for weight.
- A bare function (
-
...
argument: Arguments to give to the functions in.fns
. For instance,across(age, mean, na.rm = TRUE)
passes the argument to remove missingnessna.rm
into themean()
function. -
.names
argument: Customize the output of the column names. We won’t cover this argument.
Ok, stop reading and we’ll cover this together.
For instructors: Click for details.
Go over the first two arguments again, reinforcing what they read.
Let’s try out some examples. To calculate the mean of cortisol_norm
like we did
above, we’d do:
saliva_df %>%
summarise(across(cortisol_norm, mean))
#> # A tibble: 1 × 1
#> cortisol_norm
#> <dbl>
#> 1 0.0490
To calculate the mean of another column:
saliva_df %>%
summarise(across(c(cortisol_norm, melatonin_norm), mean))
#> # A tibble: 1 × 2
#> cortisol_norm melatonin_norm
#> <dbl> <dbl>
#> 1 0.0490 0.00000000765
This is nice, but changing the column names so that the function name is added would make reading what the column contents are clearer. That’s when we would use “named lists”, which are lists that look like:
list(item_one_name = ..., item_two_name = ...)
So, for having a named list with mean inside across()
, it would look like:
You can confirm that it is a list by using the function names()
:
names(list(mean = mean))
#> [1] "mean"
names(list(average = mean))
#> [1] "average"
names(list(ave = mean))
#> [1] "ave"
Let’s stick with list(mean = mean)
:
saliva_df %>%
summarise(across(cortisol_norm, list(mean = mean)))
#> # A tibble: 1 × 1
#> cortisol_norm_mean
#> <dbl>
#> 1 0.0490
If we wanted to do that for all numeric columns and also calculate sd()
:
saliva_df %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
#> # A tibble: 1 × 4
#> cortisol_norm_mean cortisol_norm_sd melatonin_norm_mean melatonin_norm_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0490 0.0478 0.00000000765 0.00000000651
We can use these concepts and code to process the other longer datasets, like
RR.csv
, in a way that makes it more meaningful to eventually merge (also
called “join”) them with the smaller datasets like user_info.csv
or
saliva.csv
. Let’s work with the RR.csv
dataset to eventually join it with
the others.
8.9 Summarizing long data like the RR dataset
With the RR dataset, each participant had almost 100,000 data points recorded
over two days of collection. So if we want to join with the other datasets,
we need to calculate summary measures by at least file_path_id
and also preferably
by day
as well. In this case, we need to group_by()
these two variables
before summarising that lets us use the split-apply-combine technique. Let’s
first summarise by taking the mean of ibi_s
(which is the inter-beat interval
in seconds):
rr_df <- import_multiple_files("RR.csv", import_rr)
rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean)))
#> # A tibble: 44 × 3
#> # Groups: file_path_id [22]
#> file_path_id day ibi_s_mean
#> <chr> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953
#> # … with 34 more rows
While there are no missing values here, let’s add the argument na.rm = TRUE
just in case.
#> # A tibble: 44 × 3
#> # Groups: file_path_id [22]
#> file_path_id day ibi_s_mean
#> <chr> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953
#> # … with 34 more rows
You might notice a message (depending on the version of dplyr you have):
`summarise()` regrouping output by 'file_path_id' (override with `.groups` argument)
Take 5 min to read this section over before we continue.
This message talks about regrouping, and overriding based on the .groups
argument. If we look in the help ?summarise
, at the .groups
argument, we see that
this argument is currently “experimental”. At the bottom there is a message about:
In addition, a message informs you of that choice, unless the option “dplyr.summarise.inform” is set to FALSE, or when summarise() is called from a function in a package.
So how would be go about removing this message? By putting the
“dplyr.summarise.inform” in the options()
function. So, go to the setup
code
chunk at the top of the document and add this code to the top:
options(dplyr.summarise.inform = FALSE)
You will now no longer get the message. Please stop reading and we will continue together.
Let’s also add standard deviation as another measure from the RR datasets:
summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE))
summarised_rr_df
#> # A tibble: 44 × 4
#> # Groups: file_path_id [22]
#> file_path_id day ibi_s_mean ibi_s_sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666 0.164
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793 0.194
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820 0.225
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856 0.397
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818 0.137
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923 0.182
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779 0.0941
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883 0.258
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727 0.147
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953 0.151
#> # … with 34 more rows
Whenever you are finished with a grouping effect, it’s good practice to end
the group_by()
with ungroup()
. Let’s add it to the end:
summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE)) %>%
ungroup()
summarised_rr_df
#> # A tibble: 44 × 4
#> file_path_id day ibi_s_mean ibi_s_sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 data-raw/mmash/user_1/RR.csv 1 0.666 0.164
#> 2 data-raw/mmash/user_1/RR.csv 2 0.793 0.194
#> 3 data-raw/mmash/user_10/RR.csv 1 0.820 0.225
#> 4 data-raw/mmash/user_10/RR.csv 2 0.856 0.397
#> 5 data-raw/mmash/user_11/RR.csv 1 0.818 0.137
#> 6 data-raw/mmash/user_11/RR.csv 2 0.923 0.182
#> 7 data-raw/mmash/user_12/RR.csv 1 0.779 0.0941
#> 8 data-raw/mmash/user_12/RR.csv 2 0.883 0.258
#> 9 data-raw/mmash/user_13/RR.csv 1 0.727 0.147
#> 10 data-raw/mmash/user_13/RR.csv 2 0.953 0.151
#> # … with 34 more rows
Ungrouping the data with ungroup()
does not provide any visual indication of
what is happening. However, in the background, it removes certain metadata that
the group_by()
function added.
Before continuing, let’s knit the R Markdown document to confirm that everything runs as it should. If the knitting works, then switch to the Git interface and add and commit the changes so far.
8.10 Exercise: Summarise the Actigraph data
Time: 15 min
Like with the RR.csv
dataset, let’s process the Actigraph.csv
dataset so
that it makes it easier to join with the other datasets later.
- Like usual, create a new Markdown header called e.g.
## Exercise: Summarise Actigraph
and insert a new code chunk below that. - Import all the Actigraph data files using the
import_multiple_files()
function you created previously. Name the new data frameactigraph_df
. - Look into the Data Description to find out what each column is for.
- Based on the documentation, which variables would you be most interested in analyzing more?
- Decide which summary measure(s) you think may be most interesting for you
(e.g.
median()
,sd()
,mean()
,max()
,min()
,var()
). - Use
group_by()
offile_path_id
andday
, then usesummarise()
withacross()
to summarise the variables you are interested in (from item 4 above) with the summary functions you chose. Assign the newly summarised data frame to a new data frame and call itsummarised_actigraph_df
. - End the grouping effect with
ungroup()
. - Knit the
doc/lesson.Rmd
document to make sure everything works. - Add and commit the changes you’ve made into the Git history.
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.
8.11 Cleaning up and adding to the processing script
We’ll do this all together.
We’ve tested out, imported, and processed two new datasets, the RR and the
Actigraph datasets. First, in the R Markdown document, cut the code that we used
to import and process the rr_df
and actigraph_df
data. Then open up the
data-raw/mmash.R
file and paste the cut code into the bottom of the script.
It should look something like this:
user_info_df <- import_multiple_files("user_info.csv", import_user_info)
saliva_df <- import_multiple_files("saliva.csv", import_saliva)
rr_df <- import_multiple_files("RR.csv", import_rr)
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
summarised_rr_df <- rr_df %>%
group_by(file_path_id, day) %>%
summarise(across(ibi_s, list(mean = mean, sd = sd), na.rm = TRUE)) %>%
ungroup()
# Code pasted here that was made from the above exercise
Next, go to the R Markdown document and again delete everything below the
setup
code chunk. After it has been deleted, add and commit the changes to the
Git history.
8.12 Summary
For instructors: Click for details.
Quickly cover this before finishing the session and when starting the next session.
- R is a functional programming language:
- It uses functions that take an input, do an action, and give an output
- It uses vectorisation that apply a function to multiple items (in a vector) all at once rather than using loops
- It uses functionals that allow functions to use other functions as input
- Use the purrr package and its function
map()
when you want to repeat a function on multiple items at once - Use
group_by()
,summarise()
, andacross()
followed byungroup()
to use the split-apply-combine technique when needing to do an action on groups within the data (e.g. calculate the mean age between education groups)