18  Doing many things at once with functionals

Briefly go over the bigger picture (found in the introduction section) and remind everyone the ‘what’ and ‘why’ of what we are doing.

We’ve made a simple but robust function to help us read in our data more easily and with less code. But we still have a problem of having nearly 3000 CSV files to read in. Even if we just used our simple function, that would still be 3000 lines of copy and pasted code. That’s not very efficient and also sounds like an anxiety-inducing situation to be in. So in this session, we’ll go over using things called “functionals” to do many things at once with minimal extra code.

18.1 Learning objectives

  1. Explain what functional programming, vectorization, and functionals are within R and identify when code is a functional or uses functional programming. Then apply this knowledge using the {purrr} package’s map() function.

18.2 💬 Discussion activity: Recall and share what you’ve learned so far

Time: ~6 minutes.

A very effective way to learn is to recall and describe to someone else what you’ve learned. So before we continue this session, take some time to think about what you’ve learned from yesterday.

  1. Take 1-2 minute and try to recall as much as you can about what you’d done over the last two days. Without looking at your notes or the other sections of the website, try to remember things about importing and robust functions.
  2. Then, for 4-5 minutes, share with your neighbour what you remember and try to describe it to each other. Maybe you will each remember different things.

18.3 📖 Reading task: Functional programming

Go over this section briefly by reinforcing what they read, especially reinforcing the concepts shown in the image. Make sure they understand the concept of applying something to many things at once and that functionals are better coding patterns to use compared to loops. Doing the code-along should also help reinforce this concept.

Also highlight that the resources appendix has some links for continued learning for this and that the Posit {purrr} cheat sheet is an amazing resource to use.

Time: ~10 minutes.

Unlike many other programming languages, R’s primary strength and approach to programming is in functional programming. So what is “functional programming”? At its simplest, functional programming is a way of thinking about and doing programming that is declarative rather than imperative. A functional (“declarative”) way of thinking is actually how we humans intuitively think about the world. Unfortunately, many programming languages are imperative, and even the computer at its core needs imperative instructions. This is part of the reason why programming is so hard for humans.

A useful analogy is that declarative/functional programming is similar to how you talk to an adult about what you want done and letting the adult do it, e.g. “move these chairs to this room”. “Imperative” programming is how you might talk to a very young child to make sure it is done exactly as you say, without deviations from those instructions. Using the chair example, it would be like saying “take this chair, walk over to that room, drop off the chair in the corner, come back, take that next chair, walk back to the room and put it beside the other chair, …” and so on. There’s a lot of room for error because, e.g. “take” or “walk” or “drop” can mean many different things.

Functional programming, in R, is programming that:

  • Uses functions (like function()).
  • Applies actions as a function to vectors all at once (called vectorisation), rather than one at a time.
    • Vectors are multiple items, like a sequence of numbers from 1 to 5, that are bundled together, for instance a column for body weight in a dataset is a vector of numbers.
  • Can use functions as input to other functions to then output a vector (called a functional or “higher-order function”).

We’ve already covered functions. You’ve definitely already used vectorization since it is one of R’s big strengths. For instance, functions like mean(), sd(), or sum() are vectorized in that you give them a vector of numbers and they do something to all the values in the vector at once. In vectorized functions, you can give the function an entire vector (e.g. c(1, 2, 3, 4)) and R will know what to do with it. Figure 18.1 shows how a function conceptually uses vectorization.

Figure 18.1: A function using vectorization. Notice how a set of items is included all at once in the func() function and outputs a single item on the right. Modified from the Posit purrr cheat sheet. Image license is CC-BY-SA 4.0.

For example, in R, there is a vectorized function called sum() that takes the entire vector of values and outputs the total sum.

values <- 1:10
# Vectorized
sum(values)
[1] 55

In many other programming languages, you would need to use a loop to calculate the sum because the language doesn’t support vectorization. In R, a loop to calculate the sum would look like this:

total_sum <- 0
# a vector
values <- 1:10
for (value in values) {
  total_sum <- value + total_sum
}
total_sum
[1] 55

Emphasize this next paragraph.

Writing effective and correct loops is surprisingly difficult and tricky. Because of this and because there are better and easier ways of writing R code to replace loops, we strongly recommend not using loops. If you think you need to, you probably don’t. This is why we will not be covering loops in this workshop.

A functional is R’s native way of doing loops. A functional is a function where you give it a function as one of its arguments. Figure 18.2 shows how the functional map() from the {purrr} package works by taking a vector (or list), applying a function to each of those items, and outputting the results from each function. The name map() doesn’t mean a geographic map, it is the mathematical meaning of map: To use a function on each item in a set of items.

Figure 18.2: A functional, in this case map(), applies a function to each item in a vector. Notice how each of the green coloured boxes are placed into the func() function and outputs the same number of blue boxes as there are green boxes. Modified from the Posit purrr cheat sheet. Image license is CC-BY-SA 4.0.

Here’s a simple example to show how it works. We’ll use paste() on each item of 1:5 to simple output the same number as the input. We will load in {purrr} to be clear which package it comes from, but {purrr} is loaded with the {tidyverse} package, so you don’t need to load it if you are using the {tidyverse} package.

library(purrr)
map(1:5, paste)
[[1]]
[1] "1"

[[2]]
[1] "2"

[[3]]
[1] "3"

[[4]]
[1] "4"

[[5]]
[1] "5"

You’ll notice that map() outputs a list, with all the [[1]] printed. A list is a specific type of object that can contain different types of objects. For example, a data frame is actually a special type of list where each column is a list item that contains the values of that column. map() will always output a list, so it is very predictable. Notice also that the paste() function is given without the () brackets. Without the brackets, the function can be used by the map() functional and treated like any other object in R. Remember how we said that a function is an action when it has () at the end. In this case, we do not want R to do the action yet, we want R to use the action on each item in the map(). In order for map() to do that, you need to give the function as an object so it can do the action later.

Tip

A useful analogy here is imagine one person writing instructions on several pieces of paper along with box of things related to the instructions. This person then gives each other person in a group one piece of paper and one thing inside the box. Both the box, as well as the piece of paper are objects, since no action has happened yet. But once each person gets the box and the piece of paper and reads the instructions, they will then do an action based on the instructions to the thing from the box.

That piece of paper is like a function without the (), and when the person reads it and follows the instructions, the function becomes an action with ().

Let’s say we wanted to paste together the number with the sentence “seconds have passed”. Normally it would look like:

paste(1, "seconds have passed")
[1] "1 seconds have passed"
paste(2, "seconds have passed")
[1] "2 seconds have passed"
paste(3, "seconds have passed")
[1] "3 seconds have passed"
paste(4, "seconds have passed")
[1] "4 seconds have passed"
paste(5, "seconds have passed")
[1] "5 seconds have passed"

Or as a loop:

for (num in 1:5) {
  sec_passed <- paste(num, "seconds have passed")
  print(sec_passed)
}
[1] "1 seconds have passed"
[1] "2 seconds have passed"
[1] "3 seconds have passed"
[1] "4 seconds have passed"
[1] "5 seconds have passed"

Using map() we can give it something called “anonymous” functions. “Anonymous” because it isn’t named with name <- function() like we’ve done before. Anonymous functions are functions that you use once and don’t remember in your environment (like named functions do). As you might guess, you can make an anonymous function by simply not naming it! For instance:

function(number) {
  paste(number, "seconds have passed")
}

There is a shortened version of this using \(x):

\(number) paste(number, "seconds have passed")

Both forms of anonymous functions are equivalent and can be used in map():

map(1:5, function(number) paste(number, "seconds have passed"))
[[1]]
[1] "1 seconds have passed"

[[2]]
[1] "2 seconds have passed"

[[3]]
[1] "3 seconds have passed"

[[4]]
[1] "4 seconds have passed"

[[5]]
[1] "5 seconds have passed"
# Or with the short version
map(1:5, \(number) paste(number, "seconds have passed"))
[[1]]
[1] "1 seconds have passed"

[[2]]
[1] "2 seconds have passed"

[[3]]
[1] "3 seconds have passed"

[[4]]
[1] "4 seconds have passed"

[[5]]
[1] "5 seconds have passed"

So map() will take each number in 1:5, put it into the anonymous function in the number argument we made, and then run the paste() function on it. The output will be a list of the results of the paste() function.

Note

{purrr} also supports the use of a syntax shortcut to write anonymous functions. This shortcut is using ~ (tilde) to start the function and .x as the replacement for the vector item. .x is used instead of x in order to avoid potential name collisions in functions where x is an function argument (for example in ggplot2::aes(), where x can be used to define the x-axis mapping for a graph). Here is the same example as above, now using the ~ shortcut:

map(1:5, ~ paste(.x, "seconds have passed"))
[[1]]
[1] "1 seconds have passed"

[[2]]
[1] "2 seconds have passed"

[[3]]
[1] "3 seconds have passed"

[[4]]
[1] "4 seconds have passed"

[[5]]
[1] "5 seconds have passed"

The ~ was made before \() was made in R. After this new, native R version was made, the {purrr} authors now strongly recommend using it rather then their ~ for anonymous functions. The \() is also a bit clearer because you give it the name of the argument, e.g. number, rather than be forced to use .x.

map() will always output a list, but sometimes we might want to output a different data type. If we look into the help documentation with ?map, it shows several other types of map that all start with map_:

  • map_chr() outputs a character vector.
  • map_int() outputs an integer vector.
  • map_dbl() outputs a numeric value, called a “double” in programming.

This is the basics of functionals like map(). Functions, vectorization, and functionals provide expressive and powerful approaches to a simple task: Doing an action on each item in a set of items. And while technically using a for loop lets you “not repeat yourself”, they tend to be more error prone and harder to write and read compared to these other tools. For some alternative explanations of this, see Section A.2.

This approach of doing an action on each item in a set of items also allows us to use another powerful “design pattern”: If you make a function to do something to one item, you can do it for all items. So your focus on writing code should try to aim on building a solid, robust function for one thing, and then use functionals to very easily apply that to many things at once with nearly no extra effort. For example, do you have a dozen different models to run on the same data? Focus on building a solid, robust function that runs one model, then arrange the input to be a list of the different model formulas, and then use functionals to run all your models at once. There are many other ways of applying this thinking to designing and writing your code.

CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

18.4 Using functionals to import multiple datasets

But what does functionals have to do with what we are doing now? Well, we have read() that only reads in one CSV file. But we have many more! Well, we can use functionals to read them all at once 😁

Before we can use map() though, we need to have a vector or list of the file paths for all the CSV files we want to read in. For that, we can use the {fs} package, which has a function called dir_ls() that finds files in a folder (directory).

So, let’s add library(fs) to the setup code chunk. Then, go to the bottom of the docs/learning.qmd document, create a new header called ## Using map, and create a code chunk below that with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”)

The dir_ls() function, which stands for “directory list”, takes the path that we want to search, in this case data-raw/nurses-stress/, an argument to search all sub-folders with recurse = TRUE, and has an argument called regexp to tell it what type of files to search for. This regexp argument is short for “regular expression”, which we will cover a bit more in a later session. Regular expressions are very powerful pattern finding tools, but they can also get very complicated. Thankfully in our case, we only want to find all CSV files in the data-raw/nurses-stress/ folder that are called HR.csv.gz. So we can use that in regexp. Let’s pipe the output of the here() path to the stress folder into dir_ls() by assigning it to a new variable hr_file so we can use it later.

docs/learning.qmd
hr_files <- here("data-raw/nurses-stress/") |>
  dir_ls(regexp = "HR.csv.gz", recurse = TRUE)

Let’s see what the output looks like. The paths you see on your computer will be different from what are shown here (it is a different computer). Plus to keep it simpler, on the website we will only show the first 3 files.

docs/learning.qmd
hr_files
data-raw/nurses-stress/stress/15/15_1594140175/HR.csv.gz
data-raw/nurses-stress/stress/15/15_1594149654/HR.csv.gz
data-raw/nurses-stress/stress/15/15_1594213322/HR.csv.gz

So we have a list of all the CSV files we want to read in. Alright, we now have all the files ready to give to map(), so let’s read in all HR files into R!

docs/learning.qmd
hr_data <- hr_files |>
  map(read)

Notice that we give map() the function read without the (). That’s because we need to give the functional map() the function as an object, not as an action. After you run it, map() will take the read function as an object and apply it to each HR file as an action.

Remember, that map() always outputs a list, so when we look into this object, it will give us 609 tibbles (data frames). For the website we’ll only show the first two tibbles:

docs/learning.qmd
hr_data[1:2]
$`/home/runner/work/r-cubed-intermediate/r-cubed-intermediate/data-raw/nurses-stress/stress/15/15_1594140175/HR.csv.gz`
# A tibble: 100 × 2
   collection_datetime    hr
   <dttm>              <dbl>
 1 2020-07-07 16:43:07  83.7
 2 2020-07-07 16:43:20  78.2
 3 2020-07-07 16:43:41  72.5
 4 2020-07-07 16:43:43  73.2
 5 2020-07-07 16:44:18  77.8
 6 2020-07-07 16:44:35  83.7
 7 2020-07-07 16:44:37  84.1
 8 2020-07-07 16:44:44  85.2
 9 2020-07-07 16:44:52  83.7
10 2020-07-07 16:45:02  83.1
# ℹ 90 more rows

$`/home/runner/work/r-cubed-intermediate/r-cubed-intermediate/data-raw/nurses-stress/stress/15/15_1594149654/HR.csv.gz`
# A tibble: 100 × 2
   collection_datetime    hr
   <dttm>              <dbl>
 1 2020-07-07 19:21:46 114. 
 2 2020-07-07 19:21:54 112. 
 3 2020-07-07 19:22:05 112. 
 4 2020-07-07 19:22:25 111. 
 5 2020-07-07 19:22:35 106. 
 6 2020-07-07 19:22:43 104. 
 7 2020-07-07 19:22:51 103. 
 8 2020-07-07 19:22:55 102. 
 9 2020-07-07 19:22:56 101. 
10 2020-07-07 19:23:05  97.2
# ℹ 90 more rows

Remind everyone that we still only import the first 100 rows of each data file. So if some of the data itself seems looks odd or so little data, that is the reason why. Remind them that we do this to more quickly prototype and test code out.

This is great because with one line of code we imported all these datasets and made them into one data frame! But we’re missing an important bit of information: The participant ID. The {purrr} package has many other powerful functions to make it easier to work with functionals. We know map() always outputs a list. But what we want is a single tibble at the end that also contains the participant ID.

There are two functions that take a list of data frames and convert them into a single data frame. They are called list_rbind() to bind (“stack”) the data frames by rows or list_cbind() to bind (“stack”) the data frames by columns. In our case, we want to bind (stack) by rows, so we will use list_rbind() by piping the output of the map() code we wrote into list_rbind().

docs/learning.qmd
hr_data <- hr_files |>
  map(read) |>
  list_rbind()
hr_data
# A tibble: 57,183 × 2
   collection_datetime    hr
   <dttm>              <dbl>
 1 2020-07-07 16:43:07  83.7
 2 2020-07-07 16:43:20  78.2
 3 2020-07-07 16:43:41  72.5
 4 2020-07-07 16:43:43  73.2
 5 2020-07-07 16:44:18  77.8
 6 2020-07-07 16:44:35  83.7
 7 2020-07-07 16:44:37  84.1
 8 2020-07-07 16:44:44  85.2
 9 2020-07-07 16:44:52  83.7
10 2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows

But, hmm, we don’t have the participant ID in the data frame. This is because list_rbind() doesn’t know how to get that information, since it isn’t included in the data frames. If we look at the help for list_rbind(), we will see that it has an argument called names_to. This argument lets us create a new column that is based on the name of the list item, which in our case is the file path. This file path also has the participant ID information in it, but it also has the full file path in it too, which isn’t exactly what we want. So we’ll have to fix it later. But first, let’s start with adding this information to the data frame as a new column called file_path_id. So we will add the names_to argument to the code we’ve already written.

docs/learning.qmd
hr_data <- hr_files |>
  map(read) |>
  list_rbind(names_to = "file_path_id")

And then look at it by running hr_data below the code:

docs/learning.qmd
hr_data
# A tibble: 57,183 × 3
   file_path_id                                collection_datetime    hr
   <chr>                                       <dttm>              <dbl>
 1 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:43:07  83.7
 2 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:43:20  78.2
 3 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:43:41  72.5
 4 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:43:43  73.2
 5 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:44:18  77.8
 6 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:44:35  83.7
 7 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:44:37  84.1
 8 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:44:44  85.2
 9 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:44:52  83.7
10 data-raw/nurses-stress/stress/15/15_159414… 2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows

The file_path_id variable will look different on everyone’s computer. Don’t worry, we’re going to tidy up the file_path_id variable later in another session. Looking at the code and thinking about our “design pattern” of “do something to one thing, then do it for all things”, we can see that if we wanted to read in the IBI or BVP or other files found inside stress/, we would only need to make one change to the code above, which is to change the regexp argument in dir_ls() to find the files we want. So, it’s your turn to try to convert it to a function and read in the other datasets.

Before moving on to the exercise, let’s style with the Palette (Ctrl-Shift-P, then type “style file”) the document, render with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”), and finally add and commit the changes to the Git history using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) before pushing to GitHub.

18.5 🧑‍💻 Exercise: Convert the map with reading code into a function

Because later sessions depend on this code, after they’ve finished the exercise, walk through with them the solution. So that we are all on the same page.

Time: ~15 minutes.

Tip

As you add more code and text to your docs/learning.qmd file, it will start to get longer. A helpful tip to move around a Quarto or R script more easily is to open up the “Document Outline” on the side by clicking the button in the top right corner of the Quarto pane or by using Ctrl-Shift-O or with the Palette (Ctrl-Shift-P, then type “outline”).

We’ve made some code that does what we want, now the next step is to convert it into a function so we can use it on the other datasets within stress/. The code we’ve written only reads the HR data and it so far looks like this:

hr_files <- here("data-raw/nurses-stress/") |>
  dir_ls(regexp = "HR.csv.gz", recurse = TRUE)

hr_data <- hr_files |>
  map(read) |>
  list_rbind(names_to = "file_path_id")

hr_data

We want to be able to do the same thing for the other datasets like BVP or IBI, but we don’t want to repeat ourselves. What we want is a new function that looks like the code chunk below that we can use to import the other datasets.

hr_data <- read_all("HR.csv.gz")
ibi_data <- read_all("IBI.csv.gz")
bvp_data <- read_all("BVP.csv.gz")

Recall that the structure of making functions below and use this template to help you get started:

___ <- function(___) {
  # Code that does something
  ___ <- ___

  ___ <- ___

  return(___)
}

Complete the tasks below to convert the code into a stable, robust function that you can re-use.

  1. We’ve used two new packages in this session, {fs} and {purrr}. First add these packages to the DESCRIPTION file by using usethis::use_package() in the Console.
  2. In the bottom of docs/learning.qmd, create a new header ## Exercise: Convert map to function and use on sleep.
  3. Below the header, create a new code chunk using Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).
  4. As you’ve already done in the previous sessions, convert this code into a function. You can always refer to the Chapter 16 and Chapter 17 sessions.
    • Name the function read_all.
    • Use the function() { ... } code to create a new function
    • Inside function(), add an argument called filename.
    • Paste the code into the function body.
    • Rename hr_files to simply files.
    • Rename hr_data to simply data.
    • Include a return() at the end.
    • Add package_name:: to the individual functions used. To find which package a function comes from, use ?functionname to see the help documentation (there is four functions to add :: to).
    • Create and write Roxygen documentation to describe the new function by using Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”).
    • Run it and check that it works using the code shown above e.g. read_all("HR.csv.gz").
  5. Render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to check that the function works.
  6. Cut and paste the function into the R/functions.R file. Then go back to the docs/learning.qmd file and (making sure the old function is deleted or the code chunk with it has eval: false), render again with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to check that the function works after moving it.
  7. Run {styler} while in the R/functions.R file with the Palette (Ctrl-Shift-P, then type “style file”).
  8. Once done, add the changes you’ve made and commit them to the Git history, using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and then push to GitHub.
# A tibble: 57,183 × 3
   file_path_id                                collection_datetime    hr
   <chr>                                       <dttm>              <dbl>
 1 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:07  83.7
 2 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:20  78.2
 3 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:41  72.5
 4 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:43  73.2
 5 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:18  77.8
 6 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:35  83.7
 7 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:37  84.1
 8 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:44  85.2
 9 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:52  83.7
10 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:45:02  83.1
# ℹ 57,173 more rows
# A tibble: 45,164 × 3
   file_path_id                                collection_datetime   ibi
   <chr>                                       <dttm>              <dbl>
 1 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:46 0.703
 2 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:46:33 0.766
 3 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:48:34 0.727
 4 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:04:41 0.734
 5 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:04:48 0.672
 6 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:06:27 0.766
 7 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:06:28 0.766
 8 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:14:28 0.688
 9 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:15:17 0.688
10 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 17:34:19 0.719
# ℹ 45,154 more rows
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

18.6 🧑‍💻 Extra exercise: Include a new max_rows argument to read_all()

Time: ~10 minutes.

If you finished the previous exercise early, you can try this exercise, which is much harder. To make it a bit harder, we also won’t provide a scaffold.

Right now, the read_all() can only read in the first 100 rows of each file, because we’ve put the default value of max_rows in read() to 100. If we wanted to read in all the data, we would have to change that default value, which isn’t ideal nor robust of a feature for read_all(). Instead, it would be nice if we could set the max_rows argument in read_all(), like:

read_all("HR.csv.gz", max_rows = 1000)
read_all("HR.csv.gz", max_rows = 10)
read_all("HR.csv.gz", max_rows = Inf)

In the R/functions.R file, complete these tasks of updating the read_all() function while using the workflow of editing, source()’ing the R/functions.R file, and testing the function in the Console. Complete these tasks:

  1. Add a new argument to the read_all() function in the function(...) called max_rows with a default value of 100.
  2. Review the section you read at the start of this session on functionals and “anonymous functions”.
  3. Use an anonymous function in the map() code to give the max_rows argument to the read() function’s max_rows argument. You will need to use the \(x) syntax for this. Instead of x, call it file to be more descriptive. In the anonymous function, you will need to use both file and max_rows in read().
  4. Test that read_all("HR.csv.gz", max_rows = 10) works (in the Console) and only reads in 10 rows of each file, by source()’ing the R/functions.R file with Ctrl-Shift-S or with the Palette (Ctrl-Shift-P, then type “source”) whenever you make a change to the read_all() function.
  5. Once it works, update the Roxygen docs by adding @params max_rows to the read_all() documentation.
  6. Render the docs/learning.qmd file to check that it works in the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”).
  7. Style the R/functions.R file with the Palette (Ctrl-Shift-P, then type “style file”).
  8. Lastly, commit your changes to Git with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and push to GitHub.
# A tibble: 6,044 × 3
   file_path_id                                collection_datetime    hr
   <chr>                                       <dttm>              <dbl>
 1 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:07  83.7
 2 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:20  78.2
 3 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:41  72.5
 4 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:43  73.2
 5 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:18  77.8
 6 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:35  83.7
 7 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:37  84.1
 8 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:44  85.2
 9 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:52  83.7
10 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:45:02  83.1
# ℹ 6,034 more rows
# A tibble: 500,181 × 3
   file_path_id                                collection_datetime    hr
   <chr>                                       <dttm>              <dbl>
 1 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:07  83.7
 2 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:20  78.2
 3 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:41  72.5
 4 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:43  73.2
 5 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:18  77.8
 6 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:35  83.7
 7 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:37  84.1
 8 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:44  85.2
 9 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:52  83.7
10 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:45:02  83.1
# ℹ 500,171 more rows
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

18.7 🧑‍💻 Extra exercise: Move the here::here() path out of the function

Time: ~5 minutes.

If you finished the previous exercise early, you can try this exercise, which is much harder. Like with the extra exercise above, to make it a bit harder, we won’t provide a scaffold.

In general, it isn’t good practice to hard code things in functions, as it makes them less robust to changes. For instance, if we ever renamed the data-raw/ folder or the nurses-stress/ folder, the read_all() function (and any other function that might use here::here() to those folders) would break and stop working. So instead, it is better to move those types of things out of the function and into a common location in the “global” environment (meaning outside of a function). That way, if you ever need to change the path, you only need to change the value of that one variable. So in this exercise, you will move that here::here() path out of the function and into a variable in the global environment, and then use that variable in the function. All while the function continues to work as usual.

While this approach is quite powerful when there are some variables that many functions use, it can also cause some problems as you might accidentally create a variable with the same name as the one you used in the function. We can avoid this by using a specific naming style for “global” variables by capitalising them and to hide them by using a . at the start of the variable name. So try it out by following these steps:

  1. Open the R/functions.R file and at the top of the file, create a new “section” with a header called # Global variables -----. The ----- is a common style to make it easier to visually see the section in the code and to make it appear in the “Document Outline” in RStudio.
  2. Below that header, create a new variable called .DATASET_DIR and assign it the value of here::here("data-raw/nurses-stress/"). The . at the start of the variable name makes it hidden in the environment and harder to accidentally change or use again. The capitalisation makes it clear that it is a “global” variable that is used inside functions and isn’t just a variable for one function.
  3. Go into the read_all() function body and replace the here::here("data-raw/nurses-stress/") with the new variable .DATASET_DIR.
  4. Test that read_all() works (in the Console) by source()’ing the R/functions.R file with Ctrl-Shift-S or with the Palette (Ctrl-Shift-P, then type “source”) and running read_all("HR.csv.gz") whenever you make a change to the read_all() function.
  5. When it works, style the R/functions.R file with the Palette (Ctrl-Shift-P, then type “style file”).
  6. Lastly, commit your changes to Git with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and push to GitHub.
# A tibble: 6,044 × 3
   file_path_id                                collection_datetime    hr
   <chr>                                       <dttm>              <dbl>
 1 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:07  83.7
 2 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:20  78.2
 3 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:41  72.5
 4 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:43  73.2
 5 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:18  77.8
 6 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:35  83.7
 7 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:37  84.1
 8 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:44  85.2
 9 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:52  83.7
10 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:45:02  83.1
# ℹ 6,034 more rows
# A tibble: 500,181 × 3
   file_path_id                                collection_datetime    hr
   <chr>                                       <dttm>              <dbl>
 1 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:07  83.7
 2 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:20  78.2
 3 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:41  72.5
 4 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:43:43  73.2
 5 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:18  77.8
 6 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:35  83.7
 7 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:37  84.1
 8 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:44  85.2
 9 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:44:52  83.7
10 /home/runner/work/r-cubed-intermediate/r-c… 2020-07-07 16:45:02  83.1
# ℹ 500,171 more rows
  1. Place it at the top of R/functions.R so that it can be used by other functions below it. Use a . at the start to make it hidden so it isn’t visible in the environment and so is harder to accidentally change or use again. Capitalise it to make it clear that it is a “global” variable that is used inside functions and isn’t just a variable for one function.
  2. Replace the here::here() in the function body with the new variable .DATASET_DIR. This way, if we ever need to change the path to the data, we only need to change it in one place, which is more robust and easier to maintain.

18.8 Key takeaways

Quickly cover this and get them to do the survey before moving on to the discussion activity.

  • R is a functional programming language:
    • It has functions that take an input, do an action, and give an output.
    • It uses vectorisation that apply a function to multiple items (in a vector) all at once rather than using loops.
    • It uses functionals that allow functions to use other functions as input.
  • Thinking in terms of declaratively telling a computer to do actions, rather than telling a computer how to do things step-by-step, is a powerful way to think about programming, especially in R.
  • Design how you write your code by making functions to do an action to one type of object (like a data frame), putting similar types of objects (like a data frame or vector) into a list, and then using functionals like map() to do the action to all of them at once, to be effective and efficient with doing lots of things with very little code.
  • Use the {purrr} package and its function map() when you want to repeat a function on multiple items at once.
  • Use list_rbind() to combine multiple data frames into one data frame by stacking them on top of each other.

18.9 💬 Discussion activity: Incorporating functionals in your own work

Time: ~6 minutes.

As we prepare for the next session and taking the break, get up and discuss with your neighbour or group the following questions:

  • What are some things you do repetitively that you think might benefit from using functionals on?
  • Consider your code you’ve written for other projects or tasks you have in other projects. How might you use functional programming to help your work? How might you change existing code to use them? Might it be a lot of work?

18.10 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

hr_files <- here("data-raw/nurses-stress/") |>
  dir_ls(regexp = "HR.csv.gz", recurse = TRUE)
hr_files
hr_data <- hr_files |>
  map(read)
hr_data[1:2]
hr_data <- hr_files |>
  map(read) |>
  list_rbind()
hr_data
hr_data <- hr_files |>
  map(read) |>
  list_rbind(names_to = "file_path_id")
hr_data
#' Read all `.csv.gz` files in the `stress/` folder into one data frame.
#'
#' @param filename The name of files in the sub-folders that we
#'    want to read in.
#'
#' @returns A single data frame/tibble.
#'
read_all <- function(filename) {
  files <- here::here("data-raw/nurses-stress/") |>
    fs::dir_ls(regexp = filename, recurse = TRUE)

  data <- files |>
    purrr::map(read) |>
    purrr::list_rbind(names_to = "file_path_id")

  return(data)
}

read_all("HR.csv.gz")
read_all("IBI.csv.gz")
#' Read all `.csv.gz` files in the `stress/` folder into one data frame.
#'
#' @param filename The name of files in the sub-folders that we
#'    want to read in.
#' @param max_rows The maximum number of rows to read in for each file.
#'
#' @returns A single data frame/tibble.
#'
read_all <- function(filename, max_rows = 100) {
  files <- here::here("data-raw/nurses-stress/") |>
    fs::dir_ls(regexp = filename, recurse = TRUE)

  data <- files |>
    purrr::map(\(file) read(file, max_rows = max_rows)) |>
    purrr::list_rbind(names_to = "file_path_id")

  return(data)
}

read_all("HR.csv.gz", max_rows = 10)
read_all("HR.csv.gz", max_rows = 1000)
.DATASET_DIR <- here::here("data-raw/nurses-stress/")

# In the function body of `read_all()`.
read_all <- function(filename, max_rows = 100) {
  files <- .DATASET_DIR |>
    fs::dir_ls(regexp = filename, recurse = TRUE)

  data <- files |>
    purrr::map(\(file) read(file, max_rows = max_rows)) |>
    purrr::list_rbind(names_to = "file_path_id")

  return(data)
}

read_all("HR.csv.gz", max_rows = 10)
read_all("HR.csv.gz", max_rows = 1000)