16  Bundling code into functions

We’ve written simple code that reads in one of our data files and does some simple cleaning to it. But we have about 2000 CSV files to read in. So even with our simple code, isn’t going to be very efficient if we just copy and paste that code to read in all those files. This session is about bundling our code into functions that will make it easier to read more files with less code.

16.1 Learning objectives

  1. Describe and identify the individual components of a function as well as the workflow for creating one, and then use that workflow to create a function that imports data.
  2. Identify and describe some basic principles and patterns for writing general-purpose and reusable code and functions.

16.2 📖 Reading task: The basics of a function

Repeat and reinforce the part of what functions are made of, their structure, that all actions are functions, and all functions are objects (but not that all objects are functions).

Time: ~5 minutes.

The first thing to know about R is that everything is an “object” and that some objects can do an action. These objects that do an action are called functions. You’ve heard or read about functions before during the workshop, but what is a function?

A function is a bundled sequence of steps that achieve a specific action and you can usually tell if an object is an action if it has a () at the end of it’s name. For example, mean() is function to calculate the mean or sd() is a function to calculate the standard deviation. It isn’t always true that functions end in () though, which you’ll read about shortly. For instance, the + is a function that adds two numbers together, the [] is a function that is used to subset or extract an item from a list of items like getting a column from a data frame, or <- is a function that creates a new object from some value.

All created functions have the same structure: they are assigned a function name with <-, it uses function() to give it its parameters or arguments, and it has the internal sequence of steps called the function body that are wrapped in {}:

Console
function_name <- function(argument1, argument2) {
  # body of function with R code
}

Notice that this uses two functions to create this function_name function object:

  • <- is the action (function) that will create the new object function_name.
  • function() is the action (function) to tell R that this object is an action (function) whenever it is used with a () at the end, e.g. function_name().

Because R is open source, anyone can see how things work underneath. So, if you want to see what a function does underneath, you would type out the function name without the () into the Console and run it. If we do it with the function sd() which calculates the standard deviation, we see:

Console
sd
function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))
<bytecode: 0x55e492e98a58>
<environment: namespace:stats>

Here you see sd() as the arguments x and na.rm. In the function body it shows how it calculates the standard deviation, which is the square root of the variance. In this code, the var() is inside the sqrt() function, which is exactly what it should be if you know the math (you don’t need to).

Normally you can tell if something is a function if it has () at the end of the name. But there are special functions, like +, [], or even <-, that do an action but that don’t have () at the end. These are called operator functions. Which is why the <- is called the assignment operator, because it assigns something to a new object. To see how they work internally, you would wrap ` around the operator. So for + or <- it would be:

Console
`+`
function (e1, e2)  .Primitive("+")
`<-`
.Primitive("<-")

You’ll see something called .Primitive. Often operators are written in very low level computer code to work, which are called “primitives”. This is way beyond the scope of this workshop to explain primitives and so we won’t go into what that means and why.

To show that they are a function, you can even use them with their () version like this:

1 + 2
[1] 3
`+`(1, 2)
[1] 3
`<-`(x, 1)
x
[1] 1
x <- 1
x
[1] 1

But hopefully you can see that using it with the () isn’t very nice to read or use!

If you can learn to make your own functions, it can help make your life and work much easier and more efficient! That’s because you can make a sequence of actions that you can then reuse again and again. And luckily, you will be making many functions throughout this workshop. Making a function always follows a basic structure:

  1. Give a name to the function (e.g. mean).
  2. Use function() to tell R the new object will be a function and assigning it to the name with <-.
  3. Optionally provide arguments to the function object, for example function(argument1, argument2, argument3).
  4. Fill out the body of the function, with the arguments (if any) contained inside, that does some sequence of actions.
  5. Optionally, use return() to indicate what final output you want the function to have. For learning purposes, we’ll always use return() to help show us what is the final function output but it isn’t necessary.

Emphasize that we will be using this workflow for creating functions all the time throughout workshop and that this workflow is also what you’d use in your daily work.

While there is no minimum or maximum number of arguments you can provide for a function (e.g. you could have zero or dozens of arguments), its generally good practice and design to have as few arguments as necessary to get the job done. Part of making functions is to reduce your own and others cognitive load when working with or reading code. The fewer arguments you use, the lower the cognitive load. So, the structure is:

name <- function(argument1, argument2) {
    # body of function
    output <- ... code ....
    return(output)
}

Writing your own functions can be absolutely amazing and fun and powerful, but you also often want to pull your hair out with frustration at errors that are difficult to understand and fix. One of the best ways to deal with this is by making functions that are small and simple, and testing them as you use them. The smaller they are, the less chance you will have that there will be an error or issue that you can’t figure out. There’s also some formal debugging steps you can do but due to time and to the challenge of making meaningful debugging exercises, as solutions to problems are very dependent on the project and context, we won’t be going over debugging. There is some extra material in Appendix A that you can look over in your own time that has some instructions on debugging and dealing with some common problems you might encounter with R.

CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

16.3 Creating our first function

Take your time slowly going over this, especially taking about the Roxygen documentation template.

Let’s create a really basic function to show the process. First, create a new Markdown header called ## Making a function to add numbers and create a code chunk below that with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). Then, inside the code chunk, we’ll write this code:

docs/learning.qmd
add_numbers <- function(num1, num2) {
  added <- num1 + num2
  return(added)
}

You can use the new function by running the above code and writing out your new function, with arguments to give it.

docs/learning.qmd
add_numbers(1, 2)
[1] 3

The function name is fairly good; add_numbers is read as “add numbers”. While we generally want to write code that describes what it does by reading it, it’s also good practice to add some formal documentation to the function. Use the “Insert Roxygen Skeleton” in the “Code” menu, by typing Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”), and you can add template documentation right above the function. Make sure your cursor is within the function in order for the Roxygen template to be added to your function. It looks like:

docs/learning.qmd
#' Title
#'
#' @param num1
#' @param num2
#'
#' @returns
#' @export
#'
#' @examples
add_numbers <- function(num1, num2) {
  added <- num1 + num2
  return(added)
}

In the Title area, this is where you type out a brief sentence or several words that describe the function. Creating a new paragraph below this line allows you to add a more detailed description. The other items are:

  • @param num: These lines describe what each argument (also called parameter) is for and what to give it.
  • @return: This describes what output the function gives. Is it a data.frame? A plot? What else does the output give?
  • @export: Tells R that this function should be accessible to the user of your package. Since we aren’t making packages, delete it.
  • @examples: Any lines below this are used to show examples of how to use the function. This is very useful when making packages, but not really in this case. So we’ll delete it. Let’s write out some documentation for this function:
docs/learning.qmd
#' Add two numbers together.
#'
#' @param num1 A number here.
#' @param num2 A number here.
#'
#' @returns Returns the sum of the two numbers.
#'
add_numbers <- function(num1, num2) {
  added <- num1 + num2
  return(added)
}

Once we’ve created that and before moving on, let’s style our code with the Palette (Ctrl-Shift-P, then type “style file”), render the Quarto document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”), and then open up the Git Interface with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) to add and commit these changes to the Git history before then pushing to GitHub.

16.4 📖 Reading task: Workflow for prototyping and creating functions

Highlight the workflow and diagram. Reinforce this workflow and that we will be using it all throughout this workshop.

Time: ~6 minutes.

At the level of the code, the way you prototype code is to:

  1. Write the code out in Quarto so that it does what you want.
  2. Convert that code into a function.
  3. Test that the function works either in the Quarto document or in the R Console.
  4. Fix the function if it doesn’t work.
  5. Restart the R console with Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-P, then type “restart”) or render with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to test that the function works in a clean environment.
  6. Whenever the function works, add and commit the changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) (or commit after you move the function to the R/functions.R script, which we will talk about in the next session).
flowchart TD
    code("Write code<br>in Quarto") --> as_function("Convert to<br>function")
    as_function --> test("Test function in<br>Quarto or Console")
    test --> fix_function("Fix function")
    test -- "Commit" --> git("Git history")
    fix_function --> test
    fix_function --> either([Either:])
    either -. As needed .-> restart("Restart R<br>session")
    either -. As needed .-> render("Render<br>Quarto")
    restart -.-> test
    render -.-> test
Figure 16.1: Workflow for prototyping code in Quarto, converting to a function, testing it, rendering or restarting, and committing to Git.

Either restarting R or rendering the Quarto document is the only way there is to be certain the R workspace is in a clean state. When code runs after a clean state, it improves the chances that your code and project will be reproducible.

We use Git because it is the best way of keeping track of what was done to your files, when, and why. It helps to keep your work transparent and makes it easier for you to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science (e.g., transparency, reproducibility, and documentation).

CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

16.5 Making a function for importing our data

Now that we have a basic understanding of what a function looks like, let’s apply that to what we’re doing right now: Importing our data.

Emphasize that we will be using this workflow for creating functions all the time throughout workshop and is also a common workflow when making functions in R.

While you read about the general workflow above, the more detailed steps for making a function is to:

  1. Write code that works and does what you want.
  2. Enclose it as a function with name <- function() { ... }, with an appropriate and descriptive name.
  3. Create arguments in the function call (function(argument1, argument2)) with appropriate and descriptive names, then replace the code in the function body with the argument names where appropriate.
  4. Rename any objects created to be more generic and include the return() function at the end to indicate what the function will output.
  5. Run the function and check that it works.
  6. Add the Roxygen documentation tags with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”) while the cursor is in the function.

In docs/learning.qmd, create a new Markdown header called ## Import 5C's HR data with a function and create a code chunk below that with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).

So, step one. Let’s copy and paste the code we previously for importing the HR.csv.gz data from participant 5C and convert that as a function:

docs/learning.qmd
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = to_snake_case,
    n_max = 100
  )

Next we wrap it in the function call and give it an appropriate name. Since we’ve used this code to read in different types of data, the name should reflect that it is fairly generic, at least for our data. So we could simply call it read.

Caution

We want to be careful using very generic names for functions, as these are often already used by R or in other packages. We can always check if a name is used by R by typing ?function_name in the Console. If it shows you documentation, then it is used and it’s best to use another name for our function.

docs/learning.qmd
read <- function() {
  here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
}

Then, we add arguments in the function and replace within the code. We want to use the read() function to read a file path that has data in it. So we can add an argument to our function for the file path to read from. So, a good name might be file_path.

It’s rarely good to hard code things into functions, like the use of here(), otherwise the function will always only read from that file. Instead, it’s good design to give functions a full file path that it can use internally. Then when we use the function, we would use here() with the correct path in the function argument.

So we would replace here("...") with file_path:

docs/learning.qmd
read <- function(file_path) {
  file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
}

While it isn’t required, for learning purposes we will use the return() function to explicitly indicate what the function outputs at the end. In our case, we want it to output the data frame that we just read in. Since we haven’t assigned the output of the read_csv() to an object, let’s do that first and than use return() to output that object at the end. Let’s keep it generic again and call it data:

docs/learning.qmd
read <- function(file_path) {
  data <- file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
  return(data)
}

Great! Now we need to test it out. Let’s try it on two HR.csv.gz datasets from 5C:

docs/learning.qmd
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
  read()
# A tibble: 37 × 2
   collection_datetime    hr
   <dttm>              <dbl>
 1 2020-04-14 17:50:46  80.4
 2 2020-04-14 17:50:50  81.6
 3 2020-04-14 17:50:55  81.8
 4 2020-04-14 17:51:08  88.8
 5 2020-04-14 17:50:44  79.3
 6 2020-04-14 17:50:52  81.8
 7 2020-04-14 17:50:48  81.2
 8 2020-04-14 17:50:53  81.7
 9 2020-04-14 17:50:56  82  
10 2020-04-14 17:50:58  82.8
# ℹ 27 more rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read()
# A tibble: 100 × 2
   collection_datetime    hr
   <dttm>              <dbl>
 1 2020-04-14 17:52:12  70  
 2 2020-04-14 17:52:42  76.8
 3 2020-04-14 17:52:51  79.3
 4 2020-04-14 17:53:12  89.4
 5 2020-04-14 17:53:32 101. 
 6 2020-04-14 17:53:35 102. 
 7 2020-04-14 17:53:42 102. 
 8 2020-04-14 17:53:45 102. 
 9 2020-04-14 17:53:52 100.0
10 2020-04-14 17:54:03  94.5
# ℹ 90 more rows

Awesome! It works 🎉 The final stage is to add the Roxygen documentation.

docs/learning.qmd
#' Read in one nurses' stress data file.
#'
#' @param file_path Path to the data file.
#'
#' @returns Outputs a data frame/tibble.
#'
read <- function(file_path) {
  data <- file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
  return(data)
}

A massive advantage of using functions is that if you want to make a change to what your code does, like if you fix a mistake, you can very easily do it by modifying the function and it will change all your other code too.

Now that we have a working function, run styler with the Palette (Ctrl-Shift-P, then type “style file”), render with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”), and then add and commit the changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) before then pushing to GitHub.

16.6 📖 Reading task: Principles to make general-purpose and reusable functions

Briefly reinforce what they read by slowly going through these points below about making generalised functions. Emphasise the principles below, especially the “do one thing” and “keep it small”.

Time: ~3 minutes.

There are a few ways to make a function more general-purpose and reusable, while still being useful for your specific purposes. These principles are:

  1. Have the function’s input (the arguments) be generic enough to take different types of objects as long as they are the same “type”. In our case, we have a function that takes a path “type”. A path is a very general input, so we can keep it as is.
  2. Have the function’s output be a common “type”, like a vector or data frame. When working in R, it’s a very good practice to have functions output a data frame, since many functions, especially in the tidyverse, take a data frame as input.
  3. Make the first argument of the function be something that can be “pipe-able”. That way you can chain together your functions with the |> operator. In general, if the function does some action to a data frame, then the data frame should be the first argument called data so that it works well with piping from tidyverse functions.
  4. Make your function do one conceptual thing well. For example, read data from a file and make it cleaner. Or convert all columns that are characters into numbers.
  5. Keep the function small. It is easier to be reused, easier to test, and easier to debug when it has fewer lines of code.
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

16.7 🧑‍💻 Exercise: Add another function argument to read different number of rows

Time: ~10 minutes.

We’ve converted the code to read in different data files from the nurses’ stress dataset into a function. Now, let’s modify it to be a bit more general-purpose so that we can also test out reading in different number of rows so we don’t have to keep changing the n_max value inside of the function.

Use the read() function code we wrote above and revise it so it can be used like the below:

# Outputs only 100 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read()

# Outputs only 1000 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read(max_rows = 1000)

# Outputs all rows (`Inf` means infinity)
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read(max_rows = Inf)
  1. In the read() function, add a new argument called max_rows after the file_path argument inside function(). We’ll called it max_rows because that’s a more descriptive name for it that n_max.
  2. Inside the function body, replace the n_max = 100 with n_max = max_rows so that the function will use the value given in the max_rows argument when it is used.
  3. Since we’d like this argument to be optional, meaning that we don’t have to use it (like the first code example above), we have to give the argument a default value in function(). Since we are using 100 already, let’s use that value by putting max_rows = 100 in function().
  4. Copy and paste the example code above to below the function code, then run the function code again along with the pasted example code. You should see that the first code example outputs 100 rows, the second outputs 1000 rows, and the third outputs all rows.
  5. Update the Roxygen documentation to include the new max_rows argument and its description. You can copy and paste the existing line with the @param for file_path and then modify it for max_rows.
  6. Run styler while in the docs/learning.qmd file with the Palette (Ctrl-Shift-P, then type “style file”).
  7. Render the Quarto document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”).
  8. Finally, add and commit the changes to the Git history, using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”). Then push to GitHub.
CautionSticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

Go over this solution with everyone as we will use it in the next session.

16.8 💬 Discussion activity: What are some tasks that could be functions?

Time: ~6 minutes.

As we prepare for the next session and the break, get up, walk around, and discuss with your neighbour some of the following questions:

  • What are some tasks you do that are repetitive or that you do multiple times with very small changes each time you do the task?
  • How might you use functions in your work? Can you think of specific tasks or situations where you could use one?

16.9 Key takeaways

Quickly cover this and get them to do the survey before moving on to the discussion activity.

  • Everything in R is an object.
  • Every action in R is a function and every function is an object.
  • Functions contain a sequence of steps that do actions to an object.
  • Functions have five components:
    • The three required ones are using function() { }, the code in the function body between the { }, and an output (usually set with return()).
    • The two optional ones are assigning the function as a new object with <- and the function arguments put within function().
  • Document functions by using Roxygen.
  • Keep functions small and simple, so it is easier to test and fix them.
  • Use few arguments in functions to reduce cognitive load.

16.10 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

sd
`+`
`<-`
1 + 2
`+`(1, 2)

`<-`(x, 1)
x
x <- 1
x
add_numbers <- function(num1, num2) {
  added <- num1 + num2
  return(added)
}
add_numbers(1, 2)
#' Title
#'
#' @param num1
#' @param num2
#'
#' @returns
#' @export
#'
#' @examples
add_numbers <- function(num1, num2) {
  added <- num1 + num2
  return(added)
}
#' Add two numbers together.
#'
#' @param num1 A number here.
#' @param num2 A number here.
#'
#' @returns Returns the sum of the two numbers.
#'
add_numbers <- function(num1, num2) {
  added <- num1 + num2
  return(added)
}
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = to_snake_case,
    n_max = 100
  )
read <- function() {
  here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
}
read <- function(file_path) {
  file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
}
read <- function(file_path) {
  data <- file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
  return(data)
}
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
  read()
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read()
#' Read in one nurses' stress data file.
#'
#' @param file_path Path to the data file.
#'
#' @returns Outputs a data frame/tibble.
#'
read <- function(file_path) {
  data <- file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
  return(data)
}
# Outputs only 100 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read()

# Outputs only 1000 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read(max_rows = 1000)

# Outputs all rows (`Inf` means infinity)
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
  read(max_rows = Inf)
#' Read in one nurses' stress data file.
#'
#' @param file_path Path to the data file.
#' @param max_rows Maximum number of rows to read.
#'
#' @returns Outputs a data frame/tibble.
#'
read <- function(file_path, max_rows = 100) {
  data <- file_path |>
    read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = max_rows
    )
  return(data)
}