7  Making robust and general-purpose functions

Warning

🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧

7.1 Learning objectives

  1. Explain what R package dependency management is, why it is necessary when writing code and ensuring reproducibility.
  2. Use tools like usethis::use_package() to manage dependencies in an R project and use :: to explicit use a specific function from a specific package in any functions you make.
  3. Identify and describe some basic principles and patterns for writing more robust, reusable, and general-purpose code.

7.2 💬 Discussion activity: How might you make your functions more general-purpose?

Time: ~8 minutes.

One of the more powerful features of making functions is that you can easily reuse them in other sections of the file or in other files, which we will cover how to do later in this session. Part of that power comes from knowing how to make functions that are general enough to be used in other settings too.

From the previous session, we made the import_cgm() function together and in the exercises you made the import_sleep() function. These two functions basically do the same thing. So why have two? We don’t need two! So let’s refactor the import_cgm() function so it is clearer what it does and can do.

The first step is to look at the code to see what it does and see how you can modify it to be more general-purpose.

  1. Take 2 minutes to look over the two functions you made and think about what you could do to make it more general-purpose.
  2. Take 3 minutes to discuss with your neighbour what you think could be done to make it more general-purpose and try to come to a conclusion.
  3. Then all together and in the remaining time, everyone will share what they’ve thought of.

Try not to look ahead 😜 We won’t generalise the function yet, first we will make it more robust, and then we will generalise it.

7.3 📖 Reading task: Making your function more robust with explicit dependencies

Time: ~10 minutes.

Before we make the function more general-purpose, this is a good time to talk about package dependencies and making your function more robust and trust-worthy.

So what is a package dependency and how do you manage it? Whenever you use an R package in your project, you depend on it in order for your code to work. The informal way to “manage” dependencies is by doing what you’ve already done before: using the library() function to load the package into R.

As you read others code online or from other researchers, you may notice that sometimes the function require() is used to load packages like the library() function is used. The problem with require() is that if the package can’t be loaded, it doesn’t give an error. That’s because require() only checks if the package is available and will otherwise continue running the code. As we’ll cover in this course, this can be a very bad thing because if a package isn’t loaded, it can change the behaviour of some of your code and give you potentially wrong results. On the other hand, library() will give an error if it can’t find the package, which is what you expect if your code depends on a package.

So, what happens if you come back to the project, or get a new computer, or someone else is working on your project too and they want to make sure they have the packages your project needs? How will they know what packages your project uses? What do they do to get those packages installed? Do they have to search through all your files just to find all library() functions you used and then install those packages individually and manually? A much better way here is to formally indicate your package dependency so that installing dependencies is easy! And we do this by making use of the DESCRIPTION file.

The advantage of using the DESCRIPTION file is that it is a standard file used by R projects to store metadata about the project, including which packages are needed to run the project. It also means there are many helper tools available that use this DESCRIPTION file, including tools to install all the packages you need.

So, if you or someone else wants to install all the packages your project depends on, all you or they have to do is go to the Console and type out (you don’t need to do this right now):

Console
pak::pak()

This function looks into the DESCRIPTION file and installs all the packages listed as dependencies there.

Where are these package dependencies listed in the DESCRIPTION file? Open up your DESCRIPTION file, which you can do quickly with Ctrl-., typing the file name out, and hitting enter to open it. Your file may or may not look like the below text. If it doesn’t, it isn’t a problem as the text is just to give you an idea of what it might look like.

Type: Project
Package: LearnR3
Title: Analysis Project for LearnR3
Version: 0.0.1
Encoding: UTF-8

Notice the Imports: key. This is where information about package dependencies is added. In the next section, we will go over how to add packages to this field.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

7.4 Explicitly state a project’s package dependencies

There are a few ways to add package dependencies to the DESCRIPTION file. The most straightforward way is to manually write the package you need in the Imports: section of DESCRIPTION file. But, there are a few issues with that, mainly, you may not add it correctly. The other, better way to add dependencies is to use the usethis::use_package() function.

Since we’ve used the here package in our code, let’s add it as a dependency. Go to the Console and let’s type out how to add it. Don’t write this code in your Quarto document, since you don’t want to run it every time you render the document.

Console
usethis::use_package("here")

You will see a bunch of text about adding it to Imports. If you look in your DESCRIPTION file now, you’ll see something like:

Type: Project
Package: LearnR3
Title: Analysis Project for LearnR3
Version: 0.0.1
Imports:
    here
Encoding: UTF-8

Since we will also make use of the tidyverse set of packages later in the workshop, we’ll also add tidyverse as a dependency.

Console
usethis::use_package("tidyverse")
Error in `refuse_package()`:
✖ tidyverse is a meta-package and it is rarely a good idea to
  depend on it.
Please determine the specific underlying package(s) that provide the
function(s) you need and depend on that instead.
ℹ For data analysis projects that use a package structure but do not
  implement a formal R package, adding tidyverse to 'Depends' is a
  reasonable compromise.
Call `use_package("tidyverse", type = "depends")` to achieve this.

This gives an error though. That’s because the tidyverse is a large collection of packages, so as stated by the message, the recommended way to add this particular dependency is with:

Console
usethis::use_package("tidyverse", type = "Depends")

If you look in the DESCRIPTION file now, you see that the new Depends field has been added with tidyverse right below it.

Type: Project
Package: LearnR3
Title: Analysis Project for LearnR3
Version: 0.0.1
Depends:
    tidyverse
Imports:
    here
Encoding: UTF-8

There are fairly technical reasons why we need to put tidyverse in the Depends field that you don’t need to know about for this workshop, aside from the fact that it is a common practice in R projects. At least in this context, we use the Depends field for tidyverse because of one big reason: the usethis::use_package() function will complain if we try to put tidyverse in the Imports and it recommends putting it in the Depends field. The other reason is that you never directly use the tidyverse package, but rather the individual packages that it loads.

Great! Now that we’ve formally established package dependencies in our project, we also need to formally declare which package each function comes from inside our own functions.

7.5 Explicitly state which package a function comes from

Put on the project the callout block below and explain why we want to make dependencies within functions more explicit and why using library() and require() is a bad idea.

One important way of making more robust functions is by coding the exact packages each of our functions come from that we use in our own function. That makes it much easier to reuse, won’t break as easily, and will give more predictable results each time you run it.

Important

Regarding the use of library() and require(), you may think that one way of telling your function what package to use is to include library() or require() inside the function. This is an incorrect way to do it and can often give completely wrong results without giving any error or warning. Sometimes, on some websites and help forums, you may see code that looks like this:

add_numbers <- function(num1, num2) {
    library(package_name)
    ...code...
    return(added)
}

Or:

add_numbers <- function(num1, num2) {
    require(package_name)
    ...code...
    return(added)
}

This is very bad practice and can have some unintended and serious consequences without giving any warning or error. We won’t get into the reasons why this is incorrect because it can quickly get quite technical and is out of the scope of this workshop.

The correct way to explicitly use a function from a package is using something we’ve already used before with usethis::use_package(): By using ::! For every function inside your package, aside from functions that come from {base}, use package_name::function_name().

When we use package_name::function_name for each function in our function, we are explicitly telling R (and us the readers) where the function comes from. This can be important because sometimes the same function name can be used by multiple packages, for example the filter() function. So if you don’t explicitly state which package the function is from, R will use the function that it finds first—which isn’t always the function you wanted to use. We also do this step at the end of making the function because doing it while we create it can be quite tedious.

Let’s start doing that with our function. We may not always know which package a function comes from, but we can easily find that out. Let’s start with the first action in our function: read_csv(). In the Console:

Console
?read_csv

This will open the help page for the read_csv() function. If you look at the top left corner, you’ll see the package name in curly brackets {}. This tells you which package the function comes from. In this case, it is readr. So, we can update our function to use readr::read_csv() instead of just read_csv():

import_cgm <- function(file_path) {
  cgm <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = to_snake_case,
      n_max = 100
    )
  return(cgm)
}

There is still more to do, but now it’s your turn to try.

7.6 🧑‍💻 Exercise: Finish setting the dependencies

Time: ~10 minutes.

  1. While we added readr to the function, we haven’t added it to the DESCRIPTION file yet. In the Console, use usethis::use_package() to add the readr package to the DESCRIPTION file.

  2. There is one other function we use in the import_cgm() function. Find it, figure out what package it comes from, use :: to explicitly state the package, and add it to DESCRIPTION file using usethis::use_package() in the Console.

  1. There is one other package we’ve used that we haven’t added to the DESCRIPTION file. We used the package in the data-raw/dime.R file. Open that file, which you can do with Ctrl-. and typing “dime.R” and selecting the file from the menu. In that file, find the package we used and add it to the DESCRIPTION file using usethis::use_package() in the Console.
  1. Finally, style the code using the Palette (Ctrl-Shift-P, then type “style file”), render the docs/learning.qmd file with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to test that everything still works, and then add and commit the changes to Git with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

7.7 📖 Reading task: Principles to make general-purpose and reusable functions

Briefly reinforce what they read by slowly going through these points below about making generalised functions. Emphasise the principles below, especially the “do one thing” and “keep it small”.

Time: ~3 minutes.

Recall from our discussion at the start of this session about making our import_cgm() function more general. There are a few ways we could do it, before first, let’s go over some general principles of making functions that are more general-purpose and reusable. These principles are:

  1. Have the function’s input (the arguments) be generic enough to take different types of objects as long as they are the same “type”. In our case, we have two functions that both take a path “type”. A path is a very general input, so we can keep it as is.
  2. Have the function’s output be a common “type”, like a vector or data frame. When working in R, it’s a very good practice to have functions output a data frame, since many functions, especially in the tidyverse, take a data frame as input.
  3. Make the first argument of the function be something that can be “pipe-able”. That way you can chain together your functions with the |> operator. In this case, either always have data as the first argument to work well with piping from tidyverse functions.
  4. Make your function do one conceptual thing well. For example, read data from a file and make it cleaner. Or convert all columns that are characters into numbers.
  5. Keep the function small. It is easier to be reused, easier to test, and easier to debug when it has fewer lines of code.
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

7.8 Generalising our import function

So, let’s make our import_cgm() function more general-purpose. We know that both the import_cgm() and import_sleep() functions do basically the same thing:

  1. Have a file path as an argument.
  2. Read in the path to a CSV file with readr::read_csv().
  3. Convert column names to snake case with snakecase::to_snake_case().
  4. Limit the number of rows imported with n_max.
  5. Quiet the message about the column types with show_col_types = FALSE.
  6. Output the imported data frame.

So, we can combine the two functions into one function that does all of these things. We could call the function a lot of different names (naming is really hard in coding), but let’s keep it simple and call it import_dime(). We want this function to be able to import different CSV files, for example:

docs/learning.qmd
import_dime(here("data-raw/dime/cgm/101.csv"))
import_dime(here("data-raw/dime/sleep/101.csv"))

Let’s generalise the function! Rather than internally say cgm or sleep, we can keep it simple can call it data. Create a new header at the bottom of the docs/learning.qmd file called ## Import DIME data function and create a code chunk with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). Then we’ll write the new function from scratch:

docs/learning.qmd
import_dime <- function(file_path) {
  data <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(data)
}

Before testing it out, let’s make the Roxygen documentation for it with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”):

docs/learning.qmd
#' Import data from the DIME study dataset.
#'
#' @param file_path Path to the CSV file.
#'
#' @returns A data frame.
#'
import_dime <- function(file_path) {
  data <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(data)
}

Below the function, write out these two lines of code to test that it works:

docs/learning.qmd
import_dime(here("data-raw/dime/cgm/101.csv"))
# A tibble: 100 × 2
   device_timestamp    historic_glucose_mmol_l
   <dttm>                                <dbl>
 1 2021-03-18 08:15:00                     5.8
 2 2021-03-18 08:30:00                     5.4
 3 2021-03-18 08:45:00                     5.1
 4 2021-03-18 09:01:00                     5.3
 5 2021-03-18 09:16:00                     5.3
 6 2021-03-18 09:31:00                     4.9
 7 2021-03-18 09:46:00                     4.7
 8 2021-03-18 10:01:00                     4.8
 9 2021-03-18 10:16:00                     5.5
10 2021-03-18 10:31:00                     5.7
# ℹ 90 more rows
import_dime(here("data-raw/dime/sleep/101.csv"))
# A tibble: 100 × 3
   date                sleep_type seconds
   <dttm>              <chr>        <dbl>
 1 2021-05-24 23:03:00 wake           540
 2 2021-05-24 23:12:00 light          180
 3 2021-05-24 23:15:00 deep          1440
 4 2021-05-24 23:39:00 light          240
 5 2021-05-24 23:43:00 wake           300
 6 2021-05-24 23:48:00 light          120
 7 2021-05-24 23:50:00 rem           1350
 8 2021-05-25 00:12:30 wake           870
 9 2021-05-25 00:27:00 rem            360
10 2021-05-25 00:33:00 light          210
# ℹ 90 more rows

This should work without any problems 🎉 Let’s style the code with the Palette (Ctrl-Shift-P, then type “style file”) and then render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to test that everything works. If everything works, let’s add and commit the changes to the Git history using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

7.9 Easily reuse stable, robust functions by storing them in the R/ folder

Really emphasize to cut and paste, so that the function in the docs/learning.qmd file is deleted and no longer kept in the Quarto document.

We’ve now created one general-purpose function that we can use later to import many different types of data files. We’ve made it more robust and have tested it so we can be certain it is fairly stable now. So, let’s move the function into a location that we could be able to re-use it in other Quarto documents (if we had more). Since we already have a file called R/functions.R, we will keep all our stable and tested functions in there.

So, in the docs/learning.qmd file, only cut the function and it’s Roxygen documentation, open the R/functions.R with Ctrl-., and then paste into this file.

The code in the R/functions.R file should now look like this:

R/functions.R
#' Import data from the DIME study dataset.
#'
#' @param file_path Path to the CSV file.
#'
#' @returns A data frame.
#'
import_dime <- function(file_path) {
  data <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(data)
}

We move the function over into this file for a few reasons:

  1. To the Quarto document from becoming too long and having too many different functions and code throughout it.
  2. To make it easier to maintain and find things in your project since you know that all stable, tested functions are in the R/ folder.
  3. To make use of the source() function to load the functions into any Quarto document you want to use them in.

Once we have cut and pasted it into the R/functions.R file, let’s include source() in the Quarto document. Open the docs/learning.qmd file and go to the top of the file to the setup code chunk. Add the line source(here("R/functions.R")) to the bottom of the code chunk. This will load the functions into the Quarto document when it is rendered. This means that we can use the functions in the R/functions.R file without having the actual code be in the Quarto document.

Since we’ve also explicitly used the snakecase::to_snake_case() function in the import_dime() function, we don’t need the library(snakecase) line from the setup code chunk. So let’s also remove that.

The setup code chunk should look like this now:

```{r setup}
library(tidyverse)
library(here)
source(here("R/functions.R"))
```

And the bottom of the Quarto document should still have the code:

docs/learning.qmd
import_dime(here("data-raw/dime/cgm/101.csv"))
# A tibble: 100 × 2
   device_timestamp    historic_glucose_mmol_l
   <dttm>                                <dbl>
 1 2021-03-18 08:15:00                     5.8
 2 2021-03-18 08:30:00                     5.4
 3 2021-03-18 08:45:00                     5.1
 4 2021-03-18 09:01:00                     5.3
 5 2021-03-18 09:16:00                     5.3
 6 2021-03-18 09:31:00                     4.9
 7 2021-03-18 09:46:00                     4.7
 8 2021-03-18 10:01:00                     4.8
 9 2021-03-18 10:16:00                     5.5
10 2021-03-18 10:31:00                     5.7
# ℹ 90 more rows
import_dime(here("data-raw/dime/sleep/101.csv"))
# A tibble: 100 × 3
   date                sleep_type seconds
   <dttm>              <chr>        <dbl>
 1 2021-05-24 23:03:00 wake           540
 2 2021-05-24 23:12:00 light          180
 3 2021-05-24 23:15:00 deep          1440
 4 2021-05-24 23:39:00 light          240
 5 2021-05-24 23:43:00 wake           300
 6 2021-05-24 23:48:00 light          120
 7 2021-05-24 23:50:00 rem           1350
 8 2021-05-25 00:12:30 wake           870
 9 2021-05-25 00:27:00 rem            360
10 2021-05-25 00:33:00 light          210
# ℹ 90 more rows

But not have the code to make the import_dime() in the Quarto document.

Let’s test it that it works. Render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) and check that it works. If it does, then we can add and commit the changes to both the docs/learning.qmd and R/functions.R file.

7.10 Key takeaways

  • Make it easier to collaborate with yourself in the future and with others by explicitly setting which packages your project depends on. Use usethis::use_package() to set the dependency for you in the DESCRIPTION file.
  • Create more re-usable and easier to test and debug functions by keeping them small (few lines of code) and that do one (conceptual) thing at a time. Less is more!
  • Make your function more robust by explicitly stating which packages the code you use in your function comes from by using package_name::function_name().
  • Keep your stable, robust functions in a separate file for easier re-use across your files, for instance, in the R/functions.R file. You can re-use the functions by using source(here("R/functions.R")) in your Quarto documents.

7.11 💬 Discussion activity: How robust might your code be or code that you’ve read?

Time: ~6 minutes.

As we prepare for the next session and to help reinforce what you’ve learned in this session, get up and walk around with your neighbour and talk about some of these questions.

Tip

Part of improving your coding skills is to think about how you can improve your code and the code of others. No one writes perfect code and no one writes great code the first time. Or the second, or the third time. Often code will be refactored multiple times before it is (sufficiently) stable and robust. That is just how coding works.

Being open and receptive to constructive critique and feedback is an essential skill to have as both a researcher and for coding. So it’s important to seek out feedback and to give feedback on your own and others’ code, and try to improve it.

  1. Think about code you’ve written or that you’ve read from others (online or colleagues). How robust do you think it was? What are some things you could do to make it more robust?
  2. Together with your neighbour, discuss some of these things you’ve thought about. Try to find out if you have similar thoughts or ideas on how to improve things.

7.12 Code used in this session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

import_cgm <- function(file_path) {
  cgm <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(cgm)
}
usethis::use_package("readr")
usethis::use_package("snakecase")
usethis::use_package("fs")
import_dime <- function(file_path) {
  data <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(data)
}
#' Import data from the DIME study dataset.
#'
#' @param file_path Path to the CSV file.
#'
#' @returns A data frame.
#'
import_dime <- function(file_path) {
  data <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(data)
}
import_dime(here("data-raw/dime/cgm/101.csv"))
import_dime(here("data-raw/dime/sleep/101.csv"))
#' Import data from the DIME study dataset.
#'
#' @param file_path Path to the CSV file.
#'
#' @returns A data frame.
#'
import_dime <- function(file_path) {
  data <- file_path |>
    readr::read_csv(
      show_col_types = FALSE,
      name_repair = snakecase::to_snake_case,
      n_max = 100
    )
  return(data)
}