Console
sdfunction (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x55e492e98a58>
<environment: namespace:stats>
We’ve written simple code that reads in one of our data files and does some simple cleaning to it. But we have about 2000 CSV files to read in. So even with our simple code, isn’t going to be very efficient if we just copy and paste that code to read in all those files. This session is about bundling our code into functions that will make it easier to read more files with less code.
Repeat and reinforce the part of what functions are made of, their structure, that all actions are functions, and all functions are objects (but not that all objects are functions).
Time: ~5 minutes.
The first thing to know about R is that everything is an “object” and that some objects can do an action. These objects that do an action are called functions. You’ve heard or read about functions before during the workshop, but what is a function?
A function is a bundled sequence of steps that achieve a specific action and you can usually tell if an object is an action if it has a () at the end of it’s name. For example, mean() is function to calculate the mean or sd() is a function to calculate the standard deviation. It isn’t always true that functions end in () though, which you’ll read about shortly. For instance, the + is a function that adds two numbers together, the [] is a function that is used to subset or extract an item from a list of items like getting a column from a data frame, or <- is a function that creates a new object from some value.
All created functions have the same structure: they are assigned a function name with <-, it uses function() to give it its parameters or arguments, and it has the internal sequence of steps called the function body that are wrapped in {}:
Console
function_name <- function(argument1, argument2) {
# body of function with R code
}Notice that this uses two functions to create this function_name function object:
<- is the action (function) that will create the new object function_name.function() is the action (function) to tell R that this object is an action (function) whenever it is used with a () at the end, e.g. function_name().Because R is open source, anyone can see how things work underneath. So, if you want to see what a function does underneath, you would type out the function name without the () into the Console and run it. If we do it with the function sd() which calculates the standard deviation, we see:
Console
sdfunction (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x55e492e98a58>
<environment: namespace:stats>
Here you see sd() as the arguments x and na.rm. In the function body it shows how it calculates the standard deviation, which is the square root of the variance. In this code, the var() is inside the sqrt() function, which is exactly what it should be if you know the math (you don’t need to).
Normally you can tell if something is a function if it has () at the end of the name. But there are special functions, like +, [], or even <-, that do an action but that don’t have () at the end. These are called operator functions. Which is why the <- is called the assignment operator, because it assigns something to a new object. To see how they work internally, you would wrap ` around the operator. So for + or <- it would be:
Console
`+`function (e1, e2) .Primitive("+")
`<-`.Primitive("<-")
You’ll see something called .Primitive. Often operators are written in very low level computer code to work, which are called “primitives”. This is way beyond the scope of this workshop to explain primitives and so we won’t go into what that means and why.
To show that they are a function, you can even use them with their () version like this:
1 + 2[1] 3
`+`(1, 2)[1] 3
`<-`(x, 1)
x[1] 1
x <- 1
x[1] 1
But hopefully you can see that using it with the () isn’t very nice to read or use!
If you can learn to make your own functions, it can help make your life and work much easier and more efficient! That’s because you can make a sequence of actions that you can then reuse again and again. And luckily, you will be making many functions throughout this workshop. Making a function always follows a basic structure:
mean).function() to tell R the new object will be a function and assigning it to the name with <-.function(argument1, argument2, argument3).return() to indicate what final output you want the function to have. For learning purposes, we’ll always use return() to help show us what is the final function output but it isn’t necessary.Emphasize that we will be using this workflow for creating functions all the time throughout workshop and that this workflow is also what you’d use in your daily work.
While there is no minimum or maximum number of arguments you can provide for a function (e.g. you could have zero or dozens of arguments), its generally good practice and design to have as few arguments as necessary to get the job done. Part of making functions is to reduce your own and others cognitive load when working with or reading code. The fewer arguments you use, the lower the cognitive load. So, the structure is:
Writing your own functions can be absolutely amazing and fun and powerful, but you also often want to pull your hair out with frustration at errors that are difficult to understand and fix. One of the best ways to deal with this is by making functions that are small and simple, and testing them as you use them. The smaller they are, the less chance you will have that there will be an error or issue that you can’t figure out. There’s also some formal debugging steps you can do but due to time and to the challenge of making meaningful debugging exercises, as solutions to problems are very dependent on the project and context, we won’t be going over debugging. There is some extra material in Appendix A that you can look over in your own time that has some instructions on debugging and dealing with some common problems you might encounter with R.
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
Take your time slowly going over this, especially taking about the Roxygen documentation template.
Let’s create a really basic function to show the process. First, create a new Markdown header called ## Making a function to add numbers and create a code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Then, inside the code chunk, we’ll write this code:
docs/learning.qmd
add_numbers <- function(num1, num2) {
added <- num1 + num2
return(added)
}You can use the new function by running the above code and writing out your new function, with arguments to give it.
docs/learning.qmd
add_numbers(1, 2)[1] 3
The function name is fairly good; add_numbers is read as “add numbers”. While we generally want to write code that describes what it does by reading it, it’s also good practice to add some formal documentation to the function. Use the “Insert Roxygen Skeleton” in the “Code” menu, by typing Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”), and you can add template documentation right above the function. Make sure your cursor is within the function in order for the Roxygen template to be added to your function. It looks like:
docs/learning.qmd
#' Title
#'
#' @param num1
#' @param num2
#'
#' @returns
#' @export
#'
#' @examples
add_numbers <- function(num1, num2) {
added <- num1 + num2
return(added)
}In the Title area, this is where you type out a brief sentence or several words that describe the function. Creating a new paragraph below this line allows you to add a more detailed description. The other items are:
@param num: These lines describe what each argument (also called parameter) is for and what to give it.@return: This describes what output the function gives. Is it a data.frame? A plot? What else does the output give?@export: Tells R that this function should be accessible to the user of your package. Since we aren’t making packages, delete it.@examples: Any lines below this are used to show examples of how to use the function. This is very useful when making packages, but not really in this case. So we’ll delete it. Let’s write out some documentation for this function:docs/learning.qmd
#' Add two numbers together.
#'
#' @param num1 A number here.
#' @param num2 A number here.
#'
#' @returns Returns the sum of the two numbers.
#'
add_numbers <- function(num1, num2) {
added <- num1 + num2
return(added)
}Once we’ve created that and before moving on, let’s style our code with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), render the Quarto document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”), and then open up the Git Interface with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) to add and commit these changes to the Git history before then pushing to GitHub.
Highlight the workflow and diagram. Reinforce this workflow and that we will be using it all throughout this workshop.
Time: ~6 minutes.
At the level of the code, the way you prototype code is to:
R/functions.R script, which we will talk about in the next session).flowchart TD
code("Write code<br>in Quarto") --> as_function("Convert to<br>function")
as_function --> test("Test function in<br>Quarto or Console")
test --> fix_function("Fix function")
test -- "Commit" --> git("Git history")
fix_function --> test
fix_function --> either([Either:])
either -. As needed .-> restart("Restart R<br>session")
either -. As needed .-> render("Render<br>Quarto")
restart -.-> test
render -.-> test
Either restarting R or rendering the Quarto document is the only way there is to be certain the R workspace is in a clean state. When code runs after a clean state, it improves the chances that your code and project will be reproducible.
We use Git because it is the best way of keeping track of what was done to your files, when, and why. It helps to keep your work transparent and makes it easier for you to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science (e.g., transparency, reproducibility, and documentation).
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
Now that we have a basic understanding of what a function looks like, let’s apply that to what we’re doing right now: Importing our data.
Emphasize that we will be using this workflow for creating functions all the time throughout workshop and is also a common workflow when making functions in R.
While you read about the general workflow above, the more detailed steps for making a function is to:
name <- function() { ... }, with an appropriate and descriptive name.function(argument1, argument2)) with appropriate and descriptive names, then replace the code in the function body with the argument names where appropriate.return() function at the end to indicate what the function will output.In docs/learning.qmd, create a new Markdown header called ## Import 5C's HR data with a function and create a code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
So, step one. Let’s copy and paste the code we previously for importing the HR.csv.gz data from participant 5C and convert that as a function:
docs/learning.qmd
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)Next we wrap it in the function call and give it an appropriate name. Since we’ve used this code to read in different types of data, the name should reflect that it is fairly generic, at least for our data. So we could simply call it read.
We want to be careful using very generic names for functions, as these are often already used by R or in other packages. We can always check if a name is used by R by typing ?function_name in the Console. If it shows you documentation, then it is used and it’s best to use another name for our function.
docs/learning.qmd
read <- function() {
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
}Then, we add arguments in the function and replace within the code. We want to use the read() function to read a file path that has data in it. So we can add an argument to our function for the file path to read from. So, a good name might be file_path.
It’s rarely good to hard code things into functions, like the use of here(), otherwise the function will always only read from that file. Instead, it’s good design to give functions a full file path that it can use internally. Then when we use the function, we would use here() with the correct path in the function argument.
So we would replace here("...") with file_path:
docs/learning.qmd
read <- function(file_path) {
file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
}While it isn’t required, for learning purposes we will use the return() function to explicitly indicate what the function outputs at the end. In our case, we want it to output the data frame that we just read in. Since we haven’t assigned the output of the read_csv() to an object, let’s do that first and than use return() to output that object at the end. Let’s keep it generic again and call it data:
docs/learning.qmd
read <- function(file_path) {
data <- file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
return(data)
}Great! Now we need to test it out. Let’s try it on two HR.csv.gz datasets from 5C:
docs/learning.qmd
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
read()# A tibble: 37 × 2
collection_datetime hr
<dttm> <dbl>
1 2020-04-14 17:50:46 80.4
2 2020-04-14 17:50:50 81.6
3 2020-04-14 17:50:55 81.8
4 2020-04-14 17:51:08 88.8
5 2020-04-14 17:50:44 79.3
6 2020-04-14 17:50:52 81.8
7 2020-04-14 17:50:48 81.2
8 2020-04-14 17:50:53 81.7
9 2020-04-14 17:50:56 82
10 2020-04-14 17:50:58 82.8
# ℹ 27 more rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read()# A tibble: 100 × 2
collection_datetime hr
<dttm> <dbl>
1 2020-04-14 17:52:12 70
2 2020-04-14 17:52:42 76.8
3 2020-04-14 17:52:51 79.3
4 2020-04-14 17:53:12 89.4
5 2020-04-14 17:53:32 101.
6 2020-04-14 17:53:35 102.
7 2020-04-14 17:53:42 102.
8 2020-04-14 17:53:45 102.
9 2020-04-14 17:53:52 100.0
10 2020-04-14 17:54:03 94.5
# ℹ 90 more rows
Awesome! It works 🎉 The final stage is to add the Roxygen documentation.
docs/learning.qmd
#' Read in one nurses' stress data file.
#'
#' @param file_path Path to the data file.
#'
#' @returns Outputs a data frame/tibble.
#'
read <- function(file_path) {
data <- file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
return(data)
}A massive advantage of using functions is that if you want to make a change to what your code does, like if you fix a mistake, you can very easily do it by modifying the function and it will change all your other code too.
Now that we have a working function, run styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), render with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”), and then add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) before then pushing to GitHub.
Briefly reinforce what they read by slowly going through these points below about making generalised functions. Emphasise the principles below, especially the “do one thing” and “keep it small”.
Time: ~3 minutes.
There are a few ways to make a function more general-purpose and reusable, while still being useful for your specific purposes. These principles are:
|> operator. In general, if the function does some action to a data frame, then the data frame should be the first argument called data so that it works well with piping from tidyverse functions.When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
Time: ~10 minutes.
We’ve converted the code to read in different data files from the nurses’ stress dataset into a function. Now, let’s modify it to be a bit more general-purpose so that we can also test out reading in different number of rows so we don’t have to keep changing the n_max value inside of the function.
Use the read() function code we wrote above and revise it so it can be used like the below:
# Outputs only 100 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read()
# Outputs only 1000 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read(max_rows = 1000)
# Outputs all rows (`Inf` means infinity)
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read(max_rows = Inf)read() function, add a new argument called max_rows after the file_path argument inside function(). We’ll called it max_rows because that’s a more descriptive name for it that n_max.n_max = 100 with n_max = max_rows so that the function will use the value given in the max_rows argument when it is used.function(). Since we are using 100 already, let’s use that value by putting max_rows = 100 in function().max_rows argument and its description. You can copy and paste the existing line with the @param for file_path and then modify it for max_rows.docs/learning.qmd file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩
Go over this solution with everyone as we will use it in the next session.
Time: ~6 minutes.
As we prepare for the next session and the break, get up, walk around, and discuss with your neighbour some of the following questions:
Quickly cover this and get them to do the survey before moving on to the discussion activity.
This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.
sd
`+`
`<-`
1 + 2
`+`(1, 2)
`<-`(x, 1)
x
x <- 1
x
add_numbers <- function(num1, num2) {
added <- num1 + num2
return(added)
}
add_numbers(1, 2)
#' Title
#'
#' @param num1
#' @param num2
#'
#' @returns
#' @export
#'
#' @examples
add_numbers <- function(num1, num2) {
added <- num1 + num2
return(added)
}
#' Add two numbers together.
#'
#' @param num1 A number here.
#' @param num2 A number here.
#'
#' @returns Returns the sum of the two numbers.
#'
add_numbers <- function(num1, num2) {
added <- num1 + num2
return(added)
}
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
read <- function() {
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
}
read <- function(file_path) {
file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
}
read <- function(file_path) {
data <- file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
return(data)
}
here("data-raw/nurses-stress/stress/5C/5C_1586886626/HR.csv.gz") |>
read()
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read()
#' Read in one nurses' stress data file.
#'
#' @param file_path Path to the data file.
#'
#' @returns Outputs a data frame/tibble.
#'
read <- function(file_path) {
data <- file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = 100
)
return(data)
}
# Outputs only 100 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read()
# Outputs only 1000 rows
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read(max_rows = 1000)
# Outputs all rows (`Inf` means infinity)
here("data-raw/nurses-stress/stress/5C/5C_1586886712/HR.csv.gz") |>
read(max_rows = Inf)
#' Read in one nurses' stress data file.
#'
#' @param file_path Path to the data file.
#' @param max_rows Maximum number of rows to read.
#'
#' @returns Outputs a data frame/tibble.
#'
read <- function(file_path, max_rows = 100) {
data <- file_path |>
read_csv(
show_col_types = FALSE,
name_repair = to_snake_case,
n_max = max_rows
)
return(data)
}