If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
9 Workflow to analyzing your tidy data
Here we will continue using the Workflow block as we cover the fourth block, “Work with final data” in Figure 9.1.
And your folder and file structure should look like:
LearnR3 ├── data/ │ ├── mmash.rda │ └── README.md ├── data-raw/ │ ├── mmash-data.zip │ ├── mmash/ │ │ ├── user_1 │ │ ├── ... │ │ └── user_22 │ └── mmash.R ├── doc/ │ ├── README.md │ └── lesson.Rmd ├── R/ │ ├── functions.R │ └── README.md ├── .Rbuildignore ├── .gitignore ├── DESCRIPTION ├── LearnR3.Rproj └── README.md
9.1 Setup for the analysis in R Markdown
We now have a working dataset to start doing some simple analyses on in the R Markdown document. A recommended workflow with R Markdown is to often “Knit” it and make sure your analysis is reproducible (while on your computer). So let’s first clean it up and start from the beginning again.
Before we do that, make sure to add and commit all changes done in
So any change that we make now we will still have the old stuff in the history
that we can look at.
Next, we’ll delete everything right below the
setup code chunk.
This code chunk should have all the packages listed with
We’ll also add another one right below them:
When everything has been deleted, add and commit these changes into the Git history.
Then, we’ll “Knit” our R Markdown document into HTML by using the “Knit” button
at the top of the pane or with
Ctrl-Shift-K. Once it creates the file, it should
either pop up or open in the Viewer pane on the side.
As we write more R code and do some simple analyses of the data, we are going to be knitting often. The main reason for this is to ensure that whatever we are writing and done will at least be reproducible on our computer, since R Markdown is designed to ensure the document is reproducible.
For this specific workflow and for checking reproducibility, you should output
to HTML rather than to a Word document. While you can create a Word document by
output: html_document to
output: word_document at the top in
the YAML header, you’d only do this when you need to submit to a journal or
need to email to co-authors for review. The reason is simple: After you generate
the Word document from R Markdown, the Word file opens up and consequently Word
locks the file from further edits. What that means is that every time you generate
the Word document, you have to close it before you can generate it again,
otherwise knitting will fail. This can get annoying very quickly (trust me),
since you don’t always remember to close the Word document. If you output to
HTML, this won’t be a problem.
9.2 Split-apply-combine technique
In the beginner course, we covered the split-apply-combine approach to doing data analysis. The method is: split the data into groups, apply some analysis to each group, and then combine the results back together. You’ve already used this approach when summarizing the RR and Actigraph datasets in last session. We’re going to continue using it because of its usefulness and power.
Let’s do some descriptive statistics. Let’s find out the mean and standard
deviations of some variables by day. We will use
like previously. First, in
doc/lesson.Rmd create a new header called
## Split-apply-combine and make a new code chunk below it.
mmash %>% group_by(day) %>% summarise(across( c(age, weight), list(mean = mean, sd = sd), na.rm = TRUE )) #> # A tibble: 4 x 5 #> day age_mean age_sd weight_mean weight_sd #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -29 NaN NA NaN NA #> 2 1 26.0 7.28 75.3 13.1 #> 3 2 26.0 7.28 75.3 13.1 #> 4 NA 28 NA 70 NA
We can see from this that there is a random day of
NaN for age and
weight, as well as a missing day where mean age is 28 and mean weight is 70.
We can always fix this at a later point, but for now, we will ignore them.
Let’s look at some other summary statistics, like
We add these into the
list() with the other statistics:
mmash %>% group_by(day) %>% summarise(across( c(age, weight), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE )) #> Warning in fn(col, ...): no non-missing arguments to min; returning Inf #> Warning in fn(col, ...): no non-missing arguments to max; returning -Inf #> Warning in fn(col, ...): no non-missing arguments to min; returning Inf #> Warning in fn(col, ...): no non-missing arguments to max; returning -Inf #> # A tibble: 4 x 9 #> day age_mean age_sd age_min age_max weight_mean weight_sd weight_min #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -29 NaN NA Inf -Inf NaN NA Inf #> 2 1 26.0 7.28 0 40 75.3 13.1 60 #> 3 2 26.0 7.28 0 40 75.3 13.1 60 #> 4 NA 28 NA 28 28 70 NA 70 #> # … with 1 more variable: weight_max <dbl>
We could keep adding other variables and other summary statistics, but you can see the dataset is quickly become harder and harder to read. One way we can fix this is by convert data to the long format, since a longer format is easier for the eye to scan than a wider format.
9.3 Pivot data to long format
For instructors: Click for details
Walk through and explain this section, making use of the tables and graphs.
Pivoting is an extremely useful action when working with data. Pivoting is when
you convert data between longer forms (more rows) and wider forms (more columns).
The tidyr package within tidyverse contains two wonderful functions for pivoting:
pivot_wider(). There is a well written documentation
on pivoting in the tidy website that can explain more about it.
The first thing we’ll use, and probably the more commonly used in general,
pivot_longer(). This function is commonly used because entering data in
the wide form is easier and more time efficient than entering data in long form.
For instance, if you were measuring glucose values over time in participants,
you might enter data in like this:
However, when it comes time to analyze the data, this wide form is very inefficient
and difficult to computationally and statistically work with. So, we do data
entry in wide and use functions like
pivot_longer() to get the data ready for
analysis. Figure 9.2 visually shows what happens when
you pivot from wide to long.
If you had, for instance, an ID column for each participant, the pivoting would look like what is shown in Figure 9.3.
Pivoting is a conceptually challenging thing to grasp, so don’t be disheartened
if you can’t understand how it works yet. As you practice using it, you will
understand it. With
pivot_longer(), the first argument is the data.
The other arguments are:
cols: The columns to use to convert to long form. The input is a vector made using
c()that contains the column names, like you would use in
select()(e.g. you can use the
-minus to exclude).
names_to: Optional, default is
name. If provided, it will be the name of the newly created column (as a quoted character) that contains the original column names.
values_to: Optional, default is
names_to, sets the name of the new columns.
Let’s try this out with
mmash. In your
doc/lesson.Rmd file, create a new
## Pivot longer and create a new code chunk below that. Now we can
start typing in our code:
Ok, this gives an error because we are mixing data types. Let’s pivot only numbers.
mmash %>% pivot_longer(where(is.numeric)) #> # A tibble: 470 x 5 #> gender user_id samples name value #> <chr> <chr> <chr> <chr> <dbl> #> 1 M user_1 before sleep weight 6.50e+1 #> 2 M user_1 before sleep height 1.69e+2 #> 3 M user_1 before sleep age 2.90e+1 #> 4 M user_1 before sleep cortisol_norm 3.41e-2 #> 5 M user_1 before sleep melatonin_norm 1.74e-8 #> 6 M user_1 before sleep day 1.00e+0 #> 7 M user_1 before sleep ibi_s_mean 6.66e-1 #> 8 M user_1 before sleep ibi_s_sd 1.64e-1 #> 9 M user_1 before sleep hr_mean 9.06e+1 #> 10 M user_1 before sleep hr_sd 1.30e+1 #> # … with 460 more rows
Nice! But not super useful. We can exclude specific columns from pivoting
- before the column name, for instance with
samples column before pivoting:
mmash %>% select(-samples) %>% pivot_longer(c(-user_id, -day, -gender)) #> # A tibble: 423 x 5 #> gender user_id day name value #> <chr> <chr> <dbl> <chr> <dbl> #> 1 M user_1 1 weight 6.50e+1 #> 2 M user_1 1 height 1.69e+2 #> 3 M user_1 1 age 2.90e+1 #> 4 M user_1 1 cortisol_norm 3.41e-2 #> 5 M user_1 1 melatonin_norm 1.74e-8 #> 6 M user_1 1 ibi_s_mean 6.66e-1 #> 7 M user_1 1 ibi_s_sd 1.64e-1 #> 8 M user_1 1 hr_mean 9.06e+1 #> 9 M user_1 1 hr_sd 1.30e+1 #> 10 M user_1 2 weight 6.50e+1 #> # … with 413 more rows
9.4 Exercise: Pivot your summarised results to longer form
Time: 10 min
pivot_longer() after the
summarise() we did previously:
- Look into the help documentation of
ends_with()and use it for pivoting only the columns that end with
ends_with()is vectorized meaning you can give it a vector of column endings.
- After using the
pivot_longer(), knit the R Markdown document into HTML (
Ctrl-Shift-Kor the “Knit” button).
- Open up the Git interface and add and commit the changes to
9.5 Pivot data to wider form
pivot_longer() on the summarized data, it looks nice, but it still
has a problem: We have both mean and standard deviation values in the same column.
It would be nicer if we had two columns, one for means and one for the sd values.
We couldn’t pivot to wider right now because that would put it right back from
where we got it. So instead, we need to separate the existing
into two columns for
summary_statistic. We can do this when
pivot_longer() by using the
names_tocan take a vector of new column names that will be created, meaning if you give it two new names, it will create two columns. But, when it has more than one new column name, it relies on you also using at least
names_sep, which means “names separation”, is used to tell
pivot_longer()that a specific character separates two or more distinct possible columns. In our case,
_separates the variable name from
So, add these two arguments:
mmash %>% group_by(day) %>% summarise(across( c(age, weight), list(mean = mean, sd = sd), na.rm = TRUE )) %>% pivot_longer(ends_with(c("_mean", "_sd")), names_sep = "_", names_to = c("name", "summary_statistic")) #> # A tibble: 16 x 4 #> day name summary_statistic value #> <dbl> <chr> <chr> <dbl> #> 1 -29 age mean NaN #> 2 -29 weight mean NaN #> 3 -29 age sd NA #> 4 -29 weight sd NA #> 5 1 age mean 26.0 #> 6 1 weight mean 75.3 #> 7 1 age sd 7.28 #> 8 1 weight sd 13.1 #> 9 2 age mean 26.0 #> 10 2 weight mean 75.3 #> 11 2 age sd 7.28 #> 12 2 weight sd 13.1 #> 13 NA age mean 28 #> 14 NA weight mean 70 #> 15 NA age sd NA #> 16 NA weight sd NA
Now we are in a good position to pivot wider. So how does
Like its opposite
pivot_longer(), the first position argument is the data and
the other arguments are:
id_cols: This is optional as it will default to all column names. This argument tells
pivot_wider()to use the given columns as the identifiers for when converting. Unlike
pivot_longer()which doesn’t require some type of “key” or “id” column to convert to long form, the conversion to wide form works requires some type of “key” or “id” column because
pivot_wider()needs to know which rows belong with each other.
names_from: Similar to the
pivot_longer(), this is the name of the column that has the values that will make up the new columns. Unlike with the
pivot_longer()which takes a character string as input, the column name for
names_frommust be unquoted because you are selecting a column that already exists in the dataset.
values_from: Same as
names_from, this is the column name (that exists and must be given unquoted) for the values that will be in the new columns.
Figure 9.4 visually shows what’s happening when using
In our case, we want the new column names to come from
and the values to come from
value. Let’s add it to the end of our pipe:
mmash %>% group_by(day) %>% summarise(across( c(age, weight), list(mean = mean, sd = sd), na.rm = TRUE )) %>% pivot_longer(ends_with(c("_mean", "_sd")), names_sep = "_", names_to = c("name", "summary_statistic")) %>% pivot_wider(names_from = summary_statistic, values_from = value) #> # A tibble: 8 x 4 #> day name mean sd #> <dbl> <chr> <dbl> <dbl> #> 1 -29 age NaN NA #> 2 -29 weight NaN NA #> 3 1 age 26.0 7.28 #> 4 1 weight 75.3 13.1 #> 5 2 age 26.0 7.28 #> 6 2 weight 75.3 13.1 #> 7 NA age 28 NA #> 8 NA weight 70 NA
And technically, you can pivot longer and wider all at once by using some features of
pivot_longer(). The only change we need to make to the
"summary_statistic" with the special setting
".value". Delete the
pivot_wider() function and you have a longer-to-wider dataset! We didn’t cover
this first because we wanted to show how to use
pivot_wider() on its own.
mmash %>% group_by(day) %>% summarise(across( c(age, weight), list(mean = mean, sd = sd), na.rm = TRUE )) %>% pivot_longer(ends_with(c("_mean", "_sd")), names_sep = "_", names_to = c("name", ".value")) #> # A tibble: 8 x 4 #> day name mean sd #> <dbl> <chr> <dbl> <dbl> #> 1 -29 age NaN NA #> 2 -29 weight NaN NA #> 3 1 age 26.0 7.28 #> 4 1 weight 75.3 13.1 #> 5 2 age 26.0 7.28 #> 6 2 weight 75.3 13.1 #> 7 NA age 28 NA #> 8 NA weight 70 NA
9.6 Exercise: Convert this code into a function
Time: 15 min
Using the same workflow we’ve been doing throughout this course, convert this code into a function.
- Name the function
- Create two arguments, called
- To work well with the pipe, put the
- To work well with the pipe, put the
- Test it to see it work by using
- It won’t work and will throw an error. In the next section we will cover how to debug the function.
- Move it over into
R/functions.R, add Roxygen documentation, and use explicit function calls with
- Don’t forget, use
?functionnameto find out which package the function comes from.
- Don’t forget, use
9.7 Debugging functions
Debugging is one of those things that seems really scary and difficult, but once you try it and use it, is actually not as intimidating. To debug, which means to find and fix problems in your code, there are two ways to do it:
- By typing in
browser()at the start of your function and then re-running it (either by manually running it or with
load_all()). This works for all functions in all R Project types.
- Using RStudio’s “Breakpoints” while in a Package R Project type. This can be
started by putting your cursor on the line in the function you want to debug and
- Clicking “Debug -> Toggle Breakpoints”
- Click the empty space on the left of the line number
browser() works with non-Package Projects, we’re going to cover and use
the Breakpoint approach for this course. The skills from this are directly usable
browser(). In the
R/functions.R file, go to the
tidy_summarise_by_day() function and click the space to the left of the line
number for the first line inside the function. There will be something that
pops up at the top of the pane saying something like:
“Breakpoints will be activated when the package is built and reloaded.”
Activate it by running
Ctrl-Shift-L. The dot should now be
solid red, rather than a circle. In the Console, type out and run the code:
You may have also already written that code previously, so you can use the up arrow in the Console to go back to it. After running the code, a bunch of things happen:
- A yellow line will highlight the code in the function, along with a green arrow on the left of the line number
- The Console will now start with
Browse>will have text like
debug at ...
- There will be new buttons on the top of the Console like “Next”, “Continue”, and “Stop”
- The Environment pane will be empty and will say “Traceback”
We are now in debug mode. In this mode you can really investigate what is happening with your code and how to fix it. The way to figure out what’s wrong is by running the code bit by bit. First let’s check that the first works by highlighting the code (shown below) and running it.
After highlighting and running this, we see that it works fine. Alright, next step:
Still is fine. Next:
Ah, here is where the problem occurs. The argument
columns contains the
age variable we want to use in
summarise()… but the problem is that
R is interpreting
age as an object that exists in the R session when in reality
it only exists in the
mmash data frame. But it works when we use it outside of
the function, why not here?
So what’s happening? We’ve encountered a problem due to “non-standard
evaluation” (or NSE). NSE is a feature of R and is used quite a lot throughout
R, but is especially used in the tidyverse packages. It’s one of the first
things computer scientists complain about when they use R, because it is such a
foreign thing in other programming languages. But NSE is what allows you to use
y ~ x + x2 in modeling) or allows you to type out
select(Gender, BMI) or
library(purrr). In “standard evaluation”, it
would instead be
select("Gender", "BMI") or
library("purrr"). So NSE gives
flexibility and ease of use for the user (we don’t have to type quotes every
time) when doing data analysis, but can give some headaches when programming in
R, like with making functions. There’s more detail about this on the
dplyr website, which will give a few options, the simplest
of which is to wrap
columns inside the
across() function with
let’s “Stop” the debugging, add these curly brackets to
columns, and remove
the Breakpoint by clicking it (if it hasn’t been removed already).
Now, if we run the function again:
Yea! It works! If we want to include other variables:
9.8 Exercise: Extend the function to allow adding other summary statistics
Time: 20 min
We’ve made the function so that we can include any column we want, but now
let’s make it so we can include any summary function we want, not just
sd(), as a new argument called
summary_fn. This task will be
challenging, so discuss and work with your neighbour on completing this
exercise. Work within the debugging environment to deal with this problem.
list(mean = mean, ...)is a named list, meaning you can use
names()to get a vector of the names of the list.
- You will have to create an intermediate object based on the new
9.9 Making prettier output in R Markdown
What we created is nice and all, but since we are working in an R Markdown
document and knitting to HTML, let’s make it easier for others (including
yourself) to read the document. Let’s make the output as actual tables.
We can do that with
knitr::kable() (meaning “knitr table”). We can even
add a table caption with the
mmash %>% tidy_summarise_by_day(c(age, weight, height), list(mean = mean, min = min, max = max)) %>% knitr::kable(caption = "Descriptive statistics of some variables.") #> Warning in fn(col, ...): no non-missing arguments to min; returning Inf #> Warning in fn(col, ...): no non-missing arguments to max; returning -Inf #> Warning in fn(col, ...): no non-missing arguments to min; returning Inf #> Warning in fn(col, ...): no non-missing arguments to max; returning -Inf #> Warning in fn(col, ...): no non-missing arguments to min; returning Inf #> Warning in fn(col, ...): no non-missing arguments to max; returning -Inf
Then knit the document and check out the HTML file. So pretty! 😁
9.10 General workflow up to this point
You now have some skills and tools to allow you to reproducibly import, process, clean, join, and eventually analyze your datasets. For the next two exercises, you will apply these skills and tools. Listed below are the general workflows we’ve covered and that you can use as a guideline to completing the following exercises:
- Import with the
- Convert into a function with the
R/function.R, restart R, and
- Test that the datasets join properly while in
doc/lesson.Rmdthen cut and paste into
- Restart R and re-create the
data/mmash.rdadataset by sourcing
- Restart R, load the new dataset with
Ctrl-Shift-L), and analyze it in
- Add any additional cleaning code to
data-raw/mmash.Rand update the
data/mmash.rdadataset whenever you find problems workflow.
- Write R in code chunks in the
doc/lesson.Rmdto further analyze your data and check reproducibility by often knitting to HTML workflow.
- Part of this workflow is to also write R code to output in a way that looks nice in the HTML (or Word) output by mostly e.g. creating tables or figures of the output.
And don’t forget, for each stage, add and commit the changes you’ve made to the files into the Git history.
9.11 Exercise: Process and join sleep and questionnaire data
Time: 25 min
There are still a few datasets that you can join in with the current
Using the workflows in Section 9.10 as a guide,
start from the beginning and import, process, clean, make functions,
and join these two datasets in with the others so that they get
included in the
data/mmash.rda final dataset. Afterwards, do some
descriptive analysis using the function …
TODO: function name?
9.12 Exercise: Create a second dataset of only the Actigraph and RR data
Time: 25 min
The Actigraph and RR datasets contain a ton of interesting and useful data that gets destroyed when we first summarize and then join them with the other datasets. While we can’t meaningfully join all this data with the other datasets, we can join them on their own.
Using the workflows in Section 9.10 as a guide,
start from the beginning and import, process, clean, make functions,
and create a final dataset of only the
- Join only these two datasets by
- Name the new dataset
actigraph_rrand save it to
data/by using another
usethis::use_data()line in the