9 Quickly re-arranging data with pivots
🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧
Here we will continue using the Workflow block as we cover the fourth block, “Work with project data” in Figure 9.1.
9.1 Learning objectives
Describe the concept of “pivoting” data to convert data in a “long” format to a “wide” format and vice versa.
Identify situations where it is more appropriate that data is in the long or wide format.
Apply the
pivot_longer()
andpivot_wider()
functions from the tidyr package to pivot data.
9.2 Setup for the analysis in Quarto
We now have a working dataset to start doing some simple analyses on in the Quarto document. A recommended workflow with Quarto is to often render / “knit” it and make sure your analysis is reproducible (while on your computer). We already cleaned it up from the previous session.
We will now add the load()
code right below the source()
function in the setup
code chunk:
As we write more R code and do some simple analyses of the data, we are going to be knitting fairly often (depending on how long the analysis takes of workshop). The main reason for this is to ensure that whatever you are writing and coding will at least be reproducible on your computer, since Quarto is designed to ensure the document is reproducible.
9.3 Re-arranging data for easier summarizing
Now that we have the final dataset to work with, we want to explore it a bit with some simple descriptive statistics. One extremely useful and powerful tool to summarizing as well as processing/wrangling data is by “pivoting” your data. Pivoting is when you convert data between longer forms (more rows) and wider forms (more columns). The tidyr package within tidyverse contains two wonderful functions for pivoting: pivot_longer()
and pivot_wider()
. There is a well written documentation on pivoting in the tidyr website that can explain more about it. The first thing we’ll use, and probably the more commonly used in general, is pivot_longer()
. This function is commonly used because entering data in the wide form is easier and more time efficient than entering data in long form. For instance, if you were measuring glucose values over time in participants, you might enter data in like this:
However, when it comes time to analyze the data, this wide form is very inefficient and difficult to computationally and statistically work with. So, we do data entry in wide and use functions like pivot_longer()
to get the data ready for analysis. Figure 9.2 visually shows what happens when you pivot from wide to long.

A
and B
columns are stacked on top of each other in the newly created V
column.
If you had, for instance, an ID column for each participant, the pivoting would look like what is shown in Figure 9.3.

id
column is excluded when pivoting into the data on the right.
Pivoting is a conceptually challenging thing to grasp, so don’t be disheartened if you can’t understand how it works yet. As you practice using it, you will understand it. With pivot_longer()
, the first argument is the data itself. The other arguments are:
-
cols
: The columns to use to convert to long form. The input is a vector made usingc()
that contains the column names, like you would use inselect()
(e.g. you can use theselect_helpers
likestarts_with()
, or-
minus to exclude). -
names_to
: Optional, the default isname
. If provided, it will be the name of the newly created column (as a quoted character) that contains the original column names. -
values_to
: Optional, the default isvalue
. Likenames_to
, sets the name of the new columns.
Both pivot_longer()
and its opposite pivot_wider()
, which we will cover later in the session, are incredibly powerful functions. We can’t show close to everything it can do in this workshop, but if you want to learn more, read up on the documentation for it.
Since we’ll be using tidyr, we will need to add it as a dependency by running this in the Console:
Console
usethis::use_package("tidyr")
Let’s try this out with mmash
. In your doc/learning.qmd
file, create a new header called ## Pivot longer
and create a new code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Now we can start typing in our code:
doc/learning.qmd
mmash |>
# pivot every column
pivot_longer(everything())
Why the error? We get an error because we are trying to mix data types. We can’t have character data and number data in the same column. Let’s pivot only numbers.
doc/learning.qmd
mmash |>
pivot_longer(where(is.numeric))
Nice! But not super useful. We can exclude specific columns from pivoting with -
before the column name, for instance with user_id
and day
. Let’s drop the samples
column before pivoting since day
gives us the same information:
9.4 Exercise: Summarise your data after pivoting
Time: 15 minutes.
Using the group_by()
and summarise()
functions we learned in Section 7.6, complete these tasks starting from this code.
doc/learning.qmd
|>
mmash select(-samples) |>
pivot_longer(c(-user_id, -day, -gender)) |>
___
- Continuing the
|>
frompivot_longer()
, usegroup_by()
to group the data bygender
,day
, andname
(the long form column produced frompivot_longer()
). - After grouping with
group_by()
, usesummarise()
andacross()
on thevalue
column and find the mean and standard deviation (put them into a named list like we did previously). Don’t forget to usena.rm = TRUE
to exclude missing values. - Stop the grouping effect with
.groups = "drop"
insummarise()
. - Run styler while in the
doc/learning.qmd
file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”). - Render the Quarto file into HTML with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”).
- Open up the Git interface and add and commit the changes to
doc/learning.qmd
with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”)
Click for the solution. Only click if you are struggling or are out of time.
9.5 Pivot data to wider form
After using pivot_longer()
on the summarised data, it looks nice, but it could be better. Right now it is in a pretty long form, but for showing as a table, having columns for either gender
or day
would make it easier to compare the mean and SD values we obtain. This is where we can use pivot_wider()
to get the data wider rather than long. The arguments for pivot_wider()
are very similar to those in pivot_longer()
, except instead of names_to
and values_to
, they are called names_from
and values_from
. Like with many R functions, the first argument is the data and the other arguments are:
-
id_cols
: This is optional as it will default to all column names. This argument tellspivot_wider()
to use the given columns as the identifiers for when converting. Unlikepivot_longer()
which doesn’t require some type of “key” or “id” column to convert to long form, the conversion to wide form requires some type of “key” or “id” column becausepivot_wider()
needs to know which rows belong with each other. -
names_from
: Similar to thepivot_longer()
, this is the name of the column that has the values that will make up the new columns. Unlike with thenames_to
argument inpivot_longer()
which takes a character string as input, the column name fornames_from
must be unquoted because you are selecting a column that already exists in the dataset. -
values_from
: Same asnames_from
, this is the column name (that exists and must be given unquoted) for the values that will be in the new columns.
Figure 9.4 visually shows what’s happening when using pivot_wider()
.

N
, which is used in the names_from
argument, are used as the names for the new columns A
and B
in the new data.
In our case, we want either gender
or day
as columns with the mean and SD values. Let’s use pivot_wider()
on day
to see differences between days.
Hmm, didn’t work. Nothing has been pivoted to wider. That’s because we are missing the value_from
argument. Since we actually have the two value_mean
and value_sd
columns that have “values” in them, we need to tell pivot_wider()
to use those two columns. Since values_from
works similar to select()
, we can use starts_with()
to select the columns starting with "values"
.
doc/learning.qmd
Now we have a different problem. There are missing values in both the day
and gender
columns that, at least in this case, we don’t want pivoted. Shouldn’t they be removed when we include na.rm = TRUE
in our code? The function of na.rm = TRUE
is not to remove NA
values, but to instead tell R to not include variables in mmash
that are NA
when calculating the mean and standard deviation. In this particular case, the columns value_mean_NA
or value_mean_-29
have NA
or NaN
values because there are no other values in the data other than NA
. Since we don’t actually care about missing days (or the random -29
day), we can remove missing values with the function called drop_na()
. We also don’t care about missing gender
values, so we’ll drop them as well. Add it in the pipe right before group_by()
.
doc/learning.qmd
mmash |>
select(-samples) |>
pivot_longer(c(-user_id, -day, -gender)) |>
drop_na(day, gender) |>
group_by(gender, day, name) |>
summarise(
across(
value,
list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)
),
.groups = "drop"
) |>
pivot_wider(names_from = day, values_from = starts_with("value"))
Now that that works, let’s render the document to HTML with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”).
9.6 Exercise: Convert this code into a function
Time: 15 minutes.
Using the same workflow we’ve been doing throughout this workshop, convert the code we just wrote above into a function.
- Name the function
tidy_summarise_by_day
. - Create one argument called
data
. Create a new variable inside the function calleddaily_summary
and put it inreturn()
so the function outputs it. - Test that the function works.
- Add Roxygen documentation with Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”) and use explicit function calls with
packagename::
.- Don’t forget, you can use
?functionname
to find out which package the function comes from.
- Don’t forget, you can use
- Move (cut and paste) the newly created function over into the
R/functions.R
file. - Run styler while in the
R/functions.R
file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”). - Restart R with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”), go into the
doc/learning.qmd
file and run thesetup
code chunk in the Quarto document with thesource()
andload()
commands. Then test that the new function works in a code chunk at the bottom of the document. - Render the Quarto document to HTML with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”).
- Add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
#' Calculate tidy summary statistics by day.
#'
#' @param data The MMASH dataset.
#'
#' @return A data.frame/tibble.
#'
tidy_summarise_by_day <- function(data) {
daily_summary <- data |>
dplyr::select(-samples) |>
tidyr::pivot_longer(c(-user_id, -day, -gender)) |>
tidyr::drop_na(day, gender) |>
dplyr::group_by(gender, day, name) |>
dplyr::summarise(
dplyr::across(
value,
list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)
),
.groups = "drop"
) |>
tidyr::pivot_wider(
names_from = day,
values_from = dplyr::starts_with("value")
)
return(daily_summary)
}
# Testing that the function works.
mmash |>
tidy_summarise_by_day()
9.7 Extending the function to use other statistics and to be tidier
Now that we’ve made the tidy summary code into a function, let’s make it more generic so we can use other summary statistics and to have the output be a bit tidier. For instance, it would be nice to be able to do something like this (we’d still have to use na.rm
though):
mmash |>
tidy_summarise_by_day(\(x) median(x, na.rm = TRUE))
mmash |>
tidy_summarise_by_day(\(x) max(x, na.rm = TRUE))
mmash |>
tidy_summarise_by_day(
list(
median = \(x) median(x, na.rm = TRUE),
max = \(x) max(x, na.rm = TRUE)
)
)
Before we get to adding this functionality, let’s first make it so the function has a tidier output. Specifically, we want to round the values so they are easier to read. Go into the R/functions.R
script to the tidy_summarize_by_day()
function. We’ll create a new line right after the dplyr::summarise()
function, after the |>
pipe. Since we want to round values of existing columns, we need to use mutate()
. And like we used across()
in summarise()
, we can also use across()
within mutate()
on specific columns. In our case, we want to round columns that start_with()
the word "value"
to 2 digits.
R/functions.R
tidy_summarise_by_day <- function(data) {
daily_summary <- data |>
dplyr::select(-samples) |>
tidyr::pivot_longer(c(-user_id, -day, -gender)) |>
tidyr::drop_na(day, gender) |>
dplyr::group_by(gender, day, name) |>
dplyr::summarise(
dplyr::across(
value,
list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)
),
.groups = "drop"
) |>
dplyr::mutate(dplyr::across(
dplyr::starts_with("value"),
\(x) round(x, digits = 2)
)) |>
tidyr::pivot_wider(
names_from = day,
values_from = dplyr::starts_with("value")
)
return(daily_summary)
}
Source the R/functions.R
file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”), then test out the function in the Console:
Console
tidy_summarise_by_day(mmash)
That’s much easier to read with the values rounded. Now let’s add the ability to change the summary statistics function to something else. This is a surprisingly easy thing so before we do that, let’s take a few minutes to brainstorm how we can achieve this.
Now that we’ve discussed this and come to a conclusion, let’s update the function.
R/functions.R
tidy_summarise_by_day <- function(data, summary_fn) {
daily_summary <- data |>
dplyr::select(-samples) |>
tidyr::pivot_longer(c(-user_id, -day, -gender)) |>
tidyr::drop_na(day, gender) |>
dplyr::group_by(gender, day, name) |>
dplyr::summarise(
dplyr::across(
value,
summary_fn
),
.groups = "drop"
) |>
dplyr::mutate(dplyr::across(
dplyr::starts_with("value"),
\(x) round(x, digits = 2)
)) |>
tidyr::pivot_wider(
names_from = day,
values_from = dplyr::starts_with("value")
)
return(daily_summary)
}
Source the R/functions.R
file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”), then test out the function in the Console:
Console
tidy_summarise_by_day(mmash, \(x) max(x, na.rm = TRUE))
Now that it works, let’s add some summary statistics to the doc/learning.qmd
file.
While everything now works, the column names are not very clear. For example, what does value_median_1
or 1
mean? It would be nice if it said something like median_day_1
or day_1
. And with pivot_wider()
, we can do that with the name_glue
argument! With this argument, we can write a custom variable naming scheme that involves combining the use of {}
and a special {.value}
keyword within a string in the argument to create these custom column names in the wider dataset. So this would look like {.value}_day_{day}
, where {.value}
refers to the values of the column(s) used in values_from
and the {day}
refers to the column used in names_from
.
We can then pipe the output to rename_with()
from the dplyr package, which acts like map()
(but for renaming columns) and can take an anonymous function. We can use this to then remove the value_
from the start of the column name.
R/functions.R
tidy_summarise_by_day <- function(data, summary_fn) {
daily_summary <- data |>
dplyr::select(-samples) |>
tidyr::pivot_longer(c(-user_id, -day, -gender)) |>
tidyr::drop_na(day, gender) |>
dplyr::group_by(gender, day, name) |>
dplyr::summarise(
dplyr::across(
value,
summary_fn
),
.groups = "drop"
) |>
dplyr::mutate(dplyr::across(
dplyr::starts_with("value"),
\(x) round(x, digits = 2)
)) |>
tidyr::pivot_wider(
names_from = day,
values_from = dplyr::starts_with("value"),
names_glue = "{.value}_day_{day}"
) |>
dplyr::rename_with(\(x) stringr::str_remove(x, "value_"))
return(daily_summary)
}
Let’s see how it looks by rendering the doc/learning.qmd
file again with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”). One of the output tables show now look something like:
Nice!! 😁 🎉
9.8 Making prettier output in Quarto
What we created is nice and all, but since we are working in a Quarto document and generating to HTML, let’s make it easier for others (including yourself) to read the document. Let’s make the output as an actual table. We can do that with knitr::kable()
(meaning “knitr table”). We can also add a table caption with the caption
argument.
Then render the document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) and check out the HTML file. So pretty! 😁 Well, there’s lots of things to fix up, but its a good starting place. Let’s finish by running styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) and then committing the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
9.9 Summary
- Data is usually structured to varying degrees as wide or long format.
- Use
pivot_longer()
to convert from wide to long. - Use
pivot_wider()
to convert from long to wide.
- Use