11 Pivoting your data from and to long or wide

11.1 Learning objectives

Describe the concept of “pivoting” data to convert data in a “long” format to a “wide” format and vice versa.
Identify situations where it is more appropriate that data is in the long or wide format.
Apply the pivot_longer() and pivot_wider() functions from the tidyr package to pivot data.
Recognise and appreciate the impact that seeking out and using existing functions to solve problems has on how you get things done, instead of writing custom code to do the things you want. For example, looking through the functions found in packages like tidyr.

11.3 📖 Reading task: Pivoting data for easier wrangling

Teacher note

Briefly walk through it again after they’ve read it and explain it a bit more, making use of the tables and graphs. Pivoting is often a difficult concept to grasp, to it’s good to go over it a few times. Emphasise the difficulty of it, to remind them it is ok to not get it.

Time: ~6 minutes.

Rarely is our data in the exact format we need or want it to be in to effectively process, analyse, or visualize it. Often we’ll need to do some heavy reshaping. One very powerful way to reshape data is with “pivoting”. Pivoting is when you take a dataset’s rows and convert them to columns or convert columns to rows, and in many different combinations of this.

The tidyr package within the tidyverse contains two wonderful functions for pivoting: pivot_longer() and pivot_wider(). There is a well written documentation on pivoting in the tidyr website that can explain more about it that you can read at another time.

Most use cases of pivoting are to convert to the longer form, which you can do with pivot_longer() and is what we will cover first in this session. There are many reasons pivoting longer is the most common pivot direction.

An example is if you ever need to enter data. Entering data, especially in a spreadsheet tool, in the wide form is much easier and more time efficient than entering data in long form. Consider the scenario where you are entering data from a participant for glucose values that were measured every 30 minutes. So you might enter data like this:

This is a **wide** dataset that is useful for data entry.
person_id	glucose_0	glucose_30	glucose_60
1	5.6	7.8	4.5
2	4.7	9.5	5.3
3	5.1	10.2	4.2

If working with a spreadsheet, is this is to use tab to move to the next cell and continue entering the data. However, when it comes time to analyze the data, this wide form is very inefficient and difficult to computationally and statistically work with. So, we do data entry in wide and use functions like pivot_longer() to get the data ready for analysis. See Figure 11.1 below that visually shows what happens when you pivot from wide to long.

A diagram showing the use of the pivot longer function, where a table with two columns, A and B, is converted to a table with two columns, name and value. The original columns values are stacked in the value column and the original columns names are stacked in the name column. — Figure 11.1: Pivot longer in tidyr. New columns are called `name` and `value`. Notice how the values in `A` and `B` columns are stacked on top of each other in the newly created `V` column.

If you had, for instance, an ID column for each participant, the pivoting would look like what is shown in Figure 11.2.

A diagram showing the use of the pivot longer function, where a table with three columns, id, A and B, is converted to a table with three columns, id, name and value. The original columns values are stacked in the value column and the original columns names are stacked in the name column, but the original id column remains, but with the ids stacked on top of each other. — Figure 11.2: Pivot longer in tidyr, excluding an `id` column. New columns are called `name` and `value`, as well as the old `id` column. Notice how, unlike the previous image, the `id` column is excluded when pivoting into the data on the right.

Pivoting is a conceptually challenging thing to grasp, so don’t be discouraged if you can’t understand how it works yet or of the ways you could use. As you practice using it, you will slowly begin to understand it. With pivot_longer(), the first argument is the data itself. The other arguments are:

cols: The columns to use to convert to long form. The input is a vector made using c() that contains the column names, like you would use in select() (e.g. you can use the select_helpers like starts_with(), or - minus to exclude).
names_to: Optional, the default is name. If provided, it will be the name of the newly created column (as a quoted character) that contains the original column names.
values_to: Optional, the default is value. Like names_to, sets the name of the new columns.

Both pivot_longer() and its opposite pivot_wider(), which we will cover later in the session, are incredibly powerful functions. We can’t show close to everything it can do in this workshop, but if you want to learn more, read up on the documentation for it.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

11.4 Pivot to long form

🧑‍🏫 Teacher note

Emphasise that there are a lot of really powerful ways you can use pivot_longer() and to point out looking to the documentation to see some examples of that.

Our CGM data is already in a long format, but we eventually want to join it with the participant details data. The participant details data is in a wide format. Let’s start with importing it in the setup code chunk of our docs/learning.qmd file:

docs/learning.qmd

participant_details <- here("data-raw/dime/participant_details.csv") |>
  import_dime()

Run the code in the setup code chunk by running each line with Ctrl-Enter. Then, let’s move to the end of our docs/learning.qmd file and create a new header called ## Pivot longer. Below that, create a new code chunk with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). We’ll start with seeing what the data looks like:

docs/learning.qmd

participant_details

# A tibble: 60 × 6
      id gender age_years intervention        start_date end_date  
   <dbl> <chr>      <dbl> <chr>               <date>     <date>    
 1   101 woman       45.6 baseline            2021-03-15 2021-03-22
 2   101 woman       45.6 low bioactive diet  2021-03-23 2021-04-06
 3   101 woman       45.6 high bioactive diet 2021-05-11 2021-05-25
 4   102 man         43.7 baseline            2021-03-15 2021-03-22
 5   102 man         43.7 high bioactive diet 2021-03-23 2021-04-06
 6   102 man         43.7 low bioactive diet  2021-05-11 2021-05-25
 7   104 man         37.3 baseline            2021-06-21 2021-06-28
 8   104 man         37.3 low bioactive diet  2021-06-29 2021-07-13
 9   104 man         37.3 high bioactive diet 2021-08-23 2021-09-06
10   105 man         35.8 baseline            2021-04-13 2021-04-21
# ℹ 50 more rows

You can see that the data, while the data has the intervention column in the long format, the dates are in the wide format, with start_date and end_date. Our CGM and sleep only has a single date column, so we can’t yet join these datasets with participant_details. We first need to pivot the data to the long format. Let’s add a pipe to participant_details to output into pivot_longer():

docs/learning.qmd

participant_details |>
  pivot_longer()

Error in `pivot_longer()`:
! `cols` must select at least one column.

This give an error because we haven’t told it which columns to use yet. We want to pivot only the date columns. We can do this using tidyselect helpers, for instance the helper function ends_with() is perfect for this case, since both date columns end with date.

docs/learning.qmd

participant_details |>
  pivot_longer(ends_with("date"))

# A tibble: 120 × 6
      id gender age_years intervention        name       value     
   <dbl> <chr>      <dbl> <chr>               <chr>      <date>    
 1   101 woman       45.6 baseline            start_date 2021-03-15
 2   101 woman       45.6 baseline            end_date   2021-03-22
 3   101 woman       45.6 low bioactive diet  start_date 2021-03-23
 4   101 woman       45.6 low bioactive diet  end_date   2021-04-06
 5   101 woman       45.6 high bioactive diet start_date 2021-05-11
 6   101 woman       45.6 high bioactive diet end_date   2021-05-25
 7   102 man         43.7 baseline            start_date 2021-03-15
 8   102 man         43.7 baseline            end_date   2021-03-22
 9   102 man         43.7 high bioactive diet start_date 2021-03-23
10   102 man         43.7 high bioactive diet end_date   2021-04-06
# ℹ 110 more rows

Awesome! We now have a long format with dates in a single column. But we also have new column names that are a bit ambiguous, like name and value. We can tell pivot_longer() to give them more descriptive names with names_to and values_to. But, we also don’t need the name column, since implicitly we know that the last date is the end date and the first date is the start date. So, we can set names_to = NULL to drop the new name column from being created.

docs/learning.qmd

participant_details |>
  pivot_longer(ends_with("date"), names_to = NULL, values_to = "date")

# A tibble: 120 × 5
      id gender age_years intervention        date      
   <dbl> <chr>      <dbl> <chr>               <date>    
 1   101 woman       45.6 baseline            2021-03-15
 2   101 woman       45.6 baseline            2021-03-22
 3   101 woman       45.6 low bioactive diet  2021-03-23
 4   101 woman       45.6 low bioactive diet  2021-04-06
 5   101 woman       45.6 high bioactive diet 2021-05-11
 6   101 woman       45.6 high bioactive diet 2021-05-25
 7   102 man         43.7 baseline            2021-03-15
 8   102 man         43.7 baseline            2021-03-22
 9   102 man         43.7 high bioactive diet 2021-03-23
10   102 man         43.7 high bioactive diet 2021-04-06
# ℹ 110 more rows

Great! But, maybe you may have noticed something else here. If we want to join this data with the CGM data, we need to have a single row for each day of the study that the participant has data for. Right now, the participant_details data has only the date for the first day and for the last day. So if we join the CGM or sleep data, we would only get the data for those two dates. What we want is to have all the days included between the start and end date for each participant. The tidyr package has many wonderful functions that help tidy up data, including a function for this particular situation! The function is called complete() because it will fill in data where you want a sequence with the missing dates in that sequence. For our situation, that is filling in all the missing dates between the start and end date for each participant.

Since we want to complete the data for each participant for each intervention, we will need to group the data. We could select each column to group by, but we could also use the pick() function we used before to select all the columns except for the date column.

docs/learning.qmd

participant_details |>
  pivot_longer(ends_with("date"), names_to = NULL, values_to = "date") |>
  group_by(pick(-date))

# A tibble: 120 × 5
      id gender age_years intervention        date      
   <dbl> <chr>      <dbl> <chr>               <date>    
 1   101 woman       45.6 baseline            2021-03-15
 2   101 woman       45.6 baseline            2021-03-22
 3   101 woman       45.6 low bioactive diet  2021-03-23
 4   101 woman       45.6 low bioactive diet  2021-04-06
 5   101 woman       45.6 high bioactive diet 2021-05-11
 6   101 woman       45.6 high bioactive diet 2021-05-25
 7   102 man         43.7 baseline            2021-03-15
 8   102 man         43.7 baseline            2021-03-22
 9   102 man         43.7 high bioactive diet 2021-03-23
10   102 man         43.7 high bioactive diet 2021-04-06
# ℹ 110 more rows

We can now use complete() to fill in the dates. The complete() function is used like mutate() or summarise(). Unlike these other functions, you need to provide it with a column that exists, and you can’t create new columns. And you also need to provide it with the values that you want completed. In our case, we want each date from the start date to the end date, one for each day. We can use the seq() function to create a sequence of dates, from the min() to the max(). So we’d write it like:

docs/learning.qmd

participant_details |>
  pivot_longer(ends_with("date"), names_to = NULL, values_to = "date") |>
  group_by(pick(-date)) |>
  complete(
    date = seq(min(date), max(date), by = "1 day")
  )

# A tibble: 823 × 5
      id gender age_years intervention        date      
   <dbl> <chr>      <dbl> <chr>               <date>    
 1   101 woman       45.6 baseline            2021-03-15
 2   101 woman       45.6 baseline            2021-03-16
 3   101 woman       45.6 baseline            2021-03-17
 4   101 woman       45.6 baseline            2021-03-18
 5   101 woman       45.6 baseline            2021-03-19
 6   101 woman       45.6 baseline            2021-03-20
 7   101 woman       45.6 baseline            2021-03-21
 8   101 woman       45.6 baseline            2021-03-22
 9   101 woman       45.6 high bioactive diet 2021-05-11
10   101 woman       45.6 high bioactive diet 2021-05-12
# ℹ 813 more rows

We now have a dataset that we can join with the other datasets! Time to make it into a function 😁

11.5 Converting the date completion code into a function

Since we’ve used the tidyr package, let’s add it as a package dependency. Open the Console and run:

Console

usethis::use_package("tidyr")

Now, we can make the function. In the docs/learning.qmd file, go to the bottom of the file where we wrote the code to clean the participant details. At the bottom of the chunk, we’ll create a new function named clean_participant_details by assigning function() to it. We’ll only have one argument called data. Then we’ll cut and paste the cleaning code into the function. After that we will replace the participant_details variable with the data argument. Finally, we will assign the cleaned data to a new variable called cleaned and return() it at the end of the function.

Next we’ll go through and add the :: to the functions used, which are tidyr::, tidyselect::, and dplyr::. After that, we will make an Roxygen documentation with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”).

docs/learning.qmd

#' Convert the participant details data to long and clean it up.
#'
#' @param data The DIME participant details data.
#'
#' @returns A data frame.
#'
clean_participant_details <- function(data) {
  cleaned <- data |>
    tidyr::pivot_longer(
      tidyselect::ends_with("date"),
      names_to = NULL, values_to = "date"
    ) |>
    dplyr::group_by(dplyr::pick(-date)) |>
    tidyr::complete(
      date = seq(min(date), max(date), by = "1 day")
    )
  return(cleaned)
}

Run the function in the code chunk with Ctrl-Enter and then test it out by going to the Console and running:

Console

clean_participant_details(participant_details)

# A tibble: 823 × 5
      id gender age_years intervention        date      
   <dbl> <chr>      <dbl> <chr>               <date>    
 1   101 woman       45.6 baseline            2021-03-15
 2   101 woman       45.6 baseline            2021-03-16
 3   101 woman       45.6 baseline            2021-03-17
 4   101 woman       45.6 baseline            2021-03-18
 5   101 woman       45.6 baseline            2021-03-19
 6   101 woman       45.6 baseline            2021-03-20
 7   101 woman       45.6 baseline            2021-03-21
 8   101 woman       45.6 baseline            2021-03-22
 9   101 woman       45.6 high bioactive diet 2021-05-11
10   101 woman       45.6 high bioactive diet 2021-05-12
# ℹ 813 more rows

Nice! It works! Once it works as intended, cut and paste it over into R/functions.R. Then, go to the docs/learning.qmd file and in the setup code chunk, add a pipe to the end of import_dime() for the participant_details object into clean_participant_details(). It should look like this:

docs/learning.qmd

participant_details <- here("data-raw/dime/participant_details.csv") |>
  import_dime() |>
  clean_participant_details()

Then to check the reproducibility, render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to make sure it all works. Woohoo 🎉

11.6 📖 Reading task: Pivoting data to a wider form

Teacher note

Like with the pivoting to long section, let them read through this section first and then go over it again to verbally explain it more, making use of the graphs to help illustrate what is happening.

Time: ~10 minutes.

Working in long form can be really useful when doing certain types of wrangling, like you’ve seen and read about above. The majority of the time, you probably will only ever use pivot_longer(). But, sometimes you need to pivot to a form that is a bit wider to do, for example, when doing certain types of joins or processing.

For us, the sleep data is in a very long form, where each hour of sleep has several rows for the different types of sleep. Joining the data in this form to the other datasets, while possible, won’t produce usable output. We want to convert it so there is only one row per hour of a participant’s day, rather than multiple rows per hour. For that, we need to convert to the wider form.

The arguments for pivot_wider() are very similar to those in pivot_longer(), except instead of names_to and values_to, they are called names_from and values_from. Like with many tidyverse functions, the first argument is the data and the other arguments are:

id_cols: This is optional as it will default to all column names. This argument tells pivot_wider() to use the given columns as the identifiers for when converting. Unlike pivot_longer() which doesn’t require some type of “key” or “id” column to convert to long form, the conversion to wide form requires some type of “key” or “id” column because pivot_wider() needs to know which rows belong with each other.
names_from: Similar to the pivot_longer(), this is the name of the column that has the values that will make up the new columns. Unlike with the names_to argument in pivot_longer() which takes a character string as input, the column name for names_from must be unquoted, because you are selecting a column that already exists in the dataset (unlike names_to which is the name of a column that you want to create but that doesn’t exist yet).
values_from: Same as names_from, this is the column name (that exists and must be given unquoted) for the values that will be in the new columns.

Figure 11.3 visually shows what’s happening when using pivot_wider().

A diagram showing the use of the pivot wider function, where a table with three columns, id, name, and value, is converted to a table with three columns, id, A and B. The values in the column name are now the names of the columns in the new table: A and B. The original id column still exists. — Figure 11.3: Pivot wider in tidyr, where a set of stacked “groups” in the data on the left are placed side-by-side as new columns in the output data on the right. Notice how the values in the column `N`, which is used in the `names_from` argument, are used as the names for the new columns `A` and `B` in the new data.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

11.7 🧑‍💻 Exercise: Create a new function to pivot sleep to wider

Time: ~25 minutes.

This will a bit more of a challenging exercise since we haven’t used the pivot_wider() function before. But the way you use it is very similar to pivot_longer(). We’ve had you practice the process many times and this will be an exercise to really challenge what you’ve learned so far! Which is part of the learning journey 😁

For this exercise, you will create a new function that converts the sleep data into a wider form. To test it out, it should look like this and should output something similar:

docs/learning.qmd

sleep_data |>
  sleep_types_to_wider()

# A tibble: 575 × 10
      id date        hour seconds_deep seconds_light seconds_wake
   <int> <date>     <int>        <dbl>         <dbl>        <dbl>
 1   101 2021-05-21    23          390          2730          450
 2   101 2021-05-22     0         1200          1230           NA
 3   101 2021-05-22     1          420          2970           NA
 4   101 2021-05-22     2           NA          3870           NA
 5   101 2021-05-22     3           NA            NA           NA
 6   101 2021-05-22     4         1710          1200          360
 7   101 2021-05-22     5           NA          1680         1350
 8   101 2021-05-22    23          330          1710          420
 9   101 2021-05-23     0          720          2910           NA
10   101 2021-05-23     1          660           630          570
# ℹ 565 more rows
# ℹ 4 more variables: seconds_rem <dbl>, seconds_asleep <dbl>,
#   seconds_restless <dbl>, seconds_awake <dbl>

Once you’ve made this new function, move it into the clean_sleep() so it looks like:

R/functions.R

clean_sleep <- function(data) {
  cleaned <- data |>
    get_participant_id() |>
    dplyr::rename(datetime = date) |>
    prepare_dates(datetime) |>
    # The function used to sum summarise may be different for you.
    summarise_column(seconds, list(sum = sum)) |>
    # This is the new function you will add below.
    sleep_types_to_wider()
  return(cleaned)
}

Then, you go to your setup code chunk in docs/learning.qmd and run all the code in it, especially:

docs/learning.qmd

sleep_data <- here("data-raw/dime/sleep/") |>
  import_csv_files() |>
  clean_sleep()

It should like this if you run it in the Console:

Console

sleep_data

# A tibble: 575 × 10
      id date        hour seconds_deep seconds_light seconds_wake
   <int> <date>     <int>        <dbl>         <dbl>        <dbl>
 1   101 2021-05-21    23          390          2730          450
 2   101 2021-05-22     0         1200          1230           NA
 3   101 2021-05-22     1          420          2970           NA
 4   101 2021-05-22     2           NA          3870           NA
 5   101 2021-05-22     3           NA            NA           NA
 6   101 2021-05-22     4         1710          1200          360
 7   101 2021-05-22     5           NA          1680         1350
 8   101 2021-05-22    23          330          1710          420
 9   101 2021-05-23     0          720          2910           NA
10   101 2021-05-23     1          660           630          570
# ℹ 565 more rows
# ℹ 4 more variables: seconds_rem <dbl>, seconds_asleep <dbl>,
#   seconds_restless <dbl>, seconds_awake <dbl>

Look into the documentation for pivot_wider() to get a better idea for how it works.
Create a new header called ## Pivot wider in the docs/learning.qmd file, then create a new code chunk below it with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).
Before making the function, prototype and test out code by piping the sleep_data into pivot_wider(). Try using names_from and values_from to get the right output. You can use the argument names_prefix to add text to the start of the new names columns.

After you’ve written some code that works, use the same process you’ve used before to make a function:

Create a new function called sleep_types_to_wider and assign function() to it.
Include data as an argument in function().
Replace the sleep_data variable with the data argument.
Assign it to a new variable called wider and return() it at the end of the function.
Explicitly indicate the package the function uses by using tidyr::.
Create Roxygen documentation for the function with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”).
Test that the code works by running the function, using it like sleep_types_to_wider(sleep_data).
When it works, cut and paste it over into R/functions.R.
Include the new function in the clean_sleep() function in R/functions.R so that it looks like the above.
Style both R/functions.R and docs/learning.qmd with the Palette (Ctrl-Shift-P, then type “style file”).
Render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to make sure it all works.
And lastly, commit your changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and push it to your GitHub.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

11.8 Key takeaways

Teacher note

Quickly cover this and get them to do the survey. The previous exercise will likely be challenging for them, so we will not do a discussion activity this time.

Data is usually structured to varying degrees as wide or long format. Thinking of your data in terms of shifting it around in structure is a powerful and useful skill to have when doing data wrangling and analysis.
When needing to convert to the longer form, use pivot_longer() from the tidyr package.
If you need to convert to the wider form, use the tidyr function pivot_wider().
Seek out and test different functions within the tidyr package, it has many very helpful functions for tidying up data such as the complete() function.

11.9 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

participant_details <- here("data-raw/dime/participant_details.csv") |>
  import_dime()
participant_details
participant_details |>
  pivot_longer()
participant_details |>
  pivot_longer(ends_with("date"))
participant_details |>
  pivot_longer(ends_with("date"), names_to = NULL, values_to = "date")
participant_details |>
  pivot_longer(ends_with("date"), names_to = NULL, values_to = "date") |>
  group_by(pick(-date))
participant_details |>
  pivot_longer(ends_with("date"), names_to = NULL, values_to = "date") |>
  group_by(pick(-date)) |>
  complete(
    date = seq(min(date), max(date), by = "1 day")
  )
usethis::use_package("tidyr")
#' Convert the participant details data to long and clean it up.
#'
#' @param data The DIME participant details data.
#'
#' @returns A data frame.
#'
clean_participant_details <- function(data) {
  cleaned <- data |>
    tidyr::pivot_longer(
      tidyselect::ends_with("date"),
      names_to = NULL, values_to = "date"
    ) |>
    dplyr::group_by(dplyr::pick(-date)) |>
    tidyr::complete(
      date = seq(min(date), max(date), by = "1 day")
    )
  return(cleaned)
}
clean_participant_details(participant_details)
participant_details <- here("data-raw/dime/participant_details.csv") |>
  import_dime() |>
  clean_participant_details()
#' Convert the sleep types to wide format.
#'
#' @param data The cleaned DIME sleep data.
#'
#' @returns A data frame.
#'
sleep_types_to_wider <- function(data) {
  wider <- data |>
    tidyr::pivot_wider(
      names_from = sleep_type,
      names_prefix = "seconds_",
      values_from = seconds_sum
    )
  return(wider)
}

11.1 Learning objectives

11.2 💬 Discussion activity: Recall and share what you’ve learned so far

11.3 📖 Reading task: Pivoting data for easier wrangling

11.4 Pivot to long form

11.5 Converting the date completion code into a function

11.6 📖 Reading task: Pivoting data to a wider form

11.7 🧑‍💻 Exercise: Create a new function to pivot sleep to wider

11.8 Key takeaways

11.9 Code used in session