In the HR-type datasets, we now have the participant ID that matches the participant ID in the survey results data. We also have minutes as the unit of time for the datetime column that matches the unit of time in the time columns of the survey results data. But, the HR data is in a longer format, where there are multiple rows for each participant, while the survey result data is in wider format but also has multiple rows per participant. Right now, we are not able to (meaningfully) join these datasets together because of this incompatibility in the format of the data. So in this session we will go over how to pivot data between long(er) and wide(r) formats.
21.1 Learning objectives
Describe the concept of “pivoting” data to convert data in a “long” format to a “wide” format and vice versa.
Identify situations where it is more appropriate that data is in the long or wide format.
Apply the pivot_longer() and pivot_wider() functions from the tidyr package to pivot data.
Recognise and appreciate the impact that seeking out and using existing functions to solve problems has on how you get things done, instead of writing custom code to do the things you want. For example, looking through the functions found in packages like tidyr.
21.2📖 Reading task: Pivoting data for easier wrangling
NoteTeacher note
Briefly walk through it again after they’ve read it and explain it a bit more, making use of the tables and graphs. Pivoting is often a difficult concept to grasp, to it’s good to go over it a few times. Emphasise the difficulty of it, to remind them it is ok to not get it.
Time: ~6 minutes.
Rarely is our data in the exact format we need or want it to be in to effectively process, analyse, or visualize it. Often we’ll need to do some heavy reshaping. One very powerful way to reshape data is with “pivoting”. Pivoting is when you take a dataset’s rows and convert them to columns or convert columns to rows, and in many different combinations of this.
The tidyr package within the tidyverse contains two wonderful functions for pivoting: pivot_longer() and pivot_wider(). There is a well written documentation on pivoting in the tidyr website that can explain more about it that you can read at another time.
Most use cases of pivoting are to convert to the longer form, which you can do with pivot_longer() and is what we will cover first in this session. There are many reasons pivoting longer is the most common pivot direction.
An example of why you often need to pivot longer is because data entry is easier to do in the wider format. Entering data, especially through a spreadsheet tool like Excel, is so much easier in the wider format and more time efficient than entering data in long form. Consider the scenario where you are entering data from a participant for glucose values that were measured every 30 minutes. So you might enter data like this:
This is a wide dataset that is useful for data entry.
person_id
glucose_0
glucose_30
glucose_60
1
5.6
7.8
4.5
2
4.7
9.5
5.3
3
5.1
10.2
4.2
If you are working with a spreadsheet, you might enter the glucose value for one participant at one time point, then hit the tab key to move to the next cell and add the next glucose value. Even outside of a spreadsheet, the way data is structured from survey software is often also in the wide format. However, when it comes time to analyze the data, this wide format is very inefficient and difficult to computationally and statistically work with. So, we do data entry in wide format and use functions like pivot_longer() to get the data ready for analysis. See Figure 21.1 below that visually shows what happens when you pivot from wide to long.
Figure 21.1: Pivot longer in tidyr. New columns are called name and value. Notice how the values in A and B columns are stacked on top of each other in the newly created V column.
If you had, for instance, an ID column for each participant, the pivoting would look like what is shown in Figure 21.2.
Figure 21.2: Pivot longer in tidyr, excluding an id column. New columns are called name and value, as well as the old id column. Notice how, unlike the previous image, the id column is excluded when pivoting into the data on the right.
Pivoting is a conceptually challenging thing to grasp, so don’t be discouraged if you don’t understand how it works yet or of the ways you could use it. As you practice using it, you will slowly begin to understand it. With pivot_longer(), the first argument is the data itself. The other arguments are:
cols: The columns to convert to long form. The input is a vector made using c() that contains the column names, like you would use in select() (e.g. you can use the tidyselect helpers like starts_with(), or - minus to exclude).
names_to: Optional, the default is name. If given, it will be the name of the newly created column (as a quoted character) that contains the original column names.
values_to: Optional, the default is value. Like names_to, sets the name of the new columns.
Both pivot_longer() and its opposite pivot_wider(), which we will cover later in the session, are incredibly powerful functions. We can’t show close to everything it can do in this workshop, but if you want to learn more, read up on the documentation for it.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
21.3 Pivot to long form
Note🧑🏫 Teacher note
Emphasise that there are a lot of really powerful ways you can use pivot_longer() and to point out looking to the documentation to see some examples of that.
There are several code chunks below. You should not create a new code chunk for each one. Instead, write the code within a single chunk while teaching, unless the text explicitly says otherwise.
Over the last few sessions, we have prepared our HR-type data. But it is in long format, which in this case is in a tidy format (where each value in a row and column represents a distinct observation at a given time). And while our survey results data is also in a tidy format, it is in a wider format and doesn’t allow us to join with the HR-type data.
While everyone read in the survey-results.csv.gz data in a previous session’s exercise, let’s make sure we’re all together and read it in again. So, go to the bottom of the docs/learning.qmd file and create a new header called ## Pivot longer. Below that, create a new code chunk with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Then, in that code chunk, read in the survey results data again with and run the code with Ctrl-EnterCtrl-Enter:
Let’s examine the data and make some observations about it. We see that each participant has potentially multiple rows of data, one row for times when they were surveyed on their stress levels. There are also multiple date and time columns: start_time, end_time, duration, and date. After that are columns for the responses to the survey questions for that survey time period.
What we want is to have a single column for the ID and a single column for a combined datetime in a way that matches the format of the HR-type data. That means we have to pivot the date and time columns into a long format and then expand them so that we have a row for each minute within the start and end time. Before we can pivot though, we need to process the data a bit. First, the date column is actually a character column. So we need to convert that to an actual date. Second, because we want a datetime column, we will need to create two new columns for start_datetime and end_datetime. Let’s fix the date first.
docs/learning.qmd
survey_data|>mutate(date =as_date(date))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = as_date(date)`.
Caused by warning:
! All formats failed to parse. No formats found.
# A tibble: 100 × 20
id start_time end_time duration date stress_level
<chr> <time> <time> <time> <date> <dbl>
1 5C 08:00 09:00 01:00 NA 1
2 5C 17:31 17:58 00:27 NA 1
3 E4 15:32 15:37 00:05 NA 2
4 E4 14:05 14:11 00:06 NA 2
5 7A 13:52 14:03 00:11 NA 2
6 94 08:49 08:56 00:07 NA NA
7 94 09:29 09:36 00:07 NA NA
8 E4 08:15 08:24 00:09 NA 2
9 E4 09:32 09:38 00:06 NA 2
10 7A 09:13 10:13 01:00 NA 2
# ℹ 90 more rows
# ℹ 14 more variables: covid_related <dbl>,
# treating_a_covid_patient <dbl>, patient_in_crisis <dbl>,
# patient_or_patient_s_family <dbl>, doctors_or_colleagues <dbl>,
# administration_lab_pharmacy_radiology_or_other_ancilliary_services <dbl>,
# increased_workload <dbl>, technology_related_stress <dbl>,
# lack_of_supplies <dbl>, documentation <dbl>, …
Hmm, that didn’t work as expected, now the dates are all NA! Dates are really tricky to work with and this is an example of that. The reason it didn’t work is because the date column is in the format of month/day/year, but the as_date() function is expecting it to be in the format of year-month-day. Thankfully, the lubridate package has the mdy() function (for month-day-year) that can handle this format. So we can use that instead.
Nice! Ok, let’s create those datetime columns now. We can use the as_datetime() function to convert the date and time columns into datetimes. Within as_datetime(), we can use paste() to combine the date and time columns together into a single character string, which can then be converted to a datetime. Let’s do that for both the start and end datetime columns. We’ll use the .before argument to place the new columns before the start_time column, mostly to make it easier to see them.
Great, we’ve combined them together. Now we don’t need the date, start_time, and end_time columns anymore since we combined them together, so we can drop those. And since the start and end datetimes provide the same information as duration, we can drop that. But we still have some unique way of identify each observation of participant at a given time. So let’s add an explicit copy of start_datetime and call it datetime_id, to ensure we always have an explicit unique set of identifiers for later processing. Let’s also assign the data to a new variable called survey_tidy so we can work on this data frame for the rest of the session.
Great, we have the datetimes in the format we want and no columns that we don’t need. But now, we need to get the data into the format to match the HR-type data. We’ll need at least a column for the participant ID and a column for the datetime, which includes both start and end time. Since we want to have a common dataset to join all data together, let’s select only the columns we need for that, which are the id, datetime_id, start_datetime, and end_datetime columns. It also makes it easier to see what we’re pivoting.
Error in `pivot_longer()`:
! `cols` must select at least one column.
This give an error because we haven’t told it which columns to use yet. We want to pivot only the date columns. We can do this by using c() for each of the columns we want to pivot.
It seems to have worked. We see that there are two new columns created called name and value. These aren’t very descriptive names though. If we look at the help documentation for pivot_longer(), we can see that it has an argument called names_to and values_to that allow use to rename these new columns to something else. If we look at the name column, it contains the original column names start_datetime and end_datetime. With the id, the value (which is the actual datetime value), and the duration column, we have a unique “key” for each observation of a participant at a given date and time. So we don’t actually need the name column. With names_to we can set it to NULL so it doesn’t create it. Let’s do that. We’ll have to remove the name value in the relocate() function too.
Better, but let’s rename the value column to collection_datetime, to match the other HR-type datasets. We can do that with the values_to argument in pivot_longer().
We’re getting there. But our HR-type data has rows in minutes, but here we only have the start and end times. So now we need to convert the collection_datetime column so that it has rows per minute, from the start time to the end time. The tidyr package has many wonderful functions that help tidy up data, including a function for this particular situation. The complete() function will “complete” (fill in) the columns you give it, converting implicit missing values to explicit missing values. This is exactly what we’re “missing” here: the rows for each minute for each participant. Like mutate() and summarise(), the complete() function works by assigning values to columns. But unlike mutate() and summarise(), you can’t create new columns with complete(). You have to apply it to an existing column, which is the collection_datetime column.
We also need to give this function the actual values that should be present in the specific column, which is the sequence of datetimes from the start to the end time by 1 minute. In this case, we can use the seq() function from the start time (min()) to the end time (max()), with a step of 60 seconds (since seq() interprets a datetime as seconds). Let’s try it out:
# A tibble: 104,528 × 3
collection_datetime id datetime_id
<dttm> <chr> <dttm>
1 2020-04-14 17:31:00 5C 2020-04-14 17:31:00
2 2020-04-14 17:32:00 <NA> NA
3 2020-04-14 17:33:00 <NA> NA
4 2020-04-14 17:34:00 <NA> NA
5 2020-04-14 17:35:00 <NA> NA
6 2020-04-14 17:36:00 <NA> NA
7 2020-04-14 17:37:00 <NA> NA
8 2020-04-14 17:38:00 <NA> NA
9 2020-04-14 17:39:00 <NA> NA
10 2020-04-14 17:40:00 <NA> NA
# ℹ 104,518 more rows
Ok, so it made a row for each minute, but now we have a lot of other missing data. And it seems that it took the min() and max() of all the datetime values, rather than the min and max for each participant. We want to fill in the missing datetimes for each participant for each period of time. So we can use group_by() to group the data for everything except the collection_datetime column, which is the column we want to complete. Rather than writing out all the columns to group by, we can use the pick() function from dplyr to select all the columns except for the collection_datetime column. We’ll also ungroup() afterwards:
Nice! We now have our dataset in a format that allows us to join it with the other datasets, including both the original survey results data and the HR-type data. Time to make it into a function and include within the docs/cleaning.qmd file.
21.4🧑💻 Exercise: Convert into two new functions
Time: ~20 minutes.
Time to convert the code into functions. There were two distinct “steps” we did to the survey data. The first was tidying up the dates and times to prepare it for pivoting and the second was pivoting into a longer format. So convert those into two separate functions. They should be used as shown below:
You’ll eventually take those functions, put them into R/functions.R and use them in the docs/cleaning.qmd file to process the survey data. Start with the date-tidying code:
As you’ve done throughout this workshop, convert the code into a function called tidy_survey_dates, with one argument data inside function().
Copy and paste the code we just wrote into the body of the function. Replace the dataset name survey_data with the data argument and assign the output to tidied.
Explicitly link each function you use inside the function to its packages with ::. There should be three packages used within the function.
There’s one package you haven’t added to the dependencies. Add it with usethis::use_package().
Add Roxygen documentation with Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”).
Run styler in your docs/learning.qmd with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).
Run the function with Ctrl-EnterCtrl-Enter and test it out in the Console with tidy_survey_dates(survey_data). It should give you the same output as before.
Once it works fine, cut and paste the function into R/functions.R.
Add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then push to GitHub.
Now do the same thing for the pivoting code:
Convert the code into a function called survey_to_long(), with one argument data inside function().
Copy and paste the code we just wrote into the body of the function. Replace the dataset name survey_tidy with the data argument and assign the output to longer.
Link the two packages used to the functions with ::.
Add Roxygen documentation with Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”).
Run styler in your docs/learning.qmd with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).
Run the function with Ctrl-EnterCtrl-Enter and test it out in the Console with survey_to_long(survey_tidy). It should give you
Cut and paste the function into R/functions.R.
Add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then push to GitHub.
In the next exercise, you’ll use these new functions in the docs/cleaning.qmd file to process the survey data.
#' Tidy up the dates in the survey results#'#' @param data A data frame of the survey results data.#'#' @return A data frame.tidy_survey_dates<-function(data){tidied<-data|>dplyr::mutate( date =lubridate::mdy(date), start_datetime =lubridate::as_datetime(paste(date, start_time)), end_datetime =lubridate::as_datetime(paste(date, end_time)), datetime_id =start_datetime, .before =start_time)|>dplyr::select(-c(date, start_time, end_time, duration))return(tidied)}survey_data|>tidy_survey_dates()
#' Pivot survey data to longer format, with only IDs and datetimes#'#' @param data A data frame of the data from `tidy_survey_dates()`.#'#' @returns A data frame.survey_to_long<-function(data){longer<-data|>dplyr::select(id, datetime_id, start_datetime, end_datetime)|>tidyr::pivot_longer(c(start_datetime, end_datetime), names_to =NULL, values_to ="collection_datetime")|>dplyr::group_by(dplyr::pick(-collection_datetime))|>tidyr::complete( collection_datetime =seq(min(collection_datetime),max(collection_datetime), by =60))|>dplyr::ungroup()return(longer)}survey_data|>tidy_survey_dates()|>survey_to_long()
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
21.5🧑💻 Exercise: Add survey processing to docs/cleaning.qmd
Note🧑🏫 Teacher note
Once everyone has finished this exercise, walkthrough it to make sure everyone is aligned and briefly explain again why we use the docs/cleaning.qmd file (because it’s the more “central” file to clean and prepare the data, while the docs/learning.qmd file is more to use as a sandbox).
Time: ~10 minutes.
Now that you’ve tidying up the next piece, time to build up the docs/cleaning.qmd more. So, open up the docs/cleaning.qmd file (you can open files quickly with Ctrl-.Ctrl-.).
On the bottom of docs/cleaning.qmd, create a new header called ## Process survey data. Create a new code chunk below that with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
In that code chunk, pipe here::here() to the survey-results.csv.gz file into read(), then pipe that into tidy_survey_dates(). Assign the output to survey_tidy.
Then below that code, pipe survey_tidy into survey_to_long(), and assign that to survey_long.
Style the code with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”).
Render the Quarto document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) to confirm that everything works.
Once it works fine, add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then push to GitHub.
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
21.6🧑💻 Extra exercise: Simplify code with {tidyselect} helpers
Time: ~5 minutes.
If you’ve finished the previous exercises early and have some time, try simplify the code in the survey_to_long() function by using tidyselect helpers rather than writing out the column names with c(). To make it more challenging, we won’t give anymore hints nor solution code.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
21.7📖 Reading task: Pivoting data to a wider form
NoteTeacher note
Like with the pivoting to long section, let them read through this section first and then go over it again to verbally explain it more, making use of the graphs to help illustrate what is happening.
Time: ~10 minutes.
While we won’t use this function in this workshop, it’s still useful to read of pivot_longer()’s opposite function, pivot_wider(), which converts to the wider format. Working in long form can be really useful when doing certain types of wrangling, like you’ve read about above and actually used. The majority of the time, you probably will only ever use pivot_longer() and this is part of the reason why we don’t cover it. But, sometimes you need to pivot to a form that is a bit wider to do, for example, when doing certain types of joins or processing.
The arguments for pivot_wider() are very similar to those in pivot_longer(), except instead of names_to and values_to, they are called names_from and values_from. Like with many tidyverse functions, the first argument is the data and the other arguments are:
id_cols: This is optional as it will default to all column names. This argument tells pivot_wider() to use the given columns as the identifiers for when converting. Unlike pivot_longer() which doesn’t require some type of “key” or “id” column to convert to long form, the conversion to wide form requires some type of “key” or “id” column because pivot_wider() needs to know which rows belong with each other.
names_from: Similar to the pivot_longer(), this is the name of the column that has the values that will make up the new columns. Unlike with the names_to argument in pivot_longer() which takes a character string as input, the column name for names_from must be unquoted, because you are selecting a column that already exists in the dataset (unlike names_to which is the name of a column that you want to create but that doesn’t exist yet).
values_from: Same as names_from, this is the column name (that exists and must be given unquoted) for the values that will be in the new columns.
Figure 21.3 visually shows what’s happening when using pivot_wider().
Figure 21.3: Pivot wider in tidyr, where a set of stacked “groups” in the data on the left are placed side-by-side as new columns in the output data on the right. Notice how the values in the column N, which is used in the names_from argument, are used as the names for the new columns A and B in the new data.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
21.8 Key takeaways
NoteTeacher note
Quickly cover this and get them to do the survey. The previous exercise will likely be challenging for them, so we will not do a discussion activity this time.
Data is usually structured to varying degrees as wide or long format. Thinking of your data in terms of shifting it around in structure is a powerful and useful skill to have when doing data wrangling and analysis.
When needing to convert to the longer form, use pivot_longer() from the tidyr package.
If you need to convert to the wider form, use the tidyr function pivot_wider().
Seek out and test different functions within the tidyr package, it has many very helpful functions for tidying up data such as the complete() function.
21.9 Code used in session
This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.