If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
On GitLab open an issue or submit a merge request by clicking the "Edit this page " button on the side of this page.
10 Quickly re-arranging data with pivots
Here we will continue using the Workflow block as we cover the fourth block, “Work with final data” in Figure 9.1.
Figure 10.1: Section of the overall workflow we will be covering.
And your folder and file structure should look like (use fs::dir_tree(recurse = 2)
if you want to check using R):
LearnR3
├── data/
│ ├── mmash.rda
│ └── README.md
├── data-raw/
│ ├── README.md
│ ├── mmash-data.zip
│ ├── mmash/
│ │ ├── user_1
│ │ ├── ...
│ │ └── user_22
│ └── mmash.R
├── doc/
│ ├── README.md
│ └── lesson.Rmd
├── R/
│ ├── functions.R
│ └── README.md
├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md
For instructors: Click for details.
Briefly go over the bigger picture (found in the introduction section) and remind everyone the ‘what’ and ‘why’ of what we are doing.
10.1 Learning objectives
- Using the concept of “pivoting” to arrange data from long to wide and vice versa.
10.2 Setup for the analysis in R Markdown
We now have a working dataset to start doing some simple analyses on in the R Markdown document. A recommended workflow with R Markdown is to often “Knit” it and make sure your analysis is reproducible (while on your computer). We already cleaned it up from the previous session.
We will now add the load()
code right below the source()
function in the
setup
code chunk:
As we write more R code and do some simple analyses of the data, we are going to be knitting fairly often (depending on how long the analysis takes of course). The main reason for this is to ensure that whatever you are writing and coding will at least be reproducible on your computer, since R Markdown is designed to ensure the document is reproducible.
For this specific workflow and for checking reproducibility, you should output
to HTML rather than to a Word document. While you can create a Word document by
changing the output: html_document
to output: word_document
at the top in
the YAML header, you’d only do this when you need to submit to a journal or
need to email to co-authors for review. The reason is simple: After you generate
the Word document from R Markdown, the Word file opens up and consequently Word
locks the file from further edits. What that means is that every time you generate
the Word document, you have to close it before you can generate it again,
otherwise knitting will fail. This can get annoying very quickly (trust me),
since you don’t always remember to close the Word document. If you output to
HTML, this won’t be a problem.
10.3 Re-arranging data for easier summarizing
For instructors: Click for details.
Let them read through this section and then walk through it again and explain it a bit more, making use of the tables and graphs. Doing both reading and listening again will help reinforce the concept of pivoting, which is usually quite difficult to grasp for those new to it.
Take 6 min to read over the sections until it says to stop, and then we’ll
go over it again.
Now that we have the final dataset to work with, we want to explore it a bit
with some simple descriptive statistics. One extremely useful and powerful tool
to summarizing data is by “pivoting” your data. Pivoting is when
you convert data between longer forms (more rows) and wider forms (more columns).
The tidyr package within tidyverse contains two wonderful functions for pivoting:
pivot_longer()
and pivot_wider()
. There is a well written documentation
on pivoting in the tidyr website that can explain more about it.
The first thing we’ll use, and probably the more commonly used in general,
is pivot_longer()
. This function is commonly used because entering data in
the wide form is easier and more time efficient than entering data in long form.
For instance, if you were measuring glucose values over time in participants,
you might enter data in like this:
person_id | glucose_0 | glucose_30 | glucose_60 |
---|---|---|---|
1 | 5.6 | 7.8 | 4.5 |
2 | 4.7 | 9.5 | 5.3 |
3 | 5.1 | 10.2 | 4.2 |
However, when it comes time to analyze the data, this wide form is very inefficient
and difficult to computationally and statistically work with. So, we do data
entry in wide and use functions like pivot_longer()
to get the data ready for
analysis. Figure 10.2 visually shows what happens when
you pivot from wide to long.

Figure 10.2: Pivot longer in tidyr. New columns are called ‘name’ and ‘value’.
If you had, for instance, an ID column for each participant, the pivoting would look like what is shown in Figure 10.3.

Figure 10.3: Pivot longer in tidyr, excluding an ‘id’ column. New columns are called ‘name’ and ‘value’, as well as the old ‘id’ column.
Pivoting is a conceptually challenging thing to grasp, so don’t be disheartened
if you can’t understand how it works yet. As you practice using it, you will
understand it. With pivot_longer()
, the first argument is the data itself.
The other arguments are:
-
cols
: The columns to use to convert to long form. The input is a vector made usingc()
that contains the column names, like you would use inselect()
(e.g. you can use theselect_helpers
likestarts_with()
, or-
minus to exclude). -
names_to
: Optional, the default isname
. If provided, it will be the name of the newly created column (as a quoted character) that contains the original column names. -
values_to
: Optional, the default isvalue
. Likenames_to
, sets the name of the new columns.
The pivot_longer()
and its opposite pivot_wider()
, that we will cover later
in the session, are both incredibly powerful functions. We can’t show close to
everything it can do in this course, but if you want to learn more, read up on
the documentation for it.
Ok, stop reading at this point and we will go over pivoting to long again.
Let’s try this out with mmash
. In your doc/lesson.Rmd
file, create a new
header called ## Pivot longer
and create a new code chunk below that. Now we can
start typing in our code:
mmash %>%
# pivot every column
pivot_longer(everything())
#> Error in `pivot_longer_spec()`:
#> ! Can't combine `gender` <character> and `weight` <double>.
This gives us an error because we are mixing data types. We can’t have character data and number data in the same column. Let’s pivot only numbers.
mmash %>%
pivot_longer(where(is.numeric))
#> # A tibble: 1,740 × 5
#> gender user_id samples name value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 M user_1 before sleep weight 6.5 e+1
#> 2 M user_1 before sleep height 1.69e+2
#> 3 M user_1 before sleep age 2.9 e+1
#> 4 M user_1 before sleep cortisol_norm 3.41e-2
#> 5 M user_1 before sleep melatonin_norm 1.74e-8
#> 6 M user_1 before sleep day 1 e+0
#> 7 M user_1 before sleep ibi_s_mean 6.66e-1
#> 8 M user_1 before sleep ibi_s_sd 1.64e-1
#> 9 M user_1 before sleep hr_mean 9.06e+1
#> 10 M user_1 before sleep hr_sd 1.30e+1
#> # … with 1,730 more rows
Nice! But not super useful. We can exclude specific columns from pivoting
with -
before the column name, for instance with user_id
and day
. Let’s
drop the samples
column before pivoting since day
gives us the same information:
mmash %>%
select(-samples) %>%
pivot_longer(c(-user_id, -day, -gender))
#> # A tibble: 1,566 × 5
#> gender user_id day name value
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 M user_1 1 weight 6.5 e+1
#> 2 M user_1 1 height 1.69e+2
#> 3 M user_1 1 age 2.9 e+1
#> 4 M user_1 1 cortisol_norm 3.41e-2
#> 5 M user_1 1 melatonin_norm 1.74e-8
#> 6 M user_1 1 ibi_s_mean 6.66e-1
#> 7 M user_1 1 ibi_s_sd 1.64e-1
#> 8 M user_1 1 hr_mean 9.06e+1
#> 9 M user_1 1 hr_sd 1.30e+1
#> 10 M user_1 2 weight 6.5 e+1
#> # … with 1,556 more rows
10.4 Exercise: Brainstorm and discuss other ways of using pivots
Time: 10 min
As a group, brainstorm and discuss as many ways as you can on how pivoting longer or wider might enhance using the split-apply-combine technique. Groups will briefly share what they’ve come up with before moving on to the next exercise.
10.5 Exercise: Summarise your data after pivoting
Time: 15 min
Use pivot_longer()
after the group_by()
and summarise()
we did previously:
Using the group_by()
and summarise()
functions we learned in
section 8.8, complete these tasks starting from
this code.
%>%
mmash select(-samples) %>%
pivot_longer(c(-user_id, -day, -gender)) %>%
___
- Continuing the
%>%
frompivot_longer()
, usegroup_by()
to group the data bygender
,day
, andname
(the long form column produced frompivot_longer()
). - After grouping with
group_by()
, usesummarise()
andacross()
on thevalue
column and find the mean and standard deviation (put them into a named list like we did previously). Don’t forget to usena.rm = TRUE
to exclude missing values. - Stop the grouping effect with
ungroup()
. - Knit the R Markdown document into HTML (
Ctrl-Shift-K
or the “Knit” button). - Open up the Git interface and add and commit the changes to
doc/lesson.Rmd
.
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.
10.6 Pivot data to wider form
For instructors: Click for details.
Like with the pivoting to long section, let them read through this section first and than go over it again to verbally explain it more, making use of the graphs to help illustrate what is happening. Doing both reading and listening will help reinforce the concepts.
Take 6 min to read over the sections until it says to stop, and then we’ll go over it again.
After using pivot_longer()
on the summarised data, it looks nice, but it could
be better. Right now it is in a pretty long form, but for showing as a table,
having columns for either gender
or day
would make it easier to compare the
mean and SD values we obtain. This is where we can use pivot_wider()
to
get the data wider rather than long.
The arguments for pivot_wider()
are very similar to those in pivot_longer()
,
except instead of names_to
and values_to
, they are called
names_from
and values_from
. Like with many R functions, the first argument
is the data and the other arguments are:
-
id_cols
: This is optional as it will default to all column names. This argument tellspivot_wider()
to use the given columns as the identifiers for when converting. Unlikepivot_longer()
which doesn’t require some type of “key” or “id” column to convert to long form, the conversion to wide form requires some type of “key” or “id” column becausepivot_wider()
needs to know which rows belong with each other. -
names_from
: Similar to thepivot_longer()
, this is the name of the column that has the values that will make up the new columns. Unlike with thenames_to
argument inpivot_longer()
which takes a character string as input, the column name fornames_from
must be unquoted because you are selecting a column that already exists in the dataset. -
values_from
: Same asnames_from
, this is the column name (that exists and must be given unquoted) for the values that will be in the new columns.
Figure 10.4 visually shows what’s happening when using
pivot_wider()
.

Figure 10.4: Pivot wider in tidyr.
Stop here and we will go over it again.
In our case, we want either gender
or day
as columns with the mean and SD values.
Let’s use pivot_wider()
on day
to see differences between days.
mmash %>%
select(-samples) %>%
pivot_longer(c(-user_id, -day, -gender)) %>%
group_by(gender, day, name) %>%
summarise(across(
value,
list(mean = mean, sd = sd),
na.rm = TRUE
)) %>%
ungroup() %>%
pivot_wider(names_from = day)
#> Error in `loc_validate()`:
#> ! Can't subset columns past the end.
#> ℹ Location 10 doesn't exist.
#> ℹ There are only 5 columns.
Hmm, didn’t work. Nothing has been pivoted to wider. That’s because we are missing
the value_from
argument. Since we actually have the two value_mean
and value_sd
columns that have “values” in them, we need to tell pivot_wider()
to use those two columns. Since values_from
works similar to select()
,
we can use starts_with()
to select the columns starting with "values"
.
mmash %>%
select(-samples) %>%
pivot_longer(c(-user_id, -day, -gender)) %>%
group_by(gender, day, name) %>%
summarise(across(
value,
list(mean = mean, sd = sd),
na.rm = TRUE
)) %>%
ungroup() %>%
pivot_wider(names_from = day, values_from = starts_with("value"))
#> # A tibble: 18 × 10
#> gender name value_mean_1 value_mean_2 value_mean_NA `value_mean_-29`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 M age 2.60e+1 2.60e+1 28 NA
#> 2 M cortisol_norm 2.81e-2 6.99e-2 NaN NA
#> 3 M height 1.80e+2 1.80e+2 175 NA
#> 4 M hr_mean 8.10e+1 6.69e+1 NaN NA
#> 5 M hr_sd 1.27e+1 1.55e+1 NaN NA
#> 6 M ibi_s_mean 7.60e-1 9.21e-1 NaN NA
#> 7 M ibi_s_sd 1.77e-1 2.78e-1 NaN NA
#> 8 M melatonin_no… 8.33e-9 6.96e-9 NaN NA
#> 9 M weight 7.53e+1 7.53e+1 70 NA
#> 10 <NA> age NaN NaN NA NaN
#> 11 <NA> cortisol_norm NaN NaN NA NaN
#> 12 <NA> height NaN NaN NA NaN
#> 13 <NA> hr_mean 7.96e+1 7.04e+1 NA 62.6
#> 14 <NA> hr_sd 1.09e+1 1.78e+1 NA 10.0
#> 15 <NA> ibi_s_mean 7.58e-1 8.56e-1 NA 0.962
#> 16 <NA> ibi_s_sd 1.14e-1 1.90e-1 NA 0.232
#> 17 <NA> melatonin_no… NaN NaN NA NaN
#> 18 <NA> weight NaN NaN NA NaN
#> # … with 4 more variables: value_sd_1 <dbl>, value_sd_2 <dbl>,
#> # value_sd_NA <dbl>, `value_sd_-29` <dbl>
Now we have a different problem. There are missing values in both the day
and gender
columns that, at least in this case, we don’t want pivoted.
Shouldn’t they be removed when we include na.rm = TRUE
in our code?
The function of na.rm = TRUE
is not to remove NA
values, but to instead
tell R to not include variables in mmash
that are NA
when calculating the mean and
standard deviation. In this particular case, the columns value_mean_NA
or
value_mean_-29
have NA
or NaN
values because there are no other values
in the data other than NA
. Since we don’t actually care about missing
days (or the random -29
day), we can remove missing values with the function
called drop_na()
. We also don’t care about missing gender
values, so we’ll
drop them as well. Add it in the pipe right before group_by()
.
mmash %>%
select(-samples) %>%
pivot_longer(c(-user_id, -day, -gender)) %>%
drop_na(day, gender) %>%
group_by(gender, day, name) %>%
summarise(across(
value,
list(mean = mean, sd = sd),
na.rm = TRUE
)) %>%
ungroup() %>%
pivot_wider(names_from = day, values_from = starts_with("value"))
#> # A tibble: 9 × 6
#> gender name value_mean_1 value_mean_2 value_sd_1 value_sd_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 M age 2.60e+1 2.60e+1 7.15e+0 7.15e+0
#> 2 M cortisol_norm 2.81e-2 6.99e-2 2.99e-2 5.22e-2
#> 3 M height 1.80e+2 1.80e+2 8.19e+0 8.19e+0
#> 4 M hr_mean 8.10e+1 6.69e+1 7.51e+0 6.63e+0
#> 5 M hr_sd 1.27e+1 1.55e+1 4.86e+0 6.89e+0
#> 6 M ibi_s_mean 7.60e-1 9.21e-1 7.82e-2 8.92e-2
#> 7 M ibi_s_sd 1.77e-1 2.78e-1 8.45e-2 1.23e-1
#> 8 M melatonin_norm 8.33e-9 6.96e-9 6.59e-9 6.28e-9
#> 9 M weight 7.53e+1 7.53e+1 1.28e+1 1.28e+1
10.7 Exercise: Convert this code into a function
Time: 15 min
Using the same workflow we’ve been doing throughout this course, convert the code we just wrote above into a function.
- Name the function
tidy_summarise_by_day
. - Create one argument called
data
. Create a new variable inside the function calleddaily_summary
and put it inreturn()
so the function outputs it. - Test that the function works.
- Add Roxygen documentation and use explicit function calls with
packagename::
.- Don’t forget, you can use
?functionname
to find out which package the function comes from.
- Don’t forget, you can use
- Move the newly created function over into the
R/functions.R
file. - Restart R, go into the
doc/lesson.Rmd
file and run thesetup
code chunk in the R Markdown document with thesource()
andload()
commands. Then test that the new function works in a code chunk at the bottom of the document.
Use this code to refresh your memory and to use as a starting point:
<- function(___) {
___
}
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.
#' Calculate tidy summary statistics by day.
#'
#' @param data The MMASH dataset.
#'
#' @return A data.frame/tibble.
#'
tidy_summarise_by_day <- function(data) {
daily_summary <- data %>%
dplyr::select(-samples) %>%
tidyr::pivot_longer(c(-user_id, -day, -gender)) %>%
tidyr::drop_na(day, gender) %>%
dplyr::group_by(gender, day, name) %>%
dplyr::summarise(dplyr::across(value,
list(mean = mean, sd = sd),
na.rm = TRUE)) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = day,
values_from = dplyr::starts_with("value"))
return(daily_summary)
}
# Testing that the function works.
mmash %>%
tidy_summarise_by_day()
10.8 Extending the function to use other statistics and to be tidier
Now that we’ve made the tidy summary code into a function, let’s make it more generic so we can use other summary statistics and to have the output be a bit tidier. For instance, it would be nice to be able to do something like this:
mmash %>%
tidy_summarise_by_day(median)
mmash %>%
tidy_summarise_by_day(max)
mmash %>%
tidy_summarise_by_day(list(median = median, max = max))
Before we get to adding this functionality, let’s first make it so the function
has a tidier output. Specifically, we want to round the values so they are
easier to read. Go into the R/functions.R
script to the
tidy_summarize_by_day()
function. We’ll create a new line right after the
dplyr::summarise()
function, after the %>%
pipe.
Since we want to round values of existing columns, we need to use mutate()
.
And like we used across()
in summarise()
, we can also use across()
within mutate()
on specific columns. In our case, we want to round columns
that start_with()
the word "value"
to 2 digits.
tidy_summarise_by_day <- function(data) {
data %>%
dplyr::select(-samples) %>%
tidyr::pivot_longer(c(-user_id, -day, -gender)) %>%
tidyr::drop_na(day, gender) %>%
dplyr::group_by(gender, day, name) %>%
dplyr::summarise(dplyr::across(value,
list(mean = mean, sd = sd),
na.rm = TRUE)) %>%
dplyr::mutate(dplyr::across(dplyr::starts_with("value"),
round, digits = 2)) %>%
tidyr::pivot_wider(names_from = day,
values_from = dplyr::starts_with("value"))
}
# Source, then test out the function in the Console:
tidy_summarise_by_day(mmash)
#> # A tibble: 9 × 6
#> # Groups: gender [1]
#> gender name value_mean_1 value_mean_2 value_sd_1 value_sd_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 M age 26.0 26.0 7.15 7.15
#> 2 M cortisol_norm 0.03 0.07 0.03 0.05
#> 3 M height 180. 180. 8.19 8.19
#> 4 M hr_mean 81.0 66.9 7.51 6.63
#> 5 M hr_sd 12.7 15.5 4.86 6.89
#> 6 M ibi_s_mean 0.76 0.92 0.08 0.09
#> 7 M ibi_s_sd 0.18 0.28 0.08 0.12
#> 8 M melatonin_norm 0 0 0 0
#> 9 M weight 75.3 75.3 12.8 12.8
That’s much easier to read with the values rounded. Now let’s add the ability to change the summary statistics function to something else. This is a surprisingly easy thing so before we do that, let’s take a few minutes to brainstorm how we can achieve this.
For instructors: Click for details.
Get the groups to chat together for about 5 minutes to think about how they’d do that. Ask that they don’t look ahead in the text. After that, discuss some ways to add the functionality.
Now that we’ve discussed this and come to a conclusion, let’s update the function.
tidy_summarise_by_day <- function(data, summary_fn) {
data %>%
dplyr::select(-samples) %>%
tidyr::pivot_longer(c(-user_id, -day, -gender)) %>%
tidyr::drop_na(day, gender) %>%
dplyr::group_by(gender, day, name) %>%
dplyr::summarise(dplyr::across(
value,
summary_fn,
na.rm = TRUE)
) %>%
dplyr::mutate(dplyr::across(dplyr::starts_with("value"),
round, digits = 2)) %>%
tidyr::pivot_wider(names_from = day,
values_from = dplyr::starts_with("value"))
}
# Source, then test out the function in the Console:
tidy_summarise_by_day(mmash, max)
#> # A tibble: 9 × 4
#> # Groups: gender [1]
#> gender name `1` `2`
#> <chr> <chr> <dbl> <dbl>
#> 1 M age 40 40
#> 2 M cortisol_norm 0.16 0.26
#> 3 M height 205 205
#> 4 M hr_mean 97.5 83.1
#> 5 M hr_sd 31.9 38.6
#> 6 M ibi_s_mean 0.96 1.06
#> 7 M ibi_s_sd 0.44 0.56
#> 8 M melatonin_norm 0 0
#> 9 M weight 115 115
Now that it works, let’s add some summary statistics to the
doc/lesson.Rmd
file.
mmash %>%
tidy_summarise_by_day(max)
#> # A tibble: 9 × 4
#> # Groups: gender [1]
#> gender name `1` `2`
#> <chr> <chr> <dbl> <dbl>
#> 1 M age 40 40
#> 2 M cortisol_norm 0.16 0.26
#> 3 M height 205 205
#> 4 M hr_mean 97.5 83.1
#> 5 M hr_sd 31.9 38.6
#> 6 M ibi_s_mean 0.96 1.06
#> 7 M ibi_s_sd 0.44 0.56
#> 8 M melatonin_norm 0 0
#> 9 M weight 115 115
mmash %>%
tidy_summarise_by_day(median)
#> # A tibble: 9 × 4
#> # Groups: gender [1]
#> gender name `1` `2`
#> <chr> <chr> <dbl> <dbl>
#> 1 M age 27 27
#> 2 M cortisol_norm 0.02 0.06
#> 3 M height 180 180
#> 4 M hr_mean 79.3 66.4
#> 5 M hr_sd 12.1 14.3
#> 6 M ibi_s_mean 0.77 0.91
#> 7 M ibi_s_sd 0.15 0.21
#> 8 M melatonin_norm 0 0
#> 9 M weight 70 70
mmash %>%
tidy_summarise_by_day(list(median = median, max = max))
#> # A tibble: 9 × 6
#> # Groups: gender [1]
#> gender name value_median_1 value_median_2 value_max_1 value_max_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 M age 27 27 40 40
#> 2 M cortisol_norm 0.02 0.06 0.16 0.26
#> 3 M height 180 180 205 205
#> 4 M hr_mean 79.3 66.4 97.5 83.1
#> 5 M hr_sd 12.1 14.3 31.9 38.6
#> 6 M ibi_s_mean 0.77 0.91 0.96 1.06
#> 7 M ibi_s_sd 0.15 0.21 0.44 0.56
#> 8 M melatonin_norm 0 0 0 0
#> 9 M weight 70 70 115 115
10.9 Making prettier output in R Markdown
For instructors: Click for details.
Can go over this quite quickly after they’ve (optionally) finished the previous exercise.
What we created is nice and all, but since we are working in an R Markdown
document and knitting to HTML, let’s make it easier for others (including
yourself) to read the document. Let’s make the output as an actual table.
We can do that with knitr::kable()
(meaning “knitr table”). We can also
add a table caption with the caption
argument.
mmash %>%
tidy_summarise_by_day(list(mean = mean, min = min, max = max)) %>%
knitr::kable(caption = "Descriptive statistics of some variables.")
gender | name | value_mean_1 | value_mean_2 | value_min_1 | value_min_2 | value_max_1 | value_max_2 |
---|---|---|---|---|---|---|---|
M | age | 25.95 | 25.95 | 0.00 | 0.00 | 40.00 | 40.00 |
M | cortisol_norm | 0.03 | 0.07 | 0.01 | 0.02 | 0.16 | 0.26 |
M | height | 180.14 | 180.14 | 169.00 | 169.00 | 205.00 | 205.00 |
M | hr_mean | 80.96 | 66.89 | 70.27 | 56.84 | 97.47 | 83.10 |
M | hr_sd | 12.70 | 15.49 | 7.85 | 7.64 | 31.86 | 38.61 |
M | ibi_s_mean | 0.76 | 0.92 | 0.62 | 0.75 | 0.96 | 1.06 |
M | ibi_s_sd | 0.18 | 0.28 | 0.09 | 0.15 | 0.44 | 0.56 |
M | melatonin_norm | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
M | weight | 75.29 | 75.29 | 60.00 | 60.00 | 115.00 | 115.00 |
Then knit the document and check out the HTML file. So pretty! 😁 (well, there’s lots of things to fix up, but its a good starting place.)
10.10 General workflow up to this point
For instructors: Click for details.
You can go over this point verbally, reiterating what they’ve learned so far.
You now have some skills and tools to allow you to reproducibly import, process, clean, join, and eventually analyze your datasets. Listed below are the general workflows we’ve covered and that you can use as a guideline to complete the following (optional) exercises and group work.
- Import with the
vroom()
tospec()
tovroom()
again. - Convert importing into a function in an R Markdown document,
move to the
R/function.R
script, restarting R, andsource()
. - Test that joining datasets into a final form works properly while in
an R Markdown document, then cut and paste the code into a data processing R
script in the
data-raw/
folder (optionally this can also be done in thedata-raw/
R script). - Restart R and generate the
.rda
dataset in thedata/
folder by sourcing thedata-raw/
R script. - Restart R, load the new dataset with
load()
and put the loading code into an R Markdown document. - Add any additional cleaning code to the data processing R script in
data-raw/
and update the.rda
dataset indata/
whenever you encounter problems in the dataset. - Write R in code chunks in an R Markdown document to further analyze your data
and check reproducibility by often knitting to HTML.
- Part of this workflow is to also write R code to output in a way that looks nice in the HTML (or Word) formats by mostly creating tables or figures of the R output.
- Use Git often by adding and committing into the history so you never lose stuff and can keep track of changes to your files.
10.11 Exercise: Discuss how you might use this workflow in your own work
Time: 15 min
We’ve covered quite a bit in this course and you’ve (hopefully) learned a lot. Before moving on to other exercises, discuss with your group how you might use this workflow (or parts of it) in your own work. What are some ways you might use these workflows and techniques? What challenges do you see might come up by using these skills and tools? Groups will briefly share what they’ve discussed before moving on to the other exercises. (side note: this exercise is partly to help reinforce what you’ve learned and also partly selfish since we’d love to hear how you might use these tools and some challenges that might come up by using them.)
10.12 Summary
For instructors: Click for details.
Quickly cover this before finishing the session and when starting the next session.
- Data is usually structured to varying degrees as wide or long format
- Use
pivot_longer()
to convert from wide to long - Use
pivot_wider()
to convert from long to wide
- Use