9  Using split-apply-combine to help in processing

Warning

🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧

Briefly go over the bigger picture (found in the introduction section) and remind everyone the ‘what’ and ‘why’ of what we are doing.

9.1 Learning objectives

  1. Review the split-apply-combine technique and identify how these concepts make use of functional programming.

  2. Apply functional programming to summarize data using the split-apply-combine technique with dplyr’s group_by(), summarise(), and across() functions.

9.2 Split-apply-combine technique and functionals

Verbally cover this section before moving on to the summarizing. Let them know they can read more about this in this section.

We’re taking a quick detour to briefly talk about a concept that perfectly illustrates how vectorization and functionals fit into doing data analysis. The concept is called the split-apply-combine technique, which we covered in the beginner R workshop. The method is:

  1. Split the data into groups (e.g. diabetes status).
  2. Apply some analysis or statistics to each group (e.g. finding the mean of age).
  3. Combine the results to present them together (e.g. into a data frame that you can use to make a plot or table).

So when you split data into multiple groups, you create a list (or a vector) that you can then use (with the map functional) to apply a statistical technique to each group through vectorization. This technique works really well for a range of tasks, including for our task of summarizing some of the MMASH data so we can merge it all into one dataset.

9.3 Summarising data through functionals

Functionals and vectorization are integral components of how R works and they appear throughout many of R’s functions and packages. They are particularly used throughout the tidyverse packages like dplyr. Let’s get into some more advanced features of dplyr functions that work as functionals.

Before we continue, re-run the code for getting user_info_df since you had restarted the R session previously.

Since we’re going to use dplyr, we need to add it as a dependency by typing this in the Console:

Console
usethis::use_package("dplyr")

There are many “verbs” in dplyr, like select(), rename(), mutate(), summarise(), and group_by() (covered in more detail in the Data Management and Wrangling session of the beginner workshop). The common usage of these verbs is through acting on and directly using the column names (e.g. without " quotes around the column name like with saliva_df |> select(cortisol_norm)). But many dplyr verbs can also take functions as input, especially when using the column selection helpers from the tidyselect package.

Likewise, with functions like summarise(), if you want to for example calculate the mean of cortisol in the saliva dataset, you would usually type out:

docs/learning.qmd
saliva_df |>
  summarise(cortisol_mean = mean(cortisol_norm))

Don’t know what the |> pipe is? Check out the section on it from the beginner workshop.

If you want to calculate the mean of multiple columns, you might think you’d have to do something like:

docs/learning.qmd
saliva_df |>
  summarise(
    cortisol_mean = mean(cortisol_norm),
    melatonin_mean = mean(melatonin_norm)
  )

But instead, there is the across() function that works like map() and allows you to calculate the mean across which ever columns you want. In many ways, across() is similar to map(), particularly in the arguments you give it and in the sense that it is a functional. But they are used in different settings: across() works well with columns within a dataframe and within a mutate() or summarise(), while map() is more generic.

Reading task: ~2 minutes

When you look in ?across, there are two main arguments and one optional one (plus one deprecated one):

  1. .cols argument: Columns you want to use.
  2. .fns: The function to use on the .cols.
    • A bare function (mean) applies it to each column and returns the output, with the column name unchanged.

    • A list with bare functions (list(mean, sd)) applies each function to each column and returns the output with the column name appended with a number (e.g. cortisol_norm_1).

    • An anonymous function passed with either ~ (using .x to refer to the variable) or \() or function() like in map(). For instance, these three lines of code below mean all the same and are used to say “put age and weight, one after the other, in place of where .x is located” to calculate the mean for age and the mean for weight.

      across(c(age, weight), ~ mean(.x, na.rm = TRUE))
      across(c(age, weight), function(x) mean(x, na.rm = TRUE))
      across(c(age, weight), \(x) mean(x, na.rm = TRUE))

      We will use the \() syntax to be consistent with how we use purrr functions.

    • A named list with bare or anonymous functions (list(average = mean, stddev = sd)) does the same as above but instead returns an output with the column names appended with the name given to the function in the list (e.g. cortisol_norm_average). You can also use anonymous functions within the list:

      list(
        average = \(x) mean(x, na.rm = TRUE),
        stddev = \(x) sd(x, na.rm = TRUE),
      )
  3. ... argument (deprecated): Arguments to give to the functions in .fns. No longer used.
  4. .names argument: Customize the output of the column names. We won’t cover this argument.

Go over the first two arguments again, reinforcing what they read.

Let’s try out some examples. To calculate the mean of cortisol_norm like we did above, we’d do:

docs/learning.qmd
saliva_df |>
  summarise(across(cortisol_norm, mean))

To calculate the mean of another column:

docs/learning.qmd
saliva_df |>
  summarise(across(c(cortisol_norm, melatonin_norm), mean))

This is nice, but changing the column names so that the function name is added would make reading what the column contents are clearer. That’s when we would use “named lists”, which are lists that look like:

list(item_one_name = ..., item_two_name = ...)

So, for having a named list with mean inside across(), it would look like (in the Console):

Console
list(mean = mean)
# or
list(average = mean)
# or
list(ave = mean)

Let’s stick with list(mean = mean):

docs/learning.qmd
saliva_df |>
  summarise(across(cortisol_norm, list(mean = mean)))

Now, let’s collect some of the concepts from above to calculate the mean and standard deviation for all numeric columns in the saliva_df:

docs/learning.qmd
saliva_df |>
  summarise(across(where(is.numeric), list(mean = mean, sd = sd)))

We can use these concepts and code to process the other longer datasets, like RR.csv, in a way that makes it more meaningful to eventually merge (also called “join”) them with the smaller datasets like user_info.csv or saliva.csv. Let’s work with the RR.csv dataset to eventually join it with the others.

9.4 Summarizing long data like the RR dataset

With the RR dataset, each participant had almost 100,000 data points recorded over two days of collection. So if we want to join with the other datasets, we need to calculate summary measures by at least file_path_id and also preferably by day as well. In this case, we need to group_by() these two variables before summarising. In this way, we use the split-apply-combine technique. Let’s first summarise by taking the mean of ibi_s (which is the inter-beat interval in seconds).

docs/learning.qmd
rr_df <- import_multiple_files("RR.csv", import_rr)
rr_df |>
  group_by(file_path_id, day) |>
  summarise(across(ibi_s, list(mean = mean)))

While there are no missing values here, let’s add the argument na.rm = TRUE just in case. In order add this argument to the mean, we

docs/learning.qmd
rr_df |>
  group_by(file_path_id, day) |>
  summarise(across(ibi_s, list(mean = \(x) mean(x, na.rm = TRUE))))

Let’s also add standard deviation as another measure from the RR datasets:

docs/learning.qmd
summarised_rr_df <- rr_df |>
  group_by(file_path_id, day) |>
  summarise(
    across(ibi_s, list(
      mean = \(x) mean(x, na.rm = TRUE),
      sd = \(x) sd(x, na.rm = TRUE)
    ))
  )

summarised_rr_df

Whenever you are finished with a grouping effect, it’s good practice to end the group_by() with .groups = "drop". Let’s add it to the end:

docs/learning.qmd
summarised_rr_df <- rr_df |>
  group_by(file_path_id, day) |>
  summarise(
    across(ibi_s, list(
      mean = \(x) mean(x, na.rm = TRUE),
      sd = \(x) sd(x, na.rm = TRUE)
    )),
    .groups = "drop"
  )
summarised_rr_df

Ungrouping the data with the .groups = "drop" in the summarise() function does not provide any visual indication of what is happening. However, in the background, it removes certain metadata that the group_by() function added.

Note

By default, using group_by() continues the grouping effect of later code, like mutate() and summarise(). Normally we would end a group_by() by using ungroup(), especially if we want to do multiple wrangling functions on the same grouping. Because sometimes, especially after using summarise(), we don’t need to keep the grouping. So we can use the .groups = "drop" argument in summarise() to end the grouping.

Before continuing, let’s run styler with the Palette (Ctrl-Shift-P, then type “style file”) and knit the Quarto document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to confirm that everything runs as it should. If the knitting works, then switch to the Git interface and add and commit the changes so far with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

9.5 Exercise: Summarise the Actigraph data

Time: 15 minutes.

Like with the RR.csv dataset, let’s process the Actigraph.csv dataset so that it makes it easier to join with the other datasets later. Make sure to read the warning block below.

Warning

Since the actigraph_df dataset is quite large, we strongly recommend not using View() or selecting the dataframe in the Environments pane to view it. For many computers, your R session will crash! Instead type out glimpse(actigraph_df) or simply actigraph_df in the Console.

  1. Like usual, create a new Markdown header called e.g. ## Exercise: Summarise Actigraph and insert a new code chunk below that with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).
  2. Import all the Actigraph data files using the import_multiple_files() function you created previously. Name the new data frame actigraph_df.
  3. Look into the Data Description to find out what each column is for.
  4. Based on the documentation, which variables would you be most interested in analyzing more?
  5. Decide which summary measure(s) you think may be most interesting for you (e.g. median(), sd(), mean(), max(), min(), var()).
  6. Use group_by() with the file_path_id and day variables only, then use summarise() with across() to summarise the variables you are interested in (from item 4 above) with the summary functions you chose. Assign the newly summarised data frame to a new data frame and call it summarised_actigraph_df.
  7. End the grouping effect with .groups = "drop" in summarise().
  8. Run styler while in the docs/learning.qmd file with the Palette (Ctrl-Shift-P, then type “style file”).
  9. Knit the docs/learning.qmd document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to make sure everything works.
  10. Add and commit the changes you’ve made into the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
actigraph_df <- import_multiple_files("Actigraph.csv", import_actigraph)
summarised_actigraph_df <- actigraph_df |>
  group_by(file_path_id, day) |>
  # These statistics will probably be different for you
  summarise(
    across(hr, list(
      mean = \(x) mean(x, na.rm = TRUE),
      sd = \(x) sd(x, na.rm = TRUE)
    )),
    .groups = "drop"
  )

9.6 💬 Discussion activity: Incorporating functionals in your own work

Time: ~10 minutes.

As we prepare for the next session, discuss with your neighbour or group the following question:

  • Think about the functions and functionals sessions. Discuss with your group some ideas on how you can implement making functions and using functionals in your work. What are some things you do repetitively (in code or general work) that you think would be great to have simplified or automated.

9.7 Summary

  • Use group_by(), summarise(), and across() with .groups = "drop" in the summarise() function to use the split-apply-combine technique when needing to do an action on groups within the data (e.g. calculate the mean age between education groups).