Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

10  Hands-on exercises

We’re still testing to see what best works for this section.

This session is all about reinforcing the skills you’ve learned in this course by continuing to use them. You can either continue working on the MMASH dataset and complete the exercises below or you could try to apply these skills to another dataset (such as your own). If you continue with the MMASH, we strongly encourage you to work together with your group to ask questions and to help each other out in understanding what to do and what to code (we are also here for help).

10.1 Quick recap of workflow

You can go over this point verbally, reiterating what they’ve learned so far.

You now have some skills and tools to allow you to reproducibly import, process, clean, join, and eventually analyze your datasets. Listed below are the general workflows we’ve covered and that you can use as a guideline to complete the following (optional) exercises and group work.

  • Import with the vroom(), and optionally use spec() to vroom() again to select specific variables.
  • Convert importing into a function in a Quarto / R Markdown document, move to the R/function.R script, restarting R, and source().
  • Test that joining datasets into a final form works properly while in a Quarto / R Markdown document, then cut and paste the code into a data processing R script in the data-raw/ folder (alternatively, you can directly write and test code while in the data-raw/ R script).
  • Restart R and generate the .rda dataset in the data/ folder by sourcing the data-raw/ R script.
  • Restart R, load the new dataset with load() and put the loading code into a Quarto / R Markdown document.
  • Add any additional cleaning code to the data processing R script in data-raw/ and update the .rda dataset in data/ whenever you encounter problems in the dataset.
  • Write R in code chunks in a Quarto / R Markdown document to further analyze your data and check reproducibility by often rendering to HTML.
    • Part of this workflow is to also write R code to output in a way that looks nice in the HTML (or Word) formats by mostly creating tables or figures of the R output.
  • Use Git often by adding and committing into the history so you never lose stuff and can keep track of changes to your files.

10.2 Import and process the activity data

We have a few other datasets that we could join together, but would likely require more processing in order to appropriately join with the other datasets. Complete these tasks in the doc/learning.qmd file:

  1. Create a new header called ## Exercise: Importing activity data
  2. Create a new code chunk below this new header or with the Palette (, then type “new chunk”).
  3. Starting the workflow from the beginning (e.g. with the spec() process), write code that imports the Activity.csv data into R.
  4. Convert this code into a new function using the workflow you’ve used from this course:
    • Call the new function import_activity.
    • Include one argument called file_path.
    • Test that it works.
    • Add Roxygen documentation with or with the Palette (, then type “roxygen comment”) and explicit package links (::) with the functions.
    • Move the newly created function into the R/functions.R.
    • Use the new function in doc/learning.qmd and use source() or with the Palette (, then type “source”) to run it.
  5. Import all the user_ datasets with import_multiple_files() and the import_activity() function.
  6. Pipe the results into mutate() and create a new column called activity_seconds that is based on subtracting end and start.
    • Use ?mutate and check the examples in the help document that pops up if you don’t recall how to use this function.
  7. You’ll notice that the activity column is numeric. Look into the Data Description and find out what each column represents and what the numbers mean in the column activity. Then think about or complete these tasks:
    • What is the advantage and disadvantage of using numbers instead of text to describe categorical data like in the activity column?
    • Using the case_when() function (within mutate()) we learned about, convert the activity numbers into more meaningful character data.
  8. Render the Quarto document with or with the Palette (, then type “render”) to see if it is reproducible.
  9. Run styler with with the Palette (, then type “style file”).
  10. Add and commit your changes to the Git history with or with the Palette (, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
import_activity <- function(file_path) {
  activity_data <- vroom::vroom(
    col_select = -1,
    col_types = vroom::cols(
      activity = vroom::col_double(),
      start = vroom::col_time(format = ""),
      end = vroom::col_time(format = ""),
      day = vroom::col_double(),
      .delim = ","
    .name_repair = snakecase::to_snake_case

activity_df <- import_multiple_files("Activity.csv", import_activity)

activity_df %>%
  mutate(activity_duration = end - start)

10.3 Fix up some remaining data issues

There are some issues with the data still. Try to work through these tasks to clean the data up more.

  1. Use the code below to examine the cleaned data for user 21. Notice some problems? Go into the raw data as well as the data documentation on the MMASH website and try to figure how what happened. How you might fix the problem? Try your fix within the data-raw/mmash.R cleaning script.

    mmash %>% 
      filter(user_id == "user_21")
    • Hint: Look into user 21’s folder at their .csv files they have. Compare to the other users. Are the same number of files there?
    • Hint: In the reduce() and full_join() code, try re-arranging the order that the datasets are listed. Does order matter?
  2. If you do count(mmash, day) in the Console while the mmash dataset is loaded, you’ll see that the day column has a weird value of -29. Use filter() to see which rows and users are affected. Look into the data documentation on the MMASH website and see if you can find an explanation for this. How would you go about solving this issue? Write up code to fix this issue in the data-raw/mmash.R script.

    • Hint: Find the users that have this value and look into their data .csv files that have day (or Day) variables. Which files have this -29 in the day column? Where does the -29 value start in that data file? Can you guess what the issue was and what -29 is or should be?

10.4 Process and join sleep and questionnaire data

There are still a few datasets that you can join in with the current datasets like sleep.csv and questionnaire.csv. Using the workflows in Section 10.1 as a guide, start from the beginning and import, process, clean, make functions, and join these two datasets in with the others so that they get included in the data/mmash.rda final dataset. Afterwards, do some descriptive analysis using the function tidy_summarise_by_day().


User 11 has no sleep data, so you will eventually have to drop user 11 from the dataset before summarizing the data.

10.5 Create a second dataset of only the Actigraph and RR data

The Actigraph and RR datasets contain a lot of interesting and useful data that gets destroyed when we first summarise and then join them with the other datasets. While we can’t meaningfully join all this data with the other datasets, we can join them on their own as separate datasets.

Using the workflows in Section 10.1 as a guide, start from the beginning and import, process, clean, make functions, and create a final dataset of only the Actigraph.csv and RR.csv datasets.

  • Join only these two datasets by user_id, day, and time.
  • Name the new dataset actigraph_rr and save it to data/ by using another usethis::use_data() line in the data-raw/mmash.R script.

Than think of how some of these questions might be answered, using the Data Description as a guide:

  1. What times of the day are people lying down (inclinometer_lying) most often? How long are they lying down for?
  2. When are people most likely to be sitting down (inclinometer_sitting)? Does this also correspond to a lower heart rate or a lower interbeat interval (from the RR data) compared to when standing?
  3. Are people who have more activity throughout the day also reporting better sleep quality (like number_of_awakenings)?
  4. Does the self-reported activity match the accelerometry data?

10.6 Other datasets to try out

If you have completed the exercises above and still wanted to practice the skills, here are some other datasets you can use (aside from your own):