12 Hands-on exercises

Warning

🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧

Warning

We’re still testing to see what best works for this section.

This session is all about reinforcing the skills you’ve learned in this workshop by continuing to use them. You can either continue working on the MMASH dataset and complete the exercises below or you could try to apply these skills to another dataset (such as your own). If you continue with the MMASH, we strongly encourage you to work together with your group to ask questions and to help each other out in understanding what to do and what to code (we are also here for help).

12.1 Quick recap of workflow

Instructor note

You can go over this point verbally, reiterating what they’ve learned so far.

You now have some skills and tools to allow you to reproducibly import, process, clean, join, and eventually analyze your datasets. Listed below are the general workflows we’ve covered and that you can use as a guideline to complete the following (optional) exercises and group work.

Import with the read_csv().
Convert importing into a function in a Quarto document, move to the R/function.R script, restarting R, and source().
Test that joining datasets into a final form works properly while in a Quarto document, then cut and paste the code into a data processing R script in the data-raw/ folder (alternatively, you can directly write and test code while in the data-raw/ R script).
Restart R and generate the .csv dataset in the data/ folder by sourcing the data-raw/ R script.
Restart R, load the new dataset with load() and put the loading code into a Quarto document.
Add any additional cleaning code to the data processing R script in data-raw/ and update the .csv dataset in data/ whenever you encounter problems in the dataset.
Write R in code chunks in a Quarto document to further analyze your data and check reproducibility by often rendering to HTML.
- Part of this workflow is to also write R code to output in a way that looks nice in the HTML (or Word) formats by mostly creating tables or figures of the R output.
Use Git often by adding and committing into the history so you never lose stuff and can keep track of changes to your files.

12.2 Import and process the activity data

We have a few other datasets that we could join together, but would likely require more processing in order to appropriately join with the other datasets. Complete these tasks in the docs/learning.qmd file:

Create a new header called ## Exercise: Importing activity data
Create a new code chunk below this new header Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).
Starting the workflow from the beginning, write code that imports the Activity.csv data into R.
Convert this code into a new function using the workflow you’ve used from this workshop:
- Call the new function import_activity.
- Include one argument called file_path.
- Test that it works.
- Add Roxygen documentation with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type “roxygen comment”) and explicit package links (::) with the functions.
- Move the newly created function into the R/functions.R.
- Use the new function in docs/learning.qmd and use source() Ctrl-Shift-S or with the Palette (Ctrl-Shift-P, then type “source”) to run it.
Import all the user_ datasets with import_multiple_files() and the import_activity() function.
Pipe the results into mutate() and create a new column called activity_seconds that is based on subtracting end and start.
- Use ?mutate and check the examples in the help document that pops up if you don’t recall how to use this function.
You’ll notice that the activity column is numeric. Look into the Data Description and find out what each column represents and what the numbers mean in the column activity. Then think about or complete these tasks:
- What is the advantage and disadvantage of using numbers instead of text to describe categorical data like in the activity column?
- Using the case_when() function (within mutate()) we learned about, convert the activity numbers into more meaningful character data.
Render the Quarto document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to see if it is reproducible.
Run styler with the Palette (Ctrl-Shift-P, then type “style file”).
Add and commit your changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

Click for the solution. Only click if you are struggling or are out of time.

import_activity <- function(file_path) {
  activity_data <- readr::read_csv(
    file_path,
    col_select = -1,
    col_types = readr::cols(
      activity = readr::col_double(),
      start = readr::col_time(format = ""),
      end = readr::col_time(format = ""),
      day = readr::col_double()
    ),
    name_repair = snakecase::to_snake_case
  )
  return(activity_data)
}

activity_df <- import_multiple_files("Activity.csv", import_activity)

activity_df |>
  mutate(activity_duration = end - start)

12.3 Fix up some remaining data issues

There are some issues with the data still. Try to work through these tasks to clean the data up more.

Use the code below to examine the cleaned data for user 21. Notice some problems? Go into the raw data as well as the data documentation on the MMASH website and try to figure how what happened. How you might fix the problem? Try your fix within the docs/learning.qmd cleaning script.
```
docs/learning.qmd
```
```
mmash |>
  filter(user_id == "user_21")
```
- Hint: Look into user 21’s folder at their .csv files they have. Compare to the other users. Are the same number of files there?
- Hint: In the reduce() and full_join() code, try re-arranging the order that the datasets are listed. Does order matter?
If you do count(mmash, day) in the Console while the mmash dataset is loaded, you’ll see that the day column has a weird value of -29. Use filter() to see which rows and users are affected. Look into the data documentation on the MMASH website and see if you can find an explanation for this. How would you go about solving this issue? Write up code to fix this issue in the docs/learning.qmd file.
- Hint: Find the users that have this value and look into their data .csv files that have day (or Day) variables. Which files have this -29 in the day column? Where does the -29 value start in that data file? Can you guess what the issue was and what -29 is or should be?

12.4 Process and join sleep and questionnaire data

There are still a few datasets that you can join in with the current datasets like sleep.csv and questionnaire.csv. Using the workflows in Section 12.1 as a guide, start from the beginning and import, process, clean, make functions, and join these two datasets in with the others so that they get included in the data/mmash.csv final dataset. Afterwards, do some descriptive analysis using the function tidy_summarise_by_day().

Tip

User 11 has no sleep data, so you will eventually have to drop user 11 from the dataset before summarizing the data.

12.5 Create a second dataset of only the Actigraph and RR data

The Actigraph and RR datasets contain a lot of interesting and useful data that gets destroyed when we first summarise and then join them with the other datasets. While we can’t meaningfully join all this data with the other datasets, we can join them on their own as separate datasets.

Using the workflows in Section 12.1 as a guide, start from the beginning and import, process, clean, make functions, and create a final dataset of only the Actigraph.csv and RR.csv datasets.

Join only these two datasets by user_id, day, and time.
Name the new dataset actigraph_rr and save it to data/ by using another write_csv() line in the docs/learning.qmd script.

Than think of how some of these questions might be answered, using the Data Description as a guide:

When are people most likely to be lying down (inclinometer_lying)? How long do they lie down for?
When are people most likely to be sitting down (inclinometer_sitting)? Does this also correspond to a lower heart rate or a lower interbeat interval (from the RR data) compared to when standing?
Are people who have more activity throughout the day also reporting better sleep quality (like number_of_awakenings)?
Does the self-reported activity match the accelerometry data?

12.6 Other datasets to try out

If you have completed the exercises above and still wanted to practice the skills, here are some other datasets you can use (aside from your own):

MIMIC-IV demo data in the OMOP Common Data Model
RR interval time series from healthy subjects
Data on the 2021 Olympics in Tokyo
Find your own dataset from PhysioNet or Zenodo.
Work on the game “Murder Mystery” (which Anders Askeland converted into a game using R).