14  Introduction to workshop

Introduction slides

The slides contain speaking notes that you can view by pressing β€˜S’ on the keyboard.

14.1 Background and context

Hypothetical research questions

Imagine that you are a researcher who wants to do a study. You’re interested in how physiological measures get impacted by stress. You work in a hospital and have many nurses and other health professionals around you, who regularly experience stress. So you came up with these research questions:

  1. How might stress during a nurse’s shift impact some simple physiological indicators, like heart rate?
  2. Will these physiological indicators be noticeably different during a stressful shift compared to less stressful shifts?
  3. Does the type of stress matter impact when it comes to the impact on physiological indicators?

Actual scenario

The nurse’s stress study is a real study that some researchers did. They collected physiological data from sensors worn by nurses during their shifts for several days in a row, each device (or device’s specific measurement) collecting data several times every second. They also had the nurses fill out a survey about their self-reported stress levels, where they think the stress came from, and when they experienced the stress during their shifts.

14.2 πŸ’¬ Discussion activity: What should the data look like to answer our research questions?

Time: ~12 minutes.

Think about the research questions. Now try to think how that dataset might look like (as a single, β€œrectangular” dataset with columns and rows like a typical spreadsheet) that would allow you to answer those research questions.

  1. What might the columns be in order to answer the questions?
  2. What might the rows be or represent to answer the questions? Per nurse? Per timepoint? Per shift?
  3. How might these rows relate to the columns?
  4. What might the values be in the cells? What do they represent?
  5. How big might the data be?

We have the research questions. Now, what would the data exactly look like to be able to answer these questions? How is it structured?

  1. For 4 minutes, think about these questions on your own, writing down notes if you like. Try sketching out with pen and paper what the data might look like.
  2. For next 4 minutes, discuss with a neighbour and compare your ideas.
  3. For the last 4 minutes, we’ll all share our ideas with the group and discuss.

Try to come to a concrete idea of what the data might look like. If you are able to, use a whiteboard or flipchart to write down what they share.

14.3 A brief overview of the data

Go through the data files and structure on the projector in RStudio, but they don’t need to follow along on their own computer for this part. Just show them the file tree and explain the data in a bit more detail. Especially remind them about the .gz part.

Now that we’ve discussed what the data might look like, let’s take a look at how the data actually is structured. Below is a file tree of the entire project.

LearnR3
β”œβ”€β”€ data/
β”‚   └── README.md
β”œβ”€β”€ data-raw/
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ nurses-stress.tar
β”‚   β”œβ”€β”€ nurses-stress/
β”‚   β”‚  β”œβ”€β”€ stress/
β”‚   β”‚  └── survey-results.csv.gz
β”‚   └── nurses-stress.R
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ cleaning.qmd
β”‚   └── learning.qmd
β”œβ”€β”€ R/
β”‚   β”œβ”€β”€ functions.R
β”‚   └── README.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ DESCRIPTION
β”œβ”€β”€ LearnR3.Rproj
β”œβ”€β”€ README.md
└── TODO.md

When we open up the data-raw/nurses-stress/ folder, we see the survey-results.csv.gz file and nurses-stress/ folder. The survey-results.csv.gz file is a compressed (gzipped) CSV file that contains the events when nurses experienced stress (or were asked to fill in the survey). In the stress/ folder are more folders, one for each participant. Inside each of the participants’ folder are folders for each date they wore the device(s) and inside those are the raw data files for each physiological measurement. Those raw .csv.gz files could be for a full day, or a few hours, or even less than an hour, depending on multiple factors like the participant, their shift, or whether the device was working properly.

Warning

While this dataset is one of the cleanest/tidiest open datasets I’ve found that nicely aligns with the learning outcomes of this workshop, it is still quite messy and some parts don’t make a lot of sense. The documentation for the dataset is incomplete and there are some issues that only those with knowledge specific to this study could answer. Unfortunately, if you spend any time working with real-world data, this messiness and lack of documentation is very common.

14.4 πŸ’¬ Discussion activity: What are some steps we might take to get our final dataset?

Time: ~10 minutes.

Now that you have a better idea of the data, let’s brainstorming some steps we might need to take to get the data from the current state to the single dataset that we can use to answer our research questions. While you haven’t yet seen the actual data, you can still brainstorm and consider some potential steps to take.

  1. For 2 minutes, on your own, think about some steps to take to get that single dataset.
  2. For the next 4 minutes, discuss with your neighbour and compare your ideas.
  3. For the last 4 minutes, we’ll share our ideas with the group and discuss.

After they’ve discussed this, come up with a relatively concrete, though high-level, list of steps that we need to take to get from the start to the end. In general, it might be something like this:

  1. Read in and process each heart rate (HR)-type data file.
  2. Merge all the HR-type data files together into a single dataset.
  3. Read in and process the survey results data file.
  4. Merge the HR-type datasets with the survey results dataset into a single dataset.

14.5 Workshop’s main β€œdestination”

We use the nurse’s stress dataset as a case study to demonstrate the main β€œdestination” of the workshop, which is not the final dataset itself. Instead, the core β€œdestination” of the workshop is:

To apply a reproducible and programmatic approach to efficiently processing, cleaning, wrangling, and joining multiple data files into a single data file that is ready for use to do analysis on.

We’ll achieve this by using a β€œproject-based workflow” (that we covered in the introduction workshop), using a β€œfunction-oriented workflow” (that we will cover and use throughout the workshop), and using Quarto documents for prototyping as well as for our final document that will contain the code to generate the final dataset.

By the end, the files and folders in our project should looke like the below:

LearnR3
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ README.md
β”‚   └── nurses-stress.csv <- Added
β”œβ”€β”€ data-raw/
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ nurses-stress.tar
β”‚   β”œβ”€β”€ nurses-stress/
β”‚   β”‚  β”œβ”€β”€ stress/
β”‚   β”‚  └── survey-results.csv.gz
β”‚   └── nurses-stress.R
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ cleaning.html <- Added
β”‚   β”œβ”€β”€ cleaning.qmd
β”‚   β”œβ”€β”€ learning.html <- Added
β”‚   └── learning.qmd
β”œβ”€β”€ R/
β”‚   β”œβ”€β”€ functions.R
β”‚   └── README.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ DESCRIPTION
β”œβ”€β”€ LearnR3.Rproj
β”œβ”€β”€ README.md
└── TODO.md

This structure, particularly the use of the .Rproj file and the DESCRIPTION file (that we will cover in this workshop) are expansions of the β€œproject-based workflow” that is commonly used in R projects. We use this type of workflow because we want to follow some standard conventions in R, like having a DESCRIPTION metadata file, keeping raw data in the data-raw/ folder, and keeping (most) R scripts in the R/. We also want to keep things structured to make it easier for others and ourselves to reproduce the work.