14 Introduction to workshop
The slides contain speaking notes that you can view by pressing βSβ on the keyboard.
14.1 Background and context
Hypothetical research questions
Imagine that you are a researcher who wants to do a study. Youβre interested in how physiological measures get impacted by stress. You work in a hospital and have many nurses and other health professionals around you, who regularly experience stress. So you came up with these research questions:
- How might stress during a nurseβs shift impact some simple physiological indicators, like heart rate?
- Will these physiological indicators be noticeably different during a stressful shift compared to less stressful shifts?
- Does the type of stress matter impact when it comes to the impact on physiological indicators?
Actual scenario
The nurseβs stress study is a real study that some researchers did. They collected physiological data from sensors worn by nurses during their shifts for several days in a row, each device (or deviceβs specific measurement) collecting data several times every second. They also had the nurses fill out a survey about their self-reported stress levels, where they think the stress came from, and when they experienced the stress during their shifts.
14.2 π¬ Discussion activity: What should the data look like to answer our research questions?
Time: ~12 minutes.
Think about the research questions. Now try to think how that dataset might look like (as a single, βrectangularβ dataset with columns and rows like a typical spreadsheet) that would allow you to answer those research questions.
- What might the columns be in order to answer the questions?
- What might the rows be or represent to answer the questions? Per nurse? Per timepoint? Per shift?
- How might these rows relate to the columns?
- What might the values be in the cells? What do they represent?
- How big might the data be?
We have the research questions. Now, what would the data exactly look like to be able to answer these questions? How is it structured?
- For 4 minutes, think about these questions on your own, writing down notes if you like. Try sketching out with pen and paper what the data might look like.
- For next 4 minutes, discuss with a neighbour and compare your ideas.
- For the last 4 minutes, weβll all share our ideas with the group and discuss.
Try to come to a concrete idea of what the data might look like. If you are able to, use a whiteboard or flipchart to write down what they share.
14.3 A brief overview of the data
Go through the data files and structure on the projector in RStudio, but they donβt need to follow along on their own computer for this part. Just show them the file tree and explain the data in a bit more detail. Especially remind them about the .gz part.
Now that weβve discussed what the data might look like, letβs take a look at how the data actually is structured. Below is a file tree of the entire project.
LearnR3
βββ data/
β βββ README.md
βββ data-raw/
β βββ README.md
β βββ nurses-stress.tar
β βββ nurses-stress/
β β βββ stress/
β β βββ survey-results.csv.gz
β βββ nurses-stress.R
βββ docs/
β βββ README.md
β βββ cleaning.qmd
β βββ learning.qmd
βββ R/
β βββ functions.R
β βββ README.md
βββ .gitignore
βββ DESCRIPTION
βββ LearnR3.Rproj
βββ README.md
βββ TODO.md
When we open up the data-raw/nurses-stress/ folder, we see the survey-results.csv.gz file and nurses-stress/ folder. The survey-results.csv.gz file is a compressed (gzipped) CSV file that contains the events when nurses experienced stress (or were asked to fill in the survey). In the stress/ folder are more folders, one for each participant. Inside each of the participantsβ folder are folders for each date they wore the device(s) and inside those are the raw data files for each physiological measurement. Those raw .csv.gz files could be for a full day, or a few hours, or even less than an hour, depending on multiple factors like the participant, their shift, or whether the device was working properly.
While this dataset is one of the cleanest/tidiest open datasets Iβve found that nicely aligns with the learning outcomes of this workshop, it is still quite messy and some parts donβt make a lot of sense. The documentation for the dataset is incomplete and there are some issues that only those with knowledge specific to this study could answer. Unfortunately, if you spend any time working with real-world data, this messiness and lack of documentation is very common.
14.4 π¬ Discussion activity: What are some steps we might take to get our final dataset?
Time: ~10 minutes.
Now that you have a better idea of the data, letβs brainstorming some steps we might need to take to get the data from the current state to the single dataset that we can use to answer our research questions. While you havenβt yet seen the actual data, you can still brainstorm and consider some potential steps to take.
- For 2 minutes, on your own, think about some steps to take to get that single dataset.
- For the next 4 minutes, discuss with your neighbour and compare your ideas.
- For the last 4 minutes, weβll share our ideas with the group and discuss.
After theyβve discussed this, come up with a relatively concrete, though high-level, list of steps that we need to take to get from the start to the end. In general, it might be something like this:
- Read in and process each heart rate (HR)-type data file.
- Merge all the HR-type data files together into a single dataset.
- Read in and process the survey results data file.
- Merge the HR-type datasets with the survey results dataset into a single dataset.
14.5 Workshopβs main βdestinationβ
We use the nurseβs stress dataset as a case study to demonstrate the main βdestinationβ of the workshop, which is not the final dataset itself. Instead, the core βdestinationβ of the workshop is:
To apply a reproducible and programmatic approach to efficiently processing, cleaning, wrangling, and joining multiple data files into a single data file that is ready for use to do analysis on.
Weβll achieve this by using a βproject-based workflowβ (that we covered in the introduction workshop), using a βfunction-oriented workflowβ (that we will cover and use throughout the workshop), and using Quarto documents for prototyping as well as for our final document that will contain the code to generate the final dataset.
By the end, the files and folders in our project should looke like the below:
LearnR3
βββ data/
β βββ README.md
β βββ nurses-stress.csv <- Added
βββ data-raw/
β βββ README.md
β βββ nurses-stress.tar
β βββ nurses-stress/
β β βββ stress/
β β βββ survey-results.csv.gz
β βββ nurses-stress.R
βββ docs/
β βββ README.md
β βββ cleaning.html <- Added
β βββ cleaning.qmd
β βββ learning.html <- Added
β βββ learning.qmd
βββ R/
β βββ functions.R
β βββ README.md
βββ .gitignore
βββ DESCRIPTION
βββ LearnR3.Rproj
βββ README.md
βββ TODO.md
This structure, particularly the use of the .Rproj file and the DESCRIPTION file (that we will cover in this workshop) are expansions of the βproject-based workflowβ that is commonly used in R projects. We use this type of workflow because we want to follow some standard conventions in R, like having a DESCRIPTION metadata file, keeping raw data in the data-raw/ folder, and keeping (most) R scripts in the R/. We also want to keep things structured to make it easier for others and ourselves to reproduce the work.