13 Group project work
Walk-through with them for tasks 1 to 5.
Get them to decide on who will be the “owner” and who will be the “collaborator”. Then, use the “add collaborator” section of GitHub to get them all to add their team member who is the collaborator to the repository of the owner.
Then walk through with the collaborator how to clone the owner’s repository to their computer. Walk through with them how to open the project in RStudio.
After that, walk through having them create a new Quarto document for each team member, named after their own name. Commit and push those changes so that they can all get started sooner.
Maybe walk through the code used to initially process and clean the data.
This is a new section, so it might not be very smooth yet. Any issues that come up we will share with the group and update this page immediately. So we will be co-creating this section together!! 🎉 😁
It’s time to try what you’ve learned from this workshop on a new dataset! We’ve taken the dataset from this Zenodo record and tidied it up a bit for you, as it was bit difficult to work with in its original form. And it was a bit big. So, work through these tasks in teams of 2-3:
Using the same repository has you have been using for this workshop, we’re going to make a small change. Instead of there being only one person on the project, we will get you to add a collaborator so you can work together. Follow the instructions from the teacher for this step.
Every team member should then clone this new repository to their own computer. Again, follow the instructions from the teacher for this step.
Every team member should open up this project in RStudio. Then download this link from Google Drive to your computer. Yes, it’s less then ideal to use Google Drive, but it’s an easy way to get the data to you.
Once you have downloaded it, save the file in the
data-raw/
folder of your project. This next step is important! One team member should runusethis::use_git_ignore("data-raw/nurses-stress*")
so that you don’t accidentally save the zip file to your Git repository. The team member should then commit the changes to Git and push the changes up to GitHub. Then each team member should pull the changes from GitHub to their own computer. If you opened the.gitignore
file, you should see that thedata-raw/nurses-stress*
file is now ignored by Git. This means that it won’t be included in your Git repository, which is good because it’s a big file and we don’t want to include it in our project.Then, each team member should create a new Quarto document for themselves, with their name, in the
docs/
folder. For example, if Luke was in your group, he would create a file calleddocs/luke.qmd
. Use the menu buttons: File > New File > Quarto Document. Commit the new file to Git. From now on, you will each work separately in this document, while pushing and pulling the changes to GitHub for the others to see.Look over the documentation of the dataset. I have only done some minimal tidying of the data, mainly adding a timestamp column and converted the files into more compressed file formats. I’ve also only included the HR, TEMP, and EDA data files, as well as the main participant data file (
survey_results.csv
).Unzip the file you downloaded. You can do this manually. Unzip it to the
data-raw/
folder, with the same name as the zip file. Look through the files. The first folders you find are labeled by the participant ID and sub-folders in them follow the patternID_TIME-IN-SECONDS
. For instance15_1594140175
means participant 15, and the time in seconds is 1594140175, which is the time that has passed since 1970-01-01 (the “origin”). Inside those folders are the data files for that participant, stored as.parquet
files. Parquet is a data format that is very efficient for storing large amounts of data. You can load them into R using thearrow
package, witharrow::read_parquet()
. You will need to install thearrow
package and set the dependency withusethis::use_package("arrow")
.Using the same functions and workflows we covered in the workshop, what you want to end up with is a single dataset that has all the HR, TEMP, and EDA data for each participant so that you can (eventually) analyse it.
As a group, discuss some ways to do this. One idea is to do this in stages. First, discuss how you will go about exploring the data and identifying what the issues are. Then afterwards, get into the details of fixing those issues.
Here’s some things to get you started with the exploration tasks:
- Getting a list of all the files that end in
.parquet
in thedata-raw/
folder. Tip: Indir_ls()
there is theregexp
argument that allows you to use regular expressions. For example, if you wanted to find only theHR.parquet
files, you could useregexp = "HR\\.parquet"
. - Read in one of those files using
arrow::read_parquet()
. - After reading in, keep only the first 1000 rows of the data using
slice_head(n = 1000)
from the{dplyr}
package. - Then, identify how you want to join all the data together. This will probably require that you import each of the HR, TEMP, and EDA files to see what they have inside them.
- You will also need to import the
survey_results.csv
file, which will probably need some cleaning up too.
- Getting a list of all the files that end in
If you’d like to see how this data was tidied up, check out this script. It uses almost all the same functions and processes we’ve covered in this workshop, plus some extra ones to help speed things up (since processing all these large data can take some time if done “normally”).