23 Project work
In this last session of the workshop, you will take what you learned and apply it to create a reproducible project while working in a team (of 2 or 3 people). This will help you reinforce what you learned in the previous sessions.
The aim of this project is to produce two files that are stored on GitHub:
- A reproducible Quarto file in
docs/cleaning.qmdthat cleans the data. - A generated
data/cleaned-dime.csvfile that is created by thedocs/cleaning.qmdfile.
During the project, we want you to:
- Create and use functions that process the data.
- Store those functions in the
R/functions.Rfile and usingsource()to use them in your Quarto document. - Use functionals to apply those functions to the data.
- Process and join together the data into one dataset that can help answer a research question.
- Use Git and GitHub to keep track of changes to your project’s files and to collaborate together.
- Use
docs/cleaning.qmdas the main file that only contains the code to clean and process the data. It shouldn’t be used for testing out code or doing exploration, that’s what thedocs/sandbox-*.qmdfiles are for.
During the last 15-20 minutes of this session, the lead teacher will download your project from GitHub and re-generate your report to check that it is reproducible. The teacher will run approximately these functions after downloading your project:
styler::style_dir()to check that your code is styled.quarto::quarto_render(here::here("docs/cleaning.qmd"))to check that your Quarto file is reproducible/can run.fs::file_exists(here::here("data/cleaned-dime.csv"))to check that the cleaned dataset was created.
23.1 Specific tasks
You will be collaborating as a team using Git and GitHub. We will have set up the project with Git and GitHub for you before the session, so you can quickly start collaborating together on the project. You will be pushing and pulling a lot of content, so you will need to maintain regular and open communication with your team member(s).
Below is a list of tasks to complete in order. Read through all the tasks first before starting any of them.
23.1.1 Clone team repository
All team members need to clone (download) your team’s repository to their own computer. The lead teacher will walk you through where to find your repository.
The steps to clone the repository are to:
- Copy the URL for your team’s repository from the team’s repository page on GitHub.
- Open RStudio and go to the “File” menu on the top menu bar.
- In the “File” menu, click “New Project…”, select “Version Control”, and then click “Git”.
- In the next screen, paste your team’s repository URL that you copied into the “Repository URL” box. Don’t change the name of the project in the “Project directory name”.
- Select the “Browse…” button next to the “Create project as subdirectory of” box and choose to save the project to your
Desktop/folder. - Click the “Create Project”.
After clicking it, RStudio will clone the repository to your computer and should then open it as an R project.
23.1.2 Read the documentation of the dataset
The dataset you will work with for this project is a study that examined how bioactive components in food impact our glucose levels and sleep. The study is called the DIME study (1).
The original dataset is a bit messy and difficult to initially work with. I’ve had to do some simple cleaning of the dataset to make it easier for you to start working with. If you want to see what cleaning I did, check out the clean.R file on the workshop repository. It uses many of the same functions, style, and processes that we’ve covered in this workshop.
Read the description of the DIME dataset that was taken from the DIME dataset page:
“The DIME study consists of a randomised 2x2 cross-over human intervention where healthy participants (n = 20) are subjected to a diet high in bioactive-rich food for two weeks and a diet low in bioactive-rich food. There is a four-week washout between the two interventions.
The continuous glucose monitoring was achieved using the Abbott freestyle libre flash glucose device. The baseline of the participants were determined 7 days before the start of the intervention, followed by the first arm and second arm. The period between the two arms (washout) was not recorded.
We also included sleep data which consists of the amount of time spent in bed and during that time the amount of time spent in light, deep and rem in all 20 participants during the course of the dietary intervention, both the high and low bioactive diet. that was captured using FitBit wearables during both stages of the dietary intervention”.
23.1.3 Download the dataset
All team members should now go back to the RStudio R project of your team’s repository. Then, open the data-raw/dime.R file. I’ve included code to download and unzip the data into a new folder for you. All team members should source() this file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”). This will download the dataset and save it to data-raw/dime/. Inside this folder, it should look a bit like:
data-raw/
├── dime/
│ ├── cgm/
│ ├── sleep/
│ └── participant_details.csv
├── dime.zip
├── dime.R
└── README.md
23.1.4 Brainstorm a design for the flow of data processing
Like we did at the start of the workshop, take some time to think, brainstorm, design, and plan what needs to be processed and how you might do that. You can use pen and paper for this if you want.
To help you focus on the processing stage, these would be the research questions we want you to try to answer:
- How do glucose levels over time compare the baseline and the different treatments, after controlling for gender and age?
- How do glucose levels over time compare between the two treatments, after controlling for gender and age?
- How does sleep quality impact this relationship between glucose levels and the different treatments?
From these questions, brainstorm how the data should look like as a single dataset in order to answer the research questions. Then investigate how the data looks like right now and identify what tasks generally need to be done to get from the current state of the data to the state that would allow you to answer the research questions.
The final dataset should have one row per participant per time point (e.g. every minute/half-hour), along with the necessary columns to answer the research questions above.
Compare this hypothetical dataset to the current state of the different data files. Now, try to list out and identify the different processing tasks that need to be done to get it from the current state to the ideal state.
23.1.5 Coordinate tasks so docs/cleaning.qmd creates a final cleaned dataset
👷 This part is being actively developed and edited. It will be added here once done.
23.1.6 Regularly style your code
Make sure your code is styled correctly by regularly using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style your code as you work in your Quarto document. Nicely formatted code is just as important as code that runs, as it makes it easier to read and understand. So practice styling your code often! It might help to run styler::style_dir() on your whole project as you get closer to finishing the project, so that everything is styled correctly.
23.1.7 Regularly commit the changes
Use the “Git workflow” we’ve used throughout this workshop by adding to the staged area, committing, pushing, and pulling the changes that you and your teammate will make. You may encounter merge conflicts. If you do, get one of the helpers or teachers to help you out!
23.1.8 Regularly render the Quarto document
Render your Quarto document to HTML often to ensure reproducibility of your document. You don’t need to commit the rendered HTML file if you don’t want to, it will be automatically ignored by Git because we added it to the .gitignore file for you when we set the project up.
23.2 Quick “checklist” for a good project
- Project used Git and is on GitHub.
- Used Quarto for testing out code and prototyping how you would process the data.
- Created functions and stored them in the
R/functions.Rfile. - Used functionals to apply functions to chunks or groups of data.
- Used
{tidyverse}for data processing. - Code is correctly styled.
23.3 Checking reproducibility
At the end, the lead teacher will download each of the teams’ projects from GitHub and will test the reproducibility of the projects. As mentioned above, the teacher will run these functions after downloading your project:
styler::style_dir()quarto::quarto_render(here::here("docs/cleaning.qmd"))fs::file_exists(here::here("data/cleaned-dime.csv"))
23.4 Expectations for the project
What we expect you to do for the team project:
- Use Git and GitHub throughout your work.
- Work collaboratively as a team and share responsibilities and tasks.
- Use as much of what we covered in the workshop to practice what you learned.
What we don’t expect:
- Complicated analysis or coding. The simpler it is, the easier is to for you to do the coding and understand what is going on. It also helps us to see that you’ve practiced what you’ve learned.
- Clever or overly concise code. Clearly written and readable code is always better than clever or concise code. Keep it simple and understandable!
Essentially, the team project is a way to reinforce what you learned during the workshop, but in a more relaxed and collaborative setting. The point is give you space and time to practice what you learned, not to create a perfect (or even reproducible) project.