23 Project work
In this last session of the workshop, you will take what you learned and apply it to create a reproducible project while working in a team (of 2 or 3 people). This will help you reinforce what you learned in the previous sessions.
The aim of this project is to produce two files, one that is stored on GitHub and the other is ignored by Git:
- A reproducible Quarto file in
docs/cleaning.qmdthat cleans the data. - A generated
data/cleaned-dime.csvfile that is created by thedocs/cleaning.qmdfile, but that is not stored on GitHub.
During the project, we want you to:
- Create and use functions that process the data.
- Store those functions in each of your
R/functions-*.Rfiles and usesource()to use them in your Quarto document. - Use functionals to apply those functions to the data.
- Process and join together the data into one dataset that can help answer a research question.
- Use Git and GitHub to keep track of changes to your project’s files and to collaborate together.
- Use
docs/cleaning.qmdas the main file that only contains the code to clean and process the data. It shouldn’t be used for testing out code or doing exploration, that’s what thedocs/sandbox-*.qmdfiles are for. It should source both of yourR/functions-*.Rfiles to use the functions you created in those files.
During the last 15-20 minutes of this session, the lead teacher will download your project from GitHub and re-generate your report to check that it is reproducible. The teacher will run approximately these functions after downloading your project:
styler::style_dir()to check that your code is styled.quarto::quarto_render(here::here("docs/cleaning.qmd"))to check that your Quarto file is reproducible/can run.fs::file_exists(here::here("data/cleaned-dime.csv"))to check that the cleaned dataset was created.
23.1 Specific tasks
You will be collaborating as a team using Git and GitHub. We will have set up the project with Git and GitHub for you before the session, so you can quickly start collaborating together on the project. You will be pushing and pulling a lot of content, so you will need to maintain regular and open communication with your team member(s).
Below is a list of tasks to complete in order. Read through all the tasks first before starting any of them.
23.1.1 Clone team repository
All team members need to clone (download) your team’s repository to their own computer. The lead teacher will walk you through where to find your repository.
The steps to clone the repository are to:
- Copy the URL for your team’s repository from the team’s repository page on GitHub.
- Open RStudio and go to the “File” menu on the top menu bar.
- In the “File” menu, click “New Project…”, select “Version Control”, and then click “Git”.
- In the next screen, paste your team’s repository URL that you copied into the “Repository URL” box. Don’t change the name of the project in the “Project directory name”.
- Select the “Browse…” button next to the “Create project as subdirectory of” box and choose to save the project to your
Desktop/folder. - Click the “Create Project”.
After clicking it, RStudio will clone the repository to your computer and should then open it as an R project.
23.1.2 Read the documentation of the dataset
The dataset you will work with for this project is a study that examined how bioactive components in food impact our glucose levels and sleep. The study is called the DIME study (1).
The original dataset is a bit messy and difficult to initially work with. I’ve had to do some simple cleaning of the dataset to make it easier for you to start working with. If you want to see what cleaning I did, check out the clean.R file on the workshop repository. It uses many of the same functions, style, and processes that we’ve covered in this workshop.
Read the description of the DIME dataset that was taken from the DIME dataset page:
“The DIME study consists of a randomised 2x2 cross-over human intervention where healthy participants (n = 20) are subjected to a diet high in bioactive-rich food for two weeks and a diet low in bioactive-rich food. There is a four-week washout between the two interventions.
The continuous glucose monitoring was achieved using the Abbott freestyle libre flash glucose device. The baseline of the participants were determined 7 days before the start of the intervention, followed by the first arm and second arm. The period between the two arms (washout) was not recorded.
We also included sleep data which consists of the amount of time spent in bed and during that time the amount of time spent in light, deep and rem in all 20 participants during the course of the dietary intervention, both the high and low bioactive diet. that was captured using FitBit wearables during both stages of the dietary intervention”.
23.1.3 Download the dataset
All team members should now go back to the RStudio R project of your team’s repository. Then, open the data-raw/dime.R file. I’ve included a TODO task to manually download, along with code to unzip the data into a new folder for you. All team members should manually download the data into data-raw/ and save the file as dime.zip. Then run source() on the data-raw/dime.R file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”). This will download the dataset and save it to data-raw/dime/. Inside this folder, it should look a bit like:
data-raw/
├── dime/
│ ├── cgm/
│ ├── sleep/
│ └── participant_details.csv
├── dime.zip
├── dime.R
└── README.md
23.1.4 Brainstorm a design for the flow of data processing
Like we did at the start of the workshop, take some time to think, brainstorm, design, and plan what needs to be processed and how you might do that. You can use pen and paper for this if you want.
To help you focus on the processing stage, these would be the research questions we want you to try to answer:
- How do glucose levels over time compare between the baseline and the different treatments, after controlling for gender and age?
- How do glucose levels over time compare between the two treatments, after controlling for gender and age?
- How does sleep quality impact this relationship between glucose levels and the different treatments?
From these questions, brainstorm how the data should look like as a single dataset in order to answer the research questions. Then investigate how the data looks like right now and identify what tasks generally need to be done to get from the current state of the data to the state that would allow you to answer the research questions.
The final dataset should have one row per participant per time point (e.g. every minute/half-hour), along with the necessary columns to answer the research questions above.
Compare this hypothetical dataset to the current state of the different data files. Now, try to list out and identify the different processing tasks that need to be done to get it from the current state to the ideal state.
23.1.5 Coordinate tasks so docs/cleaning.qmd creates a final cleaned dataset
This project is fairly open-ended but here are some potential steps or tasks to do:
Before starting to code, brainstorm and decide on how the final dataset should look like in order to answer the research questions. Think and be explicit about what the columns are, what the rows are, and so on.
Open up the raw DIME data files in a spreadsheet or with a simple
read_csv()to better understand how the data looks like right now. As you do, brainstorm and design what potential processing steps might be needed to get the data into the right format and structure to answer the research questions.In each team members own
docs/sandbox-*.qmdfile, start to code so that you can read in the data and to do some of the processing steps you brainstormed. Use what we covered in the workshop. Remember: “if you can do it for one, you can do it for all”. Start with reading in one file, tidying it up if you need to, and then write a function that does the reading for the one file. Then, use functionals to apply that function to all the files in the folder. Do this for each of the different data files.Communicate and coordinate with your team member(s) when you make functions and add them into each of your
R/functions-*.Rfiles, so that you can re-use each other’s functions in yourdocs/sandbox-*.qmdfiles by usingsource().In your
docs/cleaning.qmdfile, usesource()to source both of yourR/functions-*.Rfiles, so that you can use the functions you created in those files in yourdocs/cleaning.qmdfile.As you build and tidy up the dataset, move your finalized code into
docs/cleaning.qmdand make sure it runs and creates the cleaned dataset indata/cleaned-dime.csv. You can useView()to check how the data looks like at each step. Make sure to communicate with your team member(s) as you move code intodocs/cleaning.qmd.
You might need to use pivot_wider() to tidy up the sleep data.
You might need to summarise by day for the sleep and glucose data in order to effectively join them together, especially for joining with the participant details data.
You might need to complete() the participant details by "1 day" in order to effectively join it with the sleep and glucose data.
As you prototype, you can pipe into View() to check how the data looks like at each step.
23.1.6 Regularly style your code
Make sure your code is styled correctly by regularly using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style your code as you work in your Quarto document. Nicely formatted code is just as important as code that runs, as it makes it easier to read and understand. So practice styling your code often! It might help to run styler::style_dir() on your whole project as you get closer to finishing the project, so that everything is styled correctly.
23.1.7 Regularly commit the changes
Use the “Git workflow” we’ve used throughout this workshop by adding to the staged area, committing, pushing, and pulling the changes that you and your teammate will make. You may encounter merge conflicts. If you do, get one of the helpers or teachers to help you out!
23.1.8 Regularly render the Quarto document
Render your Quarto document to HTML often to ensure reproducibility of your document. You don’t need to commit the rendered HTML file if you don’t want to, it will be automatically ignored by Git because we added it to the .gitignore file for you when we set the project up.
23.2 Quick “checklist” for a good project
- Project used Git and is on GitHub.
- Used Quarto for testing out code and prototyping how you would process the data.
- Created functions and stored them in the
R/functions-*.Rfiles. - Used functionals to apply functions to chunks or groups of data.
- Used
{tidyverse}for data processing. - Code is correctly styled.
23.3 Checking reproducibility
At the end, the lead teacher will download each of the teams’ projects from GitHub and will test the reproducibility of the projects. As mentioned above, the teacher will run these functions after downloading your project:
styler::style_dir()quarto::quarto_render(here::here("docs/cleaning.qmd"))fs::file_exists(here::here("data/cleaned-dime.csv"))
23.4 Expectations for the project
What we expect you to do for the team project:
- Use Git and GitHub throughout your work.
- Work collaboratively as a team and share responsibilities and tasks.
- Use as much of what we covered in the workshop to practice what you learned.
What we don’t expect:
- Complicated analysis or coding. The simpler it is, the easier is to for you to do the coding and understand what is going on. It also helps us to see that you’ve practiced what you’ve learned.
- Clever or overly concise code. Clearly written and readable code is always better than clever or concise code. Keep it simple and understandable!
Essentially, the team project is a way to reinforce what you learned during the workshop, but in a more relaxed and collaborative setting. The point is give you space and time to practice what you learned, not to create a perfect (or even reproducible) project.
23.5 Want to see how it could be done?
If you’re really stuck or if you’re just curious to see how we might resolve this problem, you can check out the code on the GitHub repository in data-raw/dime.R. Only look at this code after you’ve worked on the project in your team for a while. The point of this team project is to give you a chance to practice what you learned and solidify your understanding of the different concepts and tools we covered in the workshop. The code linked above is only an example of what could be done. It doesn’t include all the things we covered in this workshop, nor does it fix all the potential issues with the data. But it can be helpful to see how you could solve some of these problems.
23.6 Survey
Please complete the survey for this session: