4 Introduction to workshop
π§ We are doing major changes to this workshop, so much of the content will be changed. π§
The slides contain speaking notes that you can view by pressing βSβ on the keyboard.
4.1 π Reading task: The big picture
Time: ~15 minutes.
This section provides an bigger picture view of what we will be doing, why we want to do it, how we will be going about doing it, and what it will look like in the end. So the overall practical aim of this workshop is:
To apply a reproducible and programmatic approach to efficiently downloading a dataset that has multiple data files, then processing and cleaning them all, and saving them as a single data file to use for later analysis.
In our case, during the workshop we want to process the DIME data so that we have a single dataset to work with for later (hypothetical) analyses. Which can be visually summarised as:
We also want to make sure that whatever processing we do to the data is reproducible, so weβll be writing the code in a way that enables and enforces reproducibility of the code.
So how specifically will we do this? At the highest level, everything revolves around files and folders. Below is what the project currently looks like and what it will look like at the end of the workshop. Notice, the only differences here are the addition of a new file in the data/
folder and a new HTML file in the docs/
folder. The new file in the data/
folder is the final dataset we will be creating, and the new HTML file in the docs/
folder is the rendered Quarto document that we will be creating.
Currently looks like:
LearnR3
βββ data/
β βββ README.md
βββ data-raw/
β βββ README.md
β βββ dime.zip
β βββ dime/
β β βββ cgm/
β β βββ sleep/
β β βββ participant_details.csv
β βββ dime.R
βββ docs/
β βββ README.md
β βββ learning.qmd
βββ R/
β βββ functions.R
β βββ README.md
βββ .gitignore
βββ DESCRIPTION
βββ LearnR3.Rproj
βββ README.md
βββ TODO.md
End of workshop:
LearnR3
βββ data/
β βββ README.md
β βββ dime.csv <- Added
βββ data-raw/
β βββ README.md
β βββ dime.zip
β βββ dime/
β β βββ cgm/
β β βββ sleep/
β β βββ participant_details.csv
β βββ dime.R
βββ docs/
β βββ README.md
β βββ learning.html <- Added
β βββ learning.qmd
βββ R/
β βββ functions.R
β βββ README.md
βββ .gitignore
βββ DESCRIPTION
βββ LearnR3.Rproj
βββ README.md
βββ TODO.md
This structure, particularly the use of the .Rproj
file and the DESCRIPTION
file (that we will cover in this workshop) are part of a βproject-based workflowβ that is commonly used. We use this type of workflow because we want to follow some standard conventions in R, like having a DESCRIPTION
metadata file, keeping raw data in the data-raw/
folder, and keeping (most) R scripts in the R/
. We also want to keep things structured to make it easier for others and ourselves to reproduce the work. Below are sections for the different workflows we will be using in the workshop.
Download or copy data to data-raw/
, only change with R
In general, you want to store your original raw data that youβd use for your project within a folder in your R Project and never change it. This is because you want to keep the original data intact, so you can always go back to it if you need to. Since the raw data is rarely directly used for analyses, youβd use R to process the data and save the processed data to a new file that youβd then use for your analysis. This workflow is visually represented below in Figure 4.2
In R Projects, the common convention is to keep the raw data in the data-raw/
folder, which is what youβve already done during the pre-workshop tasks. This is where you would download or copy the data files you need for your project.
Use Quarto as a sandbox to prototype code
Quarto is an incredibly powerful tool when working to ensure your code is reproducible. For this workshop, the Quarto document we use is at docs/learning.qmd
.
With Quarto, itβs easy to quickly test code out and make sure it works. Think of using the Quarto file as a sandbox to test out and play with code, without fear of messing things up. And with Quarto, when you render the document to, for example, HTML and it runs all the code from beginning to end in a new environment (so completely clean). Which means you can test a part of the reproducibility of your code. Because of this feature, it allows you to detect errors and issues with the code, making Quarto an excellent prototyping tool. The workflow is represented below in Figure 4.3.
For this workshop, we also use Quarto because itβs easier from a teaching perspective. You as the learner can easily mix together text and notes to yourself as you learn, alongside the code you write out during the code-alongs and the exercises. We also use it to force you to practice working within Quarto documents, which is a key component to reproducibility, in a learning and supportive environment.
Use R/functions.R
to keep stable, tested code
While you use Quarto to test out and prototype code, youβll use R scripts like R/functions.R
to keep the code you have tested out already and are fairly confident that it works as you want it to. This workflow, of creating code and converting it into a function, is called a βfunction-based workflowβ. This is an incredibly common workflow in R projects and forms the basis for many other workflows and tools, such as ones that are covered in the advanced workshop.
So youβll use Quarto (docs/learning.qmd
) to write and test out code, convert it into a function (that we will cover in this workshop), and then move it into R/functions.R
script. We have this split to cognitively and physically have a separation between prototyping code and keeping finalized, tested code. Then, within the Quarto document we can source()
the R/functions.R
script so we have access to the stable and tested code. This workflow is represented below in Figure 4.4.
Workflow for prototyping code
At the level of the code, the way you prototype code is to:
- Write it out in Quarto so that does what you want.
- Convert that code into a function.
- Test that the function works either in the Quarto document or in the R Console.
- Fix the function if it doesnβt work.
- Restart the R console with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type βrestartβ) or render with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type βrenderβ) to test that the function works.
- Whenever the function works, add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type βcommitβ).
Either restarting R or rendering the Quarto document is the only way there is to be certain the R workspace is in a clean state. When code runs after a clean state, it improves the chances your code and project will be reproducible.
We use Git because it is the best way of keeping track of what was done to your files, when, and why. It helps to keep your work transparent and makes it easier for you to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science (e.g., transparency, reproducibility, and documentation).
While we covered GitHub in the introductory workshop, we canβt assume everyone will have taken that workshop. Because of that, we wonβt be using GitHub in this workshop.
4.2 π§βπ» Exercise: What types of workflows do you use?
Time: ~8 minutes.
The process of learning is about taking the content you just learned and trying to integrate it into your own context and situation, as well as talking about it to enforce it into your brain. So:
- Take 2 minutes to think about the workflows you use in your own work and how they compare to the ones youβve read about above and that we all went over briefly. Try to be exact and specific in what you do and how you do it.
- Then, with your neighbour, take 6 minutes where each of you share what youβve thought about and discuss it. How do both of your ways of working compare to each other and to what was described above?