This section provides an bigger picture view of what we will be doing, why we want to do it, how we will be going about doing it, and what it will look like in the end.
Big picture aim:
To apply a reproducible and programmatic approach to efficiently downloading multiple data files, processing and cleaning them, and saving them as a single data file.
In our case during the course, we ultimately want to process the MMASH data so that we have a single dataset to work with for later (hypothetical) analyses.
We also want to make sure that whatever processing we do to the data is reproducible. So everything we’ll do throughout the course is done in order to achieve our aim of making a single dataset and to make things reproducible.
Both the folder and file structures below as well as the Figure Figure 4.1 show exactly what we will be doing and how it will look like at the file level (not at the R code level). Hopefully with this overview, you can better understand where we are and where we want to get to.
Everyone’s project structure starts off looking like the one on the left, and ends with the one on the right. You’ll notice very little changes between the two, except for some files removed (e.g. TODO.md
), and others added (e.g. data/mmash.rda
and doc/learning.html
).
Currently looks like:
LearnR3
├── data/
│ └── README.md
├── data-raw/
│ ├── README.md
│ ├── mmash-data.zip
│ ├── mmash/
│ │ ├── user_1
│ │ ├── ...
│ │ └── user_22
│ └── mmash.R
├── doc/
│ ├── README.md
│ └── learning.qmd
├── R/
│ ├── functions.R
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
├── README.md
└── TODO.md
End of course:
LearnR3
├── data/
│ ├── README.md
│ └── mmash.rda
├── data-raw/
│ ├── README.md
│ ├── mmash-data.zip
│ ├── mmash/
│ │ ├── user_1
│ │ ├── ...
│ │ └── user_22
│ └── mmash.R
├── doc/
│ ├── README.md
│ ├── learning.html
│ └── learning.qmd
├── R/
│ ├── README.md
│ └── functions.R
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md
Why do we structure it this way? Because we want to follow some standard conventions in R, like having a DESCRIPTION
metadata file, keeping raw data in the data-raw/
folder, and keeping (most) R scripts in the R/
. We also want to keep things structured to make it easier for others and ourselves to reproduce the work.
Our workflow will generally look like Figure Figure 4.1, with each block representing one or two sessions. We’ve already done a bit of the first block, “Download raw data”.
Our workflow and process will be something like:
- Make an R script (
data-raw/mmash.R
) to download the dataset (data-raw/mmash/
), which you already did in the pre-course tasks.
- We do this to keep the code for downloading and processing the data together with the raw data.
- Use Quarto (
doc/learning.qmd
) to write and test out code, convert it into a function, test it, and then move it into R/functions.R
. Later we will also move code into the data-raw/mmash.R
. We use this workflow because:
- It’s easier to quickly test code out and make sure it works before moving the code over into a more formal location and structure. Think of using the Quarto file as a sandbox to test out and play with code, without fear of messing things up.
- We also test code out in the Quarto because it’s easier from a teaching perspective to interweave text and comments with code as we do the code-alongs and because it forces you to practice working within Quarto documents, which are key components to a reproducible workflow in R.
- Rendering a Quarto file runs all the code from beginning to end in a new environment (so completely clean), meaning we can test the reproducibility of our code. Because of this feature, it allows us to detect errors and issues with the code, making Quarto an excellent prototyping tool.
- We keep the functions in a separate file because we will frequently
source()
from it as we prototype and test out code in the Quarto file. It also creates a clear separation between “finalized” code and prototyping code.
- Use a combination of restarting R (Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”)) and using
source()
with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”) while in R/functions.R
to run the functions inside of R/functions.R
.
- We restart R because it is the only sure way there is to clearing up the R workspace and starting for a “clean plate”. And for reproducibility, we should always aim to working from a clean plate.
- Replace the old code in the Quarto document (
doc/learning.qmd
) with the new function.
- Finish building functions that prepare the dataset, use them in the cleaning script (
data-raw/mmash.R
), run the script and save the data as an .rda
file (data/mmash.rda
).
- Like with the
R/functions.R
script, we move the finalized data processing code over from the Quarto into the data-raw/mmash.R
script to have a clear separation of completed code and prototyping code.
- Why is the
data-raw/mmash.R
script not a Quarto file instead? Because our aim for the R script is to produce a specific dataset output, while the aim of a Quarto file is to create an output document (like HTML) and to test reproducibility. We use specific tools or file formats for specific purposes based on their design.
- Remove the old code from the Quarto document (
doc/learning.qmd
).
- Once we’ve finished prototyping code and moved it over into a more “final” location, we remove left over code because it isn’t necessary and we don’t want our files to be littered with left over code. Again, like with restarting R often, we want to work from a clean plate.
- Whenever we complete a task, we add and commit those file changes and save them into the Git history.
- We use Git because it is the best way of keeping track of what was done to your files, when, and why. It keeps your work transparent and makes it easier to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science (e.g., transparency, reproducibility, and documentation).
-
Note: While we covered GitHub in the introductory course, we can’t assume everyone will have taken that course. Because of that, we won’t be using GitHub in this course.