If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
For instructors: Click for details.
The slides contain speaking notes that you can view by pressing ‘p’ on the keyboard.
Take 10 min to read over this section before we go over it together.
This section provides an bigger picture view of what we will be doing, why we want to do it, how we will be going about doing it, and what it will look like in the end.
Big picture overall aim:
Firstly, what we ultimately want to do is process the MMASH data so that we have a single dataset to work with for later (hypothetical) analyses.
We also want to make sure that whatever processing we do to the data is reproducible. So everything we’ll do throughout the course is done in order to achieve our aim of making a single data and to make things reproducible.
Both the folder and file structures below as well as the
Figure 5.1 show exactly what we will be doing and how
it will look like at the file level (not at the R code level).
Hopefully with this overview, you can better understand where we are and where
we want to get to. A comment on naming: whenever folder names are given,
they always end in
/, for instance
Right now, everyone’s initial project structure should look like:
LearnR3 ├── data/ │ └── README.md ├── data-raw/ │ ├── README.md │ ├── mmash-data.zip │ ├── mmash/ │ │ ├── user_1 │ │ ├── ... │ │ └── user_22 │ └── mmash.R ├── doc/ │ ├── README.md │ └── lesson.Rmd ├── R/ │ ├── functions.R │ └── README.md ├── .gitignore ├── DESCRIPTION ├── LearnR3.Rproj ├── README.md └── TODO.md
At the end of this workshop, it should look something like:
LearnR3 ├── data/ │ ├── README.md │ └── mmash.rda ├── data-raw/ │ ├── README.md │ ├── mmash-data.zip │ ├── mmash/ │ │ ├── user_1 │ │ ├── ... │ │ └── user_22 │ └── mmash.R ├── doc/ │ ├── README.md │ ├── lesson.html │ └── lesson.Rmd ├── R/ │ ├── README.md │ └── functions.R ├── .gitignore ├── DESCRIPTION ├── LearnR3.Rproj └── README.md
Why do we structure it this way? Because we want to follow some standard conventions
in R, like having a
DESCRIPTION file, keeping raw data in the
and keeping R scripts in the
R/. We also want to keep things structured to make
it easier for others and ourselves to reproducible the work.
Our workflow will generally look like Figure 5.1, with each block representing one or two sessions. We’ve already done a bit of the first block, “Download raw data”.
Our workflow and process will be something like:
- Make an R script (
data-raw/mmash.R) to download the dataset (
data-raw/mmash/), which you already did in the pre-course tasks.
- We do this to keep the code for downloading and processing the data together with the raw data.
- Use R Markdown (
doc/lesson.Rmd) to write and test out code, convert it into a function, test it, and then move it into
R/functions.R. Later we will also move code into the
data-raw/mmash.R. We use this workflow because:
- It’s easier to quickly test code out and make sure it works before moving the code over into a more formal location and structure. Think of using the R Markdown file as a sandbox to test out and play with code, without fear of messing things up.
- We also test code out in the R Markdown because it’s easier from a teaching perspective to interweave text and comments with code as we do the code-alongs and because it forces you to practice working within R Markdown documents, which are key components to a reproducible workflow in R.
- We keep the functions in a separate file because we will frequently
source()from it as we prototype and test out code in the R Markdown file. It also creates a clear separation between “finalized” code and prototyping code.
- Use a combination of restarting R (
Ctrl-Shift-F10or “Session -> Restart R”) and using
R/functions.R) to run the functions inside of
- We restart R because it is the only sure way there is to clearing up the R workspace and starting for a “clean plate”. And for reproducibility, we should always aim to working from a clean plate.
- Replace the old code in the R Markdown document (
doc/lesson.Rmd) with the new function.
- Finish building functions that prepare the dataset, use them in the cleaning
data-raw/mmash.R), run the script and save the data as an
- Like with the
R/functions.Rscript, we move the finalized data processing code over from the R Markdown into the
data-raw/mmash.Rscript to have a clear separation of completed code and prototyping code.
- Like with the
- Remove the old code from the R Markdown document (
- Once we’ve finished prototyping code and moved it over into a more “final” location, we remove left over code because it isn’t necessary and we don’t want our files to be littered with left over code. Again, like with restarting R often, we want to work from a clean plate.
- Whenever we complete a task, we add and commit those file changes and save
them into the Git history.
- We use Git because it is the best way of keeping track of what was done to your files, when, and why. It keeps your work transparent and makes it easier to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science.
Stop reading here.
For instructors: Click for details.
Go over the text again, to reinforce the workflow, and the reasons why we do certain things.
Time: 15 min
Take 2 min to think about your workflow you use in your work. How do you exactly do the things you do (like what apps you use, how you collaborate, how you name your files and folders, where you save your work)?
Then for about 10 min, in your group/table share and discuss each others workflows. How do they compare to each other? What are things you’d like to try out in your own work? How do all your workflows compare with what has been described so far?
For the remaining time, as the whole group, we’ll briefly share what you’ve all thought and discussed.
Now that we’ve gone over the overview, let’s get our project ready for the next
steps. But first, we need to do a few things. Since we did all that work in the
pre-course tasks downloading the data and unzipping it, we need to save these
changes to the Git history. Open the Git interface with either the Git icon at
the top near the menu bar, or with
Ctrl-Alt-M. When the Git inferface opens up
we’ll click the checkbox beside the
Then we write a commit message in the text box on the right, something like
“Code to download data zip file”. Click the “Commit” button and close the Git
Next, delete one of the files we don’t need by using the fs package.
In the Console, type out:
Then, open up the
README.md and fix some of the TODO items.
After cleaning everything up, now we need to use Git to add and commit all the
current files into the history. Open up the Git interface in RStudio with
Ctrl-Alt-M or through the Git button. Write a message in the commit textbox
saying “Added initial files”. Now we’re ready for the next session!