Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

4  Introduction to course

Introduction slides

The slides contain speaking notes that you can view by pressing ‘S’ on the keyboard.

4.1 The Big Picture

Reading task: ~10 minutes

This section provides an bigger picture view of what we will be doing, why we want to do it, how we will be going about doing it, and what it will look like in the end.

Big picture aim:

To apply a reproducible and programmatic approach to efficiently downloading multiple data files, processing and cleaning them, and saving them as a single data file.

In our case during the course, we ultimately want to process the MMASH data so that we have a single dataset to work with for later (hypothetical) analyses.

We also want to make sure that whatever processing we do to the data is reproducible. So everything we’ll do throughout the course is done in order to achieve our aim of making a single dataset and to make things reproducible.

Both the folder and file structures below as well as the Figure Figure 4.1 show exactly what we will be doing and how it will look like at the file level (not at the R code level). Hopefully with this overview, you can better understand where we are and where we want to get to.

Right now, everyone’s initial project structure should look like:

LearnR3
├── data/
│   └── README.md
├── data-raw/
│   ├── README.md
│   ├── mmash-data.zip
│   ├── mmash/
│   │  ├── user_1
│   │  ├── ...
│   │  └── user_22
│   └── mmash.R
├── doc/
│   ├── README.md
│   └── learning.qmd
├── R/
│   ├── functions.R
│   └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
├── README.md
└── TODO.md

At the end of this workshop, it should look something like:

LearnR3
├── data/
│   ├── README.md
│   └── mmash.rda
├── data-raw/
│   ├── README.md
│   ├── mmash-data.zip
│   ├── mmash/
│   │  ├── user_1
│   │  ├── ...
│   │  └── user_22
│   └── mmash.R
├── doc/
│   ├── README.md
│   ├── learning.html
│   └── learning.qmd
├── R/
│   ├── README.md
│   └── functions.R
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md

Why do we structure it this way? Because we want to follow some standard conventions in R, like having a DESCRIPTION metadata file, keeping raw data in the data-raw/ folder, and keeping (most) R scripts in the R/. We also want to keep things structured to make it easier for others and ourselves to reproduce the work.

Our workflow will generally look like Figure Figure 4.1, with each block representing one or two sessions. We’ve already done a bit of the first block, “Download raw data”.

Figure 4.1: Overview of the workflow we will be using and covering.

Our workflow and process will be something like:

  • Make an R script (data-raw/mmash.R) to download the dataset (data-raw/mmash/), which you already did in the pre-course tasks.
    • We do this to keep the code for downloading and processing the data together with the raw data.
  • Use Quarto (doc/learning.qmd) to write and test out code, convert it into a function, test it, and then move it into R/functions.R. Later we will also move code into the data-raw/mmash.R. We use this workflow because:
    • It’s easier to quickly test code out and make sure it works before moving the code over into a more formal location and structure. Think of using the Quarto file as a sandbox to test out and play with code, without fear of messing things up.
    • We also test code out in the Quarto because it’s easier from a teaching perspective to interweave text and comments with code as we do the code-alongs and because it forces you to practice working within Quarto documents, which are key components to a reproducible workflow in R.
    • Rendering a Quarto file runs all the code from beginning to end in a new environment (so completely clean), meaning we can test the reproducibility of our code. Because of this feature, it allows us to detect errors and issues with the code, making Quarto an excellent prototyping tool.
    • We keep the functions in a separate file because we will frequently source() from it as we prototype and test out code in the Quarto file. It also creates a clear separation between “finalized” code and prototyping code.
  • Use a combination of restarting R (linux=Ctrl-Shift-F10 or with the Palette (linux=Ctrl-Shift-P, then type “restart”)) and using source() with linux=Ctrl-Shift-S or with the Palette (linux=Ctrl-Shift-P, then type “source”) while in R/functions.R to run the functions inside of R/functions.R.
    • We restart R because it is the only sure way there is to clearing up the R workspace and starting for a “clean plate”. And for reproducibility, we should always aim to working from a clean plate.
  • Replace the old code in the Quarto document (doc/learning.qmd) with the new function.
  • Finish building functions that prepare the dataset, use them in the cleaning script (data-raw/mmash.R), run the script and save the data as an .rda file (data/mmash.rda).
    • Like with the R/functions.R script, we move the finalized data processing code over from the Quarto into the data-raw/mmash.R script to have a clear separation of completed code and prototyping code.
    • Why is the data-raw/mmash.R script not a Quarto file instead? Because our aim for the R script is to produce a specific dataset output, while the aim of a Quarto file is to create an output document (like HTML) and to test reproducibility. We use specific tools or file formats for specific purposes base on their design.
  • Remove the old code from the Quarto document (doc/learning.qmd).
    • Once we’ve finished prototyping code and moved it over into a more “final” location, we remove left over code because it isn’t necessary and we don’t want our files to be littered with left over code. Again, like with restarting R often, we want to work from a clean plate.
  • Whenever we complete a task, we add and commit those file changes and save them into the Git history.
    • We use Git because it is the best way of keeping track of what was done to your files, when, and why. It keeps your work transparent and makes it easier to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science.
    • Note: While we covered GitHub in the introductory course, we can’t assume everyone will have taken that course. Because of that, we won’t be using GitHub in this course.

Go over the text again, to reinforce the workflow, and the reasons why we do certain things. Especially reinforce why we have the separation between mmash.R, learning.qmd, and functions.R.

4.2 Exercise: Discuss the types of workflows you use

Time: 15 minutes.

  • Take 2 min to think about your workflow you use in your work. How do you exactly do the things you do (like what apps you use, how you collaborate, how you name your files and folders, where you save your work)?

  • Then for about 10 min, in your group/table share and discuss each others workflows. How do they compare to each other? What are things you’d like to try out in your own work? How do all your workflows compare with what has been described so far?

  • For the remaining time, as the whole group, we’ll briefly share what you’ve all thought and discussed.

4.3 Setting up our project

Take this time to briefly explain the Git interface, what add and commit mean, as well as where the history is located.

Now that we’ve gone over the overview, let’s get our project ready for the next steps. But first, we need to do a few things. Since we did all that work in the pre-course tasks downloading the data and unzipping it, we need to save these changes to the Git history. Open the Git interface with either the Git icon at the top near the menu bar, or with linux=Ctrl-Alt-M or with the Palette (linux=Ctrl-Shift-P, then type “commit”). When the Git inferface opens up we’ll click the checkbox beside the .gitignore and data-raw/mmash.R files. Then we write a commit message in the text box on the right, something like Code to download data zip file. Click the “Commit” button and close the Git interface.

Next, delete these two files we don’t need by using the fs package.

In the Console, type out:

fs::file_delete("TODO.md")
fs::file_delete("doc/report.Rmd")

Then, open up the README.md and fix some of the TODO items. After cleaning everything up, now we need to use Git to add and commit all the current files into the history. Open up the Git interface in RStudio with linux=Ctrl-Alt-M or with the Palette (linux=Ctrl-Shift-P, then type “commit”) or through the Git button. Write a message in the commit textbox saying Added initial files.

4.4 Styling your files

Before moving to the next session, there’s one more “setup” thing to do. Since we write code for other humans (including ourselves in the future), code should also look pretty, clean, and organized. To do that, we’ll follow a the tidyverse style guide and use the package called styler, which automatically fixes code to fit the style guide. With styler you can style multiple files at once, one file at a time, or based on code you select and highlight. Throughout the course, we will re-style the file were are working on a lot. We’ll do that by using the Command Palette by typing linux=Ctrl-Shift-P, followed by typing “style file”. The first one that pops up, which should say “Style active file”, will be the one to select and use. We’ll get lots of chances to use it as we go through the course. The thing to note is that styler isn’t always perfect, so you might need to sometimes manually fix code