Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.

On GitLab open an issue or submit a merge request by clicking the "Edit this page " button on the side of this page.

5 Basic setup and workflow

5.1 Introduction to course

Introduction slides

For instructors: Click for details.

The slides contain speaking notes that you can view by pressing ‘p’ on the keyboard.

5.2 Overview of workflow

Take 10 min to read over this section before we go over it together.

This section provides an bigger picture view of what we will be doing, why we want to do it, how we will be going about doing it, and what it will look like in the end.

Big picture overall aim:

Firstly, what we ultimately want to do is process the MMASH data so that we have a single dataset to work with for later (hypothetical) analyses.

We also want to make sure that whatever processing we do to the data is reproducible. So everything we’ll do throughout the course is done in order to achieve our aim of making a single data and to make things reproducible.

Both the folder and file structures below as well as the Figure 5.1 show exactly what we will be doing and how it will look like at the file level (not at the R code level). Hopefully with this overview, you can better understand where we are and where we want to get to. A comment on naming: whenever folder names are given, they always end in /, for instance data/ or doc/.

Right now, everyone’s initial project structure should look like:

LearnR3
├── data/
│   └── README.md
├── data-raw/
│   ├── README.md
│   ├── mmash-data.zip
│   ├── mmash/
│   │  ├── user_1
│   │  ├── ...
│   │  └── user_22
│   └── mmash.R
├── doc/
│   ├── README.md
│   └── lesson.Rmd
├── R/
│   ├── functions.R
│   └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
├── README.md
└── TODO.md

At the end of this workshop, it should look something like:

LearnR3
├── data/
│   ├── README.md
│   └── mmash.rda
├── data-raw/
│   ├── README.md
│   ├── mmash-data.zip
│   ├── mmash/
│   │  ├── user_1
│   │  ├── ...
│   │  └── user_22
│   └── mmash.R
├── doc/
│   ├── README.md
│   ├── lesson.html
│   └── lesson.Rmd
├── R/
│   ├── README.md
│   └── functions.R
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md

Why do we structure it this way? Because we want to follow some standard conventions in R, like having a DESCRIPTION file, keeping raw data in the data-raw/ folder, and keeping R scripts in the R/. We also want to keep things structured to make it easier for others and ourselves to reproducible the work.

Our workflow will generally look like Figure 5.1, with each block representing one or two sessions. We’ve already done a bit of the first block, “Download raw data”.

Overview of the workflow we will be using and covering.

Figure 5.1: Overview of the workflow we will be using and covering.

Our workflow and process will be something like:

  • Make an R script (data-raw/mmash.R) to download the dataset (data-raw/mmash/), which you already did in the pre-course tasks.
    • We do this to keep the code for downloading and processing the data together with the raw data.
  • Use R Markdown (doc/lesson.Rmd) to write and test out code, convert it into a function, test it, and then move it into R/functions.R. Later we will also move code into the data-raw/mmash.R. We use this workflow because:
    • It’s easier to quickly test code out and make sure it works before moving the code over into a more formal location and structure. Think of using the R Markdown file as a sandbox to test out and play with code, without fear of messing things up.
    • We also test code out in the R Markdown because it’s easier from a teaching perspective to interweave text and comments with code as we do the code-alongs and because it forces you to practice working within R Markdown documents, which are key components to a reproducible workflow in R.
    • We keep the functions in a separate file because we will frequently source() from it as we prototype and test out code in the R Markdown file. It also creates a clear separation between “finalized” code and prototyping code.
  • Use a combination of restarting R (Ctrl-Shift-F10 or “Session -> Restart R”) and using source() (Ctrl-Shift-S while in R/functions.R) to run the functions inside of R/functions.R.
    • We restart R because it is the only sure way there is to clearing up the R workspace and starting for a “clean plate”. And for reproducibility, we should always aim to working from a clean plate.
  • Replace the old code in the R Markdown document (doc/lesson.Rmd) with the new function.
  • Finish building functions that prepare the dataset, use them in the cleaning script (data-raw/mmash.R), run the script and save the data as an .rda file (data/mmash.rda).
    • Like with the R/functions.R script, we move the finalized data processing code over from the R Markdown into the data-raw/mmash.R script to have a clear separation of completed code and prototyping code.
  • Remove the old code from the R Markdown document (doc/lessons.Rmd).
    • Once we’ve finished prototyping code and moved it over into a more “final” location, we remove left over code because it isn’t necessary and we don’t want our files to be littered with left over code. Again, like with restarting R often, we want to work from a clean plate.
  • Whenever we complete a task, we add and commit those file changes and save them into the Git history.
    • We use Git because it is the best way of keeping track of what was done to your files, when, and why. It keeps your work transparent and makes it easier to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science.

Stop reading here.

For instructors: Click for details.

Go over the text again, to reinforce the workflow, and the reasons why we do certain things.

5.3 Exercise: Discuss the types of workflows you use

Time: 15 min

  • Take 2 min to think about your workflow you use in your work. How do you exactly do the things you do (like what apps you use, how you collaborate, how you name your files and folders, where you save your work)?

  • Then for about 10 min, in your group/table share and discuss each others workflows. How do they compare to each other? What are things you’d like to try out in your own work? How do all your workflows compare with what has been described so far?

  • For the remaining time, as the whole group, we’ll briefly share what you’ve all thought and discussed.

5.4 Setting up our project

Now that we’ve gone over the overview, let’s get our project ready for the next steps. But first, we need to do a few things. Since we did all that work in the pre-course tasks downloading the data and unzipping it, we need to save these changes to the Git history. Open the Git interface with either the Git icon at the top near the menu bar, or with Ctrl-Alt-M. When the Git inferface opens up we’ll click the checkbox beside the .gitignore and data-raw/mmash.R files. Then we write a commit message in the text box on the right, something like “Code to download data zip file”. Click the “Commit” button and close the Git interface.

Next, delete one of the files we don’t need by using the fs package.

In the Console, type out:

fs::file_delete("TODO.md")

Then, open up the README.md and fix some of the TODO items. After cleaning everything up, now we need to use Git to add and commit all the current files into the history. Open up the Git interface in RStudio with Ctrl-Alt-M or through the Git button. Write a message in the commit textbox saying “Added initial files”. Now we’re ready for the next session!