If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
In order to participate in this course, you must complete this section for the pre-course tasks and finish with completing the survey at the end. These tasks are designed to make it easier for everyone to start the course with everything set up. For some of the tasks, you might not understand why you need to do them, but you will likely understand why once the course begins.
Depending on your skills and knowledge, these tasks could take between 3-5 hrs to finish, so we suggest planning a full day to complete them. Depending on your institution and how they handle installing software on work computers, you also might have to contact IT very early to make sure everything is properly installed and set up.
Here’s a quick overview of the tasks you need to do. Specific details about them are found as you work through the section.
- Install R, RStudio, and Git. For some people, depending on their institution, this task can take the longest because you have to contact your IT to install these packages.
- Install the necessary R packages.
- Read about Git from the introduction course and configure Git on your computer. If you haven’t used Git before, this task could take a while because of the reading.
- Run a check to see if everything works.
- You’ll later need to paste this output into the survey.
- Create an R Project, along with the folder and file setup.
- Create an R Markdown file.
- Write R code to download the data and save it to your computer. This task will probably take up the most time, maybe 30-60 minutes.
- Run a check to see that everything is as expected
- You’ll later need to paste this output into the survey.
- Read the syllabus.
- Read the Code of Conduct.
- Complete the pre-course survey. This survey is pretty quick, maybe ~10 minutes.
Check each section for exact details on completing these tasks.
In general, these pre-course tasks are meant to help prepare you for the course and make sure everything is setup properly so the first session runs smoothly. However, some of these tasks are meant for learning as well as for general setup, so we have defined the following learning objectives for this page:
- Learn about making reproducible documents with R Markdown.
- Learn about filesystems, relative and absolute paths, and how to make use of the fs R package to navigate files in your project.
- Learn where to store your raw data so that you can use scripts as a record of what was done to process the data before analyzing it, and why that’s important.
Given this is an intermediate course, you should already have R and RStudio installed. However, you may not have the latest versions installed.
- Have the latest version of R installed (any version above 4.0.0).
- Have the latest version of RStudio installed (any version above 1.4).
There are a few other things to install:
- Git. We’ll be using Git (building off of the Introduction course), so it needs to be installed.
- For some Windows users, you may need to install Rtools in order for some R packages to be installed (which you’ll do shortly). For some computers, installing Rtools can take some time. You need this in order for some packages to work.
All these programs are required for the course, even Git. Git, which is a software program to formally manage file versions, is used because of it’s popularity and the amount of documentation available for it. Check out the online book Happy Git with R, especially the “Why Git” section, to understand why we are using Git. Windows users tend to have more trouble with installing Git than macOS or Linux users. See the section on Installing Git for Windows for help.
A note to those who have or use work laptops with restrictive administrative privileges: You may encounter problems installing software due to administrative reasons (e.g. you don’t have permission to install things). For issues with updating to the latest version of R or RStudio, if you have at least 4.0.0 for R and at least 1.4 for RStudio, that should be fine. If you have versions of R and RStudio older than that, it is necessary for you to request that IT update your software if you can’t yourself. Unfortunately, given that Git is not a commonly used software for some organizations, you may not have it installed and you will need to ask IT to install it. We require it for the course, so please make sure to give IT enough time to be able to install it for you.
Once R, RStudio, and Git have been installed, open RStudio. If at any point during these pre-course tasks you have any troubles, try as best as you can to complete the task and then let us know about the issues in the pre-course survey (at the end of this section). If you continue having problems, indicate on the survey that you need help and we can try to book a quick video call to fix the problem. Otherwise, if you can, come to the course earlier (about 20-30 min) to get help.
We will be using specific R packages for the course, so you will need to install them. A detailed walkthrough for installing the necessary packages is available on the pre-course tasks for installing packages section of the introduction course, however, you only need to install the r3 helper package in order to install all the necessary packages by running these commands in the R Console:
Install the remotes package:
Install the r3 helper package for this course:
remotes::install_gitlab("rostools/r3", upgrade = TRUE)
Note: When you see a command like
for example with
remotes::install_gitlab(), you would “read” this as:
R, can you please use the install_gitlab function from the remotes package.
The normal way of doing this would be to load the package with
and then running the command (
But by using the
::, we tell R to directly use a function from a package,
without needing to load the package and all of its other functions too. We use this
trick because we only want to use the
install_gitlab() command from the
remotes package and not have to load all the other functions as well. In this
course we will be using
Since Git has already been covered in the Introduction course, we won’t cover learning it during this course. However, since version control is a fundamental component of any modern data analysis workflow and should be used, we will be using it throughout the course. If you have used or currently use Git, you can skip this section. If you haven’t used it, please do these tasks:
Follow the pre-course tasks for Git (not the GitHub tasks) from the introduction course. Specifically, type in the RStudio Console:
# A pop-up to type in your name (first and last), # as well as your email r3::setup_git_config()
Please read through the Version Control lesson of the introduction course. You don’t need to do any of the exercises or activities, but you are welcome to do them if it will help you learn or understand it better. For most of the course, we will be using Git as shown in the Using Git in RStudio section. Later on during the course, we might connect our projects to GitHub, which is described in the Synchronizing with GitHub section.
Regardless of whether you’ve done the steps above or not, everyone needs to run:
The output you’ll get for success will look something like this:
Checking R version: ✔ Your R is at the latest version of 4.2.0! Checking RStudio version: ✔ Your RStudio is at the latest version of 2022.2.2.485! Checking Git config settings: ✔ Your Git configuration is all setup! Git now knows that: - Your name is 'Luke W. Johnston' - Your email is 'firstname.lastname@example.org'
Eventually you will need to copy and paste the output into one of the survey questions. Note that while GitHub is a natural connection to using Git, given the limited time available, we will not be going over how to use GitHub. If you want to learn about using GitHub, check out the session on it in the introduction course.
One of the basic steps to reproducibility and modern workflows in data analysis is to keep everything contained in a single location. In RStudio, this is done with R Projects. Please read all of Section 7.1 from the introduction course to learn about R Projects and how they help keeping things self-contained. You don’t need to do any of the exercises or activities.
There are several ways to organise a project folder. We’ll be using the structure from the package prodigenr. The project setup can be done by either:
- Using RStudio’s New Project menu item: “File -> New Project -> New Directory”, scroll down to “Scientific Analysis Project using prodigenr” and name the project “LearnR3” in the Directory Name, saving it to the “Desktop” with Browse.
- Or, running the command
prodigenr::setup_project("~/Desktop/LearnR3")in the R Console.
When the RStudio Project opens up again, run these three commands in the R Console:
Here we use the usethis package to help set things up. usethis is an extremely useful package for managing R Projects and we highly recommend checking it out more to see how you can use it more in your own work.
We teach and use R Markdown because it is one of the very first steps to being reproducible and because it is a very powerful tool to doing data analysis. Please do these two tasks:
Please read over the R Markdown section of the introduction course. If you use R Markdown already, you can skip this step.
Open up the
LearnR3project, either by clicking the
LearnR3.Rprojfile or by using the “File -> Open Project” menu. Run the function below in the Console when RStudio is in the
LearnR3project, which will create a new file called
To best demonstrate the concepts in the course, we ideally should work on a real dataset to apply what we’re going to learn. So for this course, we’re going to use an openly licensed dataset on monitoring sleep and activity (MMASH) (1,2). To begin learning about being reproducible and applying modern approaches to data analysis, we’re going to write and save R code to download a dataset, prepare it a bit so it’s at least a little usable, and than save it to your computer. The goal at the end of the course is to create a pipeline to download the data, process and clean it, and save it in a form that makes it easier to analyze. Why don’t we get you to download an already cleaned and prepared dataset? Because in the real world, the data you get is rarely all cleaned up and ready for you, and this course is about learning more advanced tools to do the data wrangling. Look over these tasks and than switch over to the MMASH website:
- Look through the Data Description to get familiar with the dataset and see what is contained inside of it. We’ll refer back to the Data Description throughout the course as well as in the exercises.
- Look over the open license that allows you to re-use it, even for research purposes. Note: GDPR makes it stricter on how to share and use personal data, but it does not prohibit sharing it or making it public! GDPR and Open Data are not in conflict.
Note: Sometimes the PhysioNet website is slow. If that’s the case, use this alternative link instead.
After looking over the MMASH website, you need to setup where to store the dataset
to prepare it for later processing. While in your
LearnR3 R Project, go to the
Console pane in RStudio and type out:
What this function does is create a new folder called
data-raw/ and creates
an R script called
mmash.R in that folder. This is where we will store the
raw, original MMASH data that we’ll get from the website. The R script should
have opened up for you, otherwise, go into the
data-raw/ folder and open up
The first thing we want do is delete all the code in the script that is added there by default. Then we’ll create a new line at the top and type out:
The here package was described in the Management of R Projects of the introductory course and makes it easier to refer to other files in an R project. Read through the section about the here package in the introductory R course.
R works based on the current working directory, which you can see on the top
of the RStudio Console pane, which you can see in the red box inside the image below.
When in an RStudio R Project, the working directory
is the folder where the
.Rproj file is located. When you run scripts in R with
source(), sometimes the working directory will be set to where the R script is located.
So you can sometimes encounter problems with finding files. Instead, when you use
here() R knows to start searching for files from the
Let’s use an example. Below is the tree of the folders and files you have so far.
If we open up RStudio with the
LearnR3.Rproj file and run code in the
data-raw/mmash.R, R runs the commands
assuming everything starts in the
LearnR3/ folder. But! If we run the code in
mmash.R script by other ways (e.g. not with RStudio, not in an R Project,
source()), R runs everything assuming it starts in the
This can make things tricky. What
here() does is tell R to first look for the
.Rproj file and then start looking for the file we actually want. This might not
make sense yet, but as we go through the course, you will see why this is important
Note, you don’t need to run the below code. But if you want to see the
structure and content of a directory, you can use the
function from the fs package,
which means “filesystem”, by running the following code in the R Console:
# To print the file list. fs::dir_tree("~/Desktop/LearnR3", recurse = 1)
LearnR3 ├── data │ └── README.md ├── data-raw │ └── mmash.R ├── doc │ ├── lesson.Rmd │ └── README.md ├── R │ └── README.md ├── .gitignore ├── DESCRIPTION ├── LearnR3.Rproj └── README.md
Alright, the next step is to download the dataset. Paste this code into the
mmash_link <- "https://physionet.org/static/published-projects/mmash/multilevel-monitoring-of-activity-and-sleep-in-healthy-people-1.0.0.zip"
Note: Sometimes the PhysioNet website is slow. If that’s the case, use the
r3::mmash_data_linkinstead of the link used above. In this case, it will look like
mmash_link <- r3::mmash_data_link.
Then we’re going to write out the function
download.file() to download and
save the zip dataset. We’re going to save the zip file to
data-raw/mmash-data.zip with the
destfile argument. This code should be
written in the
Run these lines of code to download the dataset.
After downloading the zip file (it should be called
data-raw/ folder), comment out this line, so that inside the
data-raw/mmash.R script it looks like:
# download.file(mmash_link, destfile = here("data-raw/mmash-data.zip"))
We do this because we don’t want to accidentally run this code again, since we already downloaded the file.
Because the original dataset is stored elsewhere on a website, we don’t need to
save it to our Git history. Add the zip file to the Git ignore list by typing out and run this code in the Console. You only need to do this once.
Next, open up the zip file with your File Manager and look at what is inside.
There should be the license file, another file to check if the download worked
correctly (the SHA file, which you don’t need to worry about),
and another zip file of the dataset. Because we are starting with the original
mmash-data.zip, we should record exactly how we process the data set for
use. This relates to the key principal of “keep your raw data raw”, as in
don’t edit or touch your raw data, let R or another programming language process
it. This lets you have a history of what was done to the raw data.
During data collection and entry, programs like Excel or Google Sheets are
incredibly powerful. But after collection is done, don’t make edits directly to
the data unless absolutely necessary.
A quick comment about whether you should save your raw data in
A general guideline is:
Do store it to
data-raw/if the data will only be used for the one project. Use the
data-raw/R script to be the record for how you processed your data for the final analysis work.
Don’t save it to
data-raw/if: 1) there is a central dataset that multiple people use for multiple projects; or 2) you have the data online. Instead, use the
data-raw/R script to be the record for which website or central location you extracted it from and how you later processed it.
Don’t save it to a project-specific
data-raw/folder if you will use the raw data for multiple projects. Instead, create a central location for the data for yourself so that you can point all other projects to it and use their individual
data-raw/R scripts as the record for how you processed the raw data.
Unzip the zip files by using the
unzip() function and writing it in
data-raw/mmash.R below the
download.file() function. The main argument for
unzip() is the zip file and the next important one is called
unzip() which folder you want to extract the files to. The argument
junkpaths is used here to tell
unzip() to extract everything to the
data-raw/ folder (no idea why it’s called “junkpaths”). This code should be
written and executed in the
Notice the indentations and spacings of the code. Like writing any language,
code should follow a style guide. An easy way of following a style is by
selecting your code and using RStudio’s builtin style fixer with either
Ctrl-Shift-A or “Code -> Reformat Code” menu item.
Next, we’ll extract the new
data-raw/MMASH.zip file using
Because we want to keep the folder structure inside this zip file, we won’t use
junkpaths. Write and execute this code in the
Almost done! There are several files left over that we don’t need, so you’ll
need to write code in the script code that removes them. We’ll use the fs
package to work with files. Before you change any files, look into the
folder and confirm that the below listed files and folders are there.
# NOTE: You don't need to run this code, # its here to show how we got the file list. fs::dir_tree("data-raw", recurse = 1)
data-raw ├── README.md ├── LICENSE.txt ├── MMASH.zip ├── SHA256SUMS.txt ├── DataPaper │ ├── user_1 │ ├── user_10 │ ├── ... │ ├── user_8 │ └── user_9 ├── mmash-data.zip └── mmash.R
If your files and folders in the
data-raw/ folder do not look like this,
start over by deleting all the files except for the
files. Then re-run the code from beginning to end.
To tidy up the files, first, use the
file_delete() function from fs to delete
all the files we originally extracted (
MMASH.zip). Then use
file_move() to rename the new folder
data-raw/DataPaper/ to something more explicit like
these lines of code to the
data-raw/mmash.R script and run them:
Afterwards, the files and folders in
data-raw/ will look like:
data-raw ├── README.md ├── mmash │ ├── user_1 │ ├── user_10 │ ├── ... │ ├── user_8 │ └── user_9 ├── mmash-data.zip └── mmash.R
Like before, if your files and folders inside
data-raw/ don’t look like those
listed above, start over again (making sure to delete all but the
Since we have an R script that downloads the data and processes it for us, we don’t need to have Git track it. So, in the Console, type out and run this command:
data-raw/mmash.R script should look like this at this point:
library(here) # Download mmash_link <- "https://physionet.org/static/published-projects/mmash/multilevel-monitoring-of-activity-and-sleep-in-healthy-people-1.0.0.zip" download.file(mmash_link, destfile = here("data-raw/mmash-data.zip")) # Unzip unzip(here("data-raw/mmash-data.zip"), exdir = here("data-raw"), junkpaths = TRUE) unzip(here("data-raw/MMASH.zip"), exdir = here("data-raw")) # Remove/tidy up left over files library(fs) file_delete(here(c("data-raw/MMASH.zip", "data-raw/SHA256SUMS.txt", "data-raw/LICENSE.txt"))) file_move(here("data-raw/DataPaper"), here("data-raw/mmash"))
You now have the data ready for the course! At this point, please run this function in the Console:
~/Desktop/LearnR3 ├── DESCRIPTION ├── LearnR3.Rproj ├── R │ ├── README.md │ └── functions.R ├── README.md ├── TODO.md ├── data │ └── README.md ├── data-raw │ ├── README.md │ ├── mmash │ │ ├── user_1 │ │ ├── user_10 │ │ ├── user_11 │ │ ├── user_12 │ │ ├── user_13 │ │ ├── user_14 │ │ ├── user_15 │ │ ├── user_16 │ │ ├── user_17 │ │ ├── user_18 │ │ ├── user_19 │ │ ├── user_2 │ │ ├── user_20 │ │ ├── user_21 │ │ ├── user_22 │ │ ├── user_3 │ │ ├── user_4 │ │ ├── user_5 │ │ ├── user_6 │ │ ├── user_7 │ │ ├── user_8 │ │ └── user_9 │ ├── mmash-data.zip │ └── mmash.R └── doc ├── README.md └── lesson.Rmd
The output should look something a bit like the above text. If it doesn’t,
start over by deleting all but the
and running the code from the beginning again. If your output looks a bit like this,
than copy and paste the output into the survey question at the end.
Most of the course description is found in the syllabus. If you haven’t read it, please read it now. Read over what the course will cover, what we expect you to learn at the end of it, and what our basic assumptions are about who you are and what you know. The final pre-course task is a survey that asks some questions on if you’ve read and understood it.
One goal of the course is to teach about open science, and true to our mission, we practice what we preach. The course material is publicly accessible (all on this website) and openly licensed so you can use and re-use it for free! The material and table of contents on the side is listed in the order that we will cover in the course.
We have a Code of Conduct. If you haven’t read it, read it now. The survey at the end will ask about Conduct. We want to make sure this course is a supportive and safe environment for learning, so this Code of Conduct is important.
You’re almost done. Please fill out the pre-course survey to finish this assignment.
See you at the course!