If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
6 Importing data, fast!
Here we will cover the first block, “Download raw data” in Figure 6.1
And your folder and file structure should look like:
LearnR3 ├── data/ │ └── README.md ├── doc/ │ ├── README.md │ └── lesson.Rmd ├── R/ │ ├── functions.R │ └── README.md ├── .Rbuildignore ├── .gitignore ├── DESCRIPTION ├── LearnR3.Rproj └── README.md
6.1 Learning objectives
- Learn about filesystems, relative and absolute paths, and how to make use of the fs package to navigate files in your project.
- Learn where to store your raw data so that you can use scripts as a record of what was done to process the data before analyzing it, and why that’s important.
- Learn how to import data and do minor cleaning with the vroom package.
- Learn about strategies and resources to use when encountering problems when importing data.
- Practice using Git version control as part of the workflow of data analysis.
6.2 The MMASH dataset
- What is contained in the dataset by looking at the the Data Description. We’ll be making use of the Data Description for the code-along as well as the exercises. Take 5 min to quickly look over the Data Description and get more familiar with it.
- The open license and the ability to re-use it. A small note: GDPR makes it more strict on how to share and use personal data, but it does not prohibit sharing it or making it public! GDPR and Open Data are not in conflict.
Note: Sometimes the PhysioNet website is slow. If that’s the case, use this alternative link instead.
After we have looked over the MMASH website, we need to setup where we will store
and prepare the dataset for processing. Here we’ll make use of the usethis
package to help setup things. usethis is an extremely useful package for managing
R projects and I highly recommend checking out how you can use it more in your
own work. For now, while in your
LearnR3 R Project, go to the Console pane in
RStudio and type out:
What this function does is create a new folder called
data-raw/ and creates
an R script called
mmash.R in the folder. This is where we will store the
raw, original MMASH data that we’ll get from the website. The R script should
have opened up for you, otherwise, go into the
data-raw/ folder and open up
The first thing we want do is delete all the code that is there by default. Then we’ll create a new line at the top and type out:
The here package was described in the Management of R Projects of the introductory course. Take 8 minutes to read the section about here and read the next two paragraphs. here makes it easy to refer to other files in an R project.
R works based on the current working directory, which you can see on the top
of the RStudio Console pane. When in an RStudio R Project, the working directory
is the folder where the
.Rproj file is located. When you run scripts in with
source() sometimes the working directory will be where the R script is located.
So you can encounter problems with finding files. So instead, by using
R knows to start searching for files from the
Let’s use an example. Below is the folder tree. If we open up RStudio with the
LearnR3.Rproj file and run code in the
data-raw/mmash.R, R runs the commands
assuming everything starts in the
LearnR3/ folder. But! If we run the code in
mmash.R script by other ways (e.g. not with RStudio, not in an R Project,
source()), R runs everything assuming it starts in the
This can make things tricky. What
here() does is tell R to first look for the
.Rproj file and then start looking for the file we actually want.
LearnR3 ├── data │ └── README.md ├── data-raw │ └── mmash.R ├── doc │ ├── lesson.Rmd │ └── README.md ├── R │ └── README.md ├── .Rbuildignore ├── .gitignore ├── DESCRIPTION ├── LearnR3.Rproj └── README.md
Stop reading and we’ll go back to coding together. The first step we want
to take is to download the dataset. From this material, paste this code into the
Note: Sometimes the PhysioNet website is slow. If that’s the case, use
r3::mmash_data_linkinstead of the link used above. In this case, it will look like
mmash_link <- r3::mmash_data_link.
Then we’re going to write out the function
download.file() to download and
save the zip dataset. We’re going to save the zip file to
data-raw/mmash-data.zip with the
destfile argument. This code should be
written in the
Run these lines of code to download the dataset.
Because the original dataset is stored elsewhere, we don’t need to keep it or save it to our Git history. So we’ll add the zip file to the Git ignore list. In the Console, type out and run this code. You only need to do this once.
This is a good time to save the changes we’ve made to the Git history.
Let’s open the Git interface with either the Git icon at the top near the menu
bar, or with
Ctrl-Alt-M. When this opens up we’ll click the checkbox beside
data-raw/mmash.R files. Then we write a commit message
in the text box on the right, something like “Code to download data zip file”.
Click the “Commit” button and close the Git interface.
Alright, let’s start preparing the dataset. We’ll open up the zip file and look
at what is inside (only instructor does this). Inside there is the license file,
another file to check if the download worked correctly (the SHA file),
and another zip of the dataset inside. Because we are starting with the original
mmash-data.zip, we should record exactly how we process the data set for
use. This also relates to the principal of “keep your raw data raw”, as in
don’t edit or touch your raw data, let R or other programming language process
it. This lets you have a history of what was done to the raw data.
During data collection, programs like Excel or Google Sheets are incredibly
powerful. But after collection is done, don’t make edits directly to the data
unless absolutely necessary.
A quick comment about whether you should save your raw data in
A general guideline is:
- Do store it to
data-raw/if the data will only be used for the one project. Use the
data-raw/R script to be the record for how you processed your data for final analysis work.
- Don’t save it to
data-raw/if: 1) there is a central dataset that multiple people use for multiple projects; or 2) you got the data online. Instead, use the
data-raw/R script to be the record for which website you downloaded it from or from which central location you extracted it from and how you processed it.
- Don’t save it to a project-specific
data-raw/folder if you will use the raw data for multiple projects. Instead, create a central location for the data for yourself* so that you can point all other projects to it and use their individual
data-raw/R scripts as the record for how you processed the raw data.
Ok, let’s start unzipping the zip files. In
data-raw/mmash.R, continue writing
download.file() function. We’ll use the
to unzip the dataset. The main argument for
unzip() is the zip file,
and the other important one called
unzip() the folder we want to
extract the files to. The argument
junkpaths is used here because we want
everything extracted to the
data-raw/ folder (don’t ask nor know why it’s
Notice the indentations and spacings of the code. Like writing any language,
code should follow a style guide. An easy way of following a style is by
selecting your code and using RStudio’s builtin style fixer of either
Ctrl-Shift-A or “Code -> Reformat Code” menu item.
Ok, next, we want to extract the new
data-raw/MMASH.zip file. Because we want to
keep the folder structure inside this zip file, we don’t use
Almost done! There are several files left over that we don’t need, so we’ll
also write in the script code to remove these files. We’ll use the fs package,
which means filesystem, to work with files. First, we delete all the files
we originally extracted (
by using the
file_delete() function. Then we’ll rename the new folder
data-raw/DataPaper/ to something more explicit like
file_move() function. So the
data-raw/ folder will initially look
data-raw ├── LICENSE.txt ├── MMASH.zip ├── SHA256SUMS.txt ├── mmash │ ├── user_1 │ ├── user_10 │ ├── ... │ ├── user_8 │ └── user_9 ├── mmash-data.zip └── mmash.R
Then we add these lines of code to the
data-raw/mmash.R script and run them:
Afterward, the files and folders in
data-raw/ will look like:
data-raw ├── mmash │ ├── user_1 │ ├── user_10 │ ├── ... │ ├── user_8 │ └── user_9 ├── mmash-data.zip └── mmash.R
Since we have an R script that downloads the data and processes it for us, we don’t need to have Git track it. So, in the Console, type out and run this command:
Now that we have everything prepared, let’s add and commit the changes to the Git history.
6.3 Importing in the raw data
While we’ll eventually come back to the
for now we’ll move over to the
At the bottom of the file, create a header by typing out
## Importing raw data.
Next, we’ll make a new code chunk with
Ctrl-Alt-I and call it
Inside the code chunk we’ll load the vroom package with
library(vroom) as well as
library(here). It should look like this:
This is a specially named code chunk that tells R to run this code chunk first
whenever you first start running code in this R Markdown file. So it’s here that
we will add
library() functions when we want to load other packages.
Take 5 minutes to read the next paragraphs until it says to stop.
What is [vroom]? It is a package designed to load in data, specifically text-based data files such as CSV. In R there are several packages that you can use to load in data and of different types of file formats. We won’t cover these other packages, but you can use this as a reference for when or if you ever have to load other file types:
- haven: For reading (also known as importing or loading) in SAS, SPSS, and Stata files.
- readxl: For reading in Excel spreadsheets with
- googlesheets4: For reading in Google Sheets from their cloud service.
- readr: Standard package used to load in text-based data files like CSV. This package is included by default with tidyverse.
utils::read.delim(): This function comes from the core R package utils and includes other functions like
data.table::fread(): From the data.table package, used to load in CSV files.
We’re using the vroom package for largely one reason: It makes use of recent improvements in R that allow data to be imported in very very quickly. Just how fast? The vroom website has a benchmark page showing how fast it is. For many people, loading in the data can be one of the most time-consuming parts of starting an analysis. Hopefully by using this package, that time can be reduced.
The packages readr, vroom, haven, readxl, and googlesheets4 all are very similar in how you use them and their documentation are almost identical. So the skills you learn in this session with vroom can mostly be applied to these other packages. And because readr (which the other packages are based on) has been around for awhile, there is a large amount of support and help for using it. If you’re curious to learn more about vroom, check out the website
If your data is in CSV format, vroom is perfect. If not, there are other ways of importing data which we won’t cover. The CSV file format is a commonly used format for data because it is open, readable by any computer, and doesn’t depend on any special software to open (unlike for e.g. Excel spreadsheets). Please stop reading and we’ll go over this together.
Let’s first start by creating an object that has the file path to the dataset,
then we’ll use
vroom() to import that dataset.
user_1_info_file <- here("data-raw/mmash/user_1/user_info.csv") user_1_info_data <- vroom(user_1_info_file) #> New names: #> * `` -> ...1 #> Rows: 1 #> Columns: 5 #> Delimiter: "," #> chr : Gender #> dbl : ...1, Weight, Height, Age #> #> Use `spec()` to retrieve the guessed column specification #> Pass a specification to the `col_types` argument to quiet this message
You’ll see the output mention using
spec() to use in the argument
And that it has 5 columns, one called
...1. If we look at the CSV file though,
we see that there are only four columns with names… but that technically
there is a first empty column without a column header.
So, let’s figure out what this message means.
Let’s go to the Console and type out:
In the documentation, we see that it says:
“extracts the full column specification from a tibble…”
Without seeing the output, it’s not clear what “specification” means. So let’s
spec() on the dataset variable. In the Console again:
Ok, so from this we can see that a specification is what columns are imported
into R and what data type they are given. For instance, we can assume that
col_double() means numeric (double is how computers represent non-integer
col_character() means a character data type.
Next, let’s see what the message meant about
col_types. Let’s check out the
help documentation for
vroom() by typing in the Console:
And if we scroll down to the explanation of
“One of NULL, a cols() specification, or a string. See vignette(”readr“) for more details.”
It says to use a “cols() specification”, which is likely the output of
So, let’s copy and paste the output from
spec() and paste it into the
col_types argument of
user_1_info_data <- vroom( user_1_info_file, col_types = cols( ...1 = col_double(), Gender = col_character(), Weight = col_double(), Height = col_double(), Age = col_double(), .delim = "," ) ) #> Warning: The following named parsers don't match the column names: ...1 #> New names: #> * NA -> ...1 #> Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, : Invalid input type, expected 'list' actual 'NULL'
Hmm. A warning and an error. Ok, if we look through the message, there’s the part that says:
“The following named parsers don’t match the column names: …1”
And the error that is:
“Invalid input type, expected ‘list’ actual ‘NULL’”
We copy and pasted, so what’s going on? If you recall, the
file has an empty column name. Looking at the data dictionary
it doesn’t seem there is any reference to this column, so it seems it isn’t
important. So more than likely, vroom is complaining about this empty column
name and the use of
...1 to represent it. Since we don’t need it, let’s just
get rid of it when we load in the dataset. But how? Let’s look at the help
documentation again. Go to the Console and type out:
Looking at the list of arguments, there is an argument called
that sounds like we could use that to keep or drop columns. It says that it
is used similar to
dplyr::select(), which normally is used with actual column
names. Our column doesn’t have a name, that’s the problem. Next let’s check the
Example section of the help. Scrolling down, you’ll eventually see:
vroom(input_file, col_select = c(1, 3, 11))
So, it takes numbers! With
dplyr::select(), using the
- before the column
name (or number) means to drop the column, so in this case, we could drop the
first column with
col_select = -1!
Amazing! We did it 😁
But… we also have a new message. You may or may not see this message depending on the version of your packages (the most updated packages will show this). But it should say:
New names: * NA -> ...1
vroom() letting you know that a column was renamed because
you didn’t explicitly indicate that you wanted it renamed.
This is the empty column. Even though we excluded it with
col_select = -1,
that only excludes it after importing it.
To remove this new message, we need to tell
vroom() exactly how to handle
renaming cases. Let’s look at the help docs of vroom:
?vroom. Scroll down and
we see an argument called
.name_repair that handles naming of columns. Going
into the provided link takes us to the
tibble::tibble() help documentation.
Scroll down to the
.name_repair argument documentation and it says that it
treats problematic column names, of which a missing column name is definitely a
problem. There are several options here, but the one I want to focus on is the
comment about “function: apply custom name repair”. This is an important one
because we eventually want to rename the columns to match the style guide by
snake_case. And there’s a package to do that called
In the Console, type out
snakecase:: and hit Tab. You’ll see a list of
possible functions to use. We want to use the snake case one, so scroll down and
to_snake_case(). That’s the one we want to use.
So to remove the messages and convert the variable names to snake case, we
.name_repair = snakecase::to_snake_case to code.
Notice the lack of
() when using the function. We’ll explain more about this
in the next session.
Ok, no more messages! We can now look at the data:
Why might we use
col_types? Depending on the size of the dataset,
it may take a long time to load everything, which may not be very efficient
if you only intend to use some parts of the dataset and not all of it.
spec() incorrectly guesses the column types, so using
col_types = cols() can fix those problems.
If you have a lot of columns in your dataset, then you can make use of
cols_only() to keep only the columns you want.
Check out vroom’s website help documentation on
for more details on how to use those.
6.4 Exercise: Import the saliva data
Time: 15 min
Practice importing data files by doing the same process with the saliva data.
- Create a new header at the bottom of the
doc/lesson.Rmdfile and call it
## Exercise: Import the saliva data.
- Below the header, create a new code chunk with
- Copy and paste the code template below into the new code chunk.
Begin replacing the
___with the correct R functions or other information.
- Once you have the code working, use the RStudio Git interface to add and commit the changes into the version history.
6.5 Importing larger datasets
Sometimes you may have a dataset that’s just a bit too large. Sometimes vroom
may not have enough information to guess the data type of the column.
Or maybe there are hundreds or thousands of columns in your data
and you only want to import specific columns.
In these cases, we can do a trick: read in only first few lines of the dataset,
spec() and paste in the output into the
and then keep only the columns you want to keep.
Let’s do this on the
We can see from the file size that it is bigger than most of the other files
(~2Mb). So, we’ll use this technique to decide what we want to keep etc.
First, create a new header
## Import larger datasets and a new code chunk below it
Do the same thing that we’ve been doing, but this time we are going to use the
n_max, which tells vroom how many rows to read into R. In this case,
let’s read in 100, since that is the amount vroom guess until. This dataset,
like the others, has an empty column that we will drop.
user_1_rr_file <- here("data-raw/mmash/user_1/RR.csv") user_1_rr_data_prep <- vroom(user_1_rr_file, n_max = 100, col_select = -1) #> New names: #> * `` -> ...1 #> Rows: 100 #> Columns: 3 #> Delimiter: "," #> dbl : ibi_s, day #> time : time #> #> Use `spec()` to retrieve the guessed column specification #> Pass a specification to the `col_types` argument to quiet this message spec(user_1_rr_data_prep) #> cols( #> ...1 = col_skip(), #> ibi_s = col_double(), #> day = col_double(), #> time = col_time(format = "") #> )
Like with last time, copy and paste the output into a new use of
.delim line and the
...1 line. Don’t forget to also remove the
, at the end! Make sure to remove the
n_max argument, since we want to
import in the whole dataset.
There’s a new column type:
col_time(). To see all the other types of column
specifications, in the Console type out
col_ and then hit the Tab key.
You’ll see other types, like
col_factor, and so on.
Right, what does the data look like?
user_1_rr_data #> # A tibble: 91,858 x 3 #> ibi_s day time #> <dbl> <dbl> <time> #> 1 0.258 1 10:10:17 #> 2 0.319 1 10:10:18 #> 3 0.266 1 10:10:18 #> 4 0.401 1 10:10:18 #> 5 1.09 1 10:10:19 #> 6 0.752 1 10:10:20 #> 7 0.337 1 10:10:20 #> 8 0.933 1 10:10:21 #> 9 0.731 1 10:10:22 #> 10 0.454 1 10:10:23 #> # … with 91,848 more rows
To make sure everything is so far reproducible within the
we will “Knit” the R Markdown document to output an HTML file. Click the “Knit”
button at the top of the Source pane or by typing
Ctrl-Shift-K. If it generates
an HTML document without problems, we know our code is at least starting to be
6.6 Exercise: Import the Actigraph data
Time: 15 min
Practice some more. Do the same thing with the
Actigraph.csv dataset as
we did for the
RR.csv. But first:
- Create a new header at the bottom of the
doc/lesson.Rmdfile and call it
## Exercise: Import the Actigraph data.
- Below the header, create a new code chunk with
Use the same technique as we used for the
and read in the
Actigraph.csv file from
- Set the file path to the dataset with
- Read in a max of 100 rows for
n_maxand exclude the first column with
col_select = -1.
spec()to output the column specification and paste the results into
col_types(). Don’t forget to remove the
...1 = col_skip()and the
.delim = ","line from the
Rossi, Alessio, Eleonora Da Pozzo, Dario Menicagli, Chiara Tremolanti, Corrado Priami, Alina Sirbu, David Clifton, Claudia Martini, and David Morelli. 2020. “Multilevel Monitoring of Activity and Sleep in Healthy People.” PhysioNet. https://doi.org/10.13026/CERQ-FC86.