Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

  • Open an issue or submitting a merge request on GitLab.
  • Hypothesis Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

6 Importing data, fast!

Here we will cover the first block, “Download raw data” in Figure 6.1

Figure 6.1: Section of the overall workflow we will be covering.

And your folder and file structure should look like:

LearnR3
├── data/
│   └── README.md
├── doc/
│   ├── README.md
│   └── lesson.Rmd
├── R/
│   ├── functions.R
│   └── README.md
├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md

6.1 Learning objectives

  1. Learn about filesystems, relative and absolute paths, and how to make use of the fs package to navigate files in your project.
  2. Learn where to store your raw data so that you can use scripts as a record of what was done to process the data before analyzing it, and why that’s important.
  3. Learn how to import data and do minor cleaning with the vroom package.
  4. Learn about strategies and resources to use when encountering problems when importing data.
  5. Practice using Git version control as part of the workflow of data analysis.

6.2 The MMASH dataset

For this course, we’re going to use an openly licensed dataset on monitoring sleep and activity (MMASH) (Rossi et al. 2020). We’ll switch over to the MMASH website and go over:

  1. What is contained in the dataset by looking at the the Data Description. We’ll be making use of the Data Description for the code-along as well as the exercises. Take 5 min to quickly look over the Data Description and get more familiar with it.
  2. The open license and the ability to re-use it. A small note: GDPR makes it more strict on how to share and use personal data, but it does not prohibit sharing it or making it public! GDPR and Open Data are not in conflict.

Note: Sometimes the PhysioNet website is slow. If that’s the case, use this alternative link instead.

After we have looked over the MMASH website, we need to setup where we will store and prepare the dataset for processing. Here we’ll make use of the usethis package to help setup things. usethis is an extremely useful package for managing R projects and I highly recommend checking out how you can use it more in your own work. For now, while in your LearnR3 R Project, go to the Console pane in RStudio and type out:

usethis::use_data_raw("mmash")

What this function does is create a new folder called data-raw/ and creates an R script called mmash.R in the folder. This is where we will store the raw, original MMASH data that we’ll get from the website. The R script should have opened up for you, otherwise, go into the data-raw/ folder and open up the new mmash.R script.

The first thing we want do is delete all the code that is there by default. Then we’ll create a new line at the top and type out:

library(here)
#> here() starts at /builds/rostools/r-cubed-intermediate

The here package was described in the Management of R Projects of the introductory course. Take 8 minutes to read the section about here and read the next two paragraphs. here makes it easy to refer to other files in an R project.

R works based on the current working directory, which you can see on the top of the RStudio Console pane. When in an RStudio R Project, the working directory is the folder where the .Rproj file is located. When you run scripts in with source() sometimes the working directory will be where the R script is located. So you can encounter problems with finding files. So instead, by using here(), R knows to start searching for files from the .Rproj.

Let’s use an example. Below is the folder tree. If we open up RStudio with the LearnR3.Rproj file and run code in the data-raw/mmash.R, R runs the commands assuming everything starts in the LearnR3/ folder. But! If we run the code in the mmash.R script by other ways (e.g. not with RStudio, not in an R Project, or with source()), R runs everything assuming it starts in the data-raw/. This can make things tricky. What here() does is tell R to first look for the .Rproj file and then start looking for the file we actually want.

LearnR3
├── data
│   └── README.md
├── data-raw
│   └── mmash.R
├── doc
│   ├── lesson.Rmd
│   └── README.md
├── R
│   └── README.md
├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md

Stop reading and we’ll go back to coding together. The first step we want to take is to download the dataset. From this material, paste this code into the data-raw/mmash.R script:

mmash_link <- "https://physionet.org/static/published-projects/mmash/multilevel-monitoring-of-activity-and-sleep-in-healthy-people-1.0.0.zip"

Note: Sometimes the PhysioNet website is slow. If that’s the case, use r3::mmash_data_link instead of the link used above. In this case, it will look like mmash_link <- r3::mmash_data_link.

Then we’re going to write out the function download.file() to download and save the zip dataset. We’re going to save the zip file to data-raw/mmash-data.zip with the destfile argument. This code should be written in the data-raw/mmash.R file. Run these lines of code to download the dataset.

download.file(mmash_link, destfile = here("data-raw/mmash-data.zip"))

Because the original dataset is stored elsewhere, we don’t need to keep it or save it to our Git history. So we’ll add the zip file to the Git ignore list. In the Console, type out and run this code. You only need to do this once.

usethis::use_git_ignore("data-raw/mmash-data.zip")

This is a good time to save the changes we’ve made to the Git history. Let’s open the Git interface with either the Git icon at the top near the menu bar, or with Ctrl-Alt-M. When this opens up we’ll click the checkbox beside the .gitignore and data-raw/mmash.R files. Then we write a commit message in the text box on the right, something like “Code to download data zip file”. Click the “Commit” button and close the Git interface.

Alright, let’s start preparing the dataset. We’ll open up the zip file and look at what is inside (only instructor does this). Inside there is the license file, another file to check if the download worked correctly (the SHA file), and another zip of the dataset inside. Because we are starting with the original raw mmash-data.zip, we should record exactly how we process the data set for use. This also relates to the principal of “keep your raw data raw”, as in don’t edit or touch your raw data, let R or other programming language process it. This lets you have a history of what was done to the raw data. During data collection, programs like Excel or Google Sheets are incredibly powerful. But after collection is done, don’t make edits directly to the data unless absolutely necessary.

A quick comment about whether you should save your raw data in data-raw/. A general guideline is:

  • Do store it to data-raw/ if the data will only be used for the one project. Use the data-raw/ R script to be the record for how you processed your data for final analysis work.
  • Don’t save it to data-raw/ if: 1) there is a central dataset that multiple people use for multiple projects; or 2) you got the data online. Instead, use the data-raw/ R script to be the record for which website you downloaded it from or from which central location you extracted it from and how you processed it.
  • Don’t save it to a project-specific data-raw/ folder if you will use the raw data for multiple projects. Instead, create a central location for the data for yourself* so that you can point all other projects to it and use their individual data-raw/ R scripts as the record for how you processed the raw data.

Ok, let’s start unzipping the zip files. In data-raw/mmash.R, continue writing below the download.file() function. We’ll use the unzip() function to unzip the dataset. The main argument for unzip() is the zip file, and the other important one called exdir tells unzip() the folder we want to extract the files to. The argument junkpaths is used here because we want everything extracted to the data-raw/ folder (don’t ask nor know why it’s called “junkpaths”).

unzip(here("data-raw/mmash-data.zip"), 
      exdir = here("data-raw"),
      junkpaths = TRUE)

Notice the indentations and spacings of the code. Like writing any language, code should follow a style guide. An easy way of following a style is by selecting your code and using RStudio’s builtin style fixer of either Ctrl-Shift-A or “Code -> Reformat Code” menu item. Ok, next, we want to extract the new data-raw/MMASH.zip file. Because we want to keep the folder structure inside this zip file, we don’t use junkpaths.

unzip(here("data-raw/MMASH.zip"),
      exdir = here("data-raw"))

Almost done! There are several files left over that we don’t need, so we’ll also write in the script code to remove these files. We’ll use the fs package, which means filesystem, to work with files. First, we delete all the files we originally extracted (LICENSE.txt, SHA256SUMS.txt, and MMASH.zip) by using the file_delete() function. Then we’ll rename the new folder data-raw/DataPaper/ to something more explicit like data-raw/mmash/ using the file_move() function. So the data-raw/ folder will initially look like:

dir_tree("data-raw", recurse = 1)
data-raw
├── LICENSE.txt
├── MMASH.zip
├── SHA256SUMS.txt
├── mmash
│   ├── user_1
│   ├── user_10
│   ├── ...
│   ├── user_8
│   └── user_9
├── mmash-data.zip
└── mmash.R

Then we add these lines of code to the data-raw/mmash.R script and run them:

library(fs)
file_delete(here(c("data-raw/MMASH.zip", 
                   "data-raw/SHA256SUMS.txt",
                   "data-raw/LICENSE.txt")))
file_move(here("data-raw/DataPaper"), here("data-raw/mmash"))

Afterward, the files and folders in data-raw/ will look like:

data-raw
├── mmash
│   ├── user_1
│   ├── user_10
│   ├── ...
│   ├── user_8
│   └── user_9
├── mmash-data.zip
└── mmash.R

Since we have an R script that downloads the data and processes it for us, we don’t need to have Git track it. So, in the Console, type out and run this command:

usethis::use_git_ignore("data-raw/mmash/")

Now that we have everything prepared, let’s add and commit the changes to the Git history.

6.3 Importing in the raw data

While we’ll eventually come back to the data-raw/mmash.R script, for now we’ll move over to the doc/lesson.Rmd file. At the bottom of the file, create a header by typing out ## Importing raw data. Next, we’ll make a new code chunk with Ctrl-Alt-I and call it setup. Inside the code chunk we’ll load the vroom package with library(vroom) as well as library(here). It should look like this:

```{r setup}
library(vroom)
library(here)
```

This is a specially named code chunk that tells R to run this code chunk first whenever you first start running code in this R Markdown file. So it’s here that we will add library() functions when we want to load other packages.

Take 5 minutes to read the next paragraphs until it says to stop.

What is [vroom]? It is a package designed to load in data, specifically text-based data files such as CSV. In R there are several packages that you can use to load in data and of different types of file formats. We won’t cover these other packages, but you can use this as a reference for when or if you ever have to load other file types:

  • haven: For reading (also known as importing or loading) in SAS, SPSS, and Stata files.
  • readxl: For reading in Excel spreadsheets with .xls or .xlsx file endings.
  • googlesheets4: For reading in Google Sheets from their cloud service.
  • readr: Standard package used to load in text-based data files like CSV. This package is included by default with tidyverse.
  • utils::read.delim(): This function comes from the core R package utils and includes other functions like utils::read.csv().
  • data.table::fread(): From the data.table package, used to load in CSV files.

We’re using the vroom package for largely one reason: It makes use of recent improvements in R that allow data to be imported in very very quickly. Just how fast? The vroom website has a benchmark page showing how fast it is. For many people, loading in the data can be one of the most time-consuming parts of starting an analysis. Hopefully by using this package, that time can be reduced.

The packages readr, vroom, haven, readxl, and googlesheets4 all are very similar in how you use them and their documentation are almost identical. So the skills you learn in this session with vroom can mostly be applied to these other packages. And because readr (which the other packages are based on) has been around for awhile, there is a large amount of support and help for using it. If you’re curious to learn more about vroom, check out the website

If your data is in CSV format, vroom is perfect. If not, there are other ways of importing data which we won’t cover. The CSV file format is a commonly used format for data because it is open, readable by any computer, and doesn’t depend on any special software to open (unlike for e.g. Excel spreadsheets). Please stop reading and we’ll go over this together.

Let’s first start by creating an object that has the file path to the dataset, then we’ll use vroom() to import that dataset.

user_1_info_file <- here("data-raw/mmash/user_1/user_info.csv")
user_1_info_data <- vroom(user_1_info_file)
#> New names:
#> * `` -> ...1
#> Rows: 1
#> Columns: 5
#> Delimiter: ","
#> chr [1]: Gender
#> dbl [4]: ...1, Weight, Height, Age
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message

You’ll see the output mention using spec() to use in the argument col_types. And that it has 5 columns, one called ...1. If we look at the CSV file though, we see that there are only four columns with names… but that technically there is a first empty column without a column header. So, let’s figure out what this message means. Let’s go to the Console and type out:

?vroom::spec

In the documentation, we see that it says:

“extracts the full column specification from a tibble…”

Without seeing the output, it’s not clear what “specification” means. So let’s use spec() on the dataset variable. In the Console again:

spec(user_1_info_data)
#> cols(
#>   ...1 = col_double(),
#>   Gender = col_character(),
#>   Weight = col_double(),
#>   Height = col_double(),
#>   Age = col_double()
#> )

Ok, so from this we can see that a specification is what columns are imported into R and what data type they are given. For instance, we can assume that col_double() means numeric (double is how computers represent non-integer numbers) and col_character() means a character data type. Next, let’s see what the message meant about col_types. Let’s check out the help documentation for vroom() by typing in the Console:

?vroom::vroom

And if we scroll down to the explanation of col_types:

“One of NULL, a cols() specification, or a string. See vignette(”readr“) for more details.”

It says to use a “cols() specification”, which is likely the output of spec(). So, let’s copy and paste the output from spec() and paste it into the col_types argument of vroom().

user_1_info_data <- vroom(
    user_1_info_file,
    col_types = cols(
        ...1 = col_double(),
        Gender = col_character(),
        Weight = col_double(),
        Height = col_double(),
        Age = col_double(),
        .delim = ","
    )
)
#> Warning: The following named parsers don't match the column names: ...1
#> New names:
#> * NA -> ...1
#> Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, : Invalid input type, expected 'list' actual 'NULL'

Hmm. A warning and an error. Ok, if we look through the message, there’s the part that says:

“The following named parsers don’t match the column names: …1”

And the error that is:

“Invalid input type, expected ‘list’ actual ‘NULL’”

We copy and pasted, so what’s going on? If you recall, the user_info.csv file has an empty column name. Looking at the data dictionary it doesn’t seem there is any reference to this column, so it seems it isn’t important. So more than likely, vroom is complaining about this empty column name and the use of ...1 to represent it. Since we don’t need it, let’s just get rid of it when we load in the dataset. But how? Let’s look at the help documentation again. Go to the Console and type out:

?vroom::vroom

Looking at the list of arguments, there is an argument called col_select that sounds like we could use that to keep or drop columns. It says that it is used similar to dplyr::select(), which normally is used with actual column names. Our column doesn’t have a name, that’s the problem. Next let’s check the Example section of the help. Scrolling down, you’ll eventually see:

vroom(input_file, col_select = c(1, 3, 11))

So, it takes numbers! With dplyr::select(), using the - before the column name (or number) means to drop the column, so in this case, we could drop the first column with col_select = -1!

user_1_info_data <- vroom(
    user_1_info_file,
    col_select = -1,
    col_types = cols(
        Gender = col_character(),
        Weight = col_double(),
        Height = col_double(),
        Age = col_double()
    )
)
#> New names:
#> * NA -> ...1

Amazing! We did it 😁

But… we also have a new message. You may or may not see this message depending on the version of your packages (the most updated packages will show this). But it should say:

New names:
* NA -> ...1

This is vroom() letting you know that a column was renamed because you didn’t explicitly indicate that you wanted it renamed. This is the empty column. Even though we excluded it with col_select = -1, that only excludes it after importing it.

To remove this new message, we need to tell vroom() exactly how to handle renaming cases. Let’s look at the help docs of vroom: ?vroom. Scroll down and we see an argument called .name_repair that handles naming of columns. Going into the provided link takes us to the tibble::tibble() help documentation. Scroll down to the .name_repair argument documentation and it says that it treats problematic column names, of which a missing column name is definitely a problem. There are several options here, but the one I want to focus on is the comment about “function: apply custom name repair”. This is an important one because we eventually want to rename the columns to match the style guide by using snake_case. And there’s a package to do that called snakecase.

In the Console, type out snakecase:: and hit Tab. You’ll see a list of possible functions to use. We want to use the snake case one, so scroll down and find the to_snake_case(). That’s the one we want to use. So to remove the messages and convert the variable names to snake case, we would add .name_repair = snakecase::to_snake_case to code. Notice the lack of () when using the function. We’ll explain more about this in the next session.

user_1_info_data <- vroom(
    user_1_info_file,
    col_select = -1,
    col_types = cols(
        Gender = col_character(),
        Weight = col_double(),
        Height = col_double(),
        Age = col_double()
    ),
    .name_repair = snakecase::to_snake_case
)

Ok, no more messages! We can now look at the data:

user_1_info_data
#> # A tibble: 1 x 4
#>   gender weight height   age
#>   <chr>   <dbl>  <dbl> <dbl>
#> 1 M          65    169    29

Why might we use spec() and col_types? Depending on the size of the dataset, it may take a long time to load everything, which may not be very efficient if you only intend to use some parts of the dataset and not all of it. And sometimes, spec() incorrectly guesses the column types, so using col_types = cols() can fix those problems.

If you have a lot of columns in your dataset, then you can make use of col_select or cols_only() to keep only the columns you want. Check out vroom’s website help documentation on vroom’s col_select argument or on cols_only() for more details on how to use those.

6.4 Exercise: Import the saliva data

Time: 15 min

Practice importing data files by doing the same process with the saliva data.

  1. Create a new header at the bottom of the doc/lesson.Rmd file and call it ## Exercise: Import the saliva data.
  2. Below the header, create a new code chunk with Ctrl-Alt-I.
  3. Copy and paste the code template below into the new code chunk. Begin replacing the ___ with the correct R functions or other information.
  4. Once you have the code working, use the RStudio Git interface to add and commit the changes into the version history.
user_1_saliva_file <- here("data-raw/mmash/user_1/___")
user_1_saliva_data_prep <- vroom(user_1_saliva_file,
                                 col_select = ___)
___(user_1_saliva_data_prep)

user_1_saliva_data <- vroom(
    user_1_saliva_file,
    col_select = ___,
    col_types = ___,
    .name_repair = ___
)

6.5 Importing larger datasets

Sometimes you may have a dataset that’s just a bit too large. Sometimes vroom may not have enough information to guess the data type of the column. Or maybe there are hundreds or thousands of columns in your data and you only want to import specific columns. In these cases, we can do a trick: read in only first few lines of the dataset, use spec() and paste in the output into the col_type argument, and then keep only the columns you want to keep.

Let’s do this on the RR.csv file. We can see from the file size that it is bigger than most of the other files (~2Mb). So, we’ll use this technique to decide what we want to keep etc. First, create a new header ## Import larger datasets and a new code chunk below it (with Ctrl-Alt-I).

Do the same thing that we’ve been doing, but this time we are going to use the argument n_max, which tells vroom how many rows to read into R. In this case, let’s read in 100, since that is the amount vroom guess until. This dataset, like the others, has an empty column that we will drop.

user_1_rr_file <- here("data-raw/mmash/user_1/RR.csv")
user_1_rr_data_prep <- vroom(user_1_rr_file,
                             n_max = 100,
                             col_select = -1)
#> New names:
#> * `` -> ...1
#> Rows: 100
#> Columns: 3
#> Delimiter: ","
#> dbl  [2]: ibi_s, day
#> time [1]: time
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
spec(user_1_rr_data_prep)
#> cols(
#>   ...1 = col_skip(),
#>   ibi_s = col_double(),
#>   day = col_double(),
#>   time = col_time(format = "")
#> )

Like with last time, copy and paste the output into a new use of vroom(). Remove the .delim line and the ...1 line. Don’t forget to also remove the last , at the end! Make sure to remove the n_max argument, since we want to import in the whole dataset.

user_1_rr_data <- vroom(
    user_1_rr_file,
    col_select = -1,
    col_types = cols(
        ibi_s = col_double(),
        day = col_double(),
        # Converts to seconds
        time = col_time(format = "")
    ),
    .name_repair = snakecase::to_snake_case
) 

There’s a new column type: col_time(). To see all the other types of column specifications, in the Console type out col_ and then hit the Tab key. You’ll see other types, like col_date, col_factor, and so on. Right, what does the data look like?

user_1_rr_data
#> # A tibble: 91,858 x 3
#>    ibi_s   day time    
#>    <dbl> <dbl> <time>  
#>  1 0.258     1 10:10:17
#>  2 0.319     1 10:10:18
#>  3 0.266     1 10:10:18
#>  4 0.401     1 10:10:18
#>  5 1.09      1 10:10:19
#>  6 0.752     1 10:10:20
#>  7 0.337     1 10:10:20
#>  8 0.933     1 10:10:21
#>  9 0.731     1 10:10:22
#> 10 0.454     1 10:10:23
#> # … with 91,848 more rows

To make sure everything is so far reproducible within the lesson.Rmd file, we will “Knit” the R Markdown document to output an HTML file. Click the “Knit” button at the top of the Source pane or by typing Ctrl-Shift-K. If it generates an HTML document without problems, we know our code is at least starting to be reproducible.

6.6 Exercise: Import the Actigraph data

Time: 15 min

Practice some more. Do the same thing with the Actigraph.csv dataset as we did for the RR.csv. But first:

  1. Create a new header at the bottom of the doc/lesson.Rmd file and call it ## Exercise: Import the Actigraph data.
  2. Below the header, create a new code chunk with Ctrl-Alt-I.

Use the same technique as we used for the RR.csv data and read in the Actigraph.csv file from user_1/.

  1. Set the file path to the dataset with here().
  2. Read in a max of 100 rows for n_max and exclude the first column with col_select = -1.
  3. Use spec() to output the column specification and paste the results into col_types(). Don’t forget to remove the ...1 = col_skip() and the .delim = "," line from the cols() function.

References

Rossi, Alessio, Eleonora Da Pozzo, Dario Menicagli, Chiara Tremolanti, Corrado Priami, Alina Sirbu, David Clifton, Claudia Martini, and David Morelli. 2020. “Multilevel Monitoring of Activity and Sleep in Healthy People.” PhysioNet. https://doi.org/10.13026/CERQ-FC86.