Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.

On GitLab open an issue or submit a merge request by clicking the "Edit this page " button on the side of this page.

6 Importing data, fast!

During this session we’ll be covering the first block, “Download raw data” in Figure 6.1. You already downloaded and unzipped the raw data during the pre-course tasks and here we’ll continue the process by now importing it into R.

Section of the overall workflow we will be covering.

Figure 6.1: Section of the overall workflow we will be covering.

Right now, your folder and file structure should look like (use fs::dir_tree(recurse = 2) if you want to check using R):

LearnR3
├── data/
│   └── README.md
├── data-raw/
│   ├── README.md
│   ├── mmash-data.zip
│   ├── mmash/
│   │  ├── user_1
│   │  ├── ...
│   │  └── user_22
│   └── mmash.R
├── doc/
│   ├── README.md
│   └── lesson.Rmd
├── R/
│   ├── functions.R
│   └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md

6.1 Learning objectives

  1. Learn how to import data and do minor cleaning with the vroom package.
  2. Learn about strategies and resources to use when encountering problems when importing data (arguably one of the most important skills in any coding-type work).
  3. Practice using Git version control as part of the data analysis workflow.

6.2 Importing in the raw data

The ultimate goal for the beginning phases of a data analysis project is to eventually save a version of the raw data that you can work on. The first step to processing data is to import it into R so we can work on it. So for now, we’ll open up the doc/lesson.Rmd file so we can start building and testing out the code. At the bottom of the file, create a new header by typing out ## Importing raw data. Right below the header, make a new code chunk with Ctrl-Alt-I and call it setup. Inside the code chunk, load the vroom package with library(vroom) as well as library(here). Since we’ll also be using the tidyverse package, let’s also add library(tidyverse). It should look like this:

```{r setup}
library(tidyverse)
library(vroom)
library(here)
```

This is a special, named code chunk that tells R to run this code chunk first whenever you open this R Markdown file and run code inside of the file. It’s in this setup code chunk that we will add library() functions when we want to load other packages. After adding this code chunk, create a new code chunk (Ctrl-Alt-I) right below it.

Take 5 minutes to read the next paragraphs until it says to stop.

What is vroom? It is a package designed to load in data, specifically text-based data files such as CSV. In R there are several packages that you can use to load in data and of different types of file formats. We won’t cover these other packages, but you can use this as a reference for when or if you ever have to load other file types:

  • haven: For reading (also known as importing or loading) in SAS, SPSS, and Stata files.
  • readxl: For reading in Excel spreadsheets with .xls or .xlsx file endings.
  • googlesheets4: For reading in Google Sheets from their cloud service.
  • readr: Standard package used to load in text-based data files like CSV. This package is included by default within the tidyverse.
  • utils::read.delim(): This function comes from the core R package utils and includes other functions like utils::read.csv().
  • data.table::fread(): From the data.table package, this function is used to load in CSV files.

We’re using the vroom package for largely one reason: It makes use of recent improvements in R that allow data to be imported into R very very quickly. Just how fast? The vroom website has a benchmark page showing how fast it is. For many people, loading in the data can be one of the most time-consuming parts of starting an analysis. Hopefully by using this package, that time can be reduced.

The packages readr, vroom, haven, readxl, and googlesheets4 all are very similar in how you use them and their documentation are almost identical. So the skills you learn in this session with vroom can mostly be applied to these other packages. And because readr (which the other packages are based on) has been around for awhile, there is a large amount of support and help for using it. If you’re curious to learn more about vroom, check out the website.

If your data is in CSV format, vroom is perfect. The CSV file format is a commonly used format for data because it is open and transparent, readable by any computer, and doesn’t depend on any special software to open (unlike for e.g. Excel spreadsheets). Please stop reading now and we’ll go over this together.

Let’s first start by creating an object that has the file path to the dataset, then we’ll use vroom() to import that dataset.

user_1_info_file <- here("data-raw/mmash/user_1/user_info.csv")
user_1_info_data <- vroom(user_1_info_file)
#> New names:
#> Rows: 1 Columns: 5
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (1): Gender dbl (4): ...1, Weight, Height, Age
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `` -> `...1`

You’ll see the output mention using spec() to use in the argument col_types. And that it has 5 columns, one called ...1. If we look at the CSV file though, we see that there are only four columns with names… but that technically there is a first empty column without a column header. So, let’s figure out what this message means. Let’s go to the Console and type out:

?vroom::spec

In the documentation, we see that it says:

“extracts the full column specification from a tibble…”

Without seeing the output, it’s not clear what “specification” means. Use spec() on the dataset object. In the Console again:

spec(user_1_info_data)
#> cols(
#>   ...1 = col_double(),
#>   Gender = col_character(),
#>   Weight = col_double(),
#>   Height = col_double(),
#>   Age = col_double()
#> )

This shows that a specification is a list and description of which columns are imported into R and what data types they are given. For instance, col_double() means numeric (double is how computers represent non-integer numbers) and col_character() means a character data type. Next, let’s see what the message meant about col_types. Let’s check out the help documentation for vroom() by typing in the Console:

?vroom::vroom

And if we scroll down to the explanation of col_types:

“One of NULL, a cols() specification, or a string. See vignette(”readr”) for more details.”

It says to use a “cols() specification”, which is the output of spec(). Copy the output from spec() and paste it into the col_types argument of vroom().

user_1_info_data <- vroom(
    user_1_info_file,
    col_types = cols(
        ...1 = col_double(),
        Gender = col_character(),
        Weight = col_double(),
        Height = col_double(),
        Age = col_double(),
        .delim = ","
    )
)
#> New names:
#> • `` -> `...1`

Hmm. A warning and an error message. Let’s deal with the error message first. We copied and pasted, so what’s going on? If you recall, the user_info.csv file has an empty column name. Looking at the data dictionary it doesn’t seem there is any reference to this column, so it likely isn’t important. More than likely, vroom is complaining about this empty column name and the use of ...1 to represent it. We don’t need it, so we’ll get rid of it when we load in the dataset. But how? Look at the help documentation again. Go to the Console and type out:

?vroom::vroom

Looking at the list of arguments, there is an argument called col_select that sounds like we could use that to keep or drop columns. It says that it is used similar to dplyr::select(), which normally is used with actual column names. Our column doesn’t have a name, that’s the problem. Next check the Example section of the help. Scrolling down, you’ll eventually see:

vroom(input_file, col_select = c(1, 3, 11))

So, it takes numbers! With dplyr::select(), using the - before the column name (or number) means to drop the column, so in this case, we could drop the first column with col_select = -1!

user_1_info_data <- vroom(
    user_1_info_file,
    col_select = -1,
    col_types = cols(
        Gender = col_character(),
        Weight = col_double(),
        Height = col_double(),
        Age = col_double(),
        .delim = ","
    )
)
#> New names:
#> • `` -> `...1`

Amazing! We did it 😁

But… we still have the warning message. You may or may not see this message depending on the version of your packages (the most updated packages will show this), but you can see it in the code chunk above.

This is vroom() letting you know that a column was renamed because it found an empty column with no column name and is letting you know it had to rename it. Even though we excluded it with col_select = -1, that only excludes it after reading the column names, before importing the data.

To remove this new message, we need to tell vroom() exactly how to handle renaming cases. Let’s look at the help docs of vroom: ?vroom. Scroll down and we see an argument called .name_repair that handles naming of columns. Going into the provided link takes us to the tibble::tibble() help documentation. Scroll down to the .name_repair argument documentation and it says that it treats problematic column names, of which a missing column name is definitely a problem. There are several options here, but the one I want to focus on is the comment about “function: apply custom name repair”. This is an important one because we eventually want to rename the columns to match the style guide by using snake_case, since there is a package to do that called snakecase.

In the Console, type out snakecase:: and hit Tab. You’ll see a list of possible functions to use. We want to use the snakecase one, so scroll down and find the to_snake_case(). That’s the one we want to use. So to remove the messages and convert the variable names to snake case, we would add .name_repair = snakecase::to_snake_case to the code. Notice the lack of () when using the function. We’ll explain more about this in later sessions.

user_1_info_data <- vroom(
    user_1_info_file,
    col_select = -1,
    col_types = cols(
        Gender = col_character(),
        Weight = col_double(),
        Height = col_double(),
        Age = col_double()
    ),
    .name_repair = snakecase::to_snake_case
)
#> Warning: The following named parsers don't match the column names: Gender,
#> Weight, Height, Age

Now we have another warning though. It’s probably because the names are repaired to snake case before the col_types are used. Let’s try repeating the vroom() to spec() to vroom() workflow we just used to see if that fixes the problem.

user_1_info_data <- vroom(
    user_1_info_file,
    col_select = -1,
    .name_repair = snakecase::to_snake_case
)
#> Rows: 1 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): gender
#> dbl (3): weight, height, age
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(user_1_info_data)
#> cols(
#>   col_skip(),
#>   gender = col_character(),
#>   weight = col_double(),
#>   height = col_double(),
#>   age = col_double()
#> )
user_1_info_data <- vroom(
    user_1_info_file,
    col_select = -1,
    col_types = cols(
        gender = col_character(),
        weight = col_double(),
        height = col_double(),
        age = col_double(),
        .delim = ","
    ),
    .name_repair = snakecase::to_snake_case
)

That fixed it! No more messages or warnings! We can now look at the data:

user_1_info_data
#> # A tibble: 1 × 4
#>   gender weight height   age
#>   <chr>   <dbl>  <dbl> <dbl>
#> 1 M          65    169    29

Why might we use spec() and col_types? Depending on the size of the dataset, it could take a long time to load everything, which may not be very efficient if you only intend to use some parts of the dataset and not all of it. And sometimes, spec() incorrectly guesses the column types, so using col_types = cols() can fix those problems.

If you have a lot of columns in your dataset, then you can make use of col_select or cols_only() to keep only the columns you want. Before moving on to the exercise, add and commit the changes to the Git history through the RStudio Git interface with Ctrl-Alt-M, through the Git tab, or the Git button.

6.3 Exercise: Import the saliva data

Time: 15 min

Practice importing data files by doing the same process with the saliva data.

  1. Create a new header at the bottom of the doc/lesson.Rmd file and call it ## Exercise: Import the saliva data.
  2. Below the header, create a new code chunk with Ctrl-Alt-I.
  3. Copy and paste the code template below into the new code chunk. Begin replacing the ___ with the correct R functions or other information.
  4. Once you have the code working, use the RStudio Git interface to add and commit the changes into the version history.
user_1_saliva_file <- here("data-raw/mmash/user_1/___")
user_1_saliva_data_prep <- vroom(
    user_1_saliva_file,
    col_select = ___,
    .name_repair = ___
)
___(user_1_saliva_data_prep)

user_1_saliva_data <- vroom(
    user_1_saliva_file,
    col_select = ___,
    col_types = ___,
    .name_repair = ___
)
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.

user_1_saliva_file <- here("data-raw/mmash/user_1/saliva.csv")
user_1_saliva_data_prep <- vroom(
    user_1_saliva_file,
    col_select = -1,
    .name_repair = snakecase::to_snake_case
)
#> Rows: 2 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): samples
#> dbl (2): cortisol_norm, melatonin_norm
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(user_1_saliva_data_prep)

user_1_saliva_data <- vroom(
    user_1_saliva_file,
    col_select = -1,
    col_types = cols(
        samples = col_character(),
        cortisol_norm = col_double(),
        melatonin_norm = col_double()  
    ), 
    .name_repair = snakecase::to_snake_case
)

6.4 Importing larger datasets

Sometimes you may have a dataset that’s just a bit too large. Sometimes vroom may not have enough information to guess the data type of the column. Or maybe there are hundreds or thousands of columns in your data and you only want to import specific columns. In these cases, we can do a trick: read in the first few lines of the dataset, use spec() and paste in the output into the col_type argument, and then keep only the columns you want to keep.

Let’s do this on the RR.csv file. We can see from the file size that it is bigger than most of the other files (~2Mb). So, we’ll use this technique to decide what we want to keep. First, create a new header ## Import larger datasets and a new code chunk below it (with Ctrl-Alt-I).

Do the same thing that we’ve been doing, but this time we are going to use the argument n_max, which tells vroom how many rows to read into R. In this case, let’s read in 100, since that is the amount vroom will guess until. This dataset, like the others, has an empty column that we will drop.

user_1_rr_file <- here("data-raw/mmash/user_1/RR.csv")
user_1_rr_data_prep <- vroom(
    user_1_rr_file,
    n_max = 100,
    col_select = -1,
    .name_repair = snakecase::to_snake_case
)
#> Rows: 100 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl  (2): ibi_s, day
#> time (1): time
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(user_1_rr_data_prep)
#> cols(
#>   col_skip(),
#>   ibi_s = col_double(),
#>   day = col_double(),
#>   time = col_time(format = "")
#> )

Like with last time, copy and paste the output into a new use of vroom(). Remove the ...1 line. Add the .name_repair argument with snakecase::to_snake_case. Don’t forget to also remove the last , at the end! Make sure to remove the n_max argument, since we want to import in the whole dataset.

user_1_rr_data <- vroom(
    user_1_rr_file,
    col_select = -1,
    col_types = cols(
        ibi_s = col_double(),
        day = col_double(),
        # Converts to seconds
        time = col_time(format = "")
    ),
    .name_repair = snakecase::to_snake_case
) 

There’s a new column type: col_time(). To see all the other types of column specifications, in the Console type out col_ and then hit the Tab key. You’ll see other types, like col_date, col_factor, and so on. Right, back to the data, what does it look like?

user_1_rr_data
#> # A tibble: 91,858 × 3
#>    ibi_s   day time    
#>    <dbl> <dbl> <time>  
#>  1 0.258     1 10:10:17
#>  2 0.319     1 10:10:18
#>  3 0.266     1 10:10:18
#>  4 0.401     1 10:10:18
#>  5 1.09      1 10:10:19
#>  6 0.752     1 10:10:20
#>  7 0.337     1 10:10:20
#>  8 0.933     1 10:10:21
#>  9 0.731     1 10:10:22
#> 10 0.454     1 10:10:23
#> # … with 91,848 more rows

To make sure everything is so far reproducible within the lesson.Rmd file, we will “Knit” the R Markdown document to output an HTML file. Click the “Knit” button at the top of the Source pane or by typing Ctrl-Shift-K. If it generates an HTML document without problems, we know our code is at least starting to be reproducible.

6.5 Exercise: Import the Actigraph data

Time: 15 min

Practice some more. Do the same thing with the Actigraph.csv dataset as we did for the RR.csv. But first:

  1. Create a new header at the bottom of the doc/lesson.Rmd file and call it ## Exercise: Import the Actigraph data.
  2. Below the header, create a new code chunk with Ctrl-Alt-I.

Use the same technique as we used for the RR.csv data and read in the Actigraph.csv file from user_1/.

  1. Set the file path to the dataset with here().
  2. Read in a max of 100 rows for n_max and exclude the first column with col_select = -1.
  3. Use spec() to output the column specification and paste the results into col_types(). Don’t forget to remove the ...1 = col_skip() (or NA = col_skip()) line from the cols() function.
  4. Knit the R Markdown document (Ctrl-Shift-K or the “Knit” button) to regenerate the HTML document.
  5. If everything works, add and commit the new changes to the Git history (Ctrl-Alt-M or with the Git button).

Use this template as a guide for completing this exercise.

user_1_actigraph_file <- here("data-raw/mmash/user_1/___")
user_1_actigraph_data_prep <- ___(
    ___,
    n_max = ___,
    col_select = ___,
    .name_repair = ___
)
___(user_1_actigraph_data_prep)

user_1_actigraph_data <- ___(
    user_1_actigraph_file,
    col_select = ___,
    col_types = ___(
        ___
    ),
    ___ = ___
)
Click for the (possible) solution. Click only if you are really struggling or you are out of time for the exercise.

# Use first 100 or so lines to get spec
user_1_actigraph_file <- here("data-raw/mmash/user_1/Actigraph.csv")
user_1_actigraph_data_prep <- vroom(
    user_1_actigraph_file,
    n_max = 100,
    col_select = -1,
    .name_repair = snakecase::to_snake_case
)
#> Rows: 100 Columns: 12
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl  (11): axis_1, axis_2, axis_3, steps, hr, inclinometer_off, inclinometer...
#> time  (1): time
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(user_1_actigraph_data_prep)

user_1_actigraph_data <- vroom(
    user_1_actigraph_file,
    col_select = -1,
    col_types = cols(
        axis_1 = col_double(),
        axis_2 = col_double(),
        axis_3 = col_double(),
        steps = col_double(),
        hr = col_double(),
        inclinometer_off = col_double(),
        inclinometer_standing = col_double(),
        inclinometer_sitting = col_double(),
        inclinometer_lying = col_double(),
        vector_magnitude = col_double(),
        day = col_double(),
        time = col_time(format = "")
    ),
    .name_repair = snakecase::to_snake_case
)

6.6 Summary

For instructors: Click for details.

Quickly cover this before finishing the session and when starting the next session.

  • Use the appropriate package when importing a specific data file (e.g. for .xlsx use readxl)
  • Use vroom() for fast importing of plain text type files (e.g. csv)
  • Use spec() to diagnose importing issues
  • Use col_types = and col_select = arguments in vroom() to import only the data you need
  • Fix column names to match a style using .name_repair = argument in vroom()
  • For very large datasets, importing in the first 100-1000 rows using n_max = to test that importing works fine before importing the full dataset