If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
5 Importing data, fast!
During this session we’ll be covering the first block, “Download raw data” in Figure 5.1. You already downloaded and unzipped the raw data during the pre-course tasks and here we’ll continue the process by now importing it into R.
Right now, your folder and file structure should look like (use fs::dir_tree(recurse = 2)
if you want to check using R):
LearnR3
├── data/
│ └── README.md
├── data-raw/
│ ├── README.md
│ ├── mmash-data.zip
│ ├── mmash/
│ │ ├── user_1
│ │ ├── ...
│ │ └── user_22
│ └── mmash.R
├── doc/
│ ├── README.md
│ └── lesson.qmd
├── R/
│ ├── functions.R
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
└── README.md
5.1 Learning objectives
- Learn how to import data and do minor cleaning with the vroom package.
- Learn about strategies and resources to use when encountering problems when importing data (arguably one of the most important skills in any coding-type work).
- Practice using Git version control as part of the data analysis workflow.
5.2 Importing in the raw data
The ultimate goal for the beginning phases of a data analysis project is to eventually save a version of the raw data that you can work on. The first step to processing data is to import it into R so we can work on it. So for now, we’ll open up the doc/learning.qmd
file so we can start building and testing out the code. At the bottom of the file, create a new header by typing out ## Importing raw data
. Right below the header, make a new code chunk with or with the Palette (, then type “new chunk”) and call it setup
. Inside the code chunk, load the vroom package with library(vroom)
as well as library(here)
. Since we’ll also be using the tidyverse package, let’s also add library(tidyverse)
. It should look like this:
```{r setup}
library(tidyverse)
library(vroom)
library(here)
```
This is a special, named code chunk that tells R to run this code chunk first whenever you open this R Markdown file and run code inside of the file. It’s in this setup
code chunk that we will add library()
functions when we want to load other packages. After adding this code chunk, create a new code chunk right below it using or with the Palette (, then type “new chunk”).
Let’s first start by creating an object that has the file path to the dataset, then we’ll use vroom()
to import that dataset.
You’ll see the output mention using spec()
to use in the argument col_types
. And that it has 5 columns, one called ...1
. If we look at the CSV file though, we see that there are only four columns with names… but that technically there is a first empty column without a column header. So, let’s figure out what this message means. Let’s go to the Console and type out:
?vroom::spec
In the documentation, we see that it says:
“extracts the full column specification from a tibble…”
Without seeing the output, it’s not clear what “specification” means. Use spec()
on the dataset object. In the Console again:
spec(user_1_info_data)
#> cols(
#> ...1 = col_double(),
#> Gender = col_character(),
#> Weight = col_double(),
#> Height = col_double(),
#> Age = col_double(),
#> .delim = ","
#> )
This shows that a specification is a list and description of which columns are imported into R and what data types they are given. For instance, col_double()
means numeric (double is how computers represent non-integer numbers) and col_character()
means a character data type. Next, let’s see what the message meant about col_types
. Let’s check out the help documentation for vroom()
by typing in the Console:
?vroom::vroom
And if we scroll down to the explanation of col_types
:
“One of NULL, a cols() specification, or a string. See vignette(”readr”) for more details.”
It says to use a “cols() specification”, which is the output of spec()
. Copy the output from spec()
and paste it into the col_types
argument of vroom()
.
user_1_info_data <- vroom(
user_1_info_file,
col_types = cols(
...1 = col_double(),
Gender = col_character(),
Weight = col_double(),
Height = col_double(),
Age = col_double(),
.delim = ","
)
)
#> New names:
#> • `` -> `...1`
Hmm. A warning and an error message. Let’s deal with the error message first. We copied and pasted, so what’s going on? If you recall, the user_info.csv
file has an empty column name. Looking at the data dictionary it doesn’t seem there is any reference to this column, so it likely isn’t important. More than likely, vroom is complaining about this empty column name and the use of ...1
to represent it. We don’t need it, so we’ll get rid of it when we load in the dataset. But how? Look at the help documentation again. Go to the Console and type out:
?vroom::vroom
Looking at the list of arguments, there is an argument called col_select
that sounds like we could use that to keep or drop columns. It says that it is used similar to dplyr::select()
, which normally is used with actual column names. Our column doesn’t have a name, that’s the problem. Next check the Example section of the help. Scrolling down, you’ll eventually see:
vroom(input_file, col_select = c(1, 3, 11))
So, it takes numbers! With dplyr::select()
, using the -
before the column name (or number) means to drop the column, so in this case, we could drop the first column with col_select = -1
!
user_1_info_data <- vroom(
user_1_info_file,
col_select = -1,
col_types = cols(
Gender = col_character(),
Weight = col_double(),
Height = col_double(),
Age = col_double(),
.delim = ","
)
)
Amazing! We did it! 😁
But… we still have the warning message. You may or may not see this message depending on the version of your packages (the most updated packages will show this), but you can see it in the code chunk above.
This is vroom()
letting you know that a column was renamed because it found an empty column with no column name and is letting you know it had to rename it. Even though we excluded it with col_select = -1
, that only excludes it after reading the column names, before importing the data.
To remove this new message, we need to tell vroom()
exactly how to handle renaming cases. Let’s look at the help docs of vroom: ?vroom
. Scroll down and we see an argument called .name_repair
that handles naming of columns. Going into the provided link takes us to the tibble::tibble()
help documentation. Scroll down to the .name_repair
argument documentation and it says that it treats problematic column names, of which a missing column name is definitely a problem. There are several options here, but the one I want to focus on is the comment about “function: apply custom name repair”. This is an important one because we eventually want to rename the columns to match the style guide by using snake_case
, since there is a package to do that called snakecase.
In the Console, type out snakecase::
and hit Tab. You’ll see a list of possible functions to use. We want to use the snakecase one, so scroll down and find the to_snake_case()
. That’s the one we want to use. So to remove the messages and convert the variable names to snake case, we would add .name_repair = snakecase::to_snake_case
to the code. Notice the lack of ()
when using the function. We’ll explain more about this in later sessions.
user_1_info_data <- vroom(
user_1_info_file,
col_select = -1,
col_types = cols(
Gender = col_character(),
Weight = col_double(),
Height = col_double(),
Age = col_double()
),
.name_repair = snakecase::to_snake_case
)
Now we have another warning though. It’s probably because the names are repaired to snake case before the col_types
are used. Let’s try repeating the vroom()
to spec()
to vroom()
workflow we just used to see if that fixes the problem.
user_1_info_data <- vroom(
user_1_info_file,
col_select = -1,
.name_repair = snakecase::to_snake_case
)
spec(user_1_info_data)
#> cols(
#> col_skip(),
#> gender = col_character(),
#> weight = col_double(),
#> height = col_double(),
#> age = col_double(),
#> .delim = ","
#> )
user_1_info_data <- vroom(
user_1_info_file,
col_select = -1,
col_types = cols(
gender = col_character(),
weight = col_double(),
height = col_double(),
age = col_double(),
.delim = ","
),
.name_repair = snakecase::to_snake_case
)
That fixed it! No more messages or warnings! We can now look at the data:
user_1_info_data
#> # A tibble: 1 × 4
#> gender weight height age
#> <chr> <dbl> <dbl> <dbl>
#> 1 M 65 169 29
Why might we use spec()
and col_types
? Depending on the size of the dataset, it could take a long time to load everything, which may not be very efficient if you only intend to use some parts of the dataset and not all of it. And sometimes, spec()
incorrectly guesses the column types, so using col_types = cols()
can fix those problems.
If you have a lot of columns in your dataset, then you can make use of col_select
or cols_only()
to keep only the columns you want. Before moving on to the exercise, add and commit the changes to the Git history through the RStudio Git interface with or with the Palette (, then type “commit”).
5.3 Exercise: Import the saliva data
Time: 15 minutes.
Practice importing data files by doing the same process with the saliva data.
- Create a new header at the bottom of the
doc/learning.qmd
file and call it## Exercise: Import the saliva data
. - Below the header, create a new code chunk with or with the Palette (, then type “new chunk”).
- Copy and paste the code template below into the new code chunk. Begin replacing the
___
with the correct R functions or other information. - Once you have the code working, use the RStudio Git interface to add and commit the changes to the Git history using or with the Palette (, then type “commit”).
<- here("data-raw/mmash/user_1/___")
user_1_saliva_file <- vroom(
user_1_saliva_data_prep
user_1_saliva_file,col_select = ___,
.name_repair = ___
)___(user_1_saliva_data_prep)
<- vroom(
user_1_saliva_data
user_1_saliva_file,col_select = ___,
col_types = ___,
.name_repair = ___
)
Click for the solution. Only click if you are struggling or are out of time.
user_1_saliva_file <- here("data-raw/mmash/user_1/saliva.csv")
user_1_saliva_data_prep <- vroom(
user_1_saliva_file,
col_select = -1,
.name_repair = snakecase::to_snake_case
)
spec(user_1_saliva_data_prep)
user_1_saliva_data <- vroom(
user_1_saliva_file,
col_select = -1,
col_types = cols(
samples = col_character(),
cortisol_norm = col_double(),
melatonin_norm = col_double()
),
.name_repair = snakecase::to_snake_case
)
5.4 Importing larger datasets
Sometimes you may have a dataset that’s just a bit too large. Sometimes vroom may not have enough information to guess the data type of the column. Or maybe there are hundreds or thousands of columns in your data and you only want to import specific columns. In these cases, we can do a trick: read in the first few lines of the dataset, use spec()
and paste in the output into the col_type
argument, and then keep only the columns you want to keep.
Let’s do this on the RR.csv
file. We can see from the file size that it is bigger than most of the other files (~2Mb). So, we’ll use this technique to decide what we want to keep. First, create a new header ## Import larger datasets
and a new code chunk below it using or with the Palette (, then type “new chunk”).
Do the same thing that we’ve been doing, but this time we are going to use the argument n_max
, which tells vroom how many rows to read into R. In this case, let’s read in 100, since that is the amount vroom will guess until. This dataset, like the others, has an empty column that we will drop.
user_1_rr_file <- here("data-raw/mmash/user_1/RR.csv")
user_1_rr_data_prep <- vroom(
user_1_rr_file,
n_max = 100,
col_select = -1,
.name_repair = snakecase::to_snake_case
)
spec(user_1_rr_data_prep)
#> cols(
#> col_skip(),
#> ibi_s = col_double(),
#> day = col_double(),
#> time = col_time(format = ""),
#> .delim = ","
#> )
Like with last time, copy and paste the output into a new use of vroom()
. Remove the ...1
line. Add the .name_repair
argument with snakecase::to_snake_case
. Don’t forget to also remove the last ,
at the end! Make sure to remove the n_max
argument, since we want to import in the whole dataset.
user_1_rr_data <- vroom(
user_1_rr_file,
col_select = -1,
col_types = cols(
ibi_s = col_double(),
day = col_double(),
# Converts to seconds
time = col_time(format = "")
),
.name_repair = snakecase::to_snake_case
)
There’s a new column type: col_time()
. To see all the other types of column specifications, in the Console type out col_
and then hit the Tab key. You’ll see other types, like col_date
, col_factor
, and so on. Right, back to the data, what does it look like?
user_1_rr_data
#> # A tibble: 91,858 × 3
#> ibi_s day time
#> <dbl> <dbl> <hms>
#> 1 0.258 1 36617 secs
#> 2 0.319 1 36618 secs
#> 3 0.266 1 36618 secs
#> 4 0.401 1 36618 secs
#> 5 1.09 1 36619 secs
#> 6 0.752 1 36620 secs
#> 7 0.337 1 36620 secs
#> 8 0.933 1 36621 secs
#> 9 0.731 1 36622 secs
#> 10 0.454 1 36623 secs
#> # ℹ 91,848 more rows
To make sure everything is so far reproducible within the doc/learning.qmd
file, we will render the Quarto document to output an HTML file. Click the “Render” button at the top of the Source pane or by typing or with the Palette (, then type “render”). If it generates an HTML document without problems, we know our code is at least starting to be reproducible.
5.5 Exercise: Import the Actigraph data
Time: 15 minutes.
Practice some more. Do the same thing with the Actigraph.csv
dataset as we did for the RR.csv
. But first:
- Create a new header at the bottom of the
doc/learning.qmd
file and call it## Exercise: Import the Actigraph data
. - Below the header, create a new code chunk with or with the Palette (, then type “new chunk”).
Use the same technique as we used for the RR.csv
data and read in the Actigraph.csv
file from user_1/
.
- Set the file path to the dataset with
here()
. - Read in a max of 100 rows for
n_max
and exclude the first column withcol_select = -1
. - Use
spec()
to output the column specification and paste the results intocol_types()
. Don’t forget to remove the...1 = col_skip()
(orNA = col_skip()
) line from thecols()
function. - Render the Quarto document with or with the Palette (, then type “render”) to regenerate the HTML document.
- If everything works, add and commit the new changes to the Git history with or with the Palette (, then type “commit”)
Use this template as a guide for completing this exercise.
<- here("data-raw/mmash/user_1/___")
user_1_actigraph_file <- ___(
user_1_actigraph_data_prep
___,n_max = ___,
col_select = ___,
.name_repair = ___
)___(user_1_actigraph_data_prep)
<- ___(
user_1_actigraph_data
user_1_actigraph_file,col_select = ___,
col_types = ___(
___
),___ = ___
)
Click for the solution. Only click if you are struggling or are out of time.
# Use first 100 or so lines to get spec
user_1_actigraph_file <- here("data-raw/mmash/user_1/Actigraph.csv")
user_1_actigraph_data_prep <- vroom(
user_1_actigraph_file,
n_max = 100,
col_select = -1,
.name_repair = snakecase::to_snake_case
)
spec(user_1_actigraph_data_prep)
user_1_actigraph_data <- vroom(
user_1_actigraph_file,
col_select = -1,
col_types = cols(
axis_1 = col_double(),
axis_2 = col_double(),
axis_3 = col_double(),
steps = col_double(),
hr = col_double(),
inclinometer_off = col_double(),
inclinometer_standing = col_double(),
inclinometer_sitting = col_double(),
inclinometer_lying = col_double(),
vector_magnitude = col_double(),
day = col_double(),
time = col_time(format = "")
),
.name_repair = snakecase::to_snake_case
)
5.6 Summary
- Use the appropriate package when importing a specific data file (e.g. for
.xlsx
use readxl). - Use
vroom()
for fast importing of plain text type files (e.g. csv). - Use
spec()
to diagnose importing issues. - Use
col_types =
andcol_select =
arguments invroom()
to import only the data you need. - Fix column names to match a style using
.name_repair =
argument invroom()
. - For very large datasets, importing in the first 100-1000 rows using
n_max =
to test that importing works fine before importing the full dataset.