5 Pre-process data as you import it
🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧
During this session we’ll be introducing concepts for both the “Download raw data” block as well as the “Workflow” block in Figure 5.1. You already downloaded and unzipped the raw data during the pre-workshop tasks and here we’ll continue the process by now importing it into R.
5.1 Learning objectives
Identify the appropriate R package to use to import data based on what format the file is.
Use the readr package to import data from a CSV file into R, while also applying some initial data cleaning steps during the import.
Continue practicing basic reproducible and open workflows, such as using Git version control and using styler to format your code.
5.2 Importing in the raw data
The ultimate goal for the beginning phases of a data analysis project is to eventually save a version of the raw data that is specific to your research questions and enables you to conduct your analyses. The first step to processing data is to import it into R so we can work on it. So for now, we’ll open up the doc/learning.qmd
file so we can start building and testing out the code. There should be a setup
code chunk already be in the file, where we will put the library()
code for loading the tidyverse package, which has readr bundled with it, as well as library(here)
. It should look like this:
```{r setup}
library(tidyverse)
library(here)
```
This setup
code chunk is a special, named code chunk that tells R to run this code chunk first whenever you open this Quarto file and run code inside of the file. It’s in this setup
code chunk that we will add library()
functions when we want to load other packages. After adding this code chunk, create a new header below the setup
code chunk by typing out ## Importing raw data
, followed by creating a new code chunk right below it using Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
What is readr? It is a package designed to load in data, specifically text-based data files such as CSV. In R there are several packages that you can use to load in data and of different types of file formats. We won’t cover these other packages, so you can use this as a reference for when or if you ever have to load other file types:
- haven: For reading (also known as importing or loading) in SAS, SPSS, and Stata files.
-
readxl: For reading in Excel spreadsheets with
.xls
or.xlsx
file endings. - googlesheets4: For reading in Google Sheets from their cloud service.
- vroom: We originally taught this package, but since then it has been merged in with the readr package, which is why we now teach that instead. This package was designed to read in text-based data files really fast.
-
utils::read.delim()
: This function comes from the core R package utils and includes other functions likeutils::read.csv()
(note the.
in the name, rather than_
in thereadr::read_csv()
version). -
data.table::fread()
: From the data.table package, this function is used to load in CSV files.
We’re using the readr package because it is well integrated within the tidyverse universe of packages. It also runs vroom internally. How fast is vroom? The website has a benchmark page showing how fast it is. For many people, loading in the data can be one of the most time-consuming parts of starting an analysis. Hopefully by using this package, that time can be reduced.
The packages readr, vroom, haven, readxl, and googlesheets4 all are very similar in how you use them and their documentation are almost identical. So the skills you learn in this session with readr can mostly be applied to these other packages. And because readr (which the other packages are based on) has been around for awhile, there is a large amount of support and help for using it.
If your data is in CSV format, readr is perfect. The CSV file format is a commonly used format for data because it is open and transparent, readable by any computer, and doesn’t depend on any special software to open (unlike for e.g. Excel spreadsheets).
Let’s first start by creating an object that has the file path to the dataset, then we’ll use read_csv()
to import that dataset.
You’ll see the output mention using spec()
to use in the argument col_types
. And that it has 5 columns, one called ...1
. If we look at the CSV file though, we see that there are only four columns with names, but that technically there is a first empty column without a column header. If you recall, the user_info.csv
file has an empty column name. Looking at the data dictionary it doesn’t seem there is any reference to this column, so it likely isn’t important. More than likely, readr is complaining about this empty column name and the use of ...1
to represent it. We don’t need it, so we’ll get rid of it when we load in the dataset. But how? Look at the help documentation again. Go to the Console and type out:
Console
?readr::read_csv
Looking at the list of arguments, there is an argument called col_select
that sounds like we could use that to keep or drop columns. It says that it is used similar to dplyr::select()
, which normally is used with actual column names. Our column doesn’t have a name, that’s the problem. Next check the Example section of the help. Scrolling down, you’ll eventually see:
read_csv(input_file, col_select = c(1, 3, 11))
So, it takes numbers! With dplyr::select()
, using the -
before the column name (or number) means to drop the column, so in this case, we could drop the first column with col_select = -1
!
doc/learning.qmd
user_1_info_data <- read_csv(
user_1_info_file,
col_select = -1
)
Hmm, we still have the message. You may or may not see this message depending on the version of your packages (the most updated packages will show this), but you can see it in the code chunk above.
This is read_csv()
letting you know that a column was renamed because it found an empty column with no column name and is letting you know it had to rename it. Even though we excluded it with col_select = -1
, that only excludes it after reading the column names, before importing the data.
To remove this new message, we need to tell read_csv()
exactly how to handle renaming cases. Let’s look at the help docs of readr: ?read_csv
. Scroll down and we see an argument called name_repair
that handles naming of columns. It says that it treats problematic column names, of which a missing column name is definitely a problem. There are several options here, but the one I want to focus on is the comment about “function: apply custom name repair”. This is an important one because we eventually want to rename the columns to match the style guide by using snake_case
, since there is a package to do that called snakecase.
In the Console, type out snakecase::
and hit Tab. You’ll see a list of possible functions to use. We want to use the snakecase one, so scroll down and find the to_snake_case()
. That’s the one we want to use. So to remove the messages and convert the variable names to snake case, we would add name_repair = snakecase::to_snake_case
to the code. Notice the lack of ()
when using the function. We’ll explain more about this in later sessions.
doc/learning.qmd
user_1_info_data <- read_csv(
user_1_info_file,
col_select = -1,
name_repair = snakecase::to_snake_case
)
We can now look at the data:
Console
user_1_info_data
Before moving on to the exercise, run styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), then add and commit the changes to the Git history through the RStudio Git interface with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
5.3 Exercise: Import the saliva data
Time: 15 minutes.
Practice importing data files by doing the same process with the saliva data.
- Create a new header at the bottom of the
doc/learning.qmd
file and call it## Exercise: Import the saliva data
. - Below the header, create a new code chunk with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
- Copy and paste the code template below into the new code chunk. Begin replacing the
___
with the correct R functions or other information. - Once you have the code working, run styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) and then use the RStudio Git interface to add and commit the changes to the Git history using Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
doc/learning.qmd
<- here("data-raw/mmash/user_1/___")
user_1_saliva_file
<- read_csv(
user_1_saliva_data
user_1_saliva_file,col_select = ___,
name_repair = ___
)
Click for the solution. Only click if you are struggling or are out of time.
user_1_saliva_file <- here("data-raw/mmash/user_1/saliva.csv")
user_1_saliva_data <- read_csv(
user_1_saliva_file,
col_select = -1,
name_repair = snakecase::to_snake_case
)
5.4 Importing larger datasets
Sometimes you may have a dataset that’s just a bit too large. Sometimes readr may not have enough information to guess the data type of the column. Or maybe there are hundreds or thousands of columns in your data and you only want to import specific columns. In these cases, we can do a trick: read in the first few lines of the dataset to prototype your code to make sure it works. Reading fewer data into R is much faster than reading in the entire dataset.
Let’s do this on the RR.csv
file. We can see from the file size that it is bigger than most of the other files (~2Mb). So, we’ll use this technique to decide what we want to keep. First, create a new header ## Import larger datasets
and a new code chunk below it using Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
Do the same thing that we’ve been doing, but this time we are going to use the argument n_max
, which tells readr how many rows to read into R. In this case, let’s read in 100, since that is the amount readr will guess until. This dataset, like the others, has an empty column that we will drop.
doc/learning.qmd
user_1_rr_file <- here("data-raw/mmash/user_1/RR.csv")
user_1_rr_data <- read_csv(
user_1_rr_file,
n_max = 100,
col_select = -1,
name_repair = snakecase::to_snake_case
)
We’ll use n_max
to help us prototype our code, since it is faster to import only a few rows. We can look at the data:
Console
user_1_rr_data
To make sure everything is so far reproducible within the doc/learning.qmd
file, we will render the Quarto document to output an HTML file. Click the “Render” button at the top of the Source pane or by typing Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”). If it generates an HTML document without problems, we know our code is at least starting to be reproducible.
5.5 Exercise: Import the Actigraph data
Time: 15 minutes.
Practice some more. Do the same thing with the Actigraph.csv
dataset as we did for the RR.csv
. But first:
- Create a new header at the bottom of the
doc/learning.qmd
file and call it## Exercise: Import the Actigraph data
. - Below the header, create a new code chunk with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”).
Use the same technique as we used for the RR.csv
data and read in the Actigraph.csv
file from user_1/
.
Set the file path to the dataset with
here()
.Read in a max of 100 rows for
n_max
and exclude the first column withcol_select = -1
.Render the Quarto document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) to regenerate the HTML document.
-
If everything works, run styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) and then add and commit the new changes to the Git history with
Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”)
Use this template as a guide for completing this exercise.
doc/learning.qmd
<- here("data-raw/mmash/user_1/___")
user_1_actigraph_file <- ___(
user_1_actigraph_data
___,n_max = ___,
col_select = ___,
name_repair = ___
)
Click for the solution. Only click if you are struggling or are out of time.
# Use first 100 or so lines to get spec
user_1_actigraph_file <- here("data-raw/mmash/user_1/Actigraph.csv")
user_1_actigraph_data <- read_csv(
user_1_actigraph_file,
n_max = 100,
col_select = -1,
name_repair = snakecase::to_snake_case
)
5.6 💬 Discussion activity: Experiences with cleaning data
Time: ~10 minutes.
As we prepare for the next session, get up and discuss with you neighbour some of the following questions:
- Share a positive or exciting experience you had with cleaning, processing, and preparing data for later analysis. It can be recent or in the past, but as long as it was a memorable and positive. What made it positive? Do you try to repeat that experience, or at least the feeling of it?
- Share a challenging experience or barriers that you encounter while cleaning, processing, and preparing data. Why was it challenging or had barriers and how could you limit those for for next time?
5.7 Summary
- Use the appropriate package when importing a specific data file (e.g. for
.xlsx
use readxl). - Use
read_csv()
for fast importing of plain text type files (e.g. csv). - Use
col_select =
arguments inread_csv()
to import only the data you need. - Fix column names to match a style using
name_repair =
argument inread_csv()
. - For very large datasets, importing in the first 100-1000 rows using
n_max =
to test that importing works fine before importing the full dataset.