5  Pre-process data as you import it

Warning

🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧

5.1 Learning objectives

  1. Identify the appropriate R package to use to import data based on what format the file is.

  2. Use the readr package to import data from a CSV file into R, while also applying some initial data cleaning steps during the import.

  3. Continue practicing basic reproducible and open workflows, such as using Git version control and using styler to format your code.

5.2 📖 Reading task: Packages for importing specific data file formats

Time: ~5 minutes.

Before you can effectively analyse your project’s data, you will very likely need to first clean and process it into a format and structure that fits your project’s specific needs. So this first phase of an analysis project is to import the data into R, check it out, and clean and process it. Eventually, you want to have a version of the raw (or original) data saved in data/ ready for your analysis.

There are many different ways to import data into R, and it all depends on what file format the data is stored in. For most file formats, there is usually a specific R package designed to import it. For instance, the below is a list of packages to load common file formats:

  • haven: For reading (also known as importing or loading) in SAS, SPSS, and Stata files.
  • readxl: For reading in Excel spreadsheets with .xls or .xlsx file endings.
  • googlesheets4: For reading in Google Sheets from their cloud service.
  • utils::read.delim(): This function comes from the core R package {utils} and includes other functions like utils::read.csv().
  • data.table::fread(): From the data.table package, this function is used to load in CSV files.
  • arrow::read_parquet(): This function is used to read in Parquet files, which are an increasingly common file format for storing large datasets. This is also becoming more common in data analysis work because it can easily connect the imported Parquet data into the powerful duckdb engine through duckplyr. We won’t cover this during the workshop, but like Parquet, it is also becoming increasingly common to use when working with large datasets. The advantage of using duckplyr is that it uses the same syntax as dplyr so it is very easy to switch between using it and using dplyr.
  • readr: A package that is part of the tidyverse that is designed to load in data stored in text-based files such as CSV with readr::read_csv() and is incredibly fast at importing data. Notice the _ in the function name, compared to the . in the utils::read.csv() function name above. This small difference can confuse people, and it is often the cause of the first errors we have when teaching this workshop.

The most common type of file format for data that you will likely encounter is a CSV file (or Excel), as spreadsheet structured data is often the most common way to record and store data. CSVs are also open and transparent since they are simply text files that are both human-readable and machine-readable. And you don’t need special software to open them, like, for example, you might need for Excel spreadsheets. Because CSV is so common, we will be using readr::read_csv() and won’t cover the other packages listed above. We list these other packages for your reference and information only.

We’re going to use the readr package because it is well integrated within the tidyverse universe of packages. It’s also very fast at loading in data, which for large datasets can be one of the more time-consuming parts of starting an analysis.

The packages readr, vroom, haven, readxl, and googlesheets4 all are very similar in how you use them and their documentation are almost identical. So the skills you learn in this workshop with readr can mostly be applied to these other packages. And because readr (which the other packages are based on) has been around for awhile, there is a large amount of support and help for using it.

If your data is in CSV format, readr is perfect.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

5.3 Importing in the raw data

Take this time to briefly explain the Git interface, what add and commit mean, as well as where the history is located.

Now it’s time to import in the data! Before we do that though, we should do one thing first. While you downloaded the data during the pre-workshop tasks, you hadn’t yet committed your changes to the Git history. So we will start with that first.

Open the Git interface with either the Git icon at the top near the menu bar or with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”). In the interface, select all the current changes by clicking the checkbox beside the files and stage them. Then write a commit message in the text box on the right, something like Setting up the project. Click the “Commit” button, and finally close the Git interface.

Now, let’s get to importing some data! Open up the docs/learning.qmd. A quick way of opening a file in RStudio is to type Ctrl-. (dot or period) and then typing learning which should bring up the file we want to open. Hitting enter should now open the file.

Now with docs/learning.qmd open, we should see a setup code chunk already in the file. This code chunk is where we will put the library() code for loading the tidyverse package, which has readr bundled with it, as well as library(here). It should look like this:

```{r setup}
library(tidyverse)
library(here)
```

This setup code chunk is a special, named code chunk. Whenever you first open a Quarto file or after you restart your R session and then start running R code, RStudio will first look through the Quarto file for any code chunk named setup. When it finds one, it will run the R code in that chunk first.

So it’s good practice to put library() functions as well as functions that load your data in the setup code chunk. For this session, we will only put the library() functions there.

After adding the code to this chunk, run it with Ctrl-Enter so that the packages get loaded. Then, create a new header below the setup code chunk by typing out ## Importing raw data, followed by creating a new code chunk right below it using Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).

Very briefly explain what here() is doing and why it’s important to use it when working with projects. Specifically, that it ensures reproducibility and that the project is self-contained.

The first argument to the read_csv() function is the path to the file you want to import. We use the here() function to help us with this. The here() function is a function that helps us find the path to the file we want to import, relative to the R Project file. Any time we want a file path, we use here() to ensure our project is reproducible and self-contained.

Tip

We covered the here() function in more detail in the Introductory workshop, so won’t go over it in much detail here.

Emphasize and take your time explaining the |> pipe operator, describing the output of the function goes into the first argument of the next function.

We’ll use the |> pipe operator to pipe the output of here() into read_csv(). The pipe operator is a way to give the output of one function as the input to the next function. We will use this a lot throughout the workshop.

There’s a lot of data we could import, so let’s start with importing the continuous glucose monitor data in the cgm/ folder. For now, we will only import participant 101’s data, which is in the 101.csv file. We will assign the output of this code to cgm_101 so we can use it later:

docs/learning.qmd
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv()
Rows: 3564 Columns: 2
── Column specification ────────────────────────────────────────────────
Delimiter: ","
dbl  (1): Historic Glucose mmol/L
dttm (1): Device Timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Run this code with Ctrl-Enter. Before looking at the data, let’s look through the message it creates. The columns data types inside of a CSV files are not clear since CSV is plain text without metadata attached. So when you load data from a CSV file into R with read_csv(), the function will try to guess what the data types are for each column. Usually it guesses accurately, but it isn’t perfect. So this message

To remove this message, there are two things we can do:

  1. Explicitly tell read_csv() what the column names and data types are using the argument col_types. This is a very tedious, but often very powerful and useful step depending on the issues your data may have.
  2. Tell read_csv() to just guess and to hide the message using show_col_types = FALSE. Usually, read_csv() does a good job of guessing, so for this workshop we will use this option.

So in the code, we will add show_col_types = FALSE to read_csv() and then run the code:

docs/learning.qmd
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv(show_col_types = FALSE)

When we run it, we no longer see the message. Great! Now let’s look at the data itself. In the Console, type out cgm_101 and hit enter:

Console
cgm_101
# A tibble: 3,564 × 2
   `Device Timestamp`  `Historic Glucose mmol/L`
   <dttm>                                  <dbl>
 1 2021-03-18 08:15:00                       5.8
 2 2021-03-18 08:30:00                       5.4
 3 2021-03-18 08:45:00                       5.1
 4 2021-03-18 09:01:00                       5.3
 5 2021-03-18 09:16:00                       5.3
 6 2021-03-18 09:31:00                       4.9
 7 2021-03-18 09:46:00                       4.7
 8 2021-03-18 10:01:00                       4.8
 9 2021-03-18 10:16:00                       5.5
10 2021-03-18 10:31:00                       5.7
# ℹ 3,554 more rows

One thing to notice is that the column names are not in a format that we can easily use. For example, the first column is called "Device Timestamp" and the space in the name will make it harder to work with. So, we will need to fix the column names up to follow a common style and that is easier to code with. That style is snake_case that uses _ for spaces and is all lower case. Snake case is even recommended in the tidyverse style guide and there’s even a package called snakecase that can help us with this!

But how do we fix this as we import the data? Let’s look at the help docs of readr: ?read_csv. Scroll down in the help document and we will see an argument called name_repair that handles naming of columns. There are several options here, none of which will fix to snake_case. But there is one that says “function: apply custom name repair”. Since there is already a snakecase with some functions to help with converting to snake case, we will use this option to give the name_repair the function from the snakecase.

To find out the name of the function to use, go to the Console and start to type out snakecase::. As we type we will see a list of possible functions to use. We want to use the snakecase one, so scroll down and find the to_snake_case() function. That’s the one we want to use in name_repair. To convert the variable names to snake case, we would add name_repair = snakecase::to_snake_case to the code, like so:

docs/learning.qmd
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = snakecase::to_snake_case
  )

Notice the lack of () when using the function. We’ll explain more about this in later sessions.

We can now look at the data by going to the Console and typing:

Console
cgm_101
# A tibble: 3,564 × 2
   device_timestamp    historic_glucose_mmol_l
   <dttm>                                <dbl>
 1 2021-03-18 08:15:00                     5.8
 2 2021-03-18 08:30:00                     5.4
 3 2021-03-18 08:45:00                     5.1
 4 2021-03-18 09:01:00                     5.3
 5 2021-03-18 09:16:00                     5.3
 6 2021-03-18 09:31:00                     4.9
 7 2021-03-18 09:46:00                     4.7
 8 2021-03-18 10:01:00                     4.8
 9 2021-03-18 10:16:00                     5.5
10 2021-03-18 10:31:00                     5.7
# ℹ 3,554 more rows

5.4 Import less data while prototyping code

As you write code, if it is a bit slow to run, it will make it harder for you to be quick with writing code, trying things, and fixing things. Not just harder from a technical point of view, but also from a motivation and cognitive load perspective. If running code takes a while, you will be less likely to want to test things out, fix things, and get things right. There will be that cognitive overhead.

A big reason for slower code is when the dataset you are working with is larger. This is especially true when:

  • you first read the data into R, as larger data will be slower to load into R than smaller data.
  • your work computer is not the most powerful or has other programs (installed or setup by your IT) that can slow your computer down.

In both those cases, less data to work with will mean faster code that runs and less stress and frustration you will have because you get faster feedback on if your code works or not. So a good practice as you write and test code is to only import some of the data. Then when you are sure it works as you want it and when you are ready to do a full analysis, you can import all the data and run the code with the full data.

So, until the end of the workshop, we will only import less rows of data. read_csv() has an argument called n_max that allows you to specify how many rows to import, which we can read about by looking at the ?read_csv help documentation.

For the data we are importing right now, let’s add this n_max argument to only load in the first 100 rows. It can be less or more, but it’s good to get enough data that read_csv() can effectively guess the columns data types but not so large as to slow things down. In the code we’ve already written, add:

docs/learning.qmd
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = snakecase::to_snake_case,
    n_max = 100,
  )

When we look at the data by going to the Console and typing out the data object:

Console
cgm_101
# A tibble: 100 × 2
   device_timestamp    historic_glucose_mmol_l
   <dttm>                                <dbl>
 1 2021-03-18 08:15:00                     5.8
 2 2021-03-18 08:30:00                     5.4
 3 2021-03-18 08:45:00                     5.1
 4 2021-03-18 09:01:00                     5.3
 5 2021-03-18 09:16:00                     5.3
 6 2021-03-18 09:31:00                     4.9
 7 2021-03-18 09:46:00                     4.7
 8 2021-03-18 10:01:00                     4.8
 9 2021-03-18 10:16:00                     5.5
10 2021-03-18 10:31:00                     5.7
# ℹ 90 more rows

We see that the amount of data we’ve imported is much smaller now. Before moving on to the exercise, let’s:

  • Tidy up the code using styler with the Palette (Ctrl-Shift-P, then type “style file”)
  • Make sure everything is reproducible in the docs/learning.qmd file by rendering the Quarto document to an an HTML file. Click the “Render” button at the top of the Source pane or by typing Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”). If it generates an HTML document without problems, we know our code is at least starting to be reproducible.
  • Add and commit the changes to the Git history through the RStudio Git interface with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

5.5 🧑‍💻 Exercise: Import participant 101’s sleep data

Time: ~15 minutes.

An important part of learning is to try what we’ve just covered on your own and in a practical way. So, using the same steps as what we just did, import in participant 101’s sleep data.

  1. Create a new header at the bottom of the docs/learning.qmd file and call it ## Exercise: Import 101's sleep data.
  2. Below the header, create a new code chunk with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”).
  3. Copy and paste the code template below into the new code chunk. Begin replacing the ___ with the correct R functions or other information.
    • Give the correct file path in here() to 101’s sleep data in the sleep/ folder.
    • Use read_csv() (not read.csv()).
    • Use show_col_types = FALSE to hide the importing message.
    • Convert the columns to_snake_case using the name_repair argument in read_csv(). Don’t forgot to attach the snakecase to the function with :: and don’t use the () at the end of the function.
    • Import only first many rows with n_max. Instead of 100 like we did above, use another number of your choosing. Maybe fewer rows? Maybe a bit more?
  4. Once you have the code working, run styler with the Palette (Ctrl-Shift-P, then type “style file”), render the Quarto document to HTML to check reproducibility using Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”), and then use the RStudio Git interface to add and commit the changes to the Git history using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).
docs/learning.qmd
sleep_101 <- here(___) |>
  ___(
    ___, 
    ___,
    ___
  )
sleep_101

The output of the code might look a bit like:

# A tibble: 100 × 3
   date                sleep_type seconds
   <dttm>              <chr>        <dbl>
 1 2021-05-24 23:03:00 wake           540
 2 2021-05-24 23:12:00 light          180
 3 2021-05-24 23:15:00 deep          1440
 4 2021-05-24 23:39:00 light          240
 5 2021-05-24 23:43:00 wake           300
 6 2021-05-24 23:48:00 light          120
 7 2021-05-24 23:50:00 rem           1350
 8 2021-05-25 00:12:30 wake           870
 9 2021-05-25 00:27:00 rem            360
10 2021-05-25 00:33:00 light          210
# ℹ 90 more rows
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

5.6 Key takeaways

Quickly cover this before moving on to the discussion activity.

  • Use the appropriate package when importing a specific data file (e.g. for .xlsx use readxl).
  • Use read_csv() for fast importing of plain text type files (e.g. csv).
  • Use col_select = arguments in read_csv() to import only the data you need.
  • Fix column names to match a style using name_repair = argument in read_csv().
  • For very large datasets, importing in the first 100-1000 rows using n_max = to test that importing works fine before importing the full dataset.

5.7 💬 Discussion activity: Experiences with cleaning data

Time: ~8 minutes.

As we prepare for the next session, get up, walk around, and discuss with your neighbour some of the following questions:

  • How tidy and clean is your data in your research? Do you have to do a lot of processing before you can analyse it?
  • What are some struggles you have with processing, importing, and generally working with data?
  • What file format is your data is stored in? Is it easy to work with?

5.8 Code used in this session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv()
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv(show_col_types = FALSE)
cgm_101
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = snakecase::to_snake_case
  )
cgm_101
cgm_101 <- here("data-raw/dime/cgm/101.csv") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = snakecase::to_snake_case,
    n_max = 100,
  )
sleep_101 <- here("data-raw/dime/sleep/101.csv") |>
  read_csv(
    show_col_types = FALSE,
    name_repair = snakecase::to_snake_case,
    n_max = 100
  )
sleep_101