Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

  • Open an issue or submitting a merge request on GitLab.
  • Hypothesis Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

A Resources

A.1 For this course

A.1.1 Potential exercise solutions

These are potential exercise solutions of R code only. This are mostly intended as a resource for after the class, not during it.

Exercise: Import the saliva data

user_1_saliva_file <- here("data-raw/mmash/user_1/saliva.csv")
user_1_saliva_data_prep <- vroom(user_1_saliva_file,
                                 col_select = -1)
spec(user_1_saliva_data_prep)

user_1_saliva_data <- vroom(
    user_1_saliva_file,
    col_select = -1,
    col_types = cols(
        SAMPLES = col_character(),
        `Cortisol NORM` = col_double(),
        `Melatonin NORM` = col_double()
    ),
    .name_repair = snakecase::to_snake_case
)

Exercise: Import the Actigraph data

# Use first 100 or so lines to get spec
user_1_actigraph_file <- here("data-raw/mmash/user_1/Actigraph.csv")
user_1_actigraph_data_prep <- vroom(user_1_actigraph_file,
                                    n_max = 100,
                                    col_select = -1)
spec(user_1_actigraph_data_prep)

user_1_actigraph_data <- vroom(
    user_1_actigraph_file,
    col_select = -1,
    col_types = cols(
        Axis1 = col_double(),
        Axis2 = col_double(),
        Axis3 = col_double(),
        Steps = col_double(),
        HR = col_double(),
        `Inclinometer Off` = col_double(),
        `Inclinometer Standing` = col_double(),
        `Inclinometer Sitting` = col_double(),
        `Inclinometer Lying` = col_double(),
        `Vector Magnitude` = col_double(),
        day = col_double(),
        time = col_time(format = "")
    ),
    .name_repair = snakecase::to_snake_case
)

Exercise: Repeat with the saliva data

# Excluding the Roxygen docs.
import_saliva <- function(file_path) {
    saliva_data <- vroom(
        file_path,
        col_select = -1,
        col_types = cols(
            SAMPLES = col_character(),
            `Cortisol NORM` = col_double(),
            `Melatonin NORM` = col_double()
        ),
        .name_repair = snakecase::to_snake_case
    )
    return(saliva_data)
}

Exercise: Move and update the rest of the functions

# Excluding the Roxygen docs.
import_saliva <- function(file_path) {
    saliva_data <- vroom::vroom(
        file_path,
        col_select = -1,
        col_types = vroom::cols(
            SAMPLES = vroom::col_character(),
            `Cortisol NORM` = vroom::col_double(),
            `Melatonin NORM` = vroom::col_double()
        ),
        .name_repair = snakecase::to_snake_case
    )
    return(saliva_data)
}
import_rr <- function(file_path) {
    rr_data <- vroom::vroom(
        file_path,
        col_select = -1,
        col_types = vroom::cols(
            ibi_s = vroom::col_double(),
            day = vroom::col_double(),
            # Converts to seconds
            time = vroom::col_time(format = "")
        ),
        .name_repair = snakecase::to_snake_case
    ) 
    return(rr_data)
}
import_actigraph <- function(file_path) {
    actigraph_data <- vroom::vroom(
        file_path,
        col_select = -1,
        col_types = vroom::cols(
            Axis1 = vroom::col_double(),
            Axis2 = vroom::col_double(),
            Axis3 = vroom::col_double(),
            Steps = vroom::col_double(),
            HR = vroom::col_double(),
            `Inclinometer Off` = vroom::col_double(),
            `Inclinometer Standing` = vroom::col_double(),
            `Inclinometer Sitting` = vroom::col_double(),
            `Inclinometer Lying` = vroom::col_double(),
            `Vector Magnitude` = vroom::col_double(),
            day = vroom::col_double(),
            time = vroom::col_time(format = "")
        ),
        .name_repair = snakecase::to_snake_case
    )
    return(actigraph_data)
}

Exercise: Make a function for importing other datasets with functionals

# Excluding the Roxygen docs.
import_multiple_files <- function(file_pattern, import_function) {
    data_files <- fs::dir_ls(here::here("data-raw/mmash/"),
                             regexp = file_pattern,
                             recurse = TRUE)
    
    combined_data <- purrr::map_dfr(data_files, import_function,
                                    .id = "file_path_id")
    return(combined_data)
}

Exercise: Brainstorm a regex that will match for the user ID

# There is no code for this exercise.

Exercise: What is the pipe?

# There is no code for this exercise.

Exercise: Convert this code into a function

extract_user_id <- function(imported_data) {
    extracted_id <- imported_data %>% 
        dplyr::mutate(user_id = stringr::str_extract(file_path_id, 
                                                     "user_[0-9][0-9]?")) %>% 
        dplyr::select(-file_path_id)
    return(extracted_id)
}

Exercise: Summarise then join the Actigraph data

summarised_actigraph_df <- actigraph_df %>% 
    group_by(user_id, day) %>% 
    summarise(across(hr, list(mean = mean, sd = sd)))

Exercise: Import and process the activity data

import_activity <- function(file_path) {
    activity_data <- vroom::vroom(
        "data-raw/mmash/user_1/Activity.csv",
        col_select = -1,
        col_types = vroom::cols(
            Activity = vroom::col_double(),
            Start = vroom::col_time(format = ""),
            End = vroom::col_time(format = ""),
            Day = vroom::col_double()
        ),
        .name_repair = snakecase::to_snake_case
    ) 
    return(activity_data)
}

activity_df <- import_multiple_files("Activity.csv", import_activity)

activity_df %>% 
    # group_by(user_id) %>% 
    mutate(activity_duration = end - start)

Exercise: What other cleaned data might you create?

# There is no code for this exercise.

Exercise: Add parallel processing to your raw data processing

# The code for this is given in the section right before the exercise

A.2 For continued learning

Free online books:

Quick references:

Articles:

General sites

Interactive sites or resources for hands-on learning:

Getting help:

  • StackOverflow for tidyr
  • StackOverflow for dplyr
  • Tip: Combine auto-completion with :: to find new functions and documentation on the functions (e.g. try typing base:: and then hitting Tab to show a list of all functions found in base R)

A.3 Useful R packages

Table A.1: Useful and common packages to use in data analysis.
Package Title Description
bookdown Authoring Books and Technical Documents with R Markdown Output formats and utilities for authoring books and technical documents with R Markdown.
broom Convert Statistical Analysis Objects into Tidy Tibbles Summarizes key information about statistical objects in tidy tibbles. This makes it easy to report results, create plots and consistently work with large numbers of models at once. Broom provides three verbs that each provide different types of information about a model. tidy() summarizes information about model components such as coefficients of a regression. glance() reports information about an entire model, such as goodness of fit measures like AIC and BIC. augment() adds information about individual observations to a dataset, such as fitted values or influence measures.
data.table Extension of data.frame Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.
datapasta R Tools for Data Copy-Pasta RStudio addins and R functions that make copy-pasting vectors and tables to text painless.
dplyr A Grammar of Data Manipulation A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
forcats Tools for Working with Categorical Variables (Factors) Helpers for reordering factor levels (including moving specified levels to front, ordering by first appearance, reversing, and randomly shuffling), and tools for modifying factor levels (including collapsing rare levels into other, ‘anonymising’, and manually ‘recoding’).
fs Cross-Platform File System Operations Based on ‘libuv’ A cross-platform interface to file system operations, built on top of the ‘libuv’ C library.
ggplot2 Create Elegant Data Visualisations Using the Grammar of Graphics A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”. You provide the data, tell ‘ggplot2’ how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
glue Interpreted String Literals An implementation of interpreted string literals, inspired by Python’s Literal String Interpolation <https://www.python.org/dev/peps/pep-0498/>; and Docstrings <https://www.python.org/dev/peps/pep-0257/>; and Julia’s Triple-Quoted String Literals <https://docs.julialang.org/en/v1.3/manual/strings/#Triple-Quoted-String-Literals-1>;.
googledrive An Interface to Google Drive Manage Google Drive files from R.
haven Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files Import foreign statistical formats into R via the embedded ‘ReadStat’ C library, <https://github.com/WizardMac/ReadStat>;.
here A Simpler Way to Find Your Files Constructs paths to your project’s files. The ‘here()’ function uses a reasonable heuristics to find your project’s files, based on the current working directory at the time when the package is loaded. Use it as a drop-in replacement for ‘file.path()’, it will always locate the files relative to your project root.
janitor Simple Tools for Examining and Cleaning Dirty Data The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and isolate duplicate records. Other janitor functions nicely format the tabulation results. These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel. This package follows the principles of the “tidyverse” and works well with the pipe function %>%. janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
knitr A General-Purpose Package for Dynamic Report Generation in R Provides a general-purpose tool for dynamic report generation in R using Literate Programming techniques.
lubridate Make Dealing with Dates a Little Easier Functions to work with date-times and time-spans: fast and user friendly parsing of date-time data, extraction and updating of components of a date-time (years, months, days, hours, minutes, and seconds), algebraic manipulation on date-time and time-span objects. The ‘lubridate’ package has a consistent and memorable syntax that makes working with dates easy and fun. Parts of the ‘CCTZ’ source code, released under the Apache 2.0 License, are included in this package. See <https://github.com/google/cctz>; for more details.
patchwork The Composer of Plots The ‘ggplot2’ package provides a strong API for sequentially building up a plot, but does not concern itself with composition of multiple plots. ‘patchwork’ is a package that expands the API to allow for arbitrarily complex composition of plots by, among others, providing mathematical operators for combining multiple plots. Other packages that try to address this need (but with a different approach) are ‘gridExtra’ and ‘cowplot’.
purrr Functional Programming Tools A complete and consistent functional programming toolkit for R.
readr Read Rectangular Text Data The goal of ‘readr’ is to provide a fast and friendly way to read rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
readxl Read Excel Files Import excel files into R. Supports ‘.xls’ via the embedded ‘libxls’ C library <https://github.com/libxls/libxls>; and ‘.xlsx’ via the embedded ‘RapidXML’ C++ library <http://rapidxml.sourceforge.net>;. Works on Windows, Mac and Linux without external dependencies.
rio A Swiss-Army Knife for Data I/O Streamlined data import and export by making assumptions that the user is probably willing to make: ‘import()’ and ‘export()’ determine the data structure from the file extension, reasonable defaults are used for data import and export (e.g., ‘stringsAsFactors=FALSE’), web-based import is natively supported (including from SSL/HTTPS), compressed files can be read directly without explicit decompression, and fast import packages are used where appropriate. An additional convenience function, ‘convert()’, provides a simple method for converting between file types.
rmarkdown Dynamic Documents for R Convert R Markdown documents into a variety of formats.
stringr Simple, Consistent Wrappers for Common String Operations A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.
tibble Simple Data Frames Provides a ‘tbl_df’ class (the ‘tibble’) that provides stricter checking and better formatting than the traditional data frame.
tidyr Tidy Messy Data Tools to help to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value. ‘tidyr’ contains tools for changing the shape (pivoting) and hierarchy (nesting and ‘unnesting’) of a dataset, turning deeply nested lists into rectangular data frames (‘rectangling’), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit).
tidyverse Easily Install and Load the ‘Tidyverse’ The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at <https://tidyverse.org>;.
usethis Automate Package and Project Setup Automate package and project setup tasks that are otherwise performed manually. This includes setting up unit testing, test coverage, continuous integration, Git, ‘GitHub’, licenses, ‘Rcpp’, ‘RStudio’ projects, and more.
vroom Read and Write Rectangular Text Data Quickly The goal of ‘vroom’ is to read and write data (like ‘csv’, ‘tsv’ and ‘fwf’) quickly. When reading it uses a quick initial indexing step, then reads the values lazily , so only the data you actually use needs to be read. The writer formats the data in parallel and writes to disk asynchronously from formatting.