1 Workshop syllabus

Reproducibility and open scientific practices are increasingly demanded of, and needed by, scientists and researchers in our modern research environments. We increasingly produce larger and more complex amounts of data that often need to be heavily cleaned, reorganized, and processed before it can be analysed. This data processing often consumes the majority of the time spent coding and doing data analysis. And even though this stage of data analysis is so time-consuming, there is little to no training and support provided for it. This has led to minimal attention, scrutiny, and rigour in describing, detailing, and reviewing these procedures in studies, and contributes to the systemic lack of code sharing among researchers. All together, this aspect of research is often completely hidden and may likely to be the source of many irreproducible results.

With this workshop, we aim to begin addressing this gap. We use a highly practical, mixed-methods, hands-on approach to learning that revolves around code-along sessions (teacher and you, the learner, coding together), hands-on exercises, discussion activities, reading tasks, and group project work. The learning outcome and overall aim of the workshop is to enable you to:

Demonstrate an open and reproducible workflow in R that makes use of the effective and powerful functional programming approach for working with data, and apply it to import and process some real-world data.

This aim is broken down into specific learning objectives that are spread across the workshop’s sessions. The workshop will enable you to:

Pre-process data as you import it

Identify the appropriate R package to use to import data based on what format the file is.
Use the {readr} package to import data from a CSV file into R, while also applying some initial data cleaning steps during the import.
Continue practicing basic reproducible and open workflows, such as using Git version control and using {styler} to format your code.

Bundle code into functions

Describe and identify the individual components of a function as well as the workflow for creating one, and then use that workflow to create a function that imports data.
Describe and apply a workflow of prototyping code into a working function in a Quarto document, moving the function into a script (called, e.g., functions.R) once prototyped and tested, and then using source() to load the functions into the R session. End this workflow with rendering the Quarto document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to ensure reproducibility.

Making robust and general-purpose functions

Explain what R package dependency management is, why it is necessary when writing code and ensuring reproducibility.
Use tools like usethis::use_package() to manage dependencies in an R project and use :: to explicit use a specific function from a specific package in any functions you make.
Identify and describe some basic principles and patterns for writing more robust, reusable, and general-purpose code.

Doing many things at once with functionals

Explain what functional programming, vectorization, and functionals are within R and identify when code is a functional or uses functional programming. Then apply this knowledge using the {purrr} package’s map() function.

Cleaning characters and dates

Use the powerful (but also very difficult) regular expressions to process and clean character data by making use of the {stringr} package.
Handle dates and times in R using the {lubridate} package.
Recognise when you trigger a “non-standard evaluation” error when you make functions using {tidyverse} functions and fix it by using {{ }}.

Using split-apply-combine to help in processing

Review the split-apply-combine technique and identify how these concepts make use of functional programming.
Apply functional programming to summarize data using the split-apply-combine technique with {dplyr}’s group_by(), summarise(), and across() functions.
Identify and design ways to simplify the functions you make by creating general functions that contain other functions you’ve made, such as a general “cleaning” function that contains your custom functions that clean your specific data.

Pivoting your data from and to long or wide

Describe the concept of “pivoting” data to convert data in a “long” format to a “wide” format and vice versa.
Identify situations where it is more appropriate that data is in the long or wide format.
Apply the pivot_longer() and pivot_wider() functions from the {tidyr} package to pivot data.
Recognise and appreciate the impact that seeking out and using existing functions to solve problems has on how you get things done, instead of writing custom code to do the things you want. For example, looking through the functions found in packages like {tidyr}.

Joining data together

Describe some ways to join or “bind” data and identify which join is appropriate for a given situation. Then apply {dplyr}’s full_join() function to join two datasets together into one single dataset.

To simplify this into tangible tools and methods, during the workshop we will:

Use a function-based workflow to writing code.
Use {usethis}’s use_package() function to manage package dependencies.
Use Quarto to write reproducible documents.
Use Git and GitHub to manage code and version control.
Use the functional programming tools like {purrr}’s map() function.
Use the {lubridate} package to clean and process dates and times.
Use the {stringr} package and regular expressions to clean and process character strings.
Use the split-apply-combine technique to summarize data with {dplyr}’s group_by(), summarise(), and across() functions.
Use the {tidyr} package to pivot data from long with pivot_longer() to wide with pivot_wider().
Use the {dplyr} package to full_join() datasets together.

Learning and coding are not just solo activities, they are also very much social activities. So throughout the workshop, we provide plenty of opportunity to meet new people, share experiences, have discussion activities, and work in groups.

1.1 Is this workshop for you?

This workshop is designed in a specific way and is ideal for you if:

You are a researcher, preferably working in the biomedical field (ranging from experimental to clinical to epidemiological).
You currently are or will soon do quantitative data analysis.
You either:
- have taken the introduction to Reproducible Research in R workshop, as this intermediate workshop is a natural extension to that the introduction one;
- know a little to a moderate amount of R (or computing in general);
- know how to use R and have some familiarity with the tidyverse and RStudio.

Considering that this is a natural extension of the introductory r-cubed workshop, this workshop incorporates tools learned during that workshop, including basic Git usage as well as the use of RStudio R Projects. If you do not have familiarity with these tools, you will need to go over the material from the introduction workshop beforehand (more details about pre-workshop tasks will be sent out a couple of weeks before the workshop).

We make these assumptions about you as the learner to help focus the content of the workshop, however, if you have an interest in learning R but don’t fit any of the above assumptions, you are still welcome to attend the workshop! We welcome everyone, that is until the workshop capacity is reached.

During the workshop, we will:

Learn how to use R, specifically in the mid-beginner to early-intermediate level.
Focus only on the data processing and cleaning stage of a data analysis project.
Teach from a reproducible research and open scientific perspective (e.g. by making use of Git).
Be using practical, applied, and hands-on lessons and exercises.
Apply evidence-based teaching practices.
Use a real-world dataset to work with.

And we will not:

Go over the basics of using R and RStudio.
Cover any statistics, as these are already covered by most university curriculum.