1  Workshop syllabus

Warning

🚧 We are doing major changes to this workshop, so much of the content will be changed. 🚧

Reproducibility and open scientific practices are increasingly demanded of, and needed by, scientists and researchers in our modern research environments. We increasingly produce larger and more complex amounts of data that often need to be heavily cleaned, reorganized, and processed before it can be analysed. This data processing often consumes the majority of the time spent coding and doing data analysis. And even though this stage of data analysis is so time-consuming, there is little to no training and support provided for it. This has led to minimal attention, scrutiny, and rigour in describing, detailing, and reviewing these procedures in studies, and contributes to the systemic lack of code sharing among researchers. All together, this aspect of research is often completely hidden and may likely to be the source of many irreproducible results.

With this workshop, we aim to begin addressing this gap. We use a highly practical, mixed-methods, hands-on approach to learning that revolves around code-along sessions (teacher and you, the learner, coding together), hands-on exercises, discussion activities, reading tasks, and group project work. The learning outcome and overall aim of the workshop is to enable you to:

  1. Demonstrate an open and reproducible workflow in R that makes use of the effective and powerful functional programming approach for working with data, and apply it to import and process some real-world data.

This aim is broken down into specific learning objectives that are spread across the workshop’s sessions. The workshop will enable you to:

  1. Identify the appropriate R package to use to import data based on what format the file is.

  2. Use the {readr} package to import data from a CSV file into R, while also applying some initial data cleaning steps during the import.

  3. Continue practicing basic reproducible and open workflows, such as using Git version control and using {styler} to format your code.

  4. Describe and identify the individual components of a function as well as the workflow to creating them, and then use this workflow to make one to import some data.

  5. Describe and apply a workflow of prototyping code into a working function in a Quarto document, moving the function into a script (called, e.g., functions.R) once prototyped and tested, and then using source() to load the functions into the R session. End this workflow with rendering the Quarto document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to ensure reproducibility.

  6. Explain what R package dependency management is, why it is necessary when writing code and ensuring reproducibility.

  7. Use tools like usethis::use_package() to manage dependencies in an R project and use :: to explicit use a specific function from a specific package in any functions you make.

  8. Identify and describe some basic principles and patterns for writing more robust, reusable, and general-purpose code.

  9. Explain what functional programming, vectorization, and functionals are within R and identify when code is a functional or uses functional programming. Then apply this knowledge using the {purrr} package’s map() function.

  10. Review the split-apply-combine technique and identify how these concepts make use of functional programming.

  11. Apply functional programming to summarize data using the split-apply-combine technique with {dplyr}’s group_by(), summarise(), and across() functions.

  12. Describe some ways to join or “bind” data and identify which join is appropriate for a given situation. Then apply {dplyr}’s full_join() function to join two datasets together into one single dataset.

  13. Demonstrate how to use functionals to repeatedly join more than two datasets together.

  14. Apply the function case_when() in situations that require nested conditionals (if-else).

  15. Describe the concept of “pivoting” data to convert data in a “long” format to a “wide” format and vice versa.

  16. Identify situations where it is more appropriate that data is in the long or wide format.

  17. Apply the pivot_longer() and pivot_wider() functions from the {tidyr} package to pivot data.

To simplify this into tangible tools and methods, during the workshop we will:

  1. Use a function-based workflow to writing code.
  2. Use {usethis}’s use_package() function to manage package dependencies.
  3. Use Quarto to write reproducible documents.
  4. Use Git and GitHub to manage code and version control.
  5. Use the functional programming tools with {purrr}’s map() and reduce() functions.
  6. Use the split-apply-combine technique to summarize data with {dplyr}’s group_by(), summarise(), and across() functions.
  7. Use the {dplyr} package to full_join() datasets together.
  8. Use the {tidyr} package to pivot data from long with pivot_longer() to wide with pivot_wider().

Learning and coding are not just solo activities, they are also very much social activities. So throughout the workshop, we provide plenty of opportunity to meet new people, share experiences, have discussion activities, and work in groups.

1.1 Is this workshop for you?

This workshop is designed in a specific way and is ideal for you if:

  • You are a researcher, preferably working in the biomedical field (ranging from experimental to clinical to epidemiological).
  • You currently are or will soon do quantitative data analysis.
  • You either:

Considering that this is a natural extension of the introductory r-cubed workshop, this workshop incorporates tools learned during that workshop, including basic Git usage as well as the use of RStudio R Projects. If you do not have familiarity with these tools, you will need to go over the material from the introduction workshop beforehand (more details about pre-workshop tasks will be sent out a couple of weeks before the workshop).

We make these assumptions about you as the learner to help focus the content of the workshop, however, if you have an interest in learning R but don’t fit any of the above assumptions, you are still welcome to attend the workshop! We welcome everyone, that is until the workshop capacity is reached.

During the workshop, we will:

  • Learn how to use R, specifically in the mid-beginner to early-intermediate level.
  • Focus only on the data processing and cleaning stage of a data analysis project.
  • Teach from a reproducible research and open scientific perspective (e.g. by making use of Git).
  • Be using practical, applied, and hands-on lessons and exercises.
  • Apply evidence-based teaching practices.
  • Use a real-world dataset to work with.

And we will not:

  • Go over the basics of using R and RStudio.
  • Cover any statistics, as these are already covered by most university curriculum.