1 Syllabus
Reproducibility and open scientific practices are increasingly demanded of, and needed by, scientists and researchers in our modern research environments. We increasingly produce larger and more complex amounts of data that often need to be heavily cleaned, reorganized, and processed before it can be analysed. This data processing often consumes the majority of the time spent coding and doing data analysis. And even though this stage of data analysis is so time-consuming, there is little to no training and support provided for it. This has led to minimal attention, scrutiny, and rigour in describing, detailing, and reviewing these procedures in studies, and contributes to the systemic lack of code sharing among researchers. All together, this aspect of research is often completely hidden and may likely be the source of many irreproducible results.
In this workshop we aim to begin addressing this gap. We use a highly practical, mixed-methods, hands-on approach to learning that revolves around code-along sessions (teacher and you, the learner, coding together), hands-on exercises, discussion activities, reading tasks, and team project work. This workshop is designed for to be done over 3 full days in person, following the schedule listed in Chapter 3.
1.1 Learning outcome and objectives
The learning outcome and overall aim of the workshop is to enable you to:
- Demonstrate an open and reproducible workflow in R that makes use of the effective and powerful functional programming approach for working with data, and apply it to import and process some real-world data.
This aim is broken down into specific learning objectives that are spread across the workshop’s sessions. The workshop will enable you to:
Pre-process data as you import it
- Identify the appropriate R package to use to import data based on what format the file is.
- Use the readr package to import data from a CSV file into R, while also applying some initial data cleaning steps during the import.
- Continue practicing basic reproducible and open workflows, such as using Git version control and using styler to format your code.
- Describe and identify the individual components of a function as well as the workflow for creating one, and then use that workflow to create a function that imports data.
- Identify and describe some basic principles and patterns for writing general-purpose and reusable code and functions.
Making robust and reusable functions
- Explain what R package dependency management is, why it is necessary to write robust code and ensure reproducibility.
- Use tools like
usethis::use_package()to manage dependencies in an R project and use::to explicit use a specific function from a specific package in any functions you make. - Describe and apply a workflow of prototyping code into a working function in a Quarto document, moving the function into a script (called, e.g.,
functions.R) once they are robust and tested, and then usingsource()to allow easy reuse of functions in other R scripts or Quarto documents.
Doing many things at once with functionals
- Explain what functional programming, vectorization, and functionals are within R and identify when code is a functional or uses functional programming. Then apply this knowledge using the purrr package’s
map()function.
- Use the powerful (but also very difficult) regular expressions to process and clean character data by making use of the stringr package.
- Handle dates and times in R using the lubridate package.
- Recognise when you trigger a “non-standard evaluation” error when you make functions using tidyverse functions and fix it by using
{{ }}.
Using split-apply-combine to help in processing
- Review the split-apply-combine technique and identify how these concepts make use of functional programming.
- Apply functional programming to summarize data using the split-apply-combine technique with dplyr’s
summarise(), its.byargument, andacross()functions. - Identify and design ways to simplify the functions you make by creating general functions that contain other functions you’ve made, such as a general “cleaning” function that contains your custom functions that clean your specific data.
Pivoting your data from and to long or wide
- Describe the concept of “pivoting” data to convert data in a “long” format to a “wide” format and vice versa.
- Identify situations where it is more appropriate that data is in the long or wide format.
- Apply the
pivot_longer()andpivot_wider()functions from the tidyr package to pivot data. - Recognise and appreciate the impact that seeking out and using existing functions to solve problems has on how you get things done, instead of writing custom code to do the things you want. For example, looking through the functions found in packages like tidyr.
- Describe some ways to join or “bind” data and identify which join is appropriate for a given situation. Then apply dplyr’s
full_join()function to join two datasets together into one single dataset.
To simplify this into tangible tools and methods, during the workshop we will:
- Use a function-based workflow to writing code.
- Use usethis’s
use_package()function to manage package dependencies. - Use Quarto to write reproducible documents.
- Use Git and GitHub to manage code and version control.
- Use the functional programming tools like purrr’s
map()function. - Use the lubridate package to clean and process dates and times.
- Use the stringr package and regular expressions to clean and process character strings.
- Use the split-apply-combine technique to summarize data with dplyr’s
group_by(),summarise(), andacross()functions. - Use the tidyr package to pivot data from long with
pivot_longer()to wide withpivot_wider(). - Use the dplyr package to
full_join()datasets together.
Learning and coding are not just solo activities, they are also very much social activities. So throughout the workshop, we provide plenty of opportunity to meet new people, share experiences, have discussion activities, and work in groups.