If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
Reproducibility and open scientific practices are increasingly demanded of, and needed by, scientists and researchers in our modern research environments. We increasingly produce larger and more complex amounts of data that often need to be heavily cleaned, reorganized, and processed before it can be analyzed. This data processing often consumes the majority of the time spent coding and doing data analysis. And even though this stage of data analysis is so time-consuming, there is little to no training and support provided for it. This has led to minimal attention, scrutiny, and rigour in describing, detailing, and reviewing these procedures in studies, and contributes to the systemic lack of code sharing among researchers. All together, this aspect of research is often completely hidden and may likely to be the source of many irreproducible results.
With this course, we aim to begin addressing this gap. Using a highly practical approach that revolves around code-along sessions (instructor and learner coding together), hands-on exercises, and group work, participants of the course will be able to:
- Learn and demonstrate what an open and reproducible data processing and analysis workflow looks like.
- Learn and apply some fundamental concepts, techniques, and skills needed for processing and managing data in a reproducible and well-documented way.
- Learn where to go to get help and to continue learning modern data science and analysis skills.
The course will enable participants to answer questions such as:
- What does a modern data analysis setup and workflow look like?
- How can I create pipelines that get, process, and clean my data quickly and that works regardless of whether there is one data file or hundreds (i.e. it scales well)?
- How can I write code that is more reproducible, readable, and easily re-used for my future self and for my collaborators and colleagues?
By the end of the course, participants will: have improved their competency in processing and wrangling datasets; have improved their proficiency in using the R statistical computing language; know how to write re-usable and well-documented code; and know how to make modern and reproducible data analysis projects.
This course is designed in a specific way and is ideal for you if:
- You are a researcher, preferably working in the biomedical field (ranging from experimental to epidemiological). Specifically, this course targets those working on topics in diabetes and metabolism.
- You currently or will soon do quantitative data analysis.
- You either:
Considering that this is a natural extension of the introductory r-cubed course, this course incorporates tools learned during that course, including basic Git usage as well as use of RStudio R Projects. If you do not have familiarity with these tools, you will need to go over the material from the introduction course beforehand (more details about pre-course tasks will be sent out a couple of weeks before the course).
While having these assumptions help to focus the content of the course, if you have an interest in learning R but don’t fit any of the above assumptions, you are still welcome to attend the course! We welcome everyone, that is until the course capacity is reached.
During the course, we will:
- Learn how to use R, specifically those in the mid-beginner to early-intermediate level.
- Focus only on the data processing and cleaning stage of a data analysis project.
- Teach from a reproducible research and open scientific perspective (e.g. by making use of Git).
- Be using practical, applied, and hands-on lessons and exercises.
And we will not learn:
- The basics of using R and RStudio.
- Statistics (these are already covered by most university curriculum).