If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
2 Course syllabus
Reproducibility and open scientific practices are increasingly demanded of, and needed by, scientists and researchers in our modern research environments. We increasingly produce larger and more complex amounts of data that often need to be heavily cleaned, reorganized, and processed before it can be analyzed. This data processing often consumes the majority of the time spent coding and doing data analysis. And even though this stage of data analysis is so time-consuming, there is little to no training and support provided for it. This has led to minimal attention, scrutiny, and rigour in describing, detailing, and reviewing these procedures in studies, and contributes to the systemic lack of code sharing among researchers. All together, this aspect of research is often completely hidden and is likely to be the source of many irreproducible results.
With this course, we aim to begin addressing this gap. Using a highly practical approach that revolves around code-along sessions (instructor and learner coding together), hands-on exercises, and group work, participants of the course will be able to:
- Learn and demonstrate what an open and reproducible data analysis workflow looks like.
- Learn and apply the fundamental tools and skills for conducting a reproducible and modern analysis for a research project.
- Apply programming techniques to process and manage data in a reproducible and well-documented way.
- Learn where to go to get help and to continue learning modern data analysis skills.
The course will enable participants to answer questions such as:
- What does a modern data analysis setup and workflow look like?
- How can I ensure that my data analysis project is reproducible?
- How can I create pipelines that get, process, and clean my data quickly and that works regardless of whether there is one data file or hundreds of data files (i.e. it scales well)?
- How can I write code that is more reproducible, readable, and that can be easily re-used for my future self and for my collaborators and colleagues?
By the end of the course, participants will: have improved their competency in processing and wrangling datasets; have improved their proficiency in using the R statistical computing language; know how to write re-usable and well-documented code; and know how to make modern and reproducible data analysis projects.
2.1 Is this course for you?
This course is designed a specific way and is ideal for you if:
- You are a researcher, preferably working in the biomedical field (ranging from experimental to epidemiological). Specifically, this course targets those working in diabetes and metabolism.
- You currently or will soon do some quantitative data analysis.
- You either:
Considering that this is a natural extension of the introduction to R course, I will be incorporating tools learned in that course, including basic Git usage as well as use of RStudio R projects. If you do not have familiarity with these tools, you will need to go over the material from the introduction course beforehand (more details about pre-course tasks will be sent out a couple of weeks before the course).
While I have these assumptions to help focus the content of the course, if you have an interest in learning R but don’t fit any of the above assumptions, you are still welcome to attend the course! We welcome everyone, that is until the course capacity is reached.
During the course, we will:
- learn how to use R, specifically those in the mid-beginner to early-intermediate level
- focus only on the data processing and cleaning stage of a data analysis project
- teach from a reproducible research and open scientific perspective (e.g. by making use of Git)
- be using practical, applied, and hands-on lessons and exercises
And we will not learn:
- the basics of using R and RStudio
- statistics (these are already covered by most university curriculum)
The workshop is structured as a series of participatory live-coding sessions (instructor and learner coding together) interspersed with hands-on exercises and group work, using either a practice dataset or some other real-world dataset. There are some lectures given, mainly at the start and end of the workshop. The general schedule outline is shown in the below table.
|Date and time||Session topic|
|9:30||Arrival; coffee and snacks|
|10:00||Introduction to the course|
|11:00||Importing data, fast! (with short break)|
|13:30||Save time, don’t repeat yourself|
|14:45||Coffee break and snacks|
|15:00||Save time, don’t repeat yourself (with short break)|
|17:30||End of day survey|
|8:30||Processing datasets for cleaning|
|10:00||Coffee break and snacks|
|10:15||Processing datasets for cleaning (with short break)|
|13:00||Workflow to analyzing your tidy data|
|14:45||Coffee break and snacks|
|15:00||Workflow to analyzing your tidy data (with short break)|
|16:15||What next? Applying open and reproducible practices in real-life|
|16:30||Closing remarks and end of day survey|
The course will be taking place at Aarhus University in two different buildings:
Day 1, Sept. 8th: Building 1264, room 209
Day 2, Sept. 9th: Building 1231, room 228