D Reproducible Analytical Pipelines
Automating the steps used to create a your report or output is a good way to avoid the human errors that manual intervention will introduce.
This is the focus of the Reproducible Analytical Pipelines community. The RAP community has curated the following resources:
- RAP Website
- Udemy Course
- National Services Scotland RAP paper
- National Services Scotland RAP checklist
D.1 Automating Pipelines
You can do this by:
Designing a final output which can be created without manual intervention.
- Making sure your code is broken down into chunks which do discrete tasks in your pipeline, for example:
- gathering
- cleaning
- processing and modelling
- reporting and visualisation
Taking advantage of tools which keep track of the interactions between your data and code. These tools can then re-run the required bits of your pipeline automatically as you update, correct and improve it.
GNU Make is the classic tool and is language agnostic, but perhaps not user friendly.
The
drake
package is designed for analysis pipelines in R.The
doit
package can do the same and more for python. The python wiki also has a list of build tools for python.
D.2 Code Version Linked to Outputs
You might successfully implement an automated pipeline and a reproducible environment. However unless you know which version of these was used to produce an output you might well come unstuck!
Make sure that you can track backwards and determine which version of your code produced a particular output.
You can do this by: