when relevant content is
added and updated.
when relevant content is
added and updated.
Data science image via Shutterstock
By James Kobielus (@jameskobielus)
Science isn’t a mad dash to enlightenment. Instead, it usually involves tedious, painstaking, and methodical slogs through empirical data by researchers seeking confirmation of precisely scoped hypotheses.
You shouldn’t trust scientific findings unless they’ve been independently reproduced. To confirm someone else’s findings, an independent researcher needs to know precisely how those results were achieved in the first place. If, however, the original researcher failed to document their procedures in precise detail, neither they nor anybody else can be confident that what they found can be reproduced at a later date by themselves or anyone else.
When scientists use Agile methodologies, there’s every incentive to skimp on documentation in the interest of speed. The essence of agile is that self-organizing, cross-functional teams sprint towards results in fast, iterative, incremental, and adaptive steps. Considering that methodological improvisation is the heart of this approach, it’s not surprising that teams that follow agile principles may neglect to record every step in the winding journey they took in achieving desired outcomes. If they’ve failed to maintain a detailed audit trail of their efforts, they may inadvertently undermine later efforts to reproduce their discoveries.
If data scientists wish to deserve the title of “scientist,” they can’t turn a deaf ear to the need for reproducibility of findings. Unfortunately, reproducibility is seldom a high priority in data science initiatives, especially those who are caught up an agile scramble for some semblance of statistical truth. As Daniel Whitenack says in this recent O’Reilly article, “Truly ‘scientific’ results should not be accepted by the community unless they can be clearly reproduced and have undergone a peer review process. Of course, things get messy in practice for both academic scientists and data scientists, and many workflows employed by data scientists are far from reproducible….At the very best, the results generated by these sorts of workflows could be re-created by the person(s) directly involved with the project, but they are unlikely to be reproduced by anyone new to the project or by anyone charged with reviewing the project.”
In order to ensure that reproducibility isn’t undermined by agile methods, data scientists need to ensure that their teams conduct all their work on shared platforms that automate the following functions:
- Logging of every step in the process of acquiring, manipulating, modeling, and deploying data-driven analytics;
- Versioning of all data, models, and other artifacts at all stages in the development pipeline;
- Retention of archival copies of all data sets, plots, scripts, tools, random seeds, and other artifacts used in every iteration of the modeling process;
- Generation of detailed narratives that spell out how each step contributed to analytical results achieved in every iteration; and
- Accessibility and introspection at the project level by independent parties of every script, run, and result.
Of course, some data scientists might argue that training their models from fresh data is a form of reproducibility. In other words, iterative training of models shows that the features and correlations identified on prior runs are still valid predictors of the phenomena of interest. But training does not address the following concerns that stand in the way of true reproducibility of data-scientific findings:
- Training may not flag circumstances in which a statistical model has been overfitted to a particular data set, a phenomenon that limits the reproducibility of its predictions in other circumstances.
- Training may simply confirm that the model has identified key statistical correlations, but may obscure the underlying causal factors that could be confirmed through independent reproduction.
- Training doesn’t address the need for interpretability of the results by independent parties, which ensures that the reproduced findings are not only statistically significant but also relevant to the application domain of interest.
For all these reasons, data scientists should always ensure that agile methods leave a sufficient audit trail for independent verification, even if their peers (or compliance specialists) never choose to take them up on that challenge.
Reproducibility is the hallmark of professional integrity, being grounded in a commitment to the quality, transparency, and reusability of one’s work. At the data-science community level, reproducibility can be the greatest agility enabler of all. If statistical modelers ensure that their work meets consistent standards of transparency, auditability, and accessibility, they thereby make it easier for others to review, cite, and reuse it in other contexts.
Science is, after all, an iterative process in which an entire community of data-driven investigators systematically probe their way closer to the truth.