Open Research is something I am quite passionate about, most academic research is publicly funded and the fruits of such labour should therefore, in my view, by publicly available. This includes not just the peer-reviewed paper, but also the data and code used to analyse carry out the research. This principle of making research open requires a subtle shift in the way in which research is undertaken, particularly given the requirements these days of undertaking many analyses in silico i.e. using a computer to analyse your data.

When I passed through University there was a big empahsis on keeping detailed lab-books in the wet-lab of the work done, but these days the same is true and required of the work undertaken on computers whether that involves writing detailed compiled programmes in C or using RMarkdown, Jupyter Notebooks or Org-mode.

Reproducibility

When undertaking analyses in a computer literate programming should be the default approach. The basic principle is that a computer programme is self-documenting and contains a natural explanation of its function, e.g. in a native language, interspersed with code snippets. There are many ways to achieve this, one of the earliest I encountered was Sweave which allowed writing documents that were a mixture of LaTeX and R such that results from R such as tables, figures and numbers were automatically inserted into the resulting PDF when compiled.

This has largely been superseded in the R ecosystem these days by knitr which allows R code to be embedded in a range of markup languages, the most popular of which by a long way is RMarkdown.

There are options for other languages too such as Pweave for Python (although development has stalled) and Codebraid which allows embedding a number of statistical languages (R, Python, Julia etc.) into Pandoc Markdown.

One flavour of literate programming I particularly like is org-babel afforded by the excellent Org Mode for Emacs, but this is mainly because I really like Emacs and Org Mode and am trying to use it for most tasks.

A new system that works across R, Python, Julia and Observable is Quarto which allows users to write documents in plain Markdown or Jupyter Notebooks. It already has an quarto-mode mode.

Open Data

Making data open should be the default for any publicly funded research. If no human subjects are involved in the research then this is relatively straight-forward. Where data is collected on humans there is a balance to be struck between the individuals right to privacy and data governance such as GDPR.

There are frameworks for managing data in pipelines for research such as Pacyhderm.

Data Capture and Management

Ideally data should be captured in a structured manner with validation of fields performed when the data is entered and the data stored in a database, whether that is SQL-like tabular data or JSON like under MongoDB.

Privacy

Studies including human subjects require careful consideration because of the inherent right to privacy and the laws governing how such data should be handled. Even anonymised data has the reveal information about the participant because responses to questions and accumulated data (e.g. GPS) can be used to reconstruct and identify details about an individual such as where they live. If a particular condition being studied is rare as is the case with some cancers and other medical conditions then it would be relatively easy to identify individuals.

Thus whilst data should be made openly available the need to do so should be balanced against the privacy considerations of a studies participants.

Accessible

Data needs to be accessible to users, that doesn't just mean making it physcially available but doing so in a format that is conducive to being read by a range of software packages (not everyone uses the same software you do). Further data should be well annotated with data dictionaries that describe the structure of the data, whether variables are integers, boolean, floats, categorical, ordered categorical, numerically encoded (and what the encoding means) or free text fields so that others using the data can make sense of the data and start using it straight away.

A very basic and common form for data to be provided is in ASCII plain-text files, either as Comma-Separated Values (CSV) or these days as JavaScript Object Notation (JSON).

Computing Environments

One of the biggest challenges with the explosion of software and continual improvements in incremental versions is that the environment in which the work was done is also available should the underlying code not run in subsequent versions of software. To this end tools such as Docker which provide containerised environments with user defined versions of software are invaluable.

Docker

A useful overview is provided by the article Ten simple rules for writing Dockerfiles for reproducible data science.

Links

General

Publication

Project Management

ChangeLog

Reproducibility

Literate Programming Frameworks

Data

Repositories

Computing Environments

Docker

Programming

Design Patterns

Functional Programming.

References

  • Schwab, S., & Held, L. (2021). Statistical programming: Small mistakes, big impacts. Significance, 18(3), 6–7. doi:10.1111/1740-9713.01522
work/open_research.txt · Last modified: 2022/11/22 13:53 by admin
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0