HDBI Data Resource

Data: Inception to Publication & Beyond

This is a resource for all questions related to data sharing, collaboration, management, and publication

Richard J. Acton


January 31, 2023

doi: 10.5281/zenodo.8021381 (a9e2a07) [2024-04-11]

About this resource

These are our aspirations for making the data generated by the HDBI adhere to the FAIR principles.

Datasets should be accompanied by structured metadata which provides the details of who generated the data, how and for what purpose. It should also include a detailed account of the experimental design and how the data files map onto this design. Files should be identified with publicly available unique identifiers with which they can be retrieved from a repository wherever possible. Wherever suitable controlled vocabularies and ontologies exist these should be used to refer to methods, organisms, medical conditions and other such annotations. Any methods used in the generation of data should reference protocol level details and any sources of biological materials be clearly identified so that the data generation process could be readily reproduced with the same materials and processes. Any secondary data which are the product of analyses of primary data should be published with reference to openly licensed code which produced these outputs. The computational environment in which the code was run should also be described ideally in a form which permits its automated reproduction and details exact software versions.

Why to use this Resource

“Ask not what you can do for reproducibility; ask what reproducibility can do for you!”

- Florian Markowetz

Firstly work which takes place as a part of HDBI MUST meet the stipulations for research data provided by our funder, Wellcome. Wellcome’s Policy consists of 6 main points summarized below.

Wellcome Data, software and materials management, and sharing policy summary
  1. Maximise the availability of research data, software, and materials. At a minimum those underpinning research publications.

  2. Have a plan for managing and sharing research outputs, especially those which might serve as a general resource of use to the wider academic community.

  3. Data should be discoverable, shared using persistent identifiers according to the most recent community standards, and in appropriate repositories.

  4. Users of research data, software and materials should cite those resources and abide by the terms and conditions of any resource used.

  5. Wellcome recognizes a range of research outputs including: inventions, datasets, software, translation to health applications, and materials, in addition to conventional publications when assessing researchers.

  6. Successful sharing of research outputs will be considered critical by Wellcome in assessment of end of grant reporting.

Beyond their data management guidelines, It is also worth reviewing Wellcome’s guide on completing an outputs management plan. Anyone applying for further funding from Wellcome might find it helpful to review 3  How To Store Your Data - How to store your data of this guide when writing a grant proposal so that they can request suitable funding for research data storage.

This guide aims to help you not only to meet but to exceed Wellcome’s requirements in ways that benefit your research. It is intended to provide the resources needed to make it as simple as possible for HDBI researchers to reach these goals.

A substantial portion of this guide relates to working reproducibly, either directly or indirectly. The best practices for the care and feeding of research data, from choosing the right tools store and organised it to the right time and place to release it are important foundations for reproducible science.

Reasons to work reproducibly

Florian Markowetz outlines why working reprodicibly is a good idea for pragmatic and self-interested reasons beyond high scientific ideals in his excellent short piece: ‘Five selfish reasons to work reproducibly(Markowetz 2015 [cito:agreesWith] [cito:discusses]), I will briefly recap/paraphrase these here:

  1. Idealism
    The ability of others to reproduce our work is a foundational idea in the philosophy of science … and whatnot. Inspiring but not always motivating to action in the mundane day-to-day.

  2. Avoidance of disaster
    The easier it is for you and your collaborators to understand and cross-check your work the easier it is for co-authors to spot errors (or even fraud) before you publish them and tarnish your reputation.

    Not loosing years of hard work if you loose your laptop because you have things backed up properly - you do have things backed up properly right? 👀.

  3. An easier time writing papers
    You can avoid the laborious error prone process of transcribing numbers and updating figures when you change a data cleaning step and just have it all update automagically at the click of a button. You may even be able to avoid having to manually reformat things to different journal’s persnickety layout requirements.

  4. Easier for reviewers to see things your way
    Conversations with reviews tend to be more constructive when they can actually see what you did and even better if it’s easy for them to understand and poke it themselves to see if they can find anything wrong with it.
    It is much easier to avoid talking at cross-purposes when you have a common set of concrete facts. An imprecise description can confuse what you did, code & protocols clarify this.

  5. Continuity of work
    The ability to actually pick-up where you or someone else left off rather than spending months redoing stuff that’s already been done but not adequately documented.

    The ability to run old code on a different computer, or use an old protocol in a new lab and still have a good chance of it working.

  6. Building A Reputation
    People Know your work is in good faith because they can actually see it, even if it’s not perfect at least they know your honest.

    People will like working with you and your published data better because it’s easy to use, a good dataset or software package can translate into a lot of citations.

    If you publish, data and details of your analysis with every paper that’s two more DOIs than normal and more published artifacts on a CV = CV more better 🤷?

Learning more about the technical and systemic barriers to open and reproducible work as well as the tools and solutions to facilitate the adoption these workflows has been an interest of mine for about the last 8 years1, or more or less since the start of my academic career as a PhD student. This has progressed to the point where that number of years was calculated on the fly from the difference between the current date and the approximate date at which I started my PhD so that it will remain accurate in future revisions. I’m aiming here to codify as many of my learnings as possible and make them accessible to you so that you can do better than I have managed to do so far and avoid some of the pitfalls I’ve encountered along the way. As such it is at present based largely on my experience and influenced by my opinions, if & when this project gets more contributions that may change. A good resource with more contributors that is organised less linearly and covers more domains is RDMkit (Research data management kit) from the ELIXIR-CONVERGE project which aimed to “help standardize life science data management across Europe”.

How to use this resource

Searching this resource

Press k to summon the search box Or click the magnifying glass above the table of content in the top left of the page.

You can share a search by clicking the copy icon in the search dialog.

You can share a link to a section by using the link icon that appears next to section headers when you hover over them.

It should be possible to consume this document out of order referring only to the sections relevant to your problem(s), cross-references will be present when another section provides useful context. Though it does aim to be readable in it’s entirety by a general reader with an academic/technical background.

Collapsed blocks

You will find occasional ‘callout’ blocks like this colored box in the body of the text which are collapsed by default. You can click to expand them and read their contents. Sometimes when I’m delving into technical specifics I’ve hidden these by default so that they are available to anyone who needs them but won’t disrupt the flow for the general reader.

In each section there will be an overview of the topic with links to external sources covering these topics in additional detail. I aim to include text and video resources on each topic to suit different people’s preferences for the media from which they learn best. In addition I will attempt to provide sources which range from basic/quickstart/TLDR introductions to a topic for newcomers to longer form and deeper dive content so that beginners and advanced users all have something of value to refer to in each section. Another type of content I’ll aim include are high level overviews useful both to newcomers before they dive into the details and supervisors/collaborators who just need a high level summary to understand why the specialists they are working with are using these tools/methods and how to talk to them about it.

An example of a similar resource to this book which covers some of the same ground and a number of other areas not addressed here is The Turing Way a collaboratively authored book from The Alan Turing Institute.

How to contribute to this resource

“Online, a book can be a gathering place, a shared space where readers record their reactions and conversations.”

- Jennifer Howard (2012)

The source for this document can be found in the HDBI group on renkulab.io. Input, feedback and suggestions are welcome. Anyone wishing to tell me that I am wrong and/or stupid/ignorant for anything I have written here is warmly invited to do so. As long as it is constructive, ideally with specific suggestions for improvement, and in accordance with the contributor code of conduct. The best way of doing this is to open an issue in the gitlab repository, or for small fixes like typos to directly suggest an edit. Please check existing open issues before opening a new one in case someone else has already spotted the same problem. In the web version of this resource you will see in the top right under the section headings of this chapter an edit () link and a view () link that will take you to the source for the current page if you would like to suggest an edit.

This site has the hypothesis annotation viewer enabled so feel free to add comments and annotations to this site there2. Email me if you’d like an invite to the HDBI hypothesis group.

Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biology 16 (1). https://doi.org/10.1186/s13059-015-0850-7.

  1. Assuming that I have continued to be interested in this subject since I wrote this, which seems likely at time of writing 🤪.↩︎

  2. Quickstart: Hypothesis - Web Annotation Tool Overview

    Longform: Hypothesis 101 - Social annotation for beginners↩︎