6  Where To Publish Data

6.1 Searching for Research data repositories

re3data is a global registry of research data repositories with over 3,000 entries where you can search for an appropriate place to deposit your data (Pampel et al. 2023 [cito:citesAsAuthority]). So if you don’t know of a suitable repository then searching for one that is a good fit for your data in re3data is a good place to start.

FAIRsharing.org is a curated resource of educational material on databases, standards, and policies for data sharing.

6.2 Selected Public Data Repositories & Data Sharing Platforms

6.2.1 Sequencing Data GEO (Gene Expression Omnibus)

  • Accepts data from methods which measure some property of genomic features e.g. expression micro-arrays, RNA-seq, ChIP-seq, ATAC-seq but not genomic sequence data.
  • GEO submission
    GEO has quite a flexible model for metadata you can (and should) include a good deal of additional data along with any sequencing data that you deposit here. They explicitly provide for secondary data (i.e. data derived from your sequencing data) to be included. It is also a good idea, wherever possible, to include the process (or appropriate links to) the process by which you got from the data/metadata to the secondary results. For example in previous submissions to GEO of RNA-seq data that I processed with a standard nf-core I have included the count matrices and some other pipeline outputs as secondary results. I also included command run to kick off the pipeline that generated these results, the design matrix input file needed in addition to the sequencing files and the version of the pipeline that I used. This way someone downloading my data could re-capitulate my results exactly by re-running that same version of the pipeline on my raw data. This also means anyone wanting to use my results can interrogate the code in the nf-core pipeline to see exactly how the analysis was performed. SRA (Sequence Read Archive)

  • Store the raw sequencing data underlying GEO, and other data including genomic sequencing data HCA (Human Cell Atlas) data portal GenBank

  • Stores whole genome sequencing data and assembled genomes ENA (European Nucleotide Archive)

  • Sequencing data

6.2.2 Imaging Data

‘Global Bioimaging’ (Swedlow et al. 2021 [cito:agreesWith] [cito:citesAsAuthority] [cito:citesAsRecommendedReading] [cito:discusses]) is a group founded to: “to disseminate best practices, develop common imaging and data standards that promote data sharing”. They describe a distinction between ‘image data archives/repositories’ and ‘added-value databases (AVDBs)’. The Image Data Resource (IDR) is an added-value database whereas the Bioimage Archive is, as the name suggests, an archive/repository. (see the dedicated sections on these below).

The envisioned workflow for image generation, storage and sharing is outlined at a high level in three steps:

  1. Local Data Storage (pre-publication)
  2. Archive/Repository
  3. Added-value Database

This might for example go:

  1. a local OMERO instance
  2. Bioimage Archive
  3. IDR

Only data with sufficient value to be archived should make it from local data storage into public data repositories. If it is data that underpins a published result then it should be archived. Once in a public archive that data should ideally only be referenced by AVDBs rather than replicated to them to avoid unnecessary duplication. Work is underway at the Bioimage Archive to implement APIs which should make it easier to submit data to the archive directly from a local OMERO instance. This should also make further curation/annotation efforts in the AVDBs like IDR & EMPIAR easier.

When publishing results based on image data their value and the effectiveness of the communication of your results can be increased by following community developed standards such as those described here: (Schmied et al. 2023 [cito:citesAsRecommendedReading]). These recomendations come with simple checklists to follow for publishing images and image analysis workflows.

When depositing images in repositories such as the BIA then the REMBI (Recommended Metadata for Biological Images) standard provides excellent guidance for making image data FAIR (Sarkans et al. 2021 [cito:citesAsRecommendedReading]). IDR (Image Data Resource)

  • An added value database of high quality well annotated bio-image data of cells and tissues
  • Quite extensive manual curation, currently only accepting ‘reference’ collections with high potential for re-use.
  • IDR is an instance of OMERO, thus managing your metadata in a local instance of OMERO should make it easier to release result here. Bioimage Archive (EBI)

  • An Archival repository of biological images
  • Broad scope, any scale or modality
  • Associated with a publication or resource of general interest. EMPIAR (Electron Microscopy Public Image Archive)

  • An Added value database for Electron microscopy data figshare

  • Sharing of other images / figures. This can include things like raw blots and gels underlying more highly edited figures in a paper.

6.2.3 Protocols

Useful article in Nature about writing reproducible lab protocols(Baker 2021 [cito:citesAsRecommendedReading]). Protocols.io

  • Publish details of your laboratory protocols. Step-by-step procedures optionally supplemented them with images and other media as a supplement to the textual descriptions of the methods. (Unfortunately protocol.io is proprietary platform operated by a private company (Springer Nature) not a publicly owned archive or open source tool but I’m not aware of any good alternatives at the moment.) JOVE

  • (Journal of Visualized Experiments) Publish videos of how you perform your experimental work. This makes it easier share intricate experimental details not readily captured in text.

6.2.4 Code & Computational Environments

This Flow Diagram is intended to guide you through the steps of sharing, publishing and distributing different kinds of research software outputs. In addition to the the diagram there is further information in the sub-sections below. Software packages

  • In language specific package repositories

    • R:
      bioconductor, CRAN (Comprehensive R Archive Network), both have review processes for submitting packages to their repositories.

    • Python:
      PyPi, anyone can upload

  • Software publications
    If you have written a piece of open source software as a part of your research that stands alone as a substantial scientific output the you might want to turn it into an academic publication with peer review. These slightly alternative journals facilitate that.

    • JOSS (Journal of open source software)

    • rOpenSci specifically for R packages

    • PyOpenSci specifically for Python packages Bioinformatic analysis pipelines

If you have constructed a robust bioinformatic analysis pipeline that does the sort of data processing that other might want to do as well, then as long as you have used the appropriate tools to build your pipeline there are options to share them with a wider community of researchers.

  • WorkflowHub for any type of workflow

  • nf-core for nextflow pipelines

  • targetopia for R {targets} pipelines (via rOpenSci) - more focused on composable components of pipelines that can be connected together to perform certain types of analysis that necessarily complete pipelines Scripts, Notebooks and project specific workflows can be shared as git repositories.

6.2.5 Biological Materials access / Sharing HDBR (Human Developmental Biology Resource)

  • “[HDBR] is organised from two sites: the Institute of Genetic Medicine, Newcastle, and the Institute of Child Health, London. The HDBR is an ongoing collection of human embryonic and fetal material ranging from 3 to 20 weeks of development.”
  • See also the HDBR Atlas: “a digital atlas comprising 3D reconstructions from Carnegie Stage 12 to 23, generated using Optical Projection Tomography (OPT), and annotations of the 3D models linked to an anatomical database”

6.2.6 Spatial transcriptomics

A consensus has yet to emerge in this area and different technologies have different underlying datatypes, some use sequencing and some are more array like.

The SpatialData format is emerging as an open format for processed spatial data. It is most similar to that of other image data as it is built on zarr data structures and has been developed in coordination with the OME-NGFF efforts. It is possible therefore that BIA might take submissions in this form, one could consider the expression matrices as rather extensive image metadata.

The Haniffa lab’s webatlas is a good tool for viewing this data especially if it is integrated with single cell transcriptomics.

6.2.7 Flow Cytometry

  • flowrepository the International Society for Advancement of Cytometry (ISAC) FCS File Repository

    • Data deposited in the flow repository should meet the MIFlowCyt (minimum information about a Flow Cytometry Experiment) standard (Lee et al. 2008 [cito:citesAsAuthority]), This paper Guide to preparing data that meets the MIFlowCyt standard (Spidlen, Breuer, and Brinkman 2012 [cito:citesAsRecommendedReading]).

6.2.8 Proteomics

PRIDE is the primary repository for proteomics data. To submit data to them you need an account and to download their Java based submission tool, the process is well documented on their website.

6.2.9 None of the above

If your data does not fit into any of the above categories and you can’t find a public repository that will host it for you then there are number of generalist repositories like Zenodo see this Zenodo publication on choosing a generalist repository, OSF, & Dyrad

If for some reason even these generalist repositories don’t work for you you might consider hosting your own instance of dataverse, DataHub or iRODS which are software projects that provide tooling for managing your own data repository.

6.3 Integrated publishing - a possible future

Data, analysis, prose, collaboration, pre-print, review and publication in one place with literate programming and single source publishing

You begin your project on an instance of a platform like Renku (section Section 4.8), Start by uploading your raw data to a domain specific data repository. You get a DOI or accession for your dataset. You import this into your project. You perform your computational analyses in the reproducible computational environment. Potentially documenting your analysis as a workflow that could be used by others with a pipeline management tool. You write your manuscript in a literate programming format like Quarto. You work with your collaborators on the manuscript using a git hosting tool like gitlab where you raise and discuss issues, and share revised versions. You generate your statistics and graphics for inclusion in the manuscript with code from your data in a reproducible computational environment. You publish a pre-print by making use of a static site generator like the one built into gitlab and simply setting the project to public. You tag this version 0.0.0 and associated it with a DOI from zenodo. To manage reviews of your work you make use of gitlab issues in a manner similar to the review processes of JOSS, rOpenSci and f1000 but potentially independent of a particular publication venue through community peer review projects like Peer Community In (PCI) & Review Commons. This approach permits author led updates, errata, & corrections whilst preserving a version of record (Kane and Amin 2023 [cito:agreesWith]). Once Reviewed and published you have the 1.0.0 version of your manuscript, for future minor corrections you increment the patch version 1.0.1 and your change-log reflects that you fixed a typo. If you add a new dataset or fix an error that changes an outcome you increment the minor version number. If the journal updates the version of record you increment the major version number.

In this Fashion the complete history of the project is documented start to finish and you never had to change medium from scripts to manuscripts in word processors to emailing pdfs, to publisher websites etc. Review is handled with the same set of tools as was your internal collaboration with co-authors. pre-print publication is creating a version tag and setting the repo to public. Anyone can pick up your project in it’s entirety and play around with their own variants of your analysis at the click of a button (specifically the ‘fork’ button).

Baker, Monya. 2021. “Five Keys to Writing a Reproducible Lab Protocol.” Nature 597 (7875): 293–94. https://doi.org/10.1038/d41586-021-02428-3.
Kane, Adam, and Bawan Amin. 2023. “Amending the Literature Through Version Control.” Biology Letters 19 (1). https://doi.org/10.1098/rsbl.2022.0463.
Lee, Jamie A., Josef Spidlen, Keith Boyce, Jennifer Cai, Nicholas Crosbie, Mark Dalphin, Jeff Furlong, et al. 2008. “MIFlowCyt: The Minimum Information about a Flow Cytometry Experiment.” Cytometry Part A 73A (10): 926–30. https://doi.org/10.1002/cyto.a.20623.
Pampel, Heinz, Nina Leonie Weisweiler, Dorothea Strecker, Michael Witt, Paul Vierkant, Kirsten Elger, Roland Bertelmann, et al. 2023. “Re3data Indexing the Global Research Data Repository Landscape Since 2012.” Scientific Data 10 (1). https://doi.org/10.1038/s41597-023-02462-y.
Sarkans, Ugis, Wah Chiu, Lucy Collinson, Michele C. Darrow, Jan Ellenberg, David Grunwald, Jean-Karim Hériché, et al. 2021. “REMBI: Recommended Metadata for Biological Imagesenabling Reuse of Microscopy Data in Biology.” Nature Methods 18 (12): 1418–22. https://doi.org/10.1038/s41592-021-01166-8.
Schmied, Christopher, Michael S. Nelson, Sergiy Avilov, Gert-Jan Bakker, Cristina Bertocchi, Johanna Bischof, Ulrike Boehm, et al. 2023. “Community-Developed Checklists for Publishing Images and Image Analyses.” Nature Methods, September. https://doi.org/10.1038/s41592-023-01987-9.
Spidlen, Josef, Karin Breuer, and Ryan Brinkman. 2012. “Preparing a Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) Compliant Manuscript Using the International Society for Advancement of Cytometry (ISAC) FCS File Repository (FlowRepository.org).” Current Protocols in Cytometry 61 (1). https://doi.org/10.1002/0471142956.cy1018s61.
Swedlow, Jason R., Pasi Kankaanpää, Ugis Sarkans, Wojtek Goscinski, Graham Galloway, Leonel Malacrida, Ryan P. Sullivan, et al. 2021. “A Global View of Standards for Open Image Data Formats and Repositories.” Nature Methods 18 (12): 1440–46. https://doi.org/10.1038/s41592-021-01113-7.