4  Working With Data

The challenges of working collaboratively on data and some solutions

4.1 Electronic Lab Notebooks (ELNs)

“rigour, reproducibility and robustness. These remind us of the reason why we became scientists in the first place.”

- Levi Garraway (2017)

Note: The ELN function is also frequently combined with a LIMS (Laboratory information/inventory management system) solution as the two are often closely interrelated. They are however distinct functions and you may wish to pick different tools for these functions.

See also the turing way sub-chapter on ELNs which I have contributed to

ELNs are incredibly useful and massively expand on the utility of a paper lab notebook. Links, images, multimedia, collaboration, search, and sharing, integration with LIMS systems, and much more. It is, however, important when choosing an ELN solution that you do not give up the advantages offered by a paper lab notebook.

There are an immense and baffling array of options in the Electronic Lab Notebook space. Many organisations offer software that purports to solve the problem of electronic lab notebooks. So choosing a suitable solution can be a major headache. The choice of ELN is an incredibly important decision and one that your lab/institution will likely have live with for years or even decades. You are putting the record of your research into the hands of the tool that you choose, and entering into a long-term relationship with the provider of your ELN solution.

Consider the differences between a paper copy of a lab notebook (PLN) and an electronic copy. Consider also which are the properties of a paper copy that it is important that you are able to retain when adopting an electronic alternative.

A paper lab notebook is physically under your control you (or more likely your institution) own it. You can control access to it physically. Your physical possession of this resource means that it would be difficult for anyone to prevent you from accessing it. It would also be difficult for anyone to charge you a fee in order to continue using it. You do not need any specialist tools in order to access it’s contents. You are not dependent on the functioning of any complex systems like computer networks in order to be able to use your paper lab notebook. You do not have to agree to a ‘terms of service’ or ‘end-user license agreement’ with a 3rd party, (the terms of which are likely subject to unilateral alteration by that 3rd party), in order to use and retain access to your lab book.

If the provider of my paper lab books goes out of business it has almost no bearing on my ability to continue doing my work. One paper notebook is much like another, finding a new provider is pretty easy. Changing providers does not impact on my ability to access my past notebooks or to continue operating with the same workflow in my future ones. This is not necessarily true of ELNs. Few active measures are needed to maintain the data in your notebooks, they are vulnerable as they exist in only one copy but as long as they are kept in a cool, dry and dark spot they will likely last decades. Electronic data requires much more active upkeep.

Lab notebooks perform an archival function and proprietary formats are antithetical to this as they assume the institution which can act as a gatekeeper to the use of the proprietary format will outlive the need to archive the material. When choosing an archival format one seeks to maximize the likelihood that one can recover the relevant information from that format. Using a proprietary solution is talking a needless risk with the future of your data. Your data’s fate can become tied to that of the firm, or project within a firm, that develops and operates the software that you use to store your lab notes.

When looking for any piece of software the first question that I ask is: “Is there an Libre / open-source solution to this”, If it is a web app I ask: “Can I host my own fully featured instance, should I need to?”. I also ask: “is there a large community using the project, does it have some institutional backing of some kind?” This might take the form of a company which sells service contracts, or offers paid hosting ideally with feature parity1 with a self-hosted option. Or perhaps a foundation or other non-profit/academic organisation with robust funding.

Open solutions provide me the assurance that If I do the appropriate preparatory work I should be able to access all of my data in it’s native form by running the ELN application in a VM or similar reproducible computational environment in the future should I need to. Even if the tools are no longer maintained and in a state that can be used in production. They can still be used to read the data and interact with it in the same way. The data will also likely be stored in an open format from which it can relatively easily be extracted and ported to a new format.

This recent review (Higgins, Nogiwa-Valdez, and Stevens 2022 [cito:citesAsRecommendedReading] [cito:critiques]) provides a good overview of considerations when adopting an ELN solution. It cover such things a regulatory compliance that I have not touched on here. It does occasionally appear to conflate open-source solutions with self-hosted ones which need not necessarily be the case. This guide on choosing an ELN from Simon Bungers of Labfolder is also worth a read. Some companies will let you host proprietary apps on premises and you can pay for 3rd party hosting and administration of open source applications. This is important as if you don’t have the expertise or internal resources to administer a self-hosted instance of an open-source ELN solution you can still pay a 3rd party to do this for you in which case you get the benefits of professional support and the reassurance of an open solution. You should still take regular local backups of exports from your hosting provider from which you could restore your ELN system with different hosting. This means that you retain the option to change providers as the hosting and support are no longer vertically integrated parts of the software as a service (SaaS) experience for you.

The only ELN/LIMS software solutions that I have so far identified that meet my initial screening criteria are listed here. They are each quite different but have many of the same core features. For example rich text editing in a web browser. The ability to upload files. Sharing and permissions based on roles/groups.

4.1.0.1 eLabFTW

  • Laboratory resource scheduling feature for booking things like hoods and microscopes, automatic mol file previews for molecules and proteins & support for free-hand drawing.

  • The eLabFTW site and documentation there is also a demo deployment that you can try out

  • Self-hosting is relatively simple according to the documentation. There is also a paid support tier which would be recommend for any larger deployment to support the ongoing development of the project.

  • Paid cloud hosting is available from the developer in a geographical region suited to your needs, a more expensive tier with hosting in France compliant with additional security and privacy certifications is available.

4.1.0.2 openBIS

  • Good features for integrated metadata management e.g. linking to ontologies / controlled vocabularies. This is based in a flexible object system for making similar entries.

  • openBIS has an API and can integrate with jupyterhub for electronic lab notebooks.

  • Very feature rich LIMS system with optional integration of stores management with protocols and experiments including keeping track of bar-coded stocks.

  • You can get a feel for it in the demo deployment.

  • openBIS is a bit more complex to administer based on its documentation. It’s a slightly older and more complex project than the others on this list, meaning it is very featureful and well tested against the needs of the groups at ETH Zurich.

  • Developed at ETH Zurich openBIS can be hosted for you under the openRDM service operated by ETH Zurich scientific IT services. No fixed pricing is available cost would be dependent on your specific needs

4.1.0.3 OSF

  • OSF is oriented towards sharing and collaborating on your work, including the ability to generate DOIs and host pre-prints directly on the main instance.

  • It is free to use OSF at the main instance at osf.io so you can try it out there directly. For larger data you must provide your own additional storage addons, available from a number fo cloud storage providers.

  • Whilst OSF can be hosted yourself this is presented by the project as for the purposes of development, and is not directly available as a paid service.

  • Strong sharing features, makes it easy to take your ELN and make it, or parts of it, public.

Notes Beyond ELNs

There are additional sections covering the management of types on information which don’t necessarily fit into an ELN solution for more general, personal or informal notes see the short section on Personal Knowledge management Section 9.2, for bibliographic information management the section on Zotero Section 9.1, and for passwords the section on ?sec-bitwarden.

4.2 Code vs GUIs for provenance and repeatability

“Code is text, code is readable, code is reproducible”

- Hadley Wickham

The choice of analyzing your data in a graphical or GUI (Graphical User Interface) tool such as a spreadsheet, or statistical analysis and plotting tool or doing so in code in a programming language such as R or Python is a significant one, though the two are not always mutually exclusive. This is sometimes a choice made for us by the availability of tools to solve our specific problem being only graphical or only command line (i.e. code). On many occasions however we are presented with a choice between the two.

Working in a graphical tool is typically, though not always, faster to pickup and learn for those with no prior coding experience than interacting through code and can make it easier to quickly get started with data analysis with a shallower learning curve.

Working in code typically provides much better provenance for the data and implicitly documents every step taken in the analysis of your data. If working with a graphical tool it is commonplace for the steps taken to require manually and often in-exact instructions to repeat the same analysis. This requires that such steps be carefully documented, and meticulously followed by someone attempting to reproduce your work. Both of these steps leave significant room for a class of ‘operator’ error, failing to unambiguously document a step, misinterpreting or having to guess at an ambiguous step or just random mistakes. Whilst these difficulties are hard to avoid in lab protocols where physical steps must be described they are theoretically avoidable in computational analyses which reduces to a series of unambiguous mechanically executed steps. This can result in its own class of errors, bugs not spotted which lead to widespread errors if undiscovered in widely used tools for example so there are some trade-offs.

A graphical tool which permits you to define a series of operations, export these instructions to a file and import import that process into a different session is a step up over one in which the user must repeatedly manually specify an action. However such approaches are usually not quite as robust as code, software versions change and user interfaces no longer match the instructions or can no longer import the files from older versions. This can be especially difficult with proprietary or SAAS (software as a service) solutions where access to older version of the software is not available. It is much easier to maintain the equivalent of a lab notebook for you computational analyses if you are able to do so in code than it is do the equivalent when using a GUI tool.

Some GUI tools which generate/edit code snippets based on a GUI wrapper or produce a file containing manual annotation information can provide a bridge for things that are just easier to do graphically and purely code based solutions. These sorts of hybrid solution are available for certain tasks and make it possible to have a primarily code first workflow augmented by GUI assistance when needed/desired. This document is an example of this sort of workflow. I’m currently writing it in RStudio’s visual editor mode that resembles an ordinary WYSIWYG (what you see is what you get) word processor with all the bells and whistles like automated reference management integrated with Zotero but it’s actually generating well-formed Rmarkdown syntax.

Another useful example of this for generating reproducible figures with imaging data and graphs in Inkscape with imageJ is Jérôme Mutterer’s2 inkscape-imagej-panel plugin3 .

A section from Hadley Wickham’s 2019 Keynote at EMBL covering the merits of computational notebooks for reproducible science.

4.3 Reproducible computational analyses

Two Scientists look at a blackboard with complicated mathematical notation on either side of a section in the middle which reads: "then a miracle occurs". One of the scientists is pointing to this section and says "I think that you should be more explicit in step 2"

A miracle occurs - Sidney Harris

“In science consensus is irrelevant. What is relevant is reproducible results.”

- Michael Crichton

Important

To reproduce, or indeed to easily collaborate on a data analysis project you need shared access to three things:

  • Code / Documentation
    • The source code and it’s dependencies that spell out the steps taken in an analysis. The comments, context and motivation for writing the code that you did, how and why it works the way it does.
  • Data
    • The inputs to that code both larger datasets and configuration options / parameters used.
  • Compute Environment
    • The computational context in which the code was run. Operating system, Package versions, Configuration, etc.

We will cover a number of technologies in the following sections each of which solves a different aspect of the problems associated with performing, collaborating on and sharing reproducible computational analyses. Then we will look at a tool which brings many of these technologies together into a single relatively easy to use platform, Renku. If you are in a hurry you can skip directly to the Renku section Section 4.8 and revisit the intervening sections as needed though I’d suggest at least skimming them to get a little context.

4.3.1 Source Management

AKA version control or source control

When working with code at any scale beyond a few small scripts (and sometimes even then) it is highly advisable to use a tool to keep track of the changes that you have made to your code. This is especially true if you are collaborating with others as such tools usually also feature utilities to help you to merge code developed by multiple people working on the same project asynchronously. The de facto standard tool for this is git, it is widely used and there is much tooling built around the core git software.

git - Track changes but OP4 (and a bit more complicated)

- Richard J. Acton

Using git in a data analysis project is also a bit like using a lab notebook. Whenever you take a snapshot of your project by making a ‘commit’ you accompany it with a ‘commit message’ giving a brief description of why you did what you did. A digital file is not necessarily like a lab notebook in that a physical notebook has a chronological order where you can see the history of what you did, when and why. In contrast a digital file that you change over time just has its current form and does not retain a history of its changes. git adds this chronological dimension back to digital projects letting you time travel through the history of your projects this can be very valuable. For example if you want to be able to get back a result exactly as you generated it before you updated your code.

It is simple to learn basic git operations but its underlying structure can be a bit conceptually difficult grasp. I recommend taking the time to form a good mental model of git’s workings if you are going to use it regularly. If you want to understand more and perform more advanced operations or indeed just fully understand the simple ones. See the learning resource below for some more in depth material on git.

git

git

Before we delve a little more into git I’m going to introduce another concept - Literate programming. Source code is after all just text and many of these same concepts translate well to collaborating on prose.

4.3.1.1 Literate programming

“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

- Donald Knuth

Literate programming is a paradigm for writing code interspersed with natural language prose, or vice versa. The concept was introduced by Donald Knuth in 1984, an early example of Literate Programming was \(LaTeX\) the document authoring and typesetting language still popular with anyone writing mathematical notation on computers. Literate programming is now very popular in ‘computational notebooks’ used by data scientists, these are in many ways the computational equivalent of a lab notebook. This is a literate programming document, it leans heavily towards prose and does not contain much code but I can easily include some, check it out:

2 + 2
[1] 4

The literate programming tool I’m using is called Quarto. With it I can write plain text documents which include snippets of code in R or a variety of other languages and format my text with a simple markup syntax called markdown. As I mentioned above I’m currently editing this document in a WYSIWYG editor much like the word processors with which you are likely familiar that generates the Quarto/markdown formatted text. Markdown is however remarkably simple and very easy to learn and I regularly switch between source and visual modes with minimal friction.

This is an extremely powerful tool for generating and properly documenting my work and indeed for outputting it for different publication formats, a concept called single source publishing. This document for example is automatically published as a website, a pdf & and epub every time I commit and push changes to the gitlab repository where it is hosted. You can even get your markdown formatted according to the requirements of many journals with {rticles} or Quarto Journals. Thus the published output from this source document is tightly coupled to the code in it. Any code I write here is re-run when this document is built (unless I cache the results).

A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.

- John Gruber

Here’s is a quick markdown syntax rundown (~90% of all the markdown syntax you’ll ever need):

# Heading 1
## Heading 2
### Heading 3 etc. {#sec-h3}

[hyperlink text](url)
footnote [^1]
Inline references were used by @smith2021, it has been claimed [@jones2022] (not inline)
cross refernces @sec-h3 using the h3 short alias

**Bold**
*italic*
***Bold & Italic***
`inline code`

-   Bullet point
    -   nested
-   point

1.   Numbered list
2.   another thing ...

> Quotation - unattributed

![alt text](/path/to/an/image.png)

```
generic code chunk
```

$inline~math~x^y$

$$
math~chunk~\frac{x}{y}
$$

[^1]: callback!
Tip

Markdown comes in a number of ‘flavors’ usually a superset of the commonmark specification / reference implementation which extend it with additional features so there is some variation in syntax, many tools have built in linters to check/auto-correct any syntax not supported in a given flavor.

4.3.1.1.1 Literate Programming Learning resources

Quarto is a scientific and technical publishing system that uses markdown

Whilst still from the Posit (formerly RStudio) team it is more language agnostic than Rmarkdown which may be familiar to R users and can be installed as a separate command-line utility without R dependencies. It can use jupyter notebooks as a source document format and integrates well with vs-code as well as RStudio. It also unifies the variety of different pre-processing steps for different output formats previously performed by a family of R packages bringing us closer to true single source publishing.

if you are starting in 2023 begin with Quarto it’s basically the same as Rmarkdown but better and highly backwards compatible with R markdown. These texts are still relevant but you can can now do a lot of cool new stuff in Quarto in particular ‘fenced divs’ are awesome.

4.3.1.1.2 Jupyter Notebooks (Python)

Jupyter notebooks are another major player in the scientific computational notebook space and originate in the python community with the iPython interactive shell. They are are run from within your web browser either locally or on a remote jupyter hub. They also make use of markdown syntax for literate programming.

Unlike Quarto/Rmarkdown they are not tightly integrated with an IDE (integrated development environment) meaning they sometimes lack some of the features that this can provide. Though tighter integrations with Microsoft’s featureful open source text editor vscode are changing this.

I am primarily an R user I’ve used jupyter notebooks for Python and Raku projects but much prefer the experience of Rmarkdown style notebooks over jupyter. Mostly as you do not generally see or edit the actual source document you only generate it from the interface. This makes working with version control tools like git more challenging. Thankfully MyST makes markdown style notebooks possible if you don’t like Quarto.

There are trade-offs between the Rmarkdown & Jupyter Notebook ways of working (see: The First Notebook War) but Quarto and jupyter book in conjunction with MyST go a long way to resolving some of these issues.

If you are primarily a python person looking to get started with a literate programming workflow I would suggest that you avoid classic jupyter notebook files in favor of those written entirely in Markdown.

You could use Quatro with VScode and or Python in RStudio with Quarto and {reticulate} over jupyterhub or the jupyter extension for vscode, but this may not be the best fit for you established workflow, it is a matter of taste.

One of the nice features of vscode for working collaboratively is the live share extension which gives you real-time google-docs-like collaboration tools, though of course you can still use git in vscode for asynchronous collaboration. JupyterHub now also has support for real-time collaboration.

4.4 Jupyter/Quarto & vscode learning resources

4.4.0.1 Using git

I’m including command line examples here but the concepts should map well onto a number of different GUI front-ends to git. The glossary should help with both git’s terminology and understanding some of the key concepts that make it up.

  • repository/repo - a oject in git, they are just a directory (a folder) with the right git configuration files in it.
    • git init initializes a new repo, you can also create one on a git hosting service like gitlab github then clone a local copy. After this you’ll find a hidden .git folder in your repo.
  • commit - A single point in the git history, like a save point. Commits have authors, short messages, optional longer descriptions, and a reference to the previous commit. When you have made changes in your repo you can pick those changes that make sense to group together under a single commit and add this state to the tracked history of your project by committing these changes.
    • git commit <file that I changed> -m"Short informative message"
    • commit all changes git commit -am"Short informative message"
  • add - To include a file in tracking by git. Often used to add multiple files, if there are some items in your project that you explicitly want t to be tracked by git you can specify these in a .gitignore le.
    • git add new_file.txt
    • git add -A add all files - handle with care! you don’t want to accidentally commit a rge binary file or things that should be kept secret.
    • [advanced note an untracked file will remain in your working directory even if you change branches and by default will not be stashed if you run git stash to get a clean working tree to perform her git actions. They must added or stashed with the -u option.
  • branch - commits form a tree with each commit referencing the previous one. This tree can branch and allow two different commits to have the same parent. Thus a branch represents a particular lineage of development. The default name for the primary branch or ‘trunk’ of a repo is usually master or main.
    • list branches git branch -l
    • create branch git branch <new branch name>
    • switch to branch git checkout <new branch name>
    • delete a branch git branch -d <branch name>
  • head - The me referring to the commit at the tip of a branch, i.e. the latest commit made to that branch.
  • HEAD - EAD’ is the ‘head’ of your current active branch. Unless you have a `detached HEAD` in which case it references the arbitrary commit that you have checked out (you can checkout commits not just branches if you want to see the state of the project at e point of a given commit). I find it helpful to think of HEAD as a window that I can drag over the tree of commits. Usually it slides along the tip of the branch I’m committing changes to. If I want to see another branch or point in history I can slide HEAD there and now the working rectory will reflect the state of the project at that point in the tree.
  • checkout - to look at the state of the repo on a given branch or at a given commit. ‘Checking-out’ the repo can be scary as it can look as though your work has disappeared from the repo. All committed states are kept in the .git directory, your working directory merely points to em here. By changing which commit your HEAD is pointing to you can move the window of your working directory around the history and branches of your project.
  • remote - A repository tracking the same code base t somewhere else. Often used in the context of a git server from which you might pull and push changes. The default name of a repository’s main remote is usually origin.
    • list remotes and their urls git remote -v
    • add a new remote git remote add <remote name> <remote url>
  • pull - To retrieve changes from a remote repo. If you are collaborating on a project you would generally pull any changes from remote before starting work so that you have the latest version of your colleagues changes.
    • git pull <branch name> <remote name>, commonly git pull origin master
  • push - To send the changes that you have made locally a remote. The commit at the tip of the remote branch must be an ancestor of the tree that you are trying to push to the remote. If it is not you will have to perform a pull and resolve conflicts between your changes and those your colleagues.
    • git push <branch name> <remote name>, commonly git push origin master
  • log - The history of git commits. This shows the commit hash, message, author, author email, and date or previous commits.
    • git log shows a list of commits adding the --graph flag shows a text based graphic of the branch structure
  • commit hash - Every commit has an identifier associated with it this is a seemingly random string of letters and numbers. This is called the commit hash (because it is a SHA-1 hash the commit). It is a function of the contents of the commit and an (approximately) globally unique identifier r it. You can refer to a commit by a unique portion of it’s commit hash, this is why you often see ly the first few characters of a commit used to reference it. (Remember that a commit contains the hash of the previous commit making git a hash tree data structure from which you can infer a directed acyclic graph)
  • staging area - The git staging area houses the contents of proposed commits. It lets you decide which of the changes you have made you would like you to commit.
    • The git status command reveals the contents of your staging area under Changes to be mmitted: if you change a file it will appear in this section. If you create a new file it will not be in the staging area until you add it with git add.
    • The staging area can be manipulated interactively form the command line with git add -i
  • stash - The stash is generally used to get your comitted changes out of the way so that you have a clean working directory. Many git operations are easier or only possible once you have a clean working directory. For example to do things like pull from upstream then re-applying your changes to the updated tree. This way you can update whilst avoiding committing or deleting unfinished changes.
    • git stash stash your current changes
    • git stash list list the stashes
    • git stash pop apply the first item in the stash list to working directory. This is like rebasing in that it will apply your stash to the tip of the current branch.
  • merge - Replaying the changes made on one branch on top those made on another in order to combine them. There are different approaches to combing branches than merging this may not always be the best action, we will discuss branching and merging strategies later.
    • If I am on the master branch and I want to merge in the feature branch I can use the command: git merge feature to merge the branches. This will, if there are no conflicts, create a merge commit.
    • Resolving merge conflicts can be fiddly and is best avoided. Try and get your branch into a state where it can be merged without conflicts before attempting merge. You can backout of a merge with git merge --abort If you weren’t expecting a conflict, do is before making any attempts at resolving the conflict as if you make changes during a merge you many t be able to revert cleanly
  • rebase - Reapply commits from your branch from the tip another. If your branch has diverged from master but has no conflicts with it’s current tip then instead merging you can rebase on master. ‘snip’ your branch off from it’s current parent and automatically generate new commits where the current tip of master is the parent instead.
    • Rebase your current branch on master git rebase master.
    • Rebase a specific branch on master git rebase master <branch name>
    • You may also encounter the concept of a ‘fast-forward merge’ this is equivalent to rebasing
  • diff - Get the difference between file(s) at different rates. This might be between two commits, various special cases of this. A common diff is between the staging area and unstaged changes. (If you use git on the CLI checkout delta for improved visual diffs)
  • status - What is the condition of the working tree? Which changes have been made since the last commit? Which of the changes has or has not been staged yet.
  • mv / rm - using the base UNIX commands rather than the git versions may not have the desired effect. If you mv a file to re-name it it will appear to git as though you deleted add re-added it less you use git mv. If you rm a file you will counter-intuitively need to add the action of removing it to your staging area to let git know you’ve removed it is simpler to git rm a file which both removes it and stages the action of removing it.
  • submodule - gitception: a git repo inside another git repo! 🤯
  • Pull Request (PR) - A request that the stewards of the upstream project pull changes from your tree into theirs.

unfinished!

Installing git…links

Tell git who you are

git config --global user.email "youremail@yourdomain.com" git config --global user.name "your name"

(Dropping --global will only set these values for the current project)

Initialize a git repository (turn a folder into a git repo)

git init

Add files to be tracked by git:

echo "# README" > README.md # an example file git add README.md

set-up a remote

(Send your changes to a git server)

pull from a remote (Get the latest changes from a git server)

Get the status of your git repository with git status this will show you …

diff

staging chunks

4.4.0.1.1 git hosting & UIs

There are numerous GUI (e.g. gitkraken) / TUI (terminal/text user interface) (e.g. gitui, lazygit) interfaces to git which provide convenient interfaces to git beyond the core command line application. RStudio provides a built-in git UI in which you can commit changes, see diffs, explore history, manage branches etc. By default it is located in a tab in the top right pane of the RStudio interface in projects which use git.

Git and the platforms built around it such as github and gitlab solve the problem of sharing and collaborating on your code, and in the context of literate programming your prose as well.

You can also explore the history of the changes made to project in the history view of project in github or gitlab for example here is the gitlab history of this document.

Another useful feature of git is for attribution. Every git commit has an author so when collaborating on a project managed in git credit can go to the people who wrote particular parts of the document. (git also distinguishes between an author and a committer, so a committer can commit changes from an author who is not themselves directly using git if desired, though this is not entirely the intended use case)

You can temporarily override the default author / committer values set the the global or local git config files by setting these

export GIT_AUTHOR_NAME="John Smith"
export GIT_AUTHOR_EMAIL="jsmith@example.com"
export GIT_COMMITTER_NAME="Jane Doe"
export GIT_COMMITTER_EMAIL="jdoe@example.com"
Important

Note that truly deleting things from a git history once that history has been pushed to a repo used by others can be quite difficult. (It can take a long time because git is based on hash tree if you delete something from the history you have to re-write all subsequent commits. This is part of what makes it such a good system for provenance of code.) So never commit secrets such as passwords or API keys even to private repos if these might ever be made public. Storing sensitive values in environment variables is a common solution to this problem.

4.4.0.1.2 git branching strategies for collaborative document editing

feature branches for internal collaborators, forking for external collaborators.

There are a variety of workflow patterns which can be followed when it comes to collaborating on git projects. For a solo project you might be able to get away with just committing directly to the primary branch (often called main or master) almost all of the time. When collaborating however it can be a good idea to switch to working in ‘feature branches’. You have some small feature that you want to implement or issue to address so you make a branch and work on it there. Once you are done you can check on the status of the master branch. If master is ahead of where you branched off you might want to rebase on the new master, resolve and conflicts and perform a fast forward merge appending your new commits to the end of the master branch. It is best to keep the scope of these as small as possible so there are minimal issues when merging back into the master branch.

A collaborator with access and permissions on your repository can work on feature branches in your repo, but an external collaborator without these permissions cannot. So to achieve the same thing they can fork the repo i.e. make their own copy and submit pull requests (PR) from there. PRs are best for specific suggested changes. If there is a problem or query around what changes need to be made an issue should be opened in the issue tracker of the project and once plans for specific changes are agreed then a PR can be generated with the proposed changes. PRs can be reviewed and revised before being accepted and merged into the master branch.

This process is generally how ‘peer review’ usually just referred to as ‘code review’ tends to happens in software projects. An issue becomes a proposed set of changes, becomes a specific implementation of those changes, becomes a pull request. Any alterations to the specifics are worked out in the PR before the agreed changes are merged.

In literate programming it is advisable to follow the convention of one sentence per line in the source document when using git. This makes it easier to manage git diffs as git focuses on linewise not character-wise differences. You can get this behavior in RStudio with the options below in the YAML header of an Rmarkdown document, or at of project of global level in the RStudio settings.

---
editor_options:
  markdown:
    wrap: sentence
    canonical: true
---

4.5 Git learning resources

4.5.1 Environment management

When you write data analysis code in a language like R or Python chances are that you are going to be depending on some other packages to do your work. You may have noticed that updates to the packages sometimes break your code. A function that used to exist has been deprecated and is no longer in a package, or the arguments to a function have changed. More worrying still sometimes such changes won’t stop your code running but produce an output that is wrong but not in an obvious fashion. Thus in order to reproduce your analysis exactly we would need not just your code but the versions of the language and the packages that your code depends on. This way it is possible run your code with the confidence that it is functioning the same way for us as it was for you.

4.5.1.1 Package & Environment management tools

In the R programming language the best package management solution for reproducible environments is {renv}.

{renv} provides renv::install() which is a replacement for the base install.packages(), as well as the BiocManger::install() & remotes::install_github() functions used to install R packages. The renv::snapshot() function is used to create a project specific manifest file renv.lock which documents all the packages used and their versions. renv::restore() can then be used to make the installed packages and their versions match those specified in the lock file. {renv} has a central package cache and uses symbolic links to project libraries to ensure that there is only one copy of a given version of a package installed on your system improving it’s performance over previous attempts at project specific package management in R.

In Python there are two main tools for managing package environments.

venv is a python specific environment manager for isolated project specific python package management and part of the python standard libraries. A virtual environment can be created with by running python3 -m venv venv/ in your project directory. This command uses the venv module (-m venv) to create a virtual environment called venv (venv/) but the name is arbitrary and a sub-directory with the environment’s name will be created. To use the environment it must be activated with: source venv/bin/activate, deactivate exits the environment. The pip package manager can be used as normal within the environment and will only affect the local environment while it is active. To capture a snapshot of the environment from which it could be restored later use pip freeze > requirements.txt A virtual environment can be restored from a requirements.txt file with: pip install -r requirements.txt This guide to python virtual environments goes into some additional details of how to use venv and how it works. For management of the version of python itself pyenv is a good tool. For the management of python packaging the tool poetry is a good choice for it’s good dependency management.

conda is both a package and environment manager and is language agnostic however tends to be used in predominantly python settings. Whilst conda can be used to mange R packages I would not recommend it for a predominantly R project. By default conda does not take the approach of storing the specification of your environment within the project directory, unlike {renv} & venv, I would avoid this default behavior. Keeping the environment specification in the directory is obviously preferable if you want to be able to share the project along with it’s environment. This guide to conda projects provides a nice overview of getting started with conda environments using an environment.yml file and this demonstrates how you can set the location of the conda environments to be within a project directory.

Beyond language specific packages many of the packages in a given language will depend on system libraries in your operating system. For instance an R package which parses XML files might rely on a fast system library written in C which is used by many other packages in other languages rather than re-implement it’s own XML parsing package and duplicate the effort. Now a language specific package or environment management solution will no longer be sufficient alone. Solutions to this problem include more advanced system level package management tools such as NIX based package management (see also GNU GUIX). Alternatively, and more popularly at the moment, system dependencies can be managed by the operating system’s package manager and containers can be used to create portable and isolated system environments with different system dependencies. Nix-like package management solves more and different problems than containers and can be used to more reproducibly build container images but still currently has a bit of a ‘early adopter tax’.

4.5.1.2 Containers

A container provides an isolated self-contained computing environment similar in practice to that of a virtual machine (VM) whilst not having nearly the same performance deficits associated with virtualization (for the technically inclined, a simplification is containers share a kernel but provide a different user-land). This lets you package up your code along with all it’s dependencies and configuration in a standard ‘box’ that can run exactly the same way on essentially any Linux back-end (as well as on mac and windows through what amounts to a wrapper around a linux VM).

The most popular containerization technology is Docker though others exist (podman & Apptainer/Singularity for example). You specify the environment you want inside a Docker container using a Dockerfile and building a container image which can run things in the environment specified in the Dockerfile. Whilst running something with the exact container image is fully reproducible building a container image from a specification is not necessarily so. The Dockerfile starts from a ‘base image’ usually of the operating system you’d like to setup your environment in. You might use for example ubuntu:latest the second part of this text specifying the operating system latest is called a tag. The latest tag obviously depends on what happened to be the latest version when the build command was run thus you cannot rebuild an identical image to the original one built from this Dockerfile unless you know want version of Ubuntu was the latest when the build command was run. To avoid this ambiguity it is best to specify the version more explicitly e.g. ubuntu:jammy, jammy is the code name for Ubuntu 22.04 the current (as of writing) LTS (long term support) release of the Ubuntu operating system.

4.5.2 Dataset Management

When datasets are small their day-to-day management is often relatively un-complicated. We can just make a copy of our original raw data and work with that in our analysis. As datasets get larger and simply making a copy of them becomes an expensive operation we often have to get a bit more creative with their management.

4.5.2.1 Your Own Raw Data

Raw data is conceptually ‘read-only’ it can be a good idea to make this literal. Keep the raw data for your project in a place that you cannot accidentally modify or delete it. An easy way to do this is to make your raw data files read only and keep them in specific location which you can back-up with a little extra thoroughness.

Tip

On UNIX like systems you might want to follow a pattern like this:

# A central directory to store all your raw data files
# with subdirectories by project
mkdir -p ~/tank/test-project-data

# Make an example data file
touch ~/tank/test-project-data/data.file

# Change the mode of all the files in `test-project-data`
# and all sub-directories with `-R`
# remove the write permission with `-w`
chmod -R -w ~/tank/test-project-data

# Make a directory in your project folder to link to your raw data
mkdir -p ~/projects/test-project/data
ln -s ~/tank/test-project-data/data.file ~/projects/data/data.file

# Links to files in ~/projects/data/data.file can now be deleted
# The files they are linked to will not be affected
rm ~/projects/data/data.file

# If you run this you'll find it's still there
ls -l ~/tank/test-project-data/data.file

Need Help understanding any of these shell commands? Checkout explainshell.com paste in any shell command to get a breakdown of its component parts and what they mean.

This approach lets you keep your datasets within your project directory without actually having to keep the files there. For example you might have your data directory on a secondary higher capacity storage device than your projects folder.

You may even want to make a dedicated user account who is the only one with write permissions to your raw data files as extra protection against their accidentally being changed. A dedicated account able only to read the raw data files that is used to perform backups is also a potentially sensible strategy.

4.5.2.2 Raw Vs. Processed Data

Raw data is generally data directly from whatever your instrument is. There may be some degree of pre-processing applied by that instrument on its own raw data from its sensors prior to outputting this pre-processed data to the end user. For example in DNA sequencing machines base calls are generally made on the machine from the raw sensor output e.g. the florescence intensity before being output as a fastq file with the call and an indicator of it’s quality.

Once you have your raw data you process it yielding (surprise!) processed data. Some of this processed data will be ‘end points’ and other parts may be ‘intermediate data products’. Whether your data is an endpoint or an intermediate product is context dependent. You might for example consider the count matrix from a RNA-seq experiment as an endpoint as it is a common product of analysis used in further downstream analyses. It’s the sort of data product that it is useful to others if you include it when you deposit your data in a public repository. But you might discard the alignments in the form of BAM files as these are very large. BAM files are however computationally expensive to generate so you might keep them around for the active duration of the project but not archive them.

In theory all processed data should be dispensable if your raw data, analysis code, and computational environment are properly documented. It should be possible to exactly regenerate your results from your raw data and your computational methods.

4.5.2.3 Public Data

When working with data that you did not generate and thus do not need to ensure the preservation of. You might want to keep this somewhere separate from your own raw data. Somewhere without your own backups where you can cache the data. Always be sure to capture the metadata about how you acquired your copy though, accession numbers, when you did so, and any version numbers available.

Domain specific data repositories may have their own download tools and approaches to locally caching data which you can use.

As we will discuss in section Chapter 5 When to Publish Data it can be a good idea to publish your own data to a public repository before publishing your main analysis. This way you can access your data from the public resource as other researchers would. This is a good practice as it permits you to validate if your data are indeed FAIR. It shows that you were able to find you own data, refer to it with it’s unique identifiers and retrieve it in a appropriate format. This improves the documentation of the provenance of your data as it’s shared accessions and metadata annotation are used in the original work leaving less opportunity for errors of labeling etc. in the data repository.

4.5.2.4 Large files in git repos

Within a project managed by git large binary files, such as images, can be a problem as they will cause a repo to quickly grow to an unmanageable size if they are included by git. A solution to this problem if you want to remain within the git paradigm is git-lfs (git large file storage) though this approach is not without it’s drawbacks. Every committed version of a large file is still kept just on the git-lfs server not in everyone’s local repos where only the needed version in synchronized. When git-lfs is available at a git hosting service it is often, understandably, a paid feature or has limited capacity. It is also a non-trivial effort to configure and host your own git-lfs server.

The tool data version control (DVC) provides git compatible git-lfs like functionality with different storage back-end options including consumer cloud storage options like google drive dropbox etc. Other alternatives include lakefs, and you can also use ZFS to version data if you are using it directly and not just as a storage back-end behind other abstractions.

4.6 git-lfs learning resources

4.6.0.1 Imaging datasets

Note: I am not an image analysis specialist, so this section would likely benefit from the contributions of someone with more experience in this area.

Working with imaging data can pose a number of substantial practical challenges. Imaging datasets are often quite complex to administer they are often comprised of many files which need to be structured and accompanied by both experimental design and technical metadata. Many microscopes produce images in proprietary formats which attempt to address some of these organisational issues by bundling together metadata and images from individual planes or channels into single files that are interpretable to their software, unfortunately these formats are often proprietary which can present issues when trying to use them with software other that that provided by the manufacture of your imaging equipment or their partners.

Thanks to the open microscopy environment’s (OME) bio-formats project and its hard work reverse engineering many of these formats it is now possible to work with many of them interoperably and with open software tools. There are ongoing efforts to have commercial imaging providers make use of open standards and open up their imaging formats5.

As alluded to in Chapter 3 How To Store Your Data imaging data can in some cases be very large. 3 dimensional multi-channel and time course (aka 5D) datasets at high resolution from imaging techniques such as light sheet microscopy can rapidly balloon in size. Extremely high resolution electron microscopy images are another example. When datasets reach multiple terabytes we start running up against the limits of the current generation of readily available computing technology to make use of datasets of this unwieldy size. Fortunately there is much work underway to make this a more manageable problem including the development of next generation file formats OME-NGFF which facilitate parallel processing and the streaming of only needed portions of large datasets to users remotely accessing data from central repository(s) (Moore et al. 2021 [cito:citesAsAuthority] [cito:credits] [cito:agreesWith]).

Organizing your imaging data benefits from software tools which permit you to store, view, annotate, share, search, and programatically explore your imaging datasets. A simple file and folder directory structure with some standard operating procedures for where to put files and what to call them is slow, manual, cumbersome, and error prone. The Open Microscopy Environment’s OMERO tool is probably the best available software tool to solve your image data organisation woes. It operates a standard client-server approach with a central server on which the data is indexed and stored which can be accessed by various clients. There is a general web client, and additional web based viewers and figure creation tools as well as a desktop client to speed up larger image uploads & downloads. The OMERO server can be accessed via an application programming interface (API) which permits you to interact with your data from applications like Fiji, cellprofiler, QuPath, or napari. There are libraries for Python, R, Java, & MATLAB to facilitate using the API in your own custom analysis code.

One of the advantages to deploying an OMERO instance and using it to store and analyse your data is that it is the same software stack which underpins public image databases such as the Image Data Repository (IDR) a highly curated ‘added-value database’ for image data-sets that are community resources. You can interact with your own data in the same way you interact with publicly available datasets and when you make your own data public others can access it the same way you do internally

Image Data Learning Resources

4.6.1 Pipeline/Workflow Management

4.6.1.1 Pipeline management tools

4.6.1.1.1 Why pipeline/workflow management tools?

One of the significant practical issues addressed by using a pipeline management system for developing a new analysis or iterating on an existing one is results caching. If your long fairly complex pipeline with some slow computationally expensive steps is just a script that has to be re-run from scratch because you changed how a graph looks you’re are not going to use a single framework for your whole analysis. You are, quite sensibly, going to break it up into separate steps. You are however now at risk of ending up with your analysis in an inconsistent state if, for example, you forget to re-run a step downstream of your change. You have introduced a semi-manual stepping through each of the separate sections of your analysis to get the final result.

Many pipeline managers are designed to be (mostly) idempotent, that is to say running the same pipeline repeatedly will get you the same result, subsequent runs will not be affected by previous runs. Also running it repeatedly is a save operation in that it won’t affect the outcome. Whilst you can manage to get the same result with an ordinary script it can be very cumbersome and time consuming to do so. One of the tricks generally employed by pipeline managers to make idempotency practical is caching. If you can cache the computationally expensive parts of an analysis you can feel safe running the pipeline command again. This way you can make a minor downstream modification to a plot safe in the knowledge that you won’t have to wait hours to see the results of your change, as the same pipeline command only runs steps that need to be run to apply the changes.

This lets you keep your long and complex analyses properly connected together and re-run-able from scratch with a single command but means that you don’t have to re-run the bits you have not changed to ensure everything stays consistent. Despite this it is almost always advisable to re-run any pipeline from scratch with a clean cache once you think you have the final version ready to make sure there are no hitches. Cache invalidation is after all a legendarily hard thing to get consistently right. Because pipeline managers generally understand the dependency relationships between the steps of your analysis it is usually simple to automatically parallelise independent tasks and get better run times.

Pipeline management tools are most advantageous for longer more complex or more computationally expensive analyses, especially those intended to be reused by others. Their design tends to favor workloads which require large batch processing with little to no user interaction needed during a run. So they won’t be applicable for everyone’s use-case.

4.6.1.1.2 Which pipeline tool?

There are a number of language specific pipeline tools which may be easier to learn if your are already proficient with a particular language and benefit from language specific integrations. R’s {targets} pipeline manager for instance has nice integrations with R’s literate programming tools, which can be useful when writing a pipeline with nicely formatted outputs. In Python there is the snakemake pipeline manager.

A common reason for using a pipeline manager however is not writing a new pipeline form scratch but making use of an existing one. A good example of this is the nf-core project with uses nextflow a domain specific language (DSL) for pipeline management which excels in portability of pipelines between different systems. nf-core has a number of pre-built pipelines for common bioinformatic analyses which can be used by anyone and make it easy for others to reproduce your analysis. nf-core is an open source project so anyone can contribute updates, extensions, bug fixes or entirely new pipelines to the project which may be incorporated into the upstream versions used by the community. If you have a novel analysis method creating such a community pipeline is one of the best ways to make it easy for other researchers to use your work. (Publications which accompany tools which become popular tend to attract out-sized numbers of citations.)

A project that is worth being aware of in the workflow management space is the common workflow language (CWL). Many other pipeline management frameworks have a least partial compatibility to import/export CWL which is very valuable when migrating a pipeline between systems. If you are working with a pipeline management tool other than nextflow or even if you are using nextflow you can also deposit workflows in WorkflowHub which supports workflows of any type.

The Renku platform also has a built-in workflow management system which takes a slightly different approach to constructing pipelines step-wise which can be exported to CWL and which we will cover more in the renku section Section 4.8.

4.7 Pipeline tools learning resources

4.7.0.0.1 Continuous integration & deployment (CI/CD)

CI/CD are concepts popularized by the software development industry for testing and deploying applications. If you are developing a software package to share your code learning some of this tooling can be very useful to automate many of the steps involved in testing and distributing software. However this same tooling can be very useful for checking that you analyses are indeed reproducible and for publishing documentation associated with any workflows that you share.

This text is making use of a CI/CD pipeline in its publication process. It is built from its markdown source files into a website, epub & pdf on a gitlab CI/CD ‘runner’ every time I push commits to the remote repository. The static website for this document is only updated if the build completes correctly with no errors. This .gitlab-ci.yml file details the steps taken when building this document from source. (I’m doing some extra things so it would not normally be as complicated as it might appear in this file to build a document like this.)

4.8 Renku - Bringing it all together

Keeping your code, data, & compute environment together

Renku (連句 “linked verses”), is a Japanese form of popular collaborative linked verse poetry, written by more than one author working together.

- Wikipedia

In the last few sections we covered a number of powerful, complex and configurable technologies, it may feel a bit overwhelming as there is a lot to learn and a lot of choices to make. Fortunately, the Renku platform combines many of these technologies and has chosen some sensible defaults to make it simpler to get started using them. It is about as easy as picking up a project that is a jupyter notebook or RStudio project if you are already familiar with these and getting much of the rest (almost) for free. Renku’s flexible template system makes it possible for people with more experience of the platform to set up easy to use environments specialized for particular tasks for other collaborators on a project.

4.8.1 Why Renku?

Whilst there are other solutions to the reproducible compute problem which make it fairly straightforward to reproduce environments, e.g. binder they lack data integrations. There are proprietary cloud based solutions to having the trifecta of data, code and compute environment in the same place such as Google’s Colab. However, given google’s graveyard full of dead projects it may be unwise to depend on them if you want your work to be around and accessible in the medium to long term future. There is also Code Ocean & DagsHub but these are also a paid closed source solutions even if many of their internals and integrations are based on open source tooling. Stencila is an ambitious but still early stage open source project notably they have an integration with eLife.

Fundamentally adoption of a proprietary platform as a standard for computational reproducibility is an oxymoron, full transparency is not possible with this approach. I can’t verifiably reproduce your analysis if I’m using a black box to do it or am missing key features to create new analyses or interact with their results. Paid services in support of open source tools is the only transparent, ethical and sustainable approach to solving this problem. A project worth watching as a source of publicly funded could infrastructure for hosting such open open platforms is the European Open Science Cloud (EOSC), though this remains at a relatively early stage of development at time of writing.

Renku provides all the needed features of the above projects but is an open-source project developed at the Swiss Data Science Center based at EPFL and ETH Zurich. A project which you can host yourself and which has a public instance at renkulab.io. Similar considerations apply to the choice of a reproducible computational analysis platform as to the choice of electronic lab notebooks (section Section 4.1) this is because they have semi-overlapping functions. A platform like Renku can serve as the lab notebook to your more computationally focused researchers.

4.8.2 Getting started with Renku

4.8.2.1 Account setup

You can signup for an account at Renkulab.io at this registration page ORCID & github are supported as single single sign-on providers.

4.8.2.2 Your first project

4.8.2.3 Templates

When you start a new project in renku you generally do so from a template. Renku has a templating system which permits users to create their own templates for projects. There are a core set of default templates as well as community contributed ones. I’m also developing some templates for HDBI.

4.8.2.4 Running Renku Sessions Locally

If you have docker and the renku CLI client installed on your system you can run an interactive renku session on your local system by running: renku session start and navigating to the link that it returns in your web browser the link will look something like this: http://0.0.0.0:49153/?token=998dasdf...

If you are running a renku session on a local workstation or server with a lot of compute resources but still want to access this session remotely from your laptop or even phone there are a couple of ways of doing this.

If you can ssh (secure shell) into the machine running your container you can access your session by ssh port forwarding using a command structured like the following: ssh -nNT -L <local port>:<host>:<remote port> <host>. let’s say renku session start on my workstation returns: http://0.0.0.0:49153/?token=998dasdf I can run ssh -nNT -L 49153:me@host:49153 host on my laptop where me is my username on my workstation and host is my workstation’s IP/url then I can navigate to http://0.0.0.0:49153/?token=998dasdf on my laptop and remotely access the session.

tailscale will create a secure wireguard mesh VPN between the clients on which it is installed so if you can install tailscale on your workstation and laptop you can connect to your workstation irrespective of any firewalls or NAT normally blocking your path. Simply navigate to your workstation’s tailscale IP address and enter the port/token for the session. e.g. http://100.10.10.10:49153/?token=998dasdf where 100.10.10.10 is your workstation’s IP from tailscale status

4.8.3 Renku Learning Resources

Bankhead, Peter, Maurice B. Loughrey, José A. Fernández, Yvonne Dombrowski, Darragh G. McArt, Philip D. Dunne, Stephen McQuaid, et al. 2017. “QuPath: Open Source Software for Digital Pathology Image Analysis.” Scientific Reports 7 (1). https://doi.org/10.1038/s41598-017-17204-5.
Higgins, Stuart G., Akemi A. Nogiwa-Valdez, and Molly M. Stevens. 2022. “Considerations for Implementing Electronic Laboratory Notebooks in an Academic Research Environment.” Nature Protocols 17 (2): 179–89. https://doi.org/10.1038/s41596-021-00645-8.
Moore, Josh, Chris Allan, Sébastien Besson, Jean-Marie Burel, Erin Diel, David Gault, Kevin Kozlowski, et al. 2021. “OME-NGFF: A Next-Generation File Format for Expanding Bioimaging Data-Access Strategies.” Nature Methods 18 (12): 1496–98. https://doi.org/10.1038/s41592-021-01326-w.

  1. Feature Parity - meaning that the self-hosted version offers all the same features as the paid hosted version. I.e. there are not features locked behind a pay-wall, this business model tends to trend increasingly closed and can lead to tension between the in house dev team and the community over implementation of paid features in community versions.↩︎

  2. Jérôme’s CNRS page as ORCID is a bit sparse↩︎

  3. inkscape-imagej-panel resources:

    ↩︎
  4. OP: Over Powered, a colloquialism originating in the concept of a video game character having excessive abilities which upsets the game balance. Now commonly used to refer to characters or items in fiction or reality who abilities are disruptively good.↩︎

  5. Dear Funders, please get together and adopt a blanket policy of refusing to fund the purchase of any scientific equipment which outputs data is a proprietary format, ideally eventually moving on to refusing to fund the purchase of any equipment with proprietary embedded software. This would really save everyone a lot of time and money in the long run. Pretty Please↩︎