KBI0018: Combining GitHub-like repositories and RIA stores#

authors:

Michał Szczepanik <m.szczepanik@fz-juelich.de>

discussion:

https://github.com/psychoinformatics-de/knowledge-base/pull/73

keywords:

RIA, GitHub, configuration, clone, source candidate

software-versions:

datalad_0.18.3

RIA stores are a convenient way of storing DataLad datasets on computing architecture: personal computers, servers or compute clusters. Users may combine a RIA store with an outward-facing GitHub-like repository (landing page, collaboration hub). In many such configurations, only the top-level dataset would be pushed to GitHub, while its subdatasets would be kept only in RIA.

This document describes how to configure a superdataset so that clones made from GitHub-like repositories can access subdatasets from RIA stores. We focus primarily on the datalad.get.subdataset-source-candidate setting.

Although the KBI targets nested datasets, some of the information also applies to single datasets.

Subdataset source candidates#

A subdataset source candidate can be configured in the superdataset:

$ git config -f .datalad/config datalad.get.subdataset-source-candidate-000mypreferredRIAstore ria+ssh://path/to/store#{id}

The last part of the option name (000mypreferredRIAstore) combines a three-digit cost and an arbitrary name of the candidate. Clone candidates will be tried in the order of increasing cost. Note that there are some default candidates, including the superdataset’s remote URL with submodule path appended, and the submodule url stored in .gitmodules file (cost 500 and 600 respectively). See Prioritizing subdataset clone locations chapter in the DataLad Handbook for more information.

Writing the configuration option to .datalad/config (a repository-specific file which is version controlled and shared with the dataset) ensures that it will be available in dataset clones. Naturally, the option can also be set in a dataset clone afterwards, and placed e.g. in .git/config (a repository-specific file which is not version controlled). See More on DIY configurations chapter in the DataLad Handbook for more information about configuration files.

Example#

Assuming the following dataset structure (datalad tree command is provided by datalad-next extension):

$ datalad tree
[DS~0] /tmp/foo
├── [DS~1] bar/
└── [DS~1] baz/

Configure and publish#

We create the RIA store and siblings for all datasets, and an additional GitHub sibling for the superdaset only (without --recursive):

$ datalad create-sibling-ria --recursive --new-store-ok --name ria-store "ria+ssh://example.com/path/to/store"
$ datalad create-sibling-github --name github <repo-name>

We then set the RIA location as the top subdataset source candidate, and save this configuration file change in the dataset:

$ git config -f .datalad/config datalad.get.subdataset-source-candidate-000-myPreferredRiaStore "ria+ssh://example.com/path/to/store#{id}"
$ datalad save -m "Added source candidate config" .datalad/config

We push all datasets to the RIA store, and the superdataset additionally to GitHub:

$ datalad push --recursive --to ria-store
$ datalad push --to github

Clone and get#

Clone from GitHub into another location:

$ cd ..
$ datalad clone <github repo url> <clone target>
$ cd <clone target>

The clone is able to install a subdataset - it does so from the preferred location:

$ datalad get --no-data bar
[INFO   ] Configure additional publication dependency on "ria-store-storage"
install(ok): /tmp/<repo-name>/bar (dataset) [Installed subdataset in order to get /tmp/<repo-name>/bar]

The subdataset’s origin is the respective location in the RIA store; the *-storage special remote is enabled automatically:

$ datalad siblings -d bar
.: here(+) [git]
.: origin(-) [ssh://example.com/path/to/store/7df/cc05d-b7ba-4b32-b7b0-9f9bb6edcf9d (git)]
.: ria-store-storage(+) [ora]

The top level dataset also has its *-storage remote enabled automatically, but since it was cloned from GitHub, its origin remote points there:

$ datalad siblings -d .
.: here(+) [git]
.: origin(-) [<github repo url> (git)]
.: ria-store-storage(+) [ora]

Adding RIA git remote manually#

The dataset which was cloned directly from GitHub (superdataset in the example above) has GitHub as its origin. The ria-store-storage (autoenabled git-annex special remote) is already available, but the git remote (formerly named ria-store) is not. If we want to push superdatsets’s git updates (not just annexed contents) back to the RIA store, we need to configure the git remote.

There are plans to allow adding git remotes other than origin automatically, but no implementation yet.

Although it was created as part of a RIA store, the git remote is no different from any other git remote, and can be enabled with git remote add. We need to know the store URL, and dataset ID. Since this is a git remote, we cannot use the ria+, #{id} or #~alias notation, and we have to split the ID with a path separator after the first three characters:

$ datalad configuration get datalad.dataset.id
4183e386-1fb7-467c-a508-cea7d6b1f8e6
$ git remote add ria-store "ssh://example.com/path/to/store/418/3e386-1fb7-467c-a508-cea7d6b1f8e6"

If for some reason this step needs to be repeated for all subdatasets (e.g. they were installed from another source) it should be possible to create a short script that figures out the URL, and run it with datalad foreach-dataset.

Gitmodules file#

Since in our example the subdatasets were created using datalad create (rather than cloned into the superdataset), their urls only record the local path:

[submodule "bar"]
      path = bar
      url = ./bar
      datalad-id = 7dfcc05d-b7ba-4b32-b7b0-9f9bb6edcf9d

Had the superdataset been cloned from a ria+[http|https|ssh] URL, no source candidate configuration would be necessary, as DataLad would (by default) use the combination of superdataset origin and the local path as one of the source candidates. This would naturally not work when the superdataset is on GitHub, and subdatasets are not.

However, the gitmodules file can be edited to contain the “right” URL, as it is also one of the default source candidates:

[submodule "bar"]
      path = bar
      url = ssh://example.com/path/to/store/7df/cc05d-b7ba-4b32-b7b0-9f9bb6edcf9d
      datalad-id = 7dfcc05d-b7ba-4b32-b7b0-9f9bb6edcf9d
      datalad-url = "ria+ssh://example.com/path/to/store#7dfcc05d-b7ba-4b32-b7b0-9f9bb6edcf9d"