KBI0024: Create and update a 7-Zip archive in a RIA store#

authors:

Laura Waite <laura@waite.eu>, Michael Hanke <michael.hanke@gmail.com>

discussion:

https://github.com/psychoinformatics-de/knowledge-base/issues/47

keywords:

RIA, archive, 7-Zip, inodes

software-versions:

datalad_0.18.3

When working with RIA stores, it is possible to compress the annex/objects into 7-zip archives. With this approach, the entire annex object store can be put into an archive and remain fully accessible while minimizing inodes, regardless of file number and size. This is beneficial for compression gains or when operating on HPC-systems with inode limitations.

This document describes how to create a 7-Zip archive for an existing RIA store using the script shown below. The script does the following:

  1. Remove non-essential files and directories within the RIA store (e.g. hooks, etc). This aspect has nothing to do with creating the archive, but is useful for reducing the number of inodes.

  2. Create a 7-Zip archive containing the content in the annex object store.

  3. Clean-up (remove) the content in the annex object store after it is archived.

The script can also be used to update an archive if it already exists, as it uses the update flag when calling 7-Zip (7z u).

Preparation#

Populate a DataLad dataset:

$ datalad create my_dataset
$ cd my_dataset
$ echo "file 1 content" > file1.txt
$ datalad save -m "add file1 to the dataset"

Create a RIA store:

$ datalad create-sibling-ria -d ./ -s ria-store --new-store-ok "ria+file:///tmp/my_store"
$ datalad push --to ria-store
$ cd ../

Create an archive#

First take a look at the state of the RIA store. Content exists under annex/objects:

$ tree my_store
my_store
├── 1e9
│   └── 91f14-baff-4565-8b38-fceed63bb805
│       ├── annex
│       │   └── objects
│       │       └── v2
│       │           └── 2W
│       │               └── MD5E-s15--af1cdf0b10caa12cf13312f7bb4215df.txt
│       │                   └── MD5E-s15--af1cdf0b10caa12cf13312f7bb4215df.txt
│       ├── archives
│       ├── branches
│       ├── config
│       ├── config.dataladlock
│       ├── description
│       ├── HEAD
│       ├── hooks
│       │   ├── applypatch-msg.sample
│       │   ├── commit-msg.sample
│       │   ├── fsmonitor-watchman.sample
│       │   ├── post-update.sample
│       │   ├── pre-applypatch.sample
│       │   ├── pre-commit.sample
│       │   ├── pre-merge-commit.sample
│       │   ├── prepare-commit-msg.sample
│       │   ├── pre-push.sample
│       │   ├── pre-rebase.sample
│       │   ├── pre-receive.sample
│       │   ├── push-to-checkout.sample
│       │   └── update.sample
│       ├── info
│       │   └── exclude
│       ├── objects
│       │   ├── 08
│       │   │   └── e3ea145c77abd0cf9cb07f04f069efed2bd637
│       │   ├── 0d
│       │   │   └── 81baa2295544cae79101a18f6473a6c917b927
│       │   ├── [...]
│       │   │   └── [...]
│       │   ├── info
│       │   └── pack
│       ├── ora-remote-e9bef249-aeea-46b6-b9f3-f8e0c10c1931
│       │   └── transfer
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── master
│       │   └── tags
│       └── ria-layout-version
├── error_logs
└── ria-layout-version

Create the archive using the cleanup.sh script shown below:

$ find my_store -mindepth 2 -maxdepth 2 -type d | xargs -n1 bash cleanup.sh

As a result, the store should look like this:

$ tree my_store
my_store
├── 1e9
│   └── 91f14-baff-4565-8b38-fceed63bb805
│       ├── archives
│       │   └── archive.7z
│       ├── branches
│       ├── config
│       ├── config.dataladlock
│       ├── description
│       ├── HEAD
│       ├── objects
│       │   ├── 08
│       │   │   └── e3ea145c77abd0cf9cb07f04f069efed2bd637
│       │   ├── 0d
│       │   │   └── 81baa2295544cae79101a18f6473a6c917b927
│       │   ├── [...]
│       │   │   └── [...]
│       │   ├── info
│       │   └── pack
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── master
│       │   └── tags
│       └── ria-layout-version
├── error_logs
└── ria-layout-version

Update an archive#

The same script (cleanup.sh) can be used to update an already existing archive within a RIA store.

Make a fresh clone from the RIA store:

$ datalad clone "ria+file:///tmp/my_store#1e991f14-baff-4565-8b38-fceed63bb805" my_clone
$ cd my_clone

Add another file to the dataset:

$ echo "file 2 content" > file2.txt
$ datalad save -m "add file2 to the dataset"
$ datalad push --to origin
$ cd ../

Take look at the state of the store. Since we added a new file, there is again content under annex/objects:

$ tree my_store
my_store
├── 1e9
│   └── 91f14-baff-4565-8b38-fceed63bb805
│       ├── annex
│       │   └── objects
│       │       └── Pf
│       │           └── vq
│       │               └── MD5E-s15--7a593f3460f1efc629489d5a9e86c7b0.txt
│       │                   └── MD5E-s15--7a593f3460f1efc629489d5a9e86c7b0.txt
│       ├── archives
│       │   └── archive.7z
│       ├── branches
│       ├── config
│       ├── config.dataladlock
│       ├── description
│       ├── HEAD
│       ├── objects
│       │   ├── 08
│       │   │   └── e3ea145c77abd0cf9cb07f04f069efed2bd637
│       │   ├── 0d
│       │   │   └── 81baa2295544cae79101a18f6473a6c917b927
│       │   ├── [...]
│       │   │   └── [...]
│       │   ├── info
│       │   └── pack
│       ├── ora-remote-5a413a03-91cb-4433-a2b5-e2d108ec291b
│       │   └── transfer
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── master
│       │   └── tags
│       └── ria-layout-version
├── error_logs
└── ria-layout-version

Run the cleanup.sh script again to update the archive with the new objects:

$ find my_store -mindepth 2 -maxdepth 2 -type d | xargs -n1 bash cleanup.sh
$ tree my_store
my_store
├── 1e9
│   └── 91f14-baff-4565-8b38-fceed63bb805
│       ├── archives
│       │   └── archive.7z
│       ├── branches
│       ├── config
│       ├── config.dataladlock
│       ├── description
│       ├── HEAD
│       ├── objects
│       │   ├── 08
│       │   │   └── e3ea145c77abd0cf9cb07f04f069efed2bd637
│       │   ├── 0d
│       │   │   └── 81baa2295544cae79101a18f6473a6c917b927
│       │   ├── [...]
│       │   │   └── [...]
│       │   ├── info
│       │   └── pack
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── master
│       │   └── tags
│       └── ria-layout-version
├── error_logs
└── ria-layout-version

Let’s verify that the archive was updated successfully with the new content, using the dataset we started with:

$ cd my_dataset

This dataset only has one file (file1.txt):

$ tree
.
└── file1.txt -> .git/annex/objects/v2/2W/MD5E-s15--af1cdf0b10caa12cf13312f7bb4215df.txt/MD5E-s15--af1cdf0b10caa12cf13312f7bb4215df.txt

Run datalad update to bring in the updates from the RIA store (i.e. file2.txt):

$ datalad update --merge
$ tree
.
├── file1.txt -> .git/annex/objects/v2/2W/MD5E-s15--af1cdf0b10caa12cf13312f7bb4215df.txt/MD5E-s15--af1cdf0b10caa12cf13312f7bb4215df.txt
└── file2.txt -> .git/annex/objects/Pf/vq/MD5E-s15--7a593f3460f1efc629489d5a9e86c7b0.txt/MD5E-s15--7a593f3460f1efc629489d5a9e86c7b0.txt
cleanup.sh#
 1#!/bin/bash
 2#
 3# Michael Hanke 2020
 4
 5set -e -u
 6
 7echo "Processing $1"
 8
 9cd $1
10ds_path="$(readlink -f .)"
11
12test -f config || ( echo "not a repository: $1" && exit 1 )
13
14rm -f info/exclude
15rm -f hooks/*
16test -d info && rmdir --ignore-fail-on-non-empty info
17test -d hooks && rmdir --ignore-fail-on-non-empty hooks
18rm -rf annex/journal
19rm -f annex/index
20rm -f annex/index.lck
21rm -f annex/journal.lck
22rm -f annex/othertmp.lck
23test -d annex/othertmp && rmdir annex/othertmp
24test -d ora-remote-*/transfer && rmdir --ignore-fail-on-non-empty ora-remote-*/transfer
25test -d ora-remote-* && rmdir --ignore-fail-on-non-empty ora-remote-*
26
27# uncompressed archive by default
28sevenzopts=${HP_ZIPOPTS:--mx0}
29
30objpath="$(readlink -f annex/objects)"
31archivepath="$(readlink -f archives)"
32
33if [ ! -d "$objpath" ]; then
34        >&2 echo "No annex objects. Done."
35        exit 0
36fi
37
38if [ ! -d "$archivepath" ]; then
39        mkdir -p "$archivepath"
40        # only chown when freshly created to not destroy potential
41        # custom permission setup
42        # whoever owns the object store, owns the archives
43        chown -R --reference "$objpath" "$archivepath"
44fi
45
46mv "$objpath" "$objpath"_
47objpath="$objpath"_
48
49cd "$objpath"
50# always update, also works from scratch
517z u "$archivepath/archive.7z" . $sevenzopts
52chown -R --reference "$objpath" "$archivepath"/archive.7z
53cd -
54
55rm -rf "$objpath"
56rmdir --ignore-fail-on-non-empty "$ds_path/annex"