KBI0017: How to delete commits including their annexed data#

authors:

Christian Mönch <christian.moench@web.de>

discussion:

https://github.com/psychoinformatics-de/knowledge-base/pull/69

keywords:

delete commits, delete annexed data

software-versions:

datalad_0.8.13, git-annex_10.20230407

This knowledge base item describes how to delete commits and the annexed data that was added to the annex when the commits were created.

The initial state and the goal#

There is a datalad dataset with an annex and a number of commits. Some of those commits added files to the annex, and some might just have added files to git. We assume that the last commit has added a number of files to the annex. The dataset has not yet been pushed to a sibling.

The goal is to remove the last commit and all annexed files that were added when the commit was performed. (However, the same process can be applied for any specific commit, not only the last. See the note below.)

Principal strategy#

To remove a commit and all annexed files, two phases have to be executed. First, the commit itself has to be removed. Second, the annexed files that were added in the commit, and moved into the annex object store, have to be removed.

There are a few ways to delete commits from git. We propose to use an interactive git rebase.

Note

Technically, a git reset --hard could also be used in this instance to reset the current branch tip to the specific commit before the last. However, let’s suppose that the single commit that we wanted to remove was not the last one, and that other commits that we consider useful were added after the to-be-removed commit. We wouldn’t want to remove those other commits as well, which is exactly what a git reset --hard would do if it was applied to the commit preceding the to-be-removed commit. In this case, git reset --hard provides much more flexibility to pick and drop specific commits, as demonstrated below.

Let’s assume the git-log looks like this:

255c4fd (HEAD -> master) add two more files
9bb8847 add two initial files
2dd6618 [DATALAD] new dataset

To remove the last commit, first execute a rebase onto the hash that precedes the last commit, i.e. 9bb8847. To do that execute the command:

git rebase -i 9bb8847

During the interactive rebase drop commit 255c4fd. (It is possible to drop multiple commits in a single interactive git rebase by choosing another commit as a rebase target.)

Now the links to the annexed content are removed. In order to remove the annex content itself, execute the command git annex unused:

git annex unused

That will display a number of annexed files that are not referenced anymore from git. For example:

 unused . (checking for unused data...) (checking master...)
   Some annexed data is no longer used by any files:
     NUMBER  KEY
     1       MD5E-s9--c31ea01ca12b5558b6503a8143cdb98c.txt
     2       MD5E-s11--1d6f4be608158f767aafd1bab92043a7.txt
(To see where this data was previously used, run: git annex whereused --historical --unused

The result provides the numbers that can be used to drop the annexed data, here 1 and 2. The result does not tell us much about the data that is unused. It is important to note, that not every unused data need to be from the deleted commit. There might be other historical data that is also no longer referenced, but still useful. Git-annex provides a command to examine the unused data more closely: git annex whereused –historical –unused:

> git annex whereused --historical --unused
MD5E-s9--c31ea01ca12b5558b6503a8143cdb98c.txt HEAD@{2}:./file_1.txt
MD5E-s11--1d6f4be608158f767aafd1bab92043a7.txt HEAD@{2}:./file_2.txt

The output can be used to determine two things. First, which file name is associated with a key. Second, we can find the commits that created the data objects, that are now unused. The output HEAD@{2} refers to an entry in git’s reflog. It can be examined with the command git reflog:

> git reflog
0fbb9e2 (HEAD -> master) HEAD@{0}: commit: add more files
6e56f3d HEAD@{1}: reset: moving to 6e56f3d2c628476d8af0c2d1d14f3e4b560f017f
5a498f9 HEAD@{2}: commit: save files for subject 1 and subject 2
...

The commit message, here: save files for subject 1 and subject 2 and the gitsha of the commit, here: 5a498f9 allow to investigate in more detail what data is contained in the data-objects.

Another option to determine whether the data object is still of value is of course to examine the data object itself.

Once a file has been identified as really unused, it can be dropped with the command git annex dropunused:

> git annex dropunused 1
> git annex dropunused 2

Words of warning#

Despite the described processes to determine the value of an unused file, there is still the risk to delete an “unused” data object, that was actually still of value.

BE CAREFUL!

One useful approach is to have a branch or tag on everything that is important – git annex unused would then consider those objects necessary. git annex unused is quite powerful, we recommend studying the help for it.