KBI0022: Performance of mv + datalad save vs git mv + git commit when renaming dataset directories

authors:

Stephan Heunis <jsheunis@gmail.com>

discussion:

https://github.com/psychoinformatics-de/knowledge-base/issues/74

keywords:

performance, time, datalad save, mv, git move, rename

software-versions:

datalad_0.18.4, git-annex_10.20230330-g98a3ba0ea

When renaming a directory that contains many files in DataLad dataset, a subsequent datalad save may take an unexpected amount of time. While performance is always relative, it is worth considering the use of git mv followed by git commit instead of a standard mv and datalad save in datasets with large trees.

Note

The use of mv for renaming and moving dataset content is covered extensively in a dedicated DataLad Handbook chapter: Miscellaneous file system operations. This also includes comparisons with git mv and comments on when and when not to use either of these methods. Performance, however, is not covered in the handbook and is hence presented here. Important: The recommendations here solely apply to directories. git mv operations should not be performed on subdatasets. Instead, stick to a plain mv followed by a datalad save.

A simple performance comparison

The following test was done on a Macbook Pro. Let’s say we start off with a simple dataset with the following structure:

>> tree ../test_dataset

../test_dataset
└── toplevel
   ├── A
   │   ├── one
   │   │   └── a1.txt
   │   └── two
   │       └── a2.txt
   └── B
      ├── one
      │   └── b1.txt
      └── two
            └── b2.txt

To rename the directory toplevel, we can follow one of two options:

mv and datalad save

>> time ( mv toplevel new_toplevel; datalad save )

add(ok): new_toplevel/A/one/a1.txt (file)
add(ok): new_toplevel/A/two/a2.txt (file)
add(ok): new_toplevel/B/one/b1.txt (file)
add(ok): new_toplevel/B/two/b2.txt (file)
save(ok): . (dataset)
action summary:
add (ok: 4)
save (ok: 1)
( mv toplevel new_toplevel; datalad save; )  0.41s user 0.39s system 85% cpu 0.933 total

git mv and git commit

Note

git mv encapsulates a mv operation from the old path to the new path, followed by staging the new path, and removing the old path. This implies that it is not necessary to run a git add on the new path after a git mv, the path can just be committed.

>> time ( git mv toplevel new_toplevel; git commit -m "rename directory" )

[main ee82fde] rename directory
4 files changed, 0 insertions(+), 0 deletions(-)
rename {toplevel => new_toplevel}/A/one/a1.txt (100%)
rename {toplevel => new_toplevel}/A/two/a2.txt (100%)
rename {toplevel => new_toplevel}/B/one/b1.txt (100%)
rename {toplevel => new_toplevel}/B/two/b2.txt (100%)
( git mv toplevel new_toplevel; git commit -m "rename directory"; )  0.03s user 0.05s system 70% cpu 0.117 total

Summary

As you can see, the mv + datalad save option took about 1 second while the git mv + git commit option was about 8 times faster. While this is not substantial on a small scale, it could be an important consideration when renaming paths in datasets with large filetrees. Importantly, this point is purely about performance and does not consider other aspects that could influence the decision of which renaming method to use.