KBI0016: Drop local files added in a specific commit#
- authors:
Stephan Heunis <jsheunis@gmail.com>
- discussion:
https://github.com/psychoinformatics-de/knowledge-base/pull/66
- keywords:
datalad status, datalad drop, git diff, scripting
- software-versions:
datalad_0.18.3, git_2.39.1,
In some cases it might be preferable to drop file content from a DataLad dataset in local storage after having pushed this content to a sibling of the dataset. This is particularly useful in order to free up local storage space: since the data is now pushed safely to remote storage, we don’t have to store it locally anymore. However, a likely requirement could be that only specific files should be dropped, for example all files that were added to the dataset by a specific commit, while all other files that are available locally should remain untouched.
This Knowledge Base Item outlines several methods for dropping local files that were added
in a specific commit. These methods differ in the way they identify which files to drop
(via datalad status
, git diff
, or datalad diff
), but the actual dropping of
content is handled by datalad drop
in all cases.
Content#
Note
If you are not interested in details and just looking for the quickest and leanest way to get the job done, skip over to the section: Using datalad diff in a one-liner.
Preparation#
Let’s first create a DataLad dataset with the correct setup to support this demonstration.
datalad create drop-files-test
create(ok): /Users/jsheunis/Documents/psyinf/Data/drop-files-test (dataset)
We can add some content to ensure that prior local content exists:
cd drop-files-test
echo "file 1 content" > file1.txt
datalad save -m "add file1 to the dataset"
add(ok): file1.txt (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
After saving the dataset state, we can verify the specific commits in the git history:
git log
commit 42f197501c3293bc1c0c22e36b1618eec706090e (HEAD -> main)
Author: Stephan Heunis <s.heunis@fz-juelich.de>
Date: Wed May 10 21:50:27 2023 +0200
add file1 to the dataset
commit ba8266ccd88db5ae704e08b5f292c16748952026
Author: Stephan Heunis <s.heunis@fz-juelich.de>
Date: Wed May 10 21:49:18 2023 +0200
[DATALAD] new dataset
Let’s also create and push to a sibling to ensure it exists and can be pushed to:
datalad create-sibling -s my-sibling ../my-sibling
[INFO ] Considering to create a target dataset /Users/jsheunis/Documents/psyinf/Data/drop-files-test at /Users/jsheunis/Documents/psyinf/Data/my-sibling of localhost
[INFO ] Fetching updates for Dataset(/Users/jsheunis/Documents/psyinf/Data/drop-files-test)
update(ok): . (dataset)
[INFO ] Adjusting remote git configuration
[INFO ] Running post-update hooks in all created siblings
create_sibling(ok): /Users/jsheunis/Documents/psyinf/Data/drop-files-test (dataset)
datalad push --to my-sibling
copy(ok): file1.txt (file) [to my-sibling...]
publish(ok): . (dataset) [refs/heads/git-annex->my-sibling:refs/heads/git-annex 08856c6..ccfdb72]
publish(ok): . (dataset) [refs/heads/main->my-sibling:refs/heads/main [new branch]]
action summary:
copy (ok: 1)
publish (ok: 2)
Lastly, let’s create more content in the dataset, this time without saving it (yet):
echo "the quick brown fox" > file2.txt
echo "jumps over the lazy dog" > file3.txt
Using datalad status
#
The first method that gives a view of what changed in the dataset is datalad status,
an analog to git status
. By running this command, we can see which files are in the
untracked
state, which tells us which files we should drop after the push. Here we
show the state of the two files that were added last:
datalad status
untracked: file2.txt (file)
untracked: file3.txt (file)
The drawback of this approach is that it can’t be done after the files have been committed
to git
or git-annex
(i.e. after running datalad save
), because then the files’
state would have changed to clean
, as with any other previously committed files in the
dataset.
In addition to datalad status
, other shell tools can also be used to streamline the
process. Below we use:
jq
to select onlyuntracked
files from thedatalad status
output, and then to extract the file pathsxargs -I{} sh -c
to run a shell command for each line in the output fromjq
echo $(basename $1) >> "files_to_drop.out"
to write the filename from each line above into an output file
datalad -f json status | jq '. | select(.state == "untracked") | .path' | xargs -I{} sh -c 'echo $(basename $1) >> "files_to_drop.out"' -- {}
Since we now have the list of files that we want to drop in an (untracked) file, we can save the dataset and push the files to the remote sibling:
datalad save file2.txt file3.txt -m "save file2 and file3"
add(ok): file2.txt (file)
add(ok): file3.txt (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
save (ok: 1)
datalad push --to my-sibling
copy(ok): file2.txt (file) [to my-sibling...]
copy(ok): file3.txt (file) [to my-sibling...]
publish(ok): . (dataset) [refs/heads/git-annex->my-sibling:refs/heads/git-annex 08856c6..ccfdb72]
publish(ok): . (dataset) [refs/heads/main->my-sibling:refs/heads/main [new branch]]
action summary:
copy (ok: 2)
publish (ok: 2)
Using git diff
#
git diff is a git command that can provide detailed information about the changes between commits, branches, and more. If we know the commit hashes for the states before and after the files were added, we can use this command to inspect the changed files.
By using git log
, we can find the specific commits:
git log
commit 73489f56ecd5eb4dee14c957349f09c0d8b1684d (HEAD -> main, my-sibling/main)
Author: Stephan Heunis <s.heunis@fz-juelich.de>
Date: Wed May 10 22:16:27 2023 +0200
save file2 and file3
commit 42f197501c3293bc1c0c22e36b1618eec706090e
Author: Stephan Heunis <s.heunis@fz-juelich.de>
Date: Wed May 10 21:50:27 2023 +0200
add file1 to the dataset
commit ba8266ccd88db5ae704e08b5f292c16748952026
Author: Stephan Heunis <s.heunis@fz-juelich.de>
Date: Wed May 10 21:49:18 2023 +0200
[DATALAD] new dataset
This means:
the files that we want to drop were added as part of commit
73489f5...
the commit state before adding these files was
42f1975...
Now, we inspect git diff
between the two commits (using ..
), and we specify the
--name-only
flag so that it gives us only the filenames that changed between those
commits (i.e. not everything that changed inside these files):
git diff --name-only 42f197501c3293bc1c0c22e36b1618eec706090e..73489f56ecd5eb4dee14c957349f09c0d8b1684d > files_to_drop.out
file2.txt
file3.txt
Note: since we know that the commit with the added files is also the last commit (i.e. it
corresponds to the current HEAD
), we can also omit the second commit hash.
Let’s write the filenames into an output file:
git diff --name-only 42f197501c3293bc1c0c22e36b1618eec706090e.. > files_to_drop.out
Dropping the files#
Now we can again use some shell tools to streamline the dropping process.
Here we use:
xargs -0 -n 1
to execute a command once per line in the input file<<(tr \\n \\0 <files_to_drop.out)
to supply the input file toxargs
after usingtr
on the file to replace newline character with the\0
character thatxargs
expectsdatalad -f json drop
to drop the file provided by thexargs
code
xargs -0 -n 1 datalad -f json drop <<(tr \\n \\0 <files_to_drop.out)
{"action": "drop", "annexkey": "MD5E-s10--6fe97938d91d6a56a50c14caa5c81e12.txt", "path": "/Users/jsheunis/Documents/psyinf/Data/drop-files-test/file2.txt", "refds": "/Users/jsheunis/Documents/psyinf/Data/drop-files-test", "status": "ok", "type": "file"}
{"action": "drop", "annexkey": "MD5E-s10--6fe97938d91d6a56a50c14caa5c81e12.txt", "path": "/Users/jsheunis/Documents/psyinf/Data/drop-files-test/file3.txt", "refds": "/Users/jsheunis/Documents/psyinf/Data/drop-files-test", "status": "ok", "type": "file"}
Using datalad diff
in a one-liner#
datalad diff provides similar information as git diff
, although with additonial
functionality related to (nested) DataLad datasets.
If you enjoy running one-liners and preventing unnecessary write operations to disk,
this option is for you. Below is a single line of code that uses datalad diff
,
datalad drop
, and standard UNIX tools to identify and drop files related to a
specific commit:
datalad drop $(datalad -f '{state}:{path}' diff -f HEAD~1 -t HEAD | grep '^added:' | cut -d ':' -f 2-)
To explain:
-f '{state}:{path}'
provides an output format template which will be used to format results of thedatalad diff
command. It produces output likeadded::/Users/jsheunis/Documents/psyinf/Data/drop-files-test/file2.txt
.-f HEAD~1 -t HEAD
usesdatalad diff
’s--from
and--to
options to specify the two states that will be compared (here using symbolic names referring to previous and last commit). Full or partial commit shasums can also be used like in previous examples (-f 42f197501c3293bc1c0c22e36b1618eec706090e -t 73489f56ecd5eb4dee14c957349f09c0d8b1684d
)grep
andcut
are standard UNIX tools; here they are used to find lines starting withadded:
, and to extract only the path that is contained in these lines.
This approach could be extended to also cover files that were modified in a specific
commit, by merely amending the grep
part of the command to grep '^modified:'
.
Congrats! You now know multiple ways to drop local files that were added in a specific commit!
Drop limitation#
All of the above examples use a path-based approach to drop
content, although this
has a specific limitation if the relevant file path was removed in an earlier commit.
This means there is no actual file in the worktree, and datalad drop <path-to-file>
would result in an error. To address this, we can let datalad diff
report annex keys
instead of paths, and use git annex drop to drop the content:
datalad -f '{state}:{key}' diff --annex -f HEAD~1 -t HEAD | grep -v '^clean:' | cut -d ':' -f 2- | git annex drop --batch-keys
To explain:
datalad diff
’s--from
and--to
options are used here to find the files that changed during the last commit (-f HEAD~1 -t HEAD
).-f '{state}:{path}'
is used in the same way as beforegrep -v '^clean:'
is used with the invert the matching of lines, i.e. it selects all lines where the state is notclean
cut
is used in the same way as beforegit annex drop --batch-keys
tells git-annex to drop files specified by the incoming annex keys