KBI0004: Why do files move to .git/annex/objects when converting an existing directory into a DataLad dataset?¶
- authors:
Stephan Heunis <jsheunis@gmail.com>, Christian Mönch <christian.moench@web.de>
- discussion:
https://github.com/psychoinformatics-de/knowledge-base/pull/9
- keywords:
datalad faq, git-annex faq, .git/annex/object, symlink, link
This knowledge-base item explains why files are moved into .git/annex/objects
, when
a DataLad dataset is created from an existing directory, or when a dataset with new files
is saved.
Question:¶
On my Mac or Linux machine, when I convert an existing folder to a DataLad dataset, all files are moved to ./.git/annex/objects/
and
the file at the original location becomes a link to the moved content in ./.git/annex/objects
. Is this
normal?
Answer:¶
Yes, this is normal. DataLad manages your data with two main tools: git
and git-annex
. The moving and linking is
used by git-annex
to enable git
to work well with very large files. By applying moving and linking, git-annex
ensures that git
only needs to track the links, which are quite small in size. The task of transporting and handling
the data content is performed by git-annex
.
This setup creates a modular and portable dataset (the git
repository) which contains information about the versions
and history of data inside the dataset, while the actual data content is managed by git-annex
. File content that is
placed under management of git-annex
will be moved into ./.git/annex/objects/
and a symbolic link (symlink) to
the content will remain in the original path. This symbolic link will be tracked by git
.
The separation of version management (done by git
) and content management (done by git-annex
)) make a DataLad
dataset very flexible. You can for example share the dataset (git
repository) publicly, while keeping the contents
safe elsewhere. People can then access the dataset (with datalad clone
) and download individual files in the
dataset (with datalad get
) if they have access credentials for that particular storage location.
However, you can use configurations to specify how DataLad should commit/manage your data. You might want to commit all
your files to git
(unless they are large, too numerous, or unless you don’t want to make them available to
everybody who clones the git
repository). You could also let every file be managed by git-annex
.
The DataLad Handbook has very useful information on applying standard or custom configurations to your datasets: