KBI0003: Capturing interactive computations in Jupyter Notebooks¶

authors:: Adina Wagner <a.wagner@fz-juelich.de>
discussion:: https://github.com/psychoinformatics-de/knowledge-base/pull/17
keywords:: datalad run, run, unlock

This knowledge-base item discusses how wrap interactive computing with Jupyter Notebooks by invoking them from the terminal, wrapped in a datalad run call.

Overview¶

Jupyter notebooks offer an interactive computing environment. Just as any other files, they can be a part of a DataLad dataset, and their code can use or modify files inside of DataLad datasets.

More so than computational scripts, they incentivize interactive computations. This bears at least two difficulties:

modifying annexed files on the fly can cause permission errors
keeping track of changes during interactive computing

Wrapping the entire execution of the notebook session into a datalad run command can alleviate those difficulties. It can help to capture all changes in a session as long as the user shuts down the Jupyter server orderly via the “Quit” button, and it can unlock relevant files at the start if the user adds an --output specification to it.

If the notebook is in a state where it can be ran from start to end (i.e., no manual step-by-step execution of individual code cells), the entire notebook can be run at once using the command jupyter run <notebook>. In this execution mode, the jupyter run <notebook> call can be wrapped in a datalad run like this:

❱ datalad run \
  -m "running my notebook" \
  --output <path/to/file/getting/modified> \
  "jupyter run <my-notebook>"

If the computation involves manual execution of certain cells, and the jupyter server is ran for interactive computations, the entire session can be wrapped as follows:

Consider a dataset with an annexed file (output.file) that will be modified in a notebook session:

❱ datalad create mynotebookenv
create(ok): /tmp/mynotebookenv (dataset)
❱ cd mynotebookenv
❱ echo 123456 > output.file
❱ datalad save -m "annexed something"
add(ok): output.file (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Wrapping a jupyter-notebook command (or a more specific jupyter-notebook <path-to-notebook> into datalad run with an --output declaration can capture any changes, and allows modifying the annexed file:

❱ datalad run \
 -m "running jupyter notebook" \
 --output output.file "jupyter-notebook"
unlock(ok): output.file (file)
...
[Notebook logmessages]
...
[INFO   ] == Command exit (modification check follows) =====
run(ok): /tmp/mynotebookenv (dataset) [jupyter-notebook]
add(ok): Untitled.ipynb (file)
add(ok): output.file (file)
save(ok): . (dataset)

This process will also work if the data to be unlocked or the Notebook invoked are in different levels of a dataset hierarchy as long as the paths to --input or --output declarations to not point upwards - in other words, as long as the run command is executed from a same- or top-level dataset.

Here is an example with a subdataset that contains one annexed file:

# create a dataset hierarchy, and some content
❱ datalad create super && \
cd super && \
datalad create -d sub && \
echo 1234 > output.file && \
datalad save -m "annex something"
create(ok): /tmp/super (dataset)
create(ok): . (dataset)
add(ok): sub (dataset)
add(ok): .gitmodules (file)
add(ok): output.file (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  save (ok: 1)

We can modify content in the subdataset as long as the command is run from the a dataset higher in the dataset hierarchy:

❱ datalad run \
-m "running jupyter notebook to modify subdataset content" \
--output sub/output.file \
"jupyter-notebook Untitled.ipynb"
unlock(ok): sub/output.file (file)
[INFO   ] == Command start (output follows) =====

[Notebook log output]

[INFO   ] == Command exit (modification check follows) =====
run(ok): /tmp/super (dataset) [jupyter-notebook Untitled.ipynb]
add(ok): output.file (file)
save(ok): sub (dataset)
add(ok): sub (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

It would not work if the --output specification points outside of the dataset:

❱ datalad create super && \
cd super && \
datalad create -d sub && \
echo 1234 > output.file && \
datalad save -m "annex something"
 create(ok): /tmp/super (dataset)
 create(ok): . (dataset)
 add(ok): sub (dataset)
 add(ok): .gitmodules (file)
 add(ok): output.file (file)
 save(ok): . (dataset)
 action summary:
   add (ok: 3)
   save (ok: 1)
❱ tree
.
├── output.file -> .git/annex/objects/kj/05/MD5E-s5--e7df7cd2ca07f4f1ab415d457a6e1c13/MD5E-s5--e7df7cd2ca07f4f1ab415d457a6e1c13
└── sub

❱ cd sub
❱ datalad run \
 -m "running jupyter notebook from subdataset" \
 --output ../output.file \
 "jupyter-notebook"
get(error): .. [path not associated with dataset Dataset(/tmp/super/sub)]