What is a good strategy for keeping IPython notebooks under version control?
The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. In particular, I am trying to find a good workflow that:
As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. The problem is when I do not want to version control the output. There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues:
Cell/All Output/Clear
menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers.I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. A full solution might require some changes to IPython, or may rely on some simple external scripts. I currently use mercurial, but would like a solution that also works with git: an ideal solution would be version-control agnostic.
This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. The answer to this question should provide the definitive strategy. It is fine if it requires a recent (even development) version of IPython or an easily installed extension.
Update: I have been playing with my modified notebook version which optionally saves a .clean
version with every save using Gregory Crosswhite's suggestions. This satisfies most of my constraints but leaves the following unresolved:
.clean
file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated.Cell/All Output/Clear
menu option for removing the output.Here is my solution with git. It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history.
Although this can probably be adapted to other VCSs, I know it doesn't satisfy your requirements (at least the VSC agnosticity). Still, it is perfect for me, and although it's nothing particularly brilliant, and many people probably already use it, I didn't find clear instructions about how to implement it by googling around. So it may be useful to other people.
~/bin/ipynb_output_filter.py
)chmod +x ~/bin/ipynb_output_filter.py
)Create the file ~/.gitattributes
, with the following content
*.ipynb filter=dropoutput_ipynb
Run the following commands:
git config --global core.attributesfile ~/.gitattributes
git config --global filter.dropoutput_ipynb.clean ~/bin/ipynb_output_filter.py
git config --global filter.dropoutput_ipynb.smudge cat
Done!
Limitations:
somebranch
and you do git checkout otherbranch; git checkout somebranch
, you usually expect the working tree to be unchanged. Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches.git commit notebook_file.ipynb
, although it would at least keep git diff notebook_file.ipynb
free from base64 garbage).My solution reflects the fact that I personally don't like to keep generated stuff versioned - notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both.
EDIT:
if you do adopt the solution as I suggested it - that is, globally - you will have trouble in case for some git repo you want to version output. So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes, with
**.ipynb filter=
as content. Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository.
the code is now maintained in its own git repo
if the instructions above result in ImportErrors, try adding "ipython" before the path of the script:
git config --global filter.dropoutput_ipynb.clean ipython ~/bin/ipynb_output_filter.py
EDIT: May 2016 (updated February 2017): there are several alternatives to my script - for completeness, here is a list of those I know: nbstripout (other variants), nbstrip, jq.