Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:
Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:
Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.
Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via argparse
.
The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.
Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.
I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.
Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?
Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal
assert
tests and docstrings.
After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.
Basic example of a cell's content with "minimal" tests and docstrings:
def zip_count(f):
"""Given zip filename, returns number of files inside.
str -> int"""
from contextlib import closing
with closing(zipfile.ZipFile(f)) as archive:
num_files = len(archive.infolist())
return num_files
zip_filename = 'data/myfile.zip'
# Make sure `myfile` always has three files
assert zip_count(zip_filename) == 3
# And total zip size is under 2 MB
assert os.path.getsize(zip_filename) / 1024**2 < 2
print(zip_count(zip_filename))
Once you've exported it to bare .py
files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert
statements that can easily be moved into tests.py
for testing with pytest
, unittest
, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.
If all goes well, all you need to do after that is to write your if __name__ == '__main__':
and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py
file, etc.
It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.
Here's a few ideas for a notebook-to-script workflow:
print
statements, plots, etc.if __name__ == '__main__'
.assert
statements for each of your functions/methods, and flesh out a minimal test suite in tests.py
.