This parent task is meant to lay out potential designs to make using Jupyter Notebooks for Analytics a first class citizen. This means:
- Distribution by default - all notebooks should run either in YARN or Kubernetes - not doing this :(
- Notebook sharing - notebooks should be readable and shareable between engineers - still not sure how :(
- Easy to use and work with existing Analytics tools and data (Hive, Spark, HDFS, Druid, etc.)
- Easy to understand, manage, and deploy dependencies
- Should integrate with TBD ML Pipeline project
This would be a full SWAP rewrite.
Ideas:
- https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
- https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6
- https://nteract.io
- https://github.com/jupyter/enterprise_gateway
- https://hopsworks.readthedocs.io/en/0.9/user_guide/hopsworks/jupyter.html
Design Document:https://docs.google.com/document/d/1r-oqMXViWvQCqsYz0qzezZBWpip8LvkvCGF6GivFB_8/edit#heading=h.vpanev2oq14b
Final outcome:
We were able to satisfy 2 out of 5 of the stated outcomes:
- Easy to use and work with existing Analytics tools and data (Hive, Spark, HDFS, Druid, etc.)
- Easy to understand, manage, and deploy dependencies
The other outcomes were difficult to support in for notebooks running inside of WMF production networks. ML integrationwork is being experimented with in T243089.