Spike. Try to ML models distributted in jupyter notebooks with dask
Maybe we can get a simple model from research and rewrite some of its computaion?
Spike. Try to ML models distributted in jupyter notebooks with dask
Maybe we can get a simple model from research and rewrite some of its computaion?
First step done: Having dask running on our cluster.
Python environment setup:
https_proxy=http://webproxy.eqiad.wmnet:8080 virtualenv -p /usr/bin/python3 dask_yarn cd dask_yarn source bin/activate https_proxy=http://webproxy.eqiad.wmnet:8080 pip install dask dask-yarn pyarrow pandas scikit-learn venv-pack https_proxy=http://webproxy.eqiad.wmnet:8080 pip install dask[dataframe] --upgrade unset http_proxy unse https_proxy venv-pack -o test_dask_yarn.tar.gz python
Then in the pyhon shell:
from dask_yarn import YarnCluster # This takes some time cluster = YarnCluster(environment='test_dask_yarn.tar.gz', name='test_dask_yarn', worker_vcores=2, worker_memory='4GB', n_workers=4) from dask.distributed import Client client = Client(cluster) client import dask.dataframe as dd ddf = dd.read_parquet('hdfs:///wmf/data/wmf/projectview/hourly/year=2020/month=1/day=20') ddf.groupby(ddf.project).view_count.sum().compute() cluster.close()
To keep archives happy - we are already testing https://github.com/criteo/tf-yarn with Miriam and Aiko, that behind the scenes uses Skein.