Page MenuHomePhabricator

Spike. Try to ML models distributted in jupyter notebooks with dask
Closed, DeclinedPublic

Description

Spike. Try to ML models distributted in jupyter notebooks with dask

Maybe we can get a simple model from research and rewrite some of its computaion?

https://github.com/dask/dask/tree/master/docs

Event Timeline

First step done: Having dask running on our cluster.

Python environment setup:

https_proxy=http://webproxy.eqiad.wmnet:8080 virtualenv -p /usr/bin/python3 dask_yarn
cd dask_yarn
source bin/activate
https_proxy=http://webproxy.eqiad.wmnet:8080 pip install dask dask-yarn pyarrow pandas scikit-learn venv-pack 
https_proxy=http://webproxy.eqiad.wmnet:8080 pip install dask[dataframe] --upgrade
unset http_proxy
unse https_proxy
venv-pack -o test_dask_yarn.tar.gz
python

Then in the pyhon shell:

from dask_yarn import YarnCluster
# This takes some time
cluster = YarnCluster(environment='test_dask_yarn.tar.gz',
                      name='test_dask_yarn',
                      worker_vcores=2,
                      worker_memory='4GB',
                      n_workers=4)


from dask.distributed import Client
client = Client(cluster)
client

import dask.dataframe as dd
ddf = dd.read_parquet('hdfs:///wmf/data/wmf/projectview/hourly/year=2020/month=1/day=20')
ddf.groupby(ddf.project).view_count.sum().compute()

cluster.close()
fdans triaged this task as Medium priority.Feb 10 2020, 5:51 PM
fdans moved this task from Incoming to Machine Learning Platform on the Analytics board.

To keep archives happy - we are already testing https://github.com/criteo/tf-yarn with Miriam and Aiko, that behind the scenes uses Skein.