Spike. Try to ML models distributted in jupyter notebooks with dask
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Nuria
	Jan 17 2020, 5:26 PM

Description

Spike. Try to ML models distributted in jupyter notebooks with dask

Maybe we can get a simple model from research and rewrite some of its computaion?

https://github.com/dask/dask/tree/master/docs

Related Objects

Mentioned In: T224658: Newpyter - SWAP Juypter Rewrite

Event Timeline

• Nuria created this task.Jan 17 2020, 5:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2020, 5:26 PM

elukey subscribed.Jan 21 2020, 10:49 AM

First step done: Having dask running on our cluster.

Python environment setup:

https_proxy=http://webproxy.eqiad.wmnet:8080 virtualenv -p /usr/bin/python3 dask_yarn
cd dask_yarn
source bin/activate
https_proxy=http://webproxy.eqiad.wmnet:8080 pip install dask dask-yarn pyarrow pandas scikit-learn venv-pack 
https_proxy=http://webproxy.eqiad.wmnet:8080 pip install dask[dataframe] --upgrade
unset http_proxy
unse https_proxy
venv-pack -o test_dask_yarn.tar.gz
python

Then in the pyhon shell:

from dask_yarn import YarnCluster
# This takes some time
cluster = YarnCluster(environment='test_dask_yarn.tar.gz',
                      name='test_dask_yarn',
                      worker_vcores=2,
                      worker_memory='4GB',
                      n_workers=4)


from dask.distributed import Client
client = Client(cluster)
client

import dask.dataframe as dd
ddf = dd.read_parquet('hdfs:///wmf/data/wmf/projectview/hourly/year=2020/month=1/day=20')
ddf.groupby(ddf.project).view_count.sum().compute()

cluster.close()

MGerlach subscribed.Feb 5 2020, 6:04 PM

• fdans triaged this task as Medium priority.Feb 10 2020, 5:51 PM

• fdans moved this task from Incoming to Machine Learning Platform on the Analytics board.

https://analytics-zoo.readthedocs.io/en/latest/

To keep archives happy - we are already testing https://github.com/criteo/tf-yarn with Miriam and Aiko, that behind the scenes uses Skein.

Ottomata mentioned this in T224658: Newpyter - SWAP Juypter Rewrite.May 3 2021, 2:55 PM

JArguello-WMF edited projects, added Machine-Learning-Team; removed Analytics.Jul 4 2022, 6:28 PM

calbon closed this task as Declined.Oct 25 2022, 2:35 PM

isarantopoulos moved this task from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 20 2023, 11:43 AM

Spike. Try to ML models distributted in jupyter notebooks with daskClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Spike. Try to ML models distributted in jupyter notebooks with dask
Closed, DeclinedPublic
Actions