speedup proposal: no lead loop in HindcastEnsemble but lead and init in obs #798

aaronspring · 2020-11-16T12:45:37Z

aaronspring
Nov 16, 2020
Maintainer

create init and lead dimensions for observation seamlessly with cfsv2_ds["o"] = (( "L", "S"), o.sel(T=T))
this approach is thinking more in real_time or valid_time, however, we do not have a dimension for time, but just a multi-dimensional coordinate time. why is this important? if were dimension, shape would get really large and get us faster into memory issues

Demo: https://gist.github.com/aaronspring/9d724cf385c1b29a5288eabf3e55148b

Idea comes from https://github.com/mktippett/ENSO/blob/master/ForecastVerification.ipynb
Comparison of this approach with current climpred: 10x less tasks: https://gist.github.com/aaronspring/fa99abbee189d65305179d3344e0a405

Small summary:

new verify reproduces verify from climpredv2.1.1 (tiny errors O(10-7))
tasks decrease from ~100 to 13-31 (when applying to small data and do.chunk())
timing reduces especially for same_verif
timing increases for big geospatial data: 1000x1000 gridcells
allows lead units from microseconds to years (also now independent of YS also Y or MS/M, QS/S)

Implementation proposal:

add multi-dim coord time(init, lead) when instantiating initialized: faciliates PredictionEnsemble.plot() but needs to allow time as coord which is not explicitly not allowed
second step: change alignment. Maybe not all the way as proposed here but maybe a combination of the new and old way, e.g. using the new lead time matrix but still looping over each lead?

Benchmark done on my 2018 macbookpro. results in csvfile in gist

Figure summary:

Visualization of alignment: here remove observations in the 1990s partly
timing reduces especially for same_verif
timing increases for big geospatial data: 200x200 gridcells

bradyrx · 2020-11-16T19:05:04Z

bradyrx
Nov 16, 2020
Maintainer

One random thought I wanted to add is this:

I view dask as a pretty advanced python tool. I myself didn't really start to leverage it to its full potential until early last summer I believe. Many students and researchers I work with are still transitioning to xarray and are happy doing things in memory with long runtimes because they view dask as daunting. I think that's totally reasonable -- it was a big jump for me.

So my concern is that when we construct this vectorized grid, if people are working in memory we could end up crashing their server or causing really long runtimes. In some cases maybe looping is better? This goes I guess for the current implementation plus this. But I view @aaronspring's ideas as moving toward a non-looped vectorized system which would create larger objects. I think it'll run beautifully in dask but we can't expect all users to have that.

So is there some way to detect their total memory and approximate the size of object we'll generate on the fly?
Can we use the looping method when in memory for a certain problem size?

Just some thoughts. We can deal with this in testing.

0 replies

aaronspring · 2020-11-16T19:31:20Z

aaronspring
Nov 16, 2020
Maintainer Author

I see this my way: once people are getting memory issues, they should/need to learn how to use dask to overcome this challenge. if they are way below memory, I dont mind how they do it.

0 replies

ahuang11 · 2020-11-17T01:36:02Z

ahuang11
Nov 17, 2020
Maintainer

.interp() might be more robust than .sel() if not all times match exactly (offset by 12H) or plain missing dates on obs (although slower). .sel(method='nearest') wouldn't work I think since if model valid time is 2020-04-02, and obs selected the nearest 2020-04-01, it's still a completely different date

So I recommend doing a

try:
    ds.sel()...
except KeyError:
    ds.interp()...
except MemoryError:
    raise MemoryError('Please utilize dask e.g. ds.chunk('auto')')

0 replies

aaronspring · 2020-11-17T06:35:12Z

aaronspring
Nov 17, 2020
Maintainer Author

With interpolation we don’t known which valid time was taken in the end and nan masking stops being effective. So far we required perfect matching of times

0 replies

aaronspring · 2020-11-17T14:08:14Z

aaronspring
Nov 17, 2020
Maintainer Author

in general we should add a multi-dimensional time dimension to our output. quick to calculate with cftime_add_time_from_init_lead this facilitates plotting sooo much:

from climpred.utils import shift_cftime_index,get_lead_cftime_shift_args,shift_cftime_singular
def cftime_add_time_from_init_lead(ds):
    if 'time' not in ds.coords and 'time' not in ds.dims:
        freq=xr.infer_freq(ds.init.to_index())
        time = xr.concat([xr.DataArray(shift_cftime_singular(ds.init,int(lead),freq),dims='init') for lead in ds.lead.values],dim='lead',join='inner',compat='broadcast_equals')    
        time['lead']=ds.lead
        ds.coords['time']=time
    return ds

assert 'time' in cftime_add_time_from_init_lead(fo).coords


hind = cftime_add_time_from_init_lead(hind)

import seaborn as sns
sns.set_palette('viridis',10)
fig,ax=plt.subplots(figsize=(10,6),nrows=2,sharex=False,sharey=True)
for l in mask.lead:
    hind.sel(lead=l)[v].plot(label=str(l.values),x='init',ax=ax[0])
    hind.sel(lead=l)[v].plot(label=str(l.values),x='time',ax=ax[1])
ax[0].set_title('plot by inits')
ax[1].set_title('plot by valid time')
obs_ds_cf_smaller[v].plot(c='k',label='obs',ax=ax[1])
plt.legend()

0 replies

bradyrx · 2020-11-17T15:46:38Z

bradyrx
Nov 17, 2020
Maintainer

Oh man, what a great point on plotting.. that was always a pain!

0 replies

aaronspring · 2020-11-17T16:04:39Z

aaronspring
Nov 17, 2020
Maintainer Author

And this will quite easy to add. I will open a PR allowing this soon.

0 replies

ahuang11 · 2020-11-18T00:22:14Z

ahuang11
Nov 18, 2020
Maintainer

Nevermind; "zero is a zero order spline. It's value at any point is the last raw value seen." not what I thought
https://stackoverflow.com/questions/27698604/what-do-the-different-values-of-the-kind-argument-mean-in-scipy-interpolate-inte

If I am not mistaken, interp(method='zero') is most robust for except clause if you don't want it to be interpolated:

import numpy as np
import pandas as pd
import xarray as xr

cfs_url = 'http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/.NCEP-CFSv2/.HINDCAST/.MONTHLY/.T/dods'
obs_url = "http://iridl.ldeo.columbia.edu/expert/SOURCES/.NOAA/.NCEP/.EMC/.CMB/.GLOBAL/.Reyn_SmithOIv2/.monthly/.sst/T/1+index/dods"

obs = xr.open_dataset(obs_url, decode_times=False).isel(T=slice(0, 10))
cfs = xr.open_dataset(cfs_url, decode_times=False)

obs.interp(T=cfs['T'], method='zero')

0 replies

ahuang11 · 2020-11-19T03:08:30Z

ahuang11
Nov 19, 2020
Maintainer

For missing indices, I think something along the lines of reindex/align; see:
https://xarray.pydata.org/en/stable/indexing.html
but I couldn't get it to work elegantly besides using where, so I'll see if anyone has suggestions:
pydata/xarray#4593

obs = xr.open_dataset(obs_url, decode_times=False).isel(T=slice(0, 10))
cfs = xr.open_dataset(cfs_url, decode_times=False)
tst = obs.interp(T=cfs['T'])
tst.where(tst['T'].isin(obs['T'].values))

0 replies

aaronspring · 2020-11-21T17:41:01Z

aaronspring
Nov 21, 2020
Maintainer Author

I updated the gist and the top post. my new way is faster for non-geospatial but slower for larger than 100000 gridcells.
I think the sel command is tough for big geospatial data. thats really an advantage in the current implementation.

Definitely, we should introduce the new time coord, regardless of what else we do with it.

I'd be interested in your opinions @bradyrx @ahuang11

0 replies

aaronspring · 2020-12-12T17:56:17Z

aaronspring
Dec 12, 2020
Maintainer Author

Lots of code can be used from xarray https://github.com/pydata/xarray/blob/6c32d7c21941461ae9c21b43e6071ee79fb47d68/xarray/core/accessor_dt.py

0 replies

aaronspring · 2021-03-12T20:21:19Z

aaronspring
Mar 12, 2021
Maintainer Author

Could try initialised.coords.to_dataset()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup proposal: no lead loop in HindcastEnsemble but lead and init in obs #798

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

speedup proposal: no lead loop in HindcastEnsemble but lead and init in obs #798

aaronspring Nov 16, 2020 Maintainer

Replies: 12 comments

bradyrx Nov 16, 2020 Maintainer

aaronspring Nov 16, 2020 Maintainer Author

ahuang11 Nov 17, 2020 Maintainer

aaronspring Nov 17, 2020 Maintainer Author

aaronspring Nov 17, 2020 Maintainer Author

bradyrx Nov 17, 2020 Maintainer

aaronspring Nov 17, 2020 Maintainer Author

ahuang11 Nov 18, 2020 Maintainer

ahuang11 Nov 19, 2020 Maintainer

aaronspring Nov 21, 2020 Maintainer Author

aaronspring Dec 12, 2020 Maintainer Author

aaronspring Mar 12, 2021 Maintainer Author

aaronspring
Nov 16, 2020
Maintainer

bradyrx
Nov 16, 2020
Maintainer

aaronspring
Nov 16, 2020
Maintainer Author

ahuang11
Nov 17, 2020
Maintainer

aaronspring
Nov 17, 2020
Maintainer Author

aaronspring
Nov 17, 2020
Maintainer Author

bradyrx
Nov 17, 2020
Maintainer

aaronspring
Nov 17, 2020
Maintainer Author

ahuang11
Nov 18, 2020
Maintainer

ahuang11
Nov 19, 2020
Maintainer

aaronspring
Nov 21, 2020
Maintainer Author

aaronspring
Dec 12, 2020
Maintainer Author

aaronspring
Mar 12, 2021
Maintainer Author