speedup proposal: no lead loop in HindcastEnsemble but lead and init in obs #798
Replies: 12 comments
-
One random thought I wanted to add is this: I view So my concern is that when we construct this vectorized grid, if people are working in memory we could end up crashing their server or causing really long runtimes. In some cases maybe looping is better? This goes I guess for the current implementation plus this. But I view @aaronspring's ideas as moving toward a non-looped vectorized system which would create larger objects. I think it'll run beautifully in
Just some thoughts. We can deal with this in testing. |
Beta Was this translation helpful? Give feedback.
-
I see this my way: once people are getting memory issues, they should/need to learn how to use dask to overcome this challenge. if they are way below memory, I dont mind how they do it. |
Beta Was this translation helpful? Give feedback.
-
.interp() might be more robust than .sel() if not all times match exactly (offset by 12H) or plain missing dates on obs (although slower). .sel(method='nearest') wouldn't work I think since if model valid time is 2020-04-02, and obs selected the nearest 2020-04-01, it's still a completely different date So I recommend doing a
|
Beta Was this translation helpful? Give feedback.
-
With interpolation we don’t known which valid time was taken in the end and nan masking stops being effective. So far we required perfect matching of times |
Beta Was this translation helpful? Give feedback.
-
in general we should add a multi-dimensional time dimension to our output. quick to calculate with from climpred.utils import shift_cftime_index,get_lead_cftime_shift_args,shift_cftime_singular
def cftime_add_time_from_init_lead(ds):
if 'time' not in ds.coords and 'time' not in ds.dims:
freq=xr.infer_freq(ds.init.to_index())
time = xr.concat([xr.DataArray(shift_cftime_singular(ds.init,int(lead),freq),dims='init') for lead in ds.lead.values],dim='lead',join='inner',compat='broadcast_equals')
time['lead']=ds.lead
ds.coords['time']=time
return ds
assert 'time' in cftime_add_time_from_init_lead(fo).coords
hind = cftime_add_time_from_init_lead(hind)
import seaborn as sns
sns.set_palette('viridis',10)
fig,ax=plt.subplots(figsize=(10,6),nrows=2,sharex=False,sharey=True)
for l in mask.lead:
hind.sel(lead=l)[v].plot(label=str(l.values),x='init',ax=ax[0])
hind.sel(lead=l)[v].plot(label=str(l.values),x='time',ax=ax[1])
ax[0].set_title('plot by inits')
ax[1].set_title('plot by valid time')
obs_ds_cf_smaller[v].plot(c='k',label='obs',ax=ax[1])
plt.legend() |
Beta Was this translation helpful? Give feedback.
-
Oh man, what a great point on plotting.. that was always a pain! |
Beta Was this translation helpful? Give feedback.
-
And this will quite easy to add. I will open a PR allowing this soon. |
Beta Was this translation helpful? Give feedback.
-
Nevermind; "zero is a zero order spline. It's value at any point is the last raw value seen." not what I thought If I am not mistaken, interp(method='zero') is most robust for except clause if you don't want it to be interpolated:
|
Beta Was this translation helpful? Give feedback.
-
For missing indices, I think something along the lines of reindex/align; see:
|
Beta Was this translation helpful? Give feedback.
-
I updated the gist and the top post. my new way is faster for non-geospatial but slower for larger than 100000 gridcells. Definitely, we should introduce the new |
Beta Was this translation helpful? Give feedback.
-
Lots of code can be used from xarray https://github.com/pydata/xarray/blob/6c32d7c21941461ae9c21b43e6071ee79fb47d68/xarray/core/accessor_dt.py |
Beta Was this translation helpful? Give feedback.
-
Could try initialised.coords.to_dataset() |
Beta Was this translation helpful? Give feedback.
-
observation
seamlessly withcfsv2_ds["o"] = (( "L", "S"), o.sel(T=T))
real_time
orvalid_time
, however, we do not have a dimension for time, but just a multi-dimensional coordinate time. why is this important? if were dimension, shape would get really large and get us faster into memory issuesDemo: https://gist.github.com/aaronspring/9d724cf385c1b29a5288eabf3e55148b
Idea comes from https://github.com/mktippett/ENSO/blob/master/ForecastVerification.ipynb
Comparison of this approach with current climpred: 10x less tasks: https://gist.github.com/aaronspring/fa99abbee189d65305179d3344e0a405
Small summary:
new
verify reproducesverify
from climpredv2.1.1 (tiny errors O(10-7)).chunk()
)same_verif
Implementation proposal:
time(init, lead)
when instantiating initialized: faciliatesPredictionEnsemble.plot()
but needs to allow time as coord which is not explicitly not allowedBenchmark done on my 2018 macbookpro. results in csvfile in gist
Figure summary:
same_verif
Beta Was this translation helpful? Give feedback.
All reactions