-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add IntervalIndex #10296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add IntervalIndex #10296
Conversation
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
xarray/indexes/interval_index.py
Outdated
""" | ||
|
||
_index: PandasIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should save the central values too.
xarray/indexes/interval_index.py
Outdated
) | ||
# TODO: use PandasIndexingAdapter directly (avoid data copy) | ||
# and/or maybe add an index build option to preserve original labels? | ||
# (if those differ from interval midpoints as defined by pd.IntervalIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always save the central value and return that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This is sub-optimal in the case where the central values exactly correspond to what is returned pd.IntervalIndex.mid
(e.g., when creating an IntervalIndex directly from a pd.IntervalIndex
), but as I was trying to account for this special case I went into all sorts of small complications, so I think it'll be easier to always save the central values into a separate PandasIndex
here.
xarray/indexes/interval_index.py
Outdated
def equals(self, other: Index) -> bool: | ||
if not isinstance(other, IntervalIndex): | ||
return False | ||
return self._index.equals(other._index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again need to check the central value here
Thanks for taking this on, Benoit!
A third case we could support is checking
👍 We would use this inside
This line "The instant time(i) should be contained within the interval, or be at one end of it" from example 7.1 seems to suggests it is closed at both edges? IME the intent is usually I started a discussion at the CF repo. I'd lean towards the default being
Not common but certainly useful; i have used One request: Can we add a test with datetime intervals please? That's certainly the most common use-case in weather/climate. |
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
- a PandasIndex for central values - another PandasIndex (with pd.IntervalIndex) for boundaries
check_mid_in_interval( | ||
joined_mid_index.index, cast(pd.IntervalIndex, joined_bounds_index.index) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this check alone is enough to ensure we always yield meaningful results here, but combined with the check done in IntervalIndex.reindex_like
(i.e., indexers are equal for both mid and bounds indexes) I think it should be enough?
Looking at the CF discussion this seems very much needed indeed. |
|
I started implementing 2 but eventually gave up as things were getting too complicated and the code uglier. Regarding 3: I think that PandasIndex(pd.IntervalIndex) is useful, it is tied to just one coordinate and it is optimal, we shouldn't auto-convert it. This may actually be the preferred option to wrap an existing pd.IntervalIndex... we should "just" better document it: >>> idx = PandasIndex(pd.interval_range(start=0, end=5), "x")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(idx))
>>> ds
<xarray.Dataset> Size: 80B
Dimensions: (x: 5)
Coordinates:
* x (x) interval[int64, right] 80B (0, 1] (1, 2] (2, 3] (3, 4] (4, 5]
Data variables:
*empty* The main reason to prefer an xarray IntervalIndex, apart from better compatibility with CF-conventions, is the possibility to define explicit positions for central values (possibly different from mid-points) that are checked during alignment. If that's a useful "generic" feature then it's worth for IntervalIndex to live in Xarray itself. Otherwise, yes this should probably better live in cf-xarray (although one technical complication is wrapping the |
Grouping the logic into one method will make it easier overriding the behavior in subclasses (interval index) without affecting much readability. Also it yield more DRY code.
So we can wrap pd.IntervalIndex in a boundary 2-dimensional coordinate variable.
Prevent numpy.dtype conversions or castings implemented in various places, gather the logic into one method.
Hey @pochedls and @lee1043, you're both more knowledgeable/experienced than me on on this topic. Do you have anything that to share that might be helpful? Also refer to the CF-convention discussion posted in the quote block above. |
whats-new.rst
api.rst
The
IntervalIndex
added here works with two coordinates, one for the midpoints (1D) and the other one for the left-right boundaries (2D), similarly to the CF conventions. Both coordinates are propagated in DataArrays thanks to #10137.A few TODOs / thoughts:
Besides
.set_xindex(["x", "x_bounds"], IntervalIndex)
we could also support the single coordinate case.set_xindex("x", IntervalIndex, bounds_dim="bounds")
where the "x" original variable is either:pd.IntervalIndex
-> simply reuse the latterA quicker way to wrap a
pd.IntervalIndex
would be useful to have too, e.g.,Conversely, perhaps it would be nice to have some convenient API that converts an IntervalIndex and its two coordinates into a basic PandasIndex and its unique coordinate? I.e., something equivalent to
.drop_indexes(["x", "x_bounds"]).drop_vars("x_bounds").set_xindex("x")
, assuming that the "x" coordinate data still wraps thepd.IntervalIndex
after dropping the Xarray IntervalIndex.(TODO): wrap IntervalIndex coordinate data with
PandasIndexingAdapter
so we avoid duplicating data (also for the bounds coordinate via a dedicated adapter subclass).(TODO): implement
IntervalIndex.join()
andIntervalIndex.reindex_like()
I tested
xr.concat()
with IntervalIndex and it is working but requires Alignment of n-dimensional indexes with partially excluded dims #10293By default for
pd.IntervalIndex
the intervals are closed on the right-side. From my understanding of CF conventions (section 7.1.1) it seems that the intervals should be closed on either the left of the right side, although none of these two is clearly stated. Should we stick with pandas' defaults then?I've not seen either in the CF conventions any specific statement on where exactly the cell coordinate labels should be located within each interval. Is it common to have locations different than midpoints? If it is the case, we could introduce an index build option to keep the original cell coordinate labels instead of relying on
pd.IntervalIndex.mid
.