Add IntervalIndex #10296

benbovy · 2025-05-07T20:39:01Z

Closes Design for IntervalIndex #8005
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

The IntervalIndex added here works with two coordinates, one for the midpoints (1D) and the other one for the left-right boundaries (2D), similarly to the CF conventions. Both coordinates are propagated in DataArrays thanks to #10137.

A few TODOs / thoughts:

Besides .set_xindex(["x", "x_bounds"], IntervalIndex) we could also support the single coordinate case .set_xindex("x", IntervalIndex, bounds_dim="bounds") where the "x" original variable is either:
- already wrapping a pd.IntervalIndex -> simply reuse the latter
- any other arbitrary 1D coordinate -> infer bounds similarly to cf_xarray's add_bounds
A quicker way to wrap a pd.IntervalIndex would be useful to have too, e.g.,

>>> xidx = IntervalIndex.from_index(pd.IntervalIndex(...), dim="x", bounds_dim="bnd")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(xidx))
>>> ds
<xarray.Dataset>
Dimensions:      (x: 10, bnd: 2)
Coordinates:
  * x            (x) float64
  * x_bnd        (x, bnd) float64 
Data variables:
  *empty*

Conversely, perhaps it would be nice to have some convenient API that converts an IntervalIndex and its two coordinates into a basic PandasIndex and its unique coordinate? I.e., something equivalent to .drop_indexes(["x", "x_bounds"]).drop_vars("x_bounds").set_xindex("x"), assuming that the "x" coordinate data still wraps the pd.IntervalIndex after dropping the Xarray IntervalIndex.
(TODO): wrap IntervalIndex coordinate data with PandasIndexingAdapter so we avoid duplicating data (also for the bounds coordinate via a dedicated adapter subclass).
(TODO): implement IntervalIndex.join() and IntervalIndex.reindex_like()
I tested xr.concat() with IntervalIndex and it is working but requires Alignment of n-dimensional indexes with partially excluded dims #10293
By default for pd.IntervalIndex the intervals are closed on the right-side. From my understanding of CF conventions (section 7.1.1) it seems that the intervals should be closed on either the left of the right side, although none of these two is clearly stated. Should we stick with pandas' defaults then?
I've not seen either in the CF conventions any specific statement on where exactly the cell coordinate labels should be located within each interval. Is it common to have locations different than midpoints? If it is the case, we could introduce an index build option to keep the original cell coordinate labels instead of relying on pd.IntervalIndex.mid.

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

xarray/indexes/interval_index.py

dcherian · 2025-05-08T03:40:55Z

xarray/indexes/interval_index.py

+
+    """
+
+    _index: PandasIndex


we should save the central values too.

dcherian · 2025-05-08T03:41:14Z

xarray/indexes/interval_index.py

+        )
+        # TODO: use PandasIndexingAdapter directly (avoid data copy)
+        # and/or maybe add an index build option to preserve original labels?
+        # (if those differ from interval midpoints as defined by pd.IntervalIndex)


I think we should always save the central value and return that.

Agreed. This is sub-optimal in the case where the central values exactly correspond to what is returned pd.IntervalIndex.mid (e.g., when creating an IntervalIndex directly from a pd.IntervalIndex), but as I was trying to account for this special case I went into all sorts of small complications, so I think it'll be easier to always save the central values into a separate PandasIndex here.

xarray/indexes/interval_index.py

dcherian · 2025-05-08T03:42:16Z

xarray/indexes/interval_index.py

+    def equals(self, other: Index) -> bool:
+        if not isinstance(other, IntervalIndex):
+            return False
+        return self._index.equals(other._index)


Again need to check the central value here

dcherian · 2025-05-08T03:57:07Z

Thanks for taking this on, Benoit!

we could also support the single coordinate case .set_xindex("x", IntervalIndex, bounds_dim="bounds")

A third case we could support is checking x for the bounds attribute and follow that to find the edges.

A quicker way to wrap a pd.IntervalIndex would be useful to have too

👍 We would use this inside BinGrouper I think but that can be a followup.

By default for pd.IntervalIndex the intervals are closed on the right-side. From my understanding of CF conventions (section 7.1.1) it seems that the intervals should be closed on either the left of the right side, although none of these two is clearly stated.

This line "The instant time(i) should be contained within the interval, or be at one end of it" from example 7.1 seems to suggests it is closed at both edges? IME the intent is usually closed="left" (@tomvothecoder does this match your experience too?) We should certainly make this configurable.

I started a discussion at the CF repo.

I'd lean towards the default being closed="left" because I'm not sure how indexing would work when closed="both" and you request an interval edge that belongs to two intervals.

I've not seen either in the CF conventions any specific statement on where exactly the cell coordinate labels should be located within each interval. Is it common to have locations different than midpoints?

Not common but certainly useful; i have used median of a bunch of samples when binning data, for example.

One request: Can we add a test with datetime intervals please? That's certainly the most common use-case in weather/climate.

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

- a PandasIndex for central values - another PandasIndex (with pd.IntervalIndex) for boundaries

benbovy · 2025-05-08T13:51:29Z

xarray/indexes/interval_index.py

+        check_mid_in_interval(
+            joined_mid_index.index, cast(pd.IntervalIndex, joined_bounds_index.index)
+        )


I'm not sure that this check alone is enough to ensure we always yield meaningful results here, but combined with the check done in IntervalIndex.reindex_like (i.e., indexers are equal for both mid and bounds indexes) I think it should be enough?

benbovy · 2025-05-08T13:55:29Z

We should certainly make this [closed] configurable.

Looking at the CF discussion this seems very much needed indeed.

dcherian · 2025-05-08T17:44:07Z

Given how ill-defined CF is, and that it is domain specific, should we have IntervalIndex.from_cf(central_values, bounds_variable, closed="auto") instead? IntervalIndex.to_cf(dim="bounds") could be helpful too.
We could also have IntervalIndex.from_pandas. When created from Pandas, the "central value" index could be None.
More importantly, Is there any reason to prefer this over a PandasIndex wrapping pd.IntervalIndex? Should we auto-convert a PandasIndex(pd.IntervalIndex) to xr.IntervalIndex? If there is no benefit, perhaps this one should live in cf-xarray?

benbovy · 2025-05-08T21:21:22Z

I started implementing 2 but eventually gave up as things were getting too complicated and the code uglier.

Regarding 3: I think that PandasIndex(pd.IntervalIndex) is useful, it is tied to just one coordinate and it is optimal, we shouldn't auto-convert it. This may actually be the preferred option to wrap an existing pd.IntervalIndex... we should "just" better document it:

>>> idx = PandasIndex(pd.interval_range(start=0, end=5), "x")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(idx))
>>> ds
<xarray.Dataset> Size: 80B
Dimensions:  (x: 5)
Coordinates:
  * x        (x) interval[int64, right] 80B (0, 1] (1, 2] (2, 3] (3, 4] (4, 5]
Data variables:
    *empty*

The main reason to prefer an xarray IntervalIndex, apart from better compatibility with CF-conventions, is the possibility to define explicit positions for central values (possibly different from mid-points) that are checked during alignment. If that's a useful "generic" feature then it's worth for IntervalIndex to live in Xarray itself. Otherwise, yes this should probably better live in cf-xarray (although one technical complication is wrapping the pd.IntervalIndex as the boundary coordinate data, since xarray.core.indexing.PandasIndexAdapter is not public and we need to subclass it).

Grouping the logic into one method will make it easier overriding the behavior in subclasses (interval index) without affecting much readability. Also it yield more DRY code.

So we can wrap pd.IntervalIndex in a boundary 2-dimensional coordinate variable.

Prevent numpy.dtype conversions or castings implemented in various places, gather the logic into one method.

tomvothecoder · 2025-05-09T17:33:42Z

This line "The instant time(i) should be contained within the interval, or be at one end of it" from example 7.1 seems to suggests it is closed at both edges? IME the intent is usually closed="left" (@tomvothecoder does this match your experience too?) We should certainly make this configurable.

I started a discussion at the CF repo.

I'd lean towards the default being closed="left" because I'm not sure how indexing would work when closed="both" and you request an interval edge that belongs to two intervals.

Hey @pochedls and @lee1043, you're both more knowledgeable/experienced than me on on this topic. Do you have anything that to share that might be helpful? Also refer to the CF-convention discussion posted in the quote block above.

benbovy and others added 5 commits May 7, 2025 15:14

add IntervalIndex

9282fc4

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

add index description (docstrings)

f71f767

add type annotations

f7041fd

expose IntervalIndex publicly via xarray.indexes

9401774

add a few TODOs

781d33f

github-actions bot added topic-indexing topic-documentation labels May 7, 2025

dcherian reviewed May 8, 2025

View reviewed changes

xarray/indexes/interval_index.py Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

xarray/indexes/interval_index.py Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

xarray/indexes/interval_index.py Outdated

"""

_index: PandasIndex

Copy link

Contributor

dcherian May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should save the central values too.

dcherian reviewed May 8, 2025

View reviewed changes

xarray/indexes/interval_index.py Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

benbovy and others added 7 commits May 8, 2025 09:53

clean-up

48dc0bd

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

better docstrings

b424b12

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

refactor: use two sub-indexes

8d80e71

- a PandasIndex for central values - another PandasIndex (with pd.IntervalIndex) for boundaries

check consistent central values vs. intervals

e60a1a4

fix mypy

8918fe8

implement join and reindex_like

c722a2e

add mid_index and bounds_index properties

23fb18b

benbovy commented May 8, 2025

View reviewed changes

benbovy added 3 commits May 9, 2025 11:19

clean-up indexing.PandasIndexingAdapter typing

de4f5d8

streamline PandasIndexingAdapter indexing logic

e1bf896

Grouping the logic into one method will make it easier overriding the behavior in subclasses (interval index) without affecting much readability. Also it yield more DRY code.

add xarray indexing adapater for pd.IntervalIndex

06a3b92

So we can wrap pd.IntervalIndex in a boundary 2-dimensional coordinate variable.

benbovy mentioned this pull request May 9, 2025

Support extension array indexes #9671

Merged

4 tasks

clean-up PandasIndexingAdapter dtype handling

80f496f

Prevent numpy.dtype conversions or castings implemented in various places, gather the logic into one method.

benbovy added 3 commits May 9, 2025 14:41

fix mypy

67d8f6c

IntervalIndex sel / isel: handle boundary dim & coord

a8015aa

more clean-up

5b5cbee

benbovy mentioned this pull request May 26, 2025

Clean-up indexing adapter classes #10355

Open

1 task

dcherian mentioned this pull request Jun 18, 2025

Design for IntervalIndex #8005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add IntervalIndex #10296

Add IntervalIndex #10296

Uh oh!

benbovy commented May 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dcherian May 8, 2025

Uh oh!

dcherian May 8, 2025

Uh oh!

benbovy May 8, 2025

Uh oh!

Uh oh!

dcherian May 8, 2025

Uh oh!

dcherian commented May 8, 2025 •

edited

Loading

Uh oh!

benbovy May 8, 2025

Uh oh!

benbovy commented May 8, 2025

Uh oh!

dcherian commented May 8, 2025 •

edited

Loading

Uh oh!

benbovy commented May 8, 2025

Uh oh!

tomvothecoder commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

Add IntervalIndex #10296

Are you sure you want to change the base?

Add IntervalIndex #10296

Uh oh!

Conversation

benbovy commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcherian May 8, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian May 8, 2025

Choose a reason for hiding this comment

Uh oh!

benbovy May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dcherian May 8, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbovy May 8, 2025

Choose a reason for hiding this comment

Uh oh!

benbovy commented May 8, 2025

Uh oh!

dcherian commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbovy commented May 8, 2025

Uh oh!

tomvothecoder commented May 9, 2025

Uh oh!

Uh oh!

benbovy commented May 7, 2025 •

edited

Loading

dcherian commented May 8, 2025 •

edited

Loading

dcherian commented May 8, 2025 •

edited

Loading