Skip to content

Add IntervalIndex #10296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Add IntervalIndex #10296

wants to merge 19 commits into from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented May 7, 2025

  • Closes Design for IntervalIndex #8005
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

The IntervalIndex added here works with two coordinates, one for the midpoints (1D) and the other one for the left-right boundaries (2D), similarly to the CF conventions. Both coordinates are propagated in DataArrays thanks to #10137.

A few TODOs / thoughts:

  • Besides .set_xindex(["x", "x_bounds"], IntervalIndex) we could also support the single coordinate case .set_xindex("x", IntervalIndex, bounds_dim="bounds") where the "x" original variable is either:

    • already wrapping a pd.IntervalIndex -> simply reuse the latter
    • any other arbitrary 1D coordinate -> infer bounds similarly to cf_xarray's add_bounds
  • A quicker way to wrap a pd.IntervalIndex would be useful to have too, e.g.,

>>> xidx = IntervalIndex.from_index(pd.IntervalIndex(...), dim="x", bounds_dim="bnd")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(xidx))
>>> ds
<xarray.Dataset>
Dimensions:      (x: 10, bnd: 2)
Coordinates:
  * x            (x) float64
  * x_bnd        (x, bnd) float64 
Data variables:
  *empty*
  • Conversely, perhaps it would be nice to have some convenient API that converts an IntervalIndex and its two coordinates into a basic PandasIndex and its unique coordinate? I.e., something equivalent to .drop_indexes(["x", "x_bounds"]).drop_vars("x_bounds").set_xindex("x"), assuming that the "x" coordinate data still wraps the pd.IntervalIndex after dropping the Xarray IntervalIndex.

  • (TODO): wrap IntervalIndex coordinate data with PandasIndexingAdapter so we avoid duplicating data (also for the bounds coordinate via a dedicated adapter subclass).

  • (TODO): implement IntervalIndex.join() and IntervalIndex.reindex_like()

  • I tested xr.concat() with IntervalIndex and it is working but requires Alignment of n-dimensional indexes with partially excluded dims #10293

  • By default for pd.IntervalIndex the intervals are closed on the right-side. From my understanding of CF conventions (section 7.1.1) it seems that the intervals should be closed on either the left of the right side, although none of these two is clearly stated. Should we stick with pandas' defaults then?

  • I've not seen either in the CF conventions any specific statement on where exactly the cell coordinate labels should be located within each interval. Is it common to have locations different than midpoints? If it is the case, we could introduce an index build option to keep the original cell coordinate labels instead of relying on pd.IntervalIndex.mid.

"""

_index: PandasIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should save the central values too.

)
# TODO: use PandasIndexingAdapter directly (avoid data copy)
# and/or maybe add an index build option to preserve original labels?
# (if those differ from interval midpoints as defined by pd.IntervalIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always save the central value and return that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is sub-optimal in the case where the central values exactly correspond to what is returned pd.IntervalIndex.mid (e.g., when creating an IntervalIndex directly from a pd.IntervalIndex), but as I was trying to account for this special case I went into all sorts of small complications, so I think it'll be easier to always save the central values into a separate PandasIndex here.

def equals(self, other: Index) -> bool:
if not isinstance(other, IntervalIndex):
return False
return self._index.equals(other._index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again need to check the central value here

@dcherian
Copy link
Contributor

dcherian commented May 8, 2025

Thanks for taking this on, Benoit!

we could also support the single coordinate case .set_xindex("x", IntervalIndex, bounds_dim="bounds")

A third case we could support is checking x for the bounds attribute and follow that to find the edges.

A quicker way to wrap a pd.IntervalIndex would be useful to have too

👍 We would use this inside BinGrouper I think but that can be a followup.

By default for pd.IntervalIndex the intervals are closed on the right-side. From my understanding of CF conventions (section 7.1.1) it seems that the intervals should be closed on either the left of the right side, although none of these two is clearly stated.

This line "The instant time(i) should be contained within the interval, or be at one end of it" from example 7.1 seems to suggests it is closed at both edges? IME the intent is usually closed="left" (@tomvothecoder does this match your experience too?) We should certainly make this configurable.

I started a discussion at the CF repo.

I'd lean towards the default being closed="left" because I'm not sure how indexing would work when closed="both" and you request an interval edge that belongs to two intervals.

I've not seen either in the CF conventions any specific statement on where exactly the cell coordinate labels should be located within each interval. Is it common to have locations different than midpoints?

Not common but certainly useful; i have used median of a bunch of samples when binning data, for example.


One request: Can we add a test with datetime intervals please? That's certainly the most common use-case in weather/climate.

benbovy and others added 7 commits May 8, 2025 09:53
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
- a PandasIndex for central values
- another PandasIndex (with pd.IntervalIndex) for boundaries
Comment on lines +229 to +231
check_mid_in_interval(
joined_mid_index.index, cast(pd.IntervalIndex, joined_bounds_index.index)
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this check alone is enough to ensure we always yield meaningful results here, but combined with the check done in IntervalIndex.reindex_like (i.e., indexers are equal for both mid and bounds indexes) I think it should be enough?

@benbovy
Copy link
Member Author

benbovy commented May 8, 2025

We should certainly make this [closed] configurable.

Looking at the CF discussion this seems very much needed indeed.

@dcherian
Copy link
Contributor

dcherian commented May 8, 2025

  1. Given how ill-defined CF is, and that it is domain specific, should we have IntervalIndex.from_cf(central_values, bounds_variable, closed="auto") instead? IntervalIndex.to_cf(dim="bounds") could be helpful too.
  2. We could also have IntervalIndex.from_pandas. When created from Pandas, the "central value" index could be None.
  3. More importantly, Is there any reason to prefer this over a PandasIndex wrapping pd.IntervalIndex? Should we auto-convert a PandasIndex(pd.IntervalIndex) to xr.IntervalIndex? If there is no benefit, perhaps this one should live in cf-xarray?

@benbovy
Copy link
Member Author

benbovy commented May 8, 2025

I started implementing 2 but eventually gave up as things were getting too complicated and the code uglier.

Regarding 3: I think that PandasIndex(pd.IntervalIndex) is useful, it is tied to just one coordinate and it is optimal, we shouldn't auto-convert it. This may actually be the preferred option to wrap an existing pd.IntervalIndex... we should "just" better document it:

>>> idx = PandasIndex(pd.interval_range(start=0, end=5), "x")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(idx))
>>> ds
<xarray.Dataset> Size: 80B
Dimensions:  (x: 5)
Coordinates:
  * x        (x) interval[int64, right] 80B (0, 1] (1, 2] (2, 3] (3, 4] (4, 5]
Data variables:
    *empty*

The main reason to prefer an xarray IntervalIndex, apart from better compatibility with CF-conventions, is the possibility to define explicit positions for central values (possibly different from mid-points) that are checked during alignment. If that's a useful "generic" feature then it's worth for IntervalIndex to live in Xarray itself. Otherwise, yes this should probably better live in cf-xarray (although one technical complication is wrapping the pd.IntervalIndex as the boundary coordinate data, since xarray.core.indexing.PandasIndexAdapter is not public and we need to subclass it).

benbovy added 3 commits May 9, 2025 11:19
Grouping the logic into one method will make it easier overriding the
behavior in subclasses (interval index) without affecting much
readability. Also it yield more DRY code.
So we can wrap pd.IntervalIndex in a boundary 2-dimensional coordinate
variable.
@benbovy benbovy mentioned this pull request May 9, 2025
4 tasks
Prevent numpy.dtype conversions or castings implemented in various
places, gather the logic into one method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design for IntervalIndex
2 participants