diff --git a/intermediate/xarray_ecosystem.ipynb b/intermediate/xarray_ecosystem.ipynb index 44d793a9..0a313bcf 100644 --- a/intermediate/xarray_ecosystem.ipynb +++ b/intermediate/xarray_ecosystem.ipynb @@ -10,26 +10,126 @@ "\n", "Xarray is easily extensible.\n", "This means it is easy to add onto to build custom packages that tackle particular computational problems.\n", - "Here we introduce two popular and widely used extensions that are installable as their own packages (via conda and pip).\n", "\n", + "These packages can plug in to xarray in various different ways. They may build directly on top of xarray, or they may take advantage of some of xarray's dedicated interfacing features:\n", + "- Accessors\n", + "- Backend (filetype) entrypoint\n", + "- Metadata attributes\n", + "- Duck-array wrapping interface\n", + "- Flexible indexes (coming soon!)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we introduce several popular or interesting extensions that are installable as their own packages (via conda and pip). These packages integrate with xarray using one or more of the features mentioned above.\n", + "\n", + "- [hvplot](https://hvplot.holoviz.org/), a powerful interactive plotting library\n", "- [rioxarray](https://corteva.github.io/rioxarray/stable/index.html), for working with geospatial raster data using rasterio\n", + "- [cf-xarray](https://cf-xarray.readthedocs.io/en/latest/), for interpreting CF-compliant data\n", "- [pint-xarray](https://pint-xarray.readthedocs.io/en/latest/), for unit-aware computations using pint." ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import xarray as xr\n", + "import numpy as np" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## The accessor interface\n", + "## Quick note on accessors" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Before we look at the packages we need to briefly introduce a feature they commonly use: [\"xarray accessors\"](https://docs.xarray.dev/en/stable/internals/extending-xarray.html).\n", + "\n", + "The accessor-style syntax is used heavily by the other libraries we are about to cover.\n", "\n", - "- Describe what an accessor is before introducting `.rio`, `.pint` etc." + "For users accessors just allow us to have a method-like syntax on xarray objects, for example `da.hvplot()`, `da.pint.quantify()`, or `da.cf.describe()`.\n", + "\n", + "For developers who are interested in defining their own accessors, see the section on accessors at the bottom of this page." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## hvplot via accessors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [HoloViews library](https://holoviews.org/) makes great use of accessors to allow seamless plotting of xarray data using a completely different plotting backend." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first need to import the code that registers the hvplot accessor" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import hvplot.xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now we can call the `.hvplot` method to plot using holoviews in the same way that we would have used `.plot` to plot using matplotlib." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.load_dataset(\"air_temperature\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds['air'].isel(time=1).hvplot(cmap=\"fire\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For some more examples of how powerful holoviews is [see here](https://tutorial.xarray.dev/intermediate/hvplot.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Rioxarray\n", + "## Rioxarray via the backend entrypoint\n", "more details about rioxarray here" ] }, @@ -39,7 +139,6 @@ "metadata": {}, "outputs": [], "source": [ - "import xarray as xr\n", "import rioxarray # this activates the rio accessor" ] }, @@ -47,9 +146,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "jupyter": { - "source_hidden": true - } + "tags": [] }, "outputs": [], "source": [ @@ -60,9 +157,227 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Pint\n", + "## cf-xarray via metadata attributes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Xarray objects can store [arbitrary metadata](https://docs.xarray.dev/en/stable/getting-started-guide/faq.html#what-is-your-approach-to-metadata) in the form of a `dict` attached to each `DataArray` and `Dataset` object, accessible via the `.attrs` property." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xr.DataArray(name=\"Hitchhiker\", data=0, attrs={\"life\": 42, \"name\": \"Arthur Dent\"})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Normally xarray operations ignore this metadata, simply carting it around until you explicitly choose to use it. However sometimes we might want to write custom code which makes use of the metadata." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[cf_xarray](https://cf-xarray.readthedocs.io/) is a project that tries to\n", + "let you make use of other [Climate and Forecast metadata convention attributes](http://cfconventions.org/) (or \"CF attributes\") that xarray ignores. It attaches itself\n", + "to all xarray objects under the `.cf` namespace.\n", + "\n", + "Where xarray allows you to specify dimension names for analysis, `cf_xarray`\n", + "lets you specify logical names like `\"latitude\"` or `\"longitude\"` instead as\n", + "long as the appropriate CF attributes are set.\n", "\n", - "more details about pint here\n" + "For example, the `\"longitude\"` dimension in different files might be labelled as: (lon, LON, long, x…), but cf_xarray let's you always refer to the logical name `\"longitude\"` in your code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cf_xarray" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# describe cf attributes in dataset\n", + "ds.air.cf.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following `mean` operation will work with any dataset that has appropriate\n", + "attributes set that allow detection of the \"latitude\" variable (e.g.\n", + "`units: \"degress_north\"` or `standard_name: \"latitude\"`)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate equivalent of .mean(\"lat\")\n", + "ds.air.cf.mean(\"latitude\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate indexing\n", + "ds.air.cf.sel(longitude=242.5, method=\"nearest\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pint via duck array wrapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Why use pint" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Pint](https://pint.readthedocs.io/en/stable/) defines physical units, allowing you to work with numpy-like arrays which track the units of your values through computations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pint defines a numpy-like array class called `pint.Quantity`, which is made up of a numpy array and a `pint.Unit` instance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pint import Unit\n", + "\n", + "# you can create a pint.Quantity by multiplying a value by a pint.Unit\n", + "d = np.array(10) * Unit(\"metres\")\n", + "d" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These units are automatically propagated through operations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = 1 * Unit(\"seconds\")\n", + "v = d / t\n", + "v" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or if the operation involves inconsistent units, a `pint.DimensionalityError` is raised." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "d + t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### pint inside xarray objects" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have already seen that xarray can wrap numpy arrays or dask arrays, but in fact xarray can wrap any type of array that behaves similarly to a numpy array.\n", + "Using this feature we can store a `pint.Quantity` array inside an xarray DataArray" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da = xr.DataArray(d)\n", + "da" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the data type stored within the DataArray is a `Quantity` object, rather than just a `np.ndarray` object, and the units of the data are displayed in the repr." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The reason this works is that a `pint.Quantity` array is what we call a \"duck array\", in that it behaves so similarly to a `numpy.ndarray` that xarray can treat them the same way. (See [python duck typing](https://towardsdatascience.com/duck-typing-python-7aeac97e11f8))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### pint-xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The convenience package [pint-xarray](https://pint-xarray.readthedocs.io/en/latest/index.html) makes it easier to get the benefits of pint whilst working with xarray objects.\n", + "\n", + "It provides utility accessor methods for promoting xarray data to pint quantities (which we call \"quantifying\") in various ways and for manipulating the resulting objects." ] }, { @@ -77,6 +392,254 @@ "\n", "xr.set_options(display_expand_data=False)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "pint-xarray provides the `.pint` accessor, which firstly allows us to easily extract the units of our data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.pint.units" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `.pint.quantify()` accessor gives us various ways to convert normal xarray data to be unit-aware." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: It is preferred to use `.pint.quantify()` to convert xarray data to use pint rather than explicitly creating a `pint.Quantity` array and placing it inside the xarray object, because pint-xarray will handle various subtleties involving dask etc." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can explicitly specify the units we want" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da = xr.DataArray([4.5, 6.7, 3.8], dims=\"time\")\n", + "da.pint.quantify(\"V\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or if the data has a \"units\" entry in its `.attrs` metadata dictionary, we can automatically use that to convert each variable.\n", + "\n", + "For example, the xarray tutorial dataset we opened earlier has units in its attributes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.attrs['units']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "which we can automatically read with `.pint.quantify()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "quantified_air = ds.pint.quantify()\n", + "quantified_air" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we have actually gone even further, and used cf-xarray to automatically interpret cf-compliant units in the `.attrs` to valid pint units.\n", + "This automatically happened just as a result of importing cf-xarray above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we plot quantified data with xarray the correct units will automatically appear on the plot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "quantified_air[\"air\"].isel(time=500).plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want to cast the pint arrays back to numpy arrays, we can use `.pint.dequantify()`, which will also write the current units back out to the `.attrs[\"units\"]` field" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "quantified_air.pint.dequantify()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise:** Write a function which will raise an error if supplied with data in the wrong units." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pint import DimensionalityError\n", + "\n", + "\n", + "def special_science_function(distance):\n", + " if distance.pint.units != \"miles\":\n", + " raise DimensionalityError(\n", + " \"this function will only give the correct answer if the input is in units of miles\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise:** Try this on some of your data!\n", + "\n", + "After you have imported pint-xarray (and maybe cf-xarray) as above, start with something like\n", + "\n", + "`ds = xr.open_dataset(my_data).pint.quantify()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Take a look at the [pint-xarray documentation](https://pint-xarray.readthedocs.io/en/latest/) or the [pint documentation](https://pint.readthedocs.io/en/stable/) if you get stuck." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Accessors part 2 - defining custom accessors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An accessor is a way of attaching a custom function to xarray types so that it can be called as if it were a method, but while retaining a clear separation between \"core\" xarray API and custom API." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For example, imagine you're a statistician who regularly uses a special `skewness` function which acts on dataarrays but is only of interest to people in your specific field.\n", + "\n", + "You can create a method which applies this skewness function to an xarray objects, and then register the method under a custom `stats` accessor like this" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from scipy.stats import skew\n", + "\n", + "\n", + "@xr.register_dataarray_accessor(\"stats\")\n", + "class StatsAccessor:\n", + " def __init__(self, da):\n", + " self._da = da\n", + "\n", + " def skewness(self, dim):\n", + " return self._da.reduce(func=skew, dim=dim)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can conveniently access this functionality via the `stats` accessor" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds['air'].stats.skewness(dim=\"time\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice how the presence of `.stats` clearly differentiates our new \"accessor method\" from core xarray methods." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The wider world..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are many other libraries in the wider xarray ecosystem. For a list of a few packages we particularly like for geoscience work [see here](https://tutorial.xarray.dev/overview/xarray-in-45-min.html#other-cool-packages), and for a [more exhaustive list see here](https://docs.xarray.dev/en/stable/ecosystem.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {