diff --git a/intermediate/01-high-level-computation-patterns.ipynb b/intermediate/01-high-level-computation-patterns.ipynb index 77a0a3ad..ba429693 100644 --- a/intermediate/01-high-level-computation-patterns.ipynb +++ b/intermediate/01-high-level-computation-patterns.ipynb @@ -10,7 +10,24 @@ "tags": [] }, "source": [ - "# High-level computational patterns\n" + "# Computational Patterns\n", + "\n", + "Often when writing code we repeat certain patterns, whether we realize it or not.\n", + "If you have learned to write list comprehensions, you are taking advantage of a \"control pattern\".\n", + "Often, these patterns are so common that many packages have built in functions to implement them.\n", + "\n", + "Quoting the [toolz documentation](https://toolz.readthedocs.io/en/latest/control.html):\n", + "\n", + "> The Toolz library contains dozens of patterns like map and groupby. Learning a\n", + "> core set (maybe a dozen) covers the vast majority of common programming tasks\n", + "> often done by hand. A rich vocabulary of core control functions conveys the\n", + "> following benefits:\n", + ">\n", + "> - You identify new patterns\n", + "> - You make fewer errors in rote coding\n", + "> - You can depend on well tested and benchmarked implementations\n", + "\n", + "The same is true for xarray." ] }, { @@ -25,18 +42,10 @@ "source": [ "## Motivation / Learning goals\n", "\n", - "From https://toolz.readthedocs.io/en/latest/control.html\n", - "\n", - "> The Toolz library contains dozens of patterns like map and groupby. Learning a\n", - "> core set (maybe a dozen) covers the vast majority of common programming tasks\n", - "> often done by hand. A rich vocabulary of core control functions conveys the\n", - "> following benefits:\n", - ">\n", - "> - You identify new patterns\n", - "> - You make fewer errors in rote coding\n", - "> - You can depend on well tested and benchmarked implementations\n", - "\n", - "The same is true for xarray\n" + "- Learn what high-level computational patterns are available in Xarray\n", + "- Identify when you are re-implementing a high-level computational pattern\n", + "- Implement that pattern using built-in Xarray functionality\n", + "- Understand the difference between `map` and `reduce`" ] }, { @@ -62,16 +71,21 @@ "analysis tasks.\n", "\n", "1. `rolling` :\n", - " [Operate on rolling windows of your data e.g. running mean](https://docs.xarray.dev/en/stable/user-guide/computation.html#rolling-window-operations)\n", + " [Operate on rolling windows of your data e.g. running mean.](https://docs.xarray.dev/en/stable/user-guide/computation.html#rolling-window-operations)\n", "1. `coarsen` :\n", - " [Downsample your data](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays)\n", + " [Downsample your data.](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays)\n", "1. `groupby` :\n", - " [Bin data in to groups and reduce](https://docs.xarray.dev/en/stable/groupby.html)\n", + " [Bin data in to groups and reduce.](https://docs.xarray.dev/en/stable/groupby.html)\n", "1. `groupby_bins`: GroupBy after discretizing a numeric variable.\n", "1. `resample` :\n", - " [Groupby specialized for time axes. Either downsample or upsample your data.](https://docs.xarray.dev/en/stable/user-guide/time-series.html#resampling-and-grouped-operations)\n", - "1. `weighted` :\n", - " [Weight your data before reducing](https://docs.xarray.dev/en/stable/user-guide/computation.html#weighted-array-reductions)\n" + " [GroupBy specialized for time axes. Either downsample or upsample your data.](https://docs.xarray.dev/en/stable/user-guide/time-series.html#resampling-and-grouped-operations)\n", + "1. `weighted` : [Weight your data before reducing.](https://docs.xarray.dev/en/stable/user-guide/computation.html#weighted-array-reductions), as in [this tutorial](https://tutorial.xarray.dev/fundamentals/03.4_weighted.html).\n", + "\n", + "\n", + "\n", + "```{Note}\n", + "the documentation links in this tutorial point to the DataArray implementations of each function, but they are also available for DataSet objects.\n", + "```\n" ] }, { @@ -84,7 +98,7 @@ "tags": [] }, "source": [ - "## Load example dataset\n" + "### Load example dataset\n" ] }, { @@ -102,7 +116,111 @@ "da = xr.tutorial.load_dataset(\"air_temperature\", engine=\"netcdf4\").air\n", "monthly = da.resample(time=\"M\").mean()\n", "data = da.isel(time=0)\n", - "data.plot()" + "data.plot();" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "edc35fa6", + "metadata": {}, + "outputs": [], + "source": [ + "da" + ] + }, + { + "cell_type": "markdown", + "id": "6ff7edbb-ab97-4bf0-881a-0627230565f3", + "metadata": {}, + "source": [ + "***\n", + "\n", + "### Identifying high-level computation patterns\n", + "\n", + "*or, when should I use these functions?*\n", + "\n", + "Consider a common use case. We want to complete some \"task\" for each of \"something\". The \"task\" might be a computation (e.g. mean, median, plot). The \"something\" could be a group of array values (e.g. pixels) or segments of time (e.g. monthly or seasonally).\n", + "\n", + "Often, our solution to this type of problem is to write a for loop. Say we want the average air temperature for each month across the entire domain (all lat and lon values):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70159772", + "metadata": {}, + "outputs": [], + "source": [ + "months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]\n", + "avg_temps = []\n", + "\n", + "for mon in months:\n", + " avg = da[da[\"time.month\"] == mon].mean()\n", + " avg_temps.append(float(avg.data))\n", + "\n", + "print(avg_temps)" + ] + }, + { + "cell_type": "markdown", + "id": "d3a992bf", + "metadata": {}, + "source": [ + "An easy conceptual next step for this example (but still using our for loop) would be to use Xarray's `groupby` function to create an iterator that does the work of grouping our data by month and looping over each month." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f1b23fa", + "metadata": {}, + "outputs": [], + "source": [ + "avg_temps = []\n", + "\n", + "for label, group in da.groupby(\"time.month\"):\n", + " avg_temps.append(float(group.mean().data))\n", + "\n", + "print(avg_temps)" + ] + }, + { + "cell_type": "markdown", + "id": "c1772b16", + "metadata": {}, + "source": [ + "Writing a for-loop here is not wrong, but it can quickly become cumbersome if you have a complex function to apply and it will take awhile to compute on a large dataset (you may even run out of memory). Parallelizing the computation would take a lot of additional work.\n", + "\n", + "Xarray's functionality instead allows us to do the same computation in one line of code (plus, the computation is optimized and ready to take advantage of parallel compute resources)!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c53fda41", + "metadata": {}, + "outputs": [], + "source": [ + "avg_temps = da.groupby(\"time.month\").mean(...) # note the use of the ellipses here\n", + "print(avg_temps.data)" + ] + }, + { + "cell_type": "markdown", + "id": "4f548b71", + "metadata": {}, + "source": [ + "Here we showed an example for computing a mean over a certain period of time (months), which ultimately uses the `GroupBy` function. The transition from loops to a built-in function is similar for `rolling` and `coarsen` over windows of values (e.g. pixels) instead of \"groups\" of time.\n", + "\n", + "Read on through this tutorial to learn some of the incredible ways to use Xarray to avoid writing long for-loops and efficiently complete computational analyses on your data.\n", + "\n", + "```{note}\n", + "By default, `da.mean()` (and `df.mean()`) will calculate the mean by reducing your data over all dimensions (unless you specify otherwise using the `dim` kwarg). The default behavior of `.mean()` on a groupby is to calculate the mean over all dimensions of the variable you are grouping by - but not all the dimensions of the object you are operating on. To compute the mean across all dimensions of a groupby, we must specify `...` for all dimensions (or use the `dim` kwarg to specify which dimensions to reduce by).\n", + "\n", + "```\n", + "\n", + "For a more complex example (identifying flood events - including their start and end date - from rainfall data) illustrating the transition from for loops to high level computation tools, see [this discussion](https://github.com/pydata/xarray/discussions/7641). The [original 40 lines of code](https://github.com/pydata/xarray/discussions/7641#discussion-4976005), including nested for loops, was streamlined into a ~15 line workflow without any loops." ] }, { @@ -115,14 +233,9 @@ "tags": [] }, "source": [ - "---\n", + "***\n", "\n", - "## Concept: \"index space\" vs \"label space\"\n", - "\n", - "These are windowed operations with a window of a fixed size.\n", - "\n", - "- `rolling`: sliding window operations e.g. running mean\n", - "- `coarsen`: decimating; reshaping\n" + "### Concept refresher: \"index space\" vs \"label space\"\n" ] }, { @@ -198,19 +311,23 @@ "id": "e9b80381-8a0d-4833-97fa-687bf693ca5a", "metadata": {}, "source": [ - "---\n", + "***\n", "\n", "## Xarray provides high-level patterns in both \"index space\" and \"label space\"\n", "\n", "### Index space\n", "\n", + "These are windowed operations with a window of a fixed size.\n", + "\n", "1. `rolling` :\n", - " [Operate on rolling windows of your data e.g. running mean](https://docs.xarray.dev/en/stable/user-guide/computation.html#rolling-window-operations)\n", + " [Operate on rolling (sliding) windows of your data e.g. running mean](https://docs.xarray.dev/en/stable/user-guide/computation.html#rolling-window-operations)\n", "1. `coarsen` :\n", - " [Downsample your data](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays)\n", + " [Downsample your data (decimating, reshaping)](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays)\n", "\n", "### Label space\n", "\n", + "These are windowed operations with irregular windows based on your data.\n", + "\n", "1. `groupby` :\n", " [Bin data in to groups and reduce](https://docs.xarray.dev/en/stable/groupby.html)\n", "1. `groupby_bins`: GroupBy after discretizing a numeric variable.\n", @@ -228,11 +345,13 @@ "tags": [] }, "source": [ + "add some \"loop\" versions to show what a user might come up with that could be turned into one of these pattern operations\n", + "\n", "---\n", "\n", "## Index space: windows of fixed width\n", "\n", - "### Sliding windows of fixed length: `rolling`\n", + "### Sliding windows of fixed length: [`rolling`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.rolling.html)\n", "\n", "- returns object of same shape as input\n", "- pads with NaNs to make this happen\n", @@ -248,7 +367,7 @@ "metadata": {}, "outputs": [], "source": [ - "data.plot()" + "data.plot();" ] }, { @@ -266,7 +385,7 @@ "metadata": {}, "outputs": [], "source": [ - "data.rolling(lat=5, lon=5, center=True).mean().plot()" + "data.rolling(lat=5, lon=5, center=True).mean().plot();" ] }, { @@ -281,48 +400,58 @@ "source": [ "#### Apply an existing numpy-only function with `reduce`\n", "\n", - "Tip: The `reduce` method expects a function that can receive and return plain\n", - "arrays (e.g. numpy). The `map` method expects a function that can receive and\n", - "return Xarray objects.\n", + "In some cases, we may want to apply a sliding window function using rolling that is not built in to Xarray. In these cases we can still leverage the sliding windows of rolling and apply our own function with [`reduce`](https://docs.xarray.dev/en/stable/generated/xarray.core.rolling.DataArrayRolling.reduce.html).\n", + "\n", + "```{tip}\n", + " The `reduce` method expects a function that can receive and return plain arrays (e.g. numpy), as in each of the \"windows\" provided by the rolling iterator. This is in contrast to the `map` method, which expects a function that can receive and return Xarray objects.\n", + "```\n", "\n", - "Here's an example function: `np.mean`\n" + "Here's an example function: [`np.ptp`](https://numpy.org/doc/stable/reference/generated/numpy.ptp.html).\n" ] }, { - "cell_type": "markdown", - "id": "9ef251aa-ce3c-4318-95ba-470568ebd967", + "cell_type": "code", + "execution_count": null, + "id": "b1610220", "metadata": {}, + "outputs": [], "source": [ - "**Exercise** Calculate the rolling mean in 5 point bins along both latitude and\n", - "longitude using\n", - "[`rolling(...).reduce`](https://docs.xarray.dev/en/stable/generated/xarray.core.rolling.DataArrayRolling.reduce.html)\n" + "data.rolling(lat=5, lon=5, center=True).reduce(np.ptp).plot();" ] }, { "cell_type": "markdown", - "id": "75397b3d-5961-4924-b688-23520b79aae8", + "id": "9ef251aa-ce3c-4318-95ba-470568ebd967", "metadata": {}, "source": [ - "**Answer:**\n" + "```{exercise}\n", + ":label: rolling-reduce\n", + "\n", + "Calculate the rolling mean in 5 point bins along both latitude and longitude using\n", + "[`rolling(**kwargs).reduce`](https://docs.xarray.dev/en/stable/generated/xarray.core.rolling.DataArrayRolling.reduce.html)\n", + "\n", + "```" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "id": "a36cbf94-ed41-42c6-8ccf-9278927d395b", "metadata": { - "jupyter": { - "source_hidden": true - }, "slideshow": { "slide_type": "subslide" }, "tags": [] }, - "outputs": [], "source": [ + "````{solution} rolling-reduce\n", + ":class: dropdown\n", + "\n", + "```python\n", "# exactly equivalent to data.rolling(...).mean()\n", - "data.rolling(lat=5, lon=5, center=True).reduce(np.mean).plot()" + "data.rolling(lat=5, lon=5, center=True).reduce(np.mean).plot();\n", + "```\n", + "\n", + "````" ] }, { @@ -335,10 +464,9 @@ "tags": [] }, "source": [ - "#### For more complicated analysis, construct a new array with a new dimension.\n", + "#### View outputs from `rolling` operations with `construct`\n", "\n", - "Allows things like short-time fourier transform, spectrogram, windowed rolling\n", - "etc.\n" + "In the above examples, we plotted the outputs of our rolling operations. Xarray makes it easy to integrate the outputs from `rolling` directly into the DataArray using the [`construct`](https://docs.xarray.dev/en/stable/generated/xarray.core.rolling.DataArrayRolling.construct.html#xarray.core.rolling.DataArrayRolling.construct) method." ] }, { @@ -365,34 +493,30 @@ }, { "cell_type": "markdown", - "id": "0a23b9a9-076b-472d-b7a6-57083566a32d", + "id": "5d7562a7", "metadata": {}, "source": [ - "**Exercise** Calculate the 5 point running mean in time using\n", - "`rolling.construct`\n" + "Because `.construct()` only returns a \"view\" (not a copy) of the original data object (i.e. it is not operating \"in-place\"), in order to \"save\" the results you would need to rewrite the original object: `simple = simple.rolling(time=5, center=True).construct(\"window\")`." ] }, { "cell_type": "markdown", - "id": "fcd19bd6-0564-4b0e-b3cb-b5c31f88b4da", + "id": "5743ba77-def9-4b6f-a777-87014311253d", "metadata": {}, "source": [ - "**Answer**\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "80c80728-440a-43d9-957c-65bf111e710d", - "metadata": { - "jupyter": { - "source_hidden": true - }, - "tags": [] - }, - "outputs": [], - "source": [ - "(simple.rolling(time=5, center=True).construct(\"window\").mean(\"window\"))" + "```{exercise}\n", + ":label: rolling-construct\n", + "Calculate the 5 point running mean in time and add it to your DataArray using `rolling.construct`\n", + "```\n", + "\n", + "````{solution} rolling-construct\n", + ":class: dropdown\n", + "\n", + "```python\n", + "simple.rolling(time=5, center=True).construct(\"window\").mean(\"window\")\n", + "```\n", + "\n", + "````\n" ] }, { @@ -402,10 +526,8 @@ "source": [ "`construct` is clever.\n", "\n", - "1. It constructs a **view** of the original array, so it is memory-efficient.\n", - " but you didn't have to know that.\n", - "1. It does something sensible for dask arrays (though generally you want big\n", - " chunksizes for the dimension you're sliding along).\n", + "1. It constructs a [**view**](https://numpy.org/doc/stable/user/basics.copies.html) of the original array, so it is memory-efficient.\n", + "1. It does something sensible for dask arrays (though generally you want big chunksizes for the dimension you're sliding along).\n", "1. It also works with rolling along multiple dimensions!\n" ] }, @@ -418,7 +540,7 @@ "source": [ "#### Advanced: Another `construct` example\n", "\n", - "This is a 2D rolling example; we need to provide two new dimension names\n" + "This is a 2D rolling example; we need to provide two new dimension names.\n" ] }, { @@ -428,7 +550,7 @@ "metadata": {}, "outputs": [], "source": [ - "(data.rolling(lat=5, lon=5, center=True).construct(lat=\"lat_roll\", lon=\"lon_roll\"))" + "data.rolling(lat=5, lon=5, center=True).construct(lat=\"lat_roll\", lon=\"lon_roll\")" ] }, { @@ -441,13 +563,11 @@ "tags": [] }, "source": [ - "---\n", + "***\n", "\n", "### Block windows of fixed length: `coarsen`\n", "\n", - "For non-overlapping windows or \"blocks\" use `coarsen`. The syntax is very\n", - "similar to `rolling`. You will need to specify `boundary` if the length of the\n", - "dimension is not a multiple of the block size\n" + "For non-overlapping windows or \"blocks\" use [`coarsen`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.coarsen.html). The syntax is very similar to `rolling`. You will need to specify how you want Xarray to handle the `boundary` if the length of the dimension is not a multiple of the block size.\n" ] }, { @@ -467,7 +587,7 @@ "metadata": {}, "outputs": [], "source": [ - "data.plot()" + "data.plot();" ] }, { @@ -477,7 +597,7 @@ "metadata": {}, "outputs": [], "source": [ - "data.coarsen(lat=5, lon=5, boundary=\"trim\").std()" + "data.coarsen(lat=5, lon=5, boundary=\"trim\").mean()" ] }, { @@ -500,10 +620,12 @@ "tags": [] }, "source": [ - "#### coarsen supports `reduce` for custom reductions\n", + "#### Coarsen supports `reduce` for custom reductions\n", "\n", - "**Exercise** Use `coarsen.reduce` to apply `np.mean` in 5x5 (latxlon) point\n", - "blocks of `data`\n" + "```{exercise}\n", + ":label: coarsen-reduce\n", + "Use `coarsen.reduce` to apply `np.ptp` in 5x5 (lat x lon) point blocks to `data`\n", + "```" ] }, { @@ -511,22 +633,14 @@ "id": "4f88d113-86d1-4158-b4e7-f54f98af3c0c", "metadata": {}, "source": [ - "**Answer**\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "446c773f-59e4-4b7a-86bd-fd7d40e223e6", - "metadata": { - "jupyter": { - "source_hidden": true - }, - "tags": [] - }, - "outputs": [], - "source": [ - "(data.coarsen(lat=5, lon=5, boundary=\"trim\").reduce(np.mean).plot())" + "````{solution} coarsen-reduce\n", + ":class: dropdown\n", + "\n", + "```python\n", + "data.coarsen(lat=5, lon=5, boundary=\"trim\").reduce(np.mean).plot();\n", + "```\n", + "\n", + "````\n" ] }, { @@ -539,12 +653,11 @@ "tags": [] }, "source": [ - "#### coarsen supports `construct` for block reshaping\n", + "#### Coarsen supports `construct` for block reshaping and storing outputs\n", "\n", "This is usually a good alternative to `np.reshape`\n", "\n", - "A simple example splits a 2-year long monthly 1D time series into a 2D array\n", - "shaped (year x month)\n" + "A simple example splits a 2-year long monthly 1D time series into a 2D array shaped (year x month)\n" ] }, { @@ -584,8 +697,7 @@ "1. The new dimensions `year` and `month` don't have any coordinate labels\n", " associated with them.\n", "\n", - "What if the data had say 23 instead of 24 values? In that case we specify a\n", - "different `boundary` , here we pad to 24 values.\n" + "What if the data had say 23 instead of 24 values (`months.isel(time=slice(1, None)`)? In that case we specify a different `boundary` (the default `boundary=\"exact\"` worked above); here we pad to 24 values.\n" ] }, { @@ -603,7 +715,7 @@ "id": "f4e90b49-42e4-411f-9148-bcaf145de26c", "metadata": {}, "source": [ - "This ends up adding values at the end of the array, not so sensible for this\n", + "This adds values at the end of the array (see the 'nan' at the end of the time coordinate?), which is not so sensible for this\n", "problem. We have some control of the padding through the `side` kwarg to `coarsen`. For `side=\"right\"` we get more sensible output." ] }, @@ -624,7 +736,7 @@ "id": "8174aad1-d6e1-4772-bf23-91e363a92c19", "metadata": {}, "source": [ - "Note that `coarsen` pads with NaNs. For more control over paddnig, use\n", + "Note that `coarsen` pads with NaNs. For more control over padding, use\n", "[DataArray.pad](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.pad.html) explicitly." ] }, @@ -645,36 +757,36 @@ }, { "cell_type": "markdown", - "id": "db43eb72-fb9f-4d6a-aab3-4617c9c41ab1", + "id": "fbe916a3", "metadata": {}, "source": [ - "**Exercise** Reshape the `time` dimension of the DataArray `monthly` to year x\n", - "month and visualize the seasonal cycle for two years at 250°E\n" + "```{note}\n", + "The value specified in `.pad` only applies the `fill_value` to the array, not to coordinate variables.\n", + "This is why the first value of time in the above example is NaN and not -1.\n", + "```" ] }, { "cell_type": "markdown", - "id": "b668514e-b40c-4c64-98bf-4579747ae6ab", + "id": "db43eb72-fb9f-4d6a-aab3-4617c9c41ab1", "metadata": {}, "source": [ - "**Answer**\n" + "```{exercise}\n", + ":label: reshape\n", + "Reshape the `time` dimension of the DataArray `monthly` to year x\n", + "month and visualize the seasonal cycle for two years at 250°E\n", + "```\n" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "d01c6873-dc67-4d8b-928a-ad4f834429fa", - "metadata": { - "jupyter": { - "source_hidden": true - }, - "slideshow": { - "slide_type": "subslide" - }, - "tags": [] - }, - "outputs": [], + "cell_type": "markdown", + "id": "b668514e-b40c-4c64-98bf-4579747ae6ab", + "metadata": {}, "source": [ + "````{solution} reshape\n", + ":class: dropdown\n", + "\n", + "```python\n", "# splits time dimension into year x month\n", "year_month = monthly.coarsen(time=12).construct(time=(\"year\", \"month\"))\n", "\n", @@ -698,7 +810,10 @@ "year_month[\"year\"] = [2013, 2014]\n", "\n", "# seasonal cycle for two years\n", - "year_month.sel(lon=250).plot.contourf(col=\"year\", x=\"month\", y=\"lat\")" + "year_month.sel(lon=250).plot.contourf(col=\"year\", x=\"month\", y=\"lat\")\n", + "```\n", + "\n", + "````\n" ] }, { @@ -706,29 +821,33 @@ "id": "4de2984e-9c28-4ed7-909f-bab47b6eae49", "metadata": {}, "source": [ - "This exercise came up during the live lecture.\n", + "This exercise came up during a live lecture.\n", "\n", - "**Exercise** Calculate the rolling 4 month average, averaged across years.\n", - "\n", - "**Answer**\n", - "\n", - "1. We first reshape using `coarsen.construct` to add `year` as a new dimension.\n", - "2. Then `rolling` on the month dimension.\n", - "3. It turns out that `roll.mean([\"year\", \"month\"])` doesn't work. So we use\n", - " `roll.construct` to get a DataArray with a new dimension `window` and then\n", - " take the mean over `window` and `year`\n" + "```{exercise}\n", + ":label: rolling\n", + "Calculate the rolling 4 month average, averaged across years.\n", + "```\n" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "id": "9d907b2b-c9c8-41cb-8af0-756d5c21ffef", "metadata": {}, - "outputs": [], "source": [ + "````{solution} rolling\n", + ":class: dropdown\n", + "\n", + "1. We first reshape using `coarsen.construct` to add `year` as a new dimension.\n", + "2. Apply `rolling` on the month dimension.\n", + "3. It turns out that `roll.mean([\"year\", \"month\"])` doesn't work. So we use `roll.construct` to get a DataArray with a new dimension `window` and then take the mean over `window` and `year`\n", + "\n", + "```python\n", "reshaped = months.coarsen(time=12).construct(time=(\"year\", \"month\"))\n", "roll = reshaped.rolling(month=4, center=True)\n", - "roll.construct(\"window\").mean([\"window\", \"year\"])" + "roll.construct(\"window\").mean([\"window\", \"year\"])\n", + "```\n", + "\n", + "````" ] }, { @@ -757,33 +876,46 @@ "tags": [] }, "source": [ - "---\n", + "***\n", "\n", "## Label space \"windows\" or bins : GroupBy\n", "\n", - "Generalization of `coarsen`: sometimes the windows you want are not regular.\n", + "Sometimes the windows you want are not regularly spaced or even defined by a grid.\n", + "For instance, grouping data by month (which have varying numbers of days) or the results of an image classification.\n", + "The GroupBy functions are essentially a generalization of `coarsen`: \n", + "\n", + "- `groupby`: divide data into distinct groups, e.g. climatologies, composites. Works when \"groups\" are exact and can be determined using equality (`==`), e.g. characters or integers. Remember that floats are not exact values.\n", + "- `groupby_bins`: Use binning operations, e.g. histograms, to group your data.\n", + "- `resample`: Specialized implementation of GroupBy specifically for time grouping (so far)\n", "\n", - "- `groupby`: e.g. climatologies, composites; works when \"groups\" are exact: e.g.\n", - " characters or integers; not floats\n", - "- `groupby_bins`: binning operations e.g. histograms\n", - "- `resample`: groupby but specialized for time grouping (so far)\n", "\n", - "**tip** Both `groupby_bins` and `resample` are implemented as `GroupBy` with a\n", - "specific way of constructing group labels.\n", + "```{hint}\n", + " Both `groupby_bins` and `resample` are implemented as `GroupBy` with a specific way of constructing group labels.\n", + "```\n", + "\n", "\n", "### Deconstructing GroupBy\n", "\n", - "Commonly called \"split-apply-combine\".\n", + "The GroupBy workflow is commonly called \"split-apply-combine\".\n", "\n", "1. \"split\" : break dataset into groups\n", - "1. \"apply\" : apply an operation, usually a reduction like `mean`\n", - "1. \"combine\" : concatenate results from apply step along new \"group\" dimension\n", + "1. \"apply\" : apply an operation, for instance a reduction like `mean`\n", + "1. \"combine\" : concatenate results from apply step along a new \"group\" dimension\n", "\n", - "But really there is a first step: \"identifying groups\" also called\n", - "\"factorization\" (or \"binning\"). Usually this is the hard part.\n", + "But really there is a \"hidden\" first step: identifying groups (also called factorization or binning). Usually this is the hard part.\n", "\n", - "So \"identify groups\" → \"split into groups\" → \"apply function\" → \"combine\n", - "results\".\n" + "In reality the workflow is: \"identify groups\" → \"split into groups\" → \"apply function\" → \"combine results\".\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55c5e475", + "metadata": {}, + "outputs": [], + "source": [ + "# recall our earlier DataArray\n", + "da" ] }, { @@ -793,6 +925,9 @@ "metadata": {}, "outputs": [], "source": [ + "# GroupBy returns an iterator that traverses the specified groups, here by month.\n", + "# Notice that groupby is clever enough for us to leave out the `.dt` before `.month`\n", + "# we would need to specify to access the month data directly, as in `da.time.dt.month`.\n", "da.groupby(\"time.month\")" ] }, @@ -803,6 +938,8 @@ "metadata": {}, "outputs": [], "source": [ + "# for each group (e.g. the air temperature in a given month for all the years),\n", + "# compute the mean\n", "da.groupby(\"time.month\").mean()" ] }, @@ -811,7 +948,9 @@ "id": "7a579539-1634-462c-b4d9-ea558fceadfb", "metadata": {}, "source": [ - "This is how xarray identifies \"groups\" for the monthly climatology computation\n" + "Notice that since we have averaged over all the years for each month, our resulting DataArray no longer has a \"year\" coordinate.\n", + "\n", + "If we want to see how Xarray identifies \"groups\" for the monthly climatology computation, we can plot our input to `groupby`. GroupBy is clever enough to figure out how many values there are an thus how many groups to make.\n" ] }, { @@ -821,7 +960,7 @@ "metadata": {}, "outputs": [], "source": [ - "da.time.dt.month.plot()" + "da[\"time.month\"].plot();" ] }, { @@ -829,7 +968,7 @@ "id": "a6d21727-4c15-4f13-ae53-61d5f4944554", "metadata": {}, "source": [ - "Similarly for binning,\n" + "Similarly for binning (remember this is useful when the parameter you are binning over is not \"exact\", like a float),\n" ] }, { @@ -860,6 +999,18 @@ "da.resample(time=\"M\")" ] }, + { + "cell_type": "markdown", + "id": "3763efb3", + "metadata": {}, + "source": [ + "```{note}\n", + "\n", + "Resampling is changing the frequency of our data to monthly (for two years), so we have 24 bins. GroupBy is taking the average across all data in the same month for two years, so we have 12 bins.\n", + "\n", + "```" + ] + }, { "cell_type": "markdown", "id": "0b2de08d-0b7b-4725-80f3-c94d19d91669", @@ -872,21 +1023,21 @@ "source": [ "### Constructing group labels\n", "\n", - "Xarray uses `pandas.factorize` for `groupby` and `pandas.cut` for\n", - "`groupby_bins`.\n", + "Xarray uses [`pandas.factorize`](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html) for `groupby` and [`pandas.cut`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) for `groupby_bins`.\n", "\n", - "If the automatic group detection doesn't work for your problem then these\n", - "functions are useful for constructing \"group labels\" in many cases\n", + "#### Functions to construct group labels\n", + "If the automatic group detection doesn't work for your problem then these functions are useful for constructing specific \"group labels\" in many cases\n", "\n", "1. [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html)\n", - " (binning)\n", + " for binning\n", "1. [numpy.searchsorted](https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html)\n", " supports many other data types\n", "1. [pandas.factorize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html)\n", " supports characters, strings etc.\n", "1. [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)\n", " for binning\n", - "1. [DataArray.isin](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.isin.html)\n" + "1. [DataArray.isin](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.isin.html)\n", + "1. [scipy.ndimage.label](https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.label.html)" ] }, { @@ -899,12 +1050,12 @@ "tags": [] }, "source": [ - "#### More commonly useful are [\"datetime components\"](https://docs.xarray.dev/en/stable/user-guide/time-series.html#datetime-components)\n", + "#### [\"Datetime components\"](https://docs.xarray.dev/en/stable/user-guide/time-series.html#datetime-components) for creating groups\n", "\n", "See a full list\n", "[here](https://docs.xarray.dev/en/stable/generated/xarray.core.accessor_dt.DatetimeAccessor.html?highlight=DatetimeAccessor)\n", "\n", - "Accessed using `DataArray.dt.*`\n" + "These can be accessed in a few different ways as illustrated below.\n" ] }, { @@ -954,10 +1105,11 @@ "id": "db7bd7e6-59cd-4b2a-ac37-2ff4d40d9fc8", "metadata": {}, "source": [ - "**Demo** Grouping over a custom definition of seasons using numpy.isin.\n", + "#### Construct and use custom labels\n", + "\n", + "##### Custom seasons with `numpy.isin`.\n", "\n", - "We want to group over 4 seasons: `DJF`, `MAM`, `JJAS`, `ON` - this makes\n", - "physical sense in the Indian Ocean basin\n", + "We want to group over four seasons: `DJF`, `MAM`, `JJAS`, `ON` - this makes physical sense in the Indian Ocean basin.\n", "\n", "Start by extracting months.\n" ] @@ -988,8 +1140,8 @@ "metadata": {}, "outputs": [], "source": [ - "season = np.full(month.shape, \" \")\n", - "season" + "myseason = np.full(month.shape, \" \")\n", + "myseason" ] }, { @@ -1007,12 +1159,29 @@ "metadata": {}, "outputs": [], "source": [ - "season[np.isin(month, [12, 1, 2])] = \"DJF\"\n", - "season[np.isin(month, [3, 4, 5])] = \"MAM\"\n", - "season[np.isin(month, [6, 7, 8, 9])] = \"JJAS\"\n", - "season[np.isin(month, [10, 11])] = \"ON\"\n", - "season = da.time.copy(data=season)\n", - "season" + "myseason[np.isin(month, [12, 1, 2])] = \"DJF\"\n", + "myseason[np.isin(month, [3, 4, 5])] = \"MAM\"\n", + "myseason[np.isin(month, [6, 7, 8, 9])] = \"JJAS\"\n", + "myseason[np.isin(month, [10, 11])] = \"ON\"" + ] + }, + { + "cell_type": "markdown", + "id": "297f4d2f", + "metadata": {}, + "source": [ + "Turn our new seasonal group array into a DataArray." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a72a117", + "metadata": {}, + "outputs": [], + "source": [ + "myseason_da = da.time.copy(data=myseason)\n", + "myseason_da" ] }, { @@ -1024,7 +1193,7 @@ "source": [ "(\n", " # Calculate climatology\n", - " da.groupby(season)\n", + " da.groupby(myseason_da)\n", " .mean()\n", " # reindex to get seasons in logical order (not alphabetical order)\n", " .reindex(time=[\"DJF\", \"MAM\", \"JJAS\", \"ON\"])\n", @@ -1042,9 +1211,9 @@ "tags": [] }, "source": [ - "#### `floor`, `ceil` and `round` time\n", + "##### `floor`, `ceil` and `round` on time\n", "\n", - "Basically \"resampling\"\n" + "Additional functionality in the [datetime accessor](https://docs.xarray.dev/en/stable/generated/xarray.core.accessor_dt.DatetimeAccessor.html) allows us to effectively \"resample\" our time data to remove roundoff errors in timestamps.\n" ] }, { @@ -1079,11 +1248,11 @@ "tags": [] }, "source": [ - "#### `strftime` can be extremely useful\n", + "##### `strftime` is another powerful option\n", "\n", "So useful and so unintuitive that it has its own website: https://strftime.org/\n", "\n", - "This example avoids merging \"Feb-29\" and \"Mar-01\" for a daily climatology\n" + "This is useful to avoid merging \"Feb-29\" and \"Mar-01\" for a daily climatology\n" ] }, { @@ -1103,9 +1272,9 @@ "tags": [] }, "source": [ - "### groupby supports `reduce` for custom reductions\n", + "### Custom reductions with GroupBy\n", "\n", - "This applies to `groupby_bins` and `resample`\n" + "Analogous to `rolling`, `reduce` and `map` apply custom reductions to `groupby_bins` and `resample`.\n" ] }, { @@ -1115,7 +1284,7 @@ "metadata": {}, "outputs": [], "source": [ - "(da.groupby(\"time.month\").reduce(np.mean).plot(col=\"month\", col_wrap=4))" + "(da.groupby(\"time.month\").reduce(np.ptp).plot(col=\"month\", col_wrap=4))" ] }, { @@ -1123,9 +1292,9 @@ "id": "7cd7ede5-8e57-4099-ab39-b9d75427f125", "metadata": {}, "source": [ - "**tip** `map` is for functions that expect and return xarray objects (see also\n", - "`Dataset.map`). `reduce` is for functions that expect and return plain arrays\n", - "(like numpy or scipy functions)\n" + "```{tip}\n", + " `map` is for functions that expect and return xarray objects (see also [`Dataset.map`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.map.html)). `reduce` is for functions that expect and return plain arrays (like Numpy or SciPy functions).\n", + "```\n" ] }, { @@ -1135,14 +1304,15 @@ "tags": [] }, "source": [ - "### GroupBy does not provide construct\n", + "### Adding GroupBy outputs to your DataArray or DataSet\n", + "\n", + "GroupBy does not provide a `construct` method, because all the groups need not be the same \"length\" (e.g. months can have 28, 29, 30, or 31 days).\n", "\n", - "All the groups need not be the same \"length\" (e.g. months can have 28, 29, 30,\n", - "or 31 days)\n", + "#### Instead looping over groupby objects is possible\n", "\n", - "### Instead looping over groupby objects is possible\n", + "Because `groupby` returns an iterator that loops over each group, it is easy to loop over groupby objects. You can also iterate over `rolling` and `coarsen` objects, however this approach is usually quite slow.\n", "\n", - "Maybe you want to plot data in each group separately?\n" + "Maybe you want to plot data in each group separately:\n" ] }, { @@ -1161,7 +1331,7 @@ "id": "8017d842-ff79-47ec-928d-43e3cf4e7b66", "metadata": {}, "source": [ - "This is a DataArray contain data for all December days\n" + "This is a DataArray containing data for all December days (because the last printed `label` value is `12`, so the last `group` value is for December)." ] }, { @@ -1194,6 +1364,24 @@ "group.plot.hist()" ] }, + { + "cell_type": "markdown", + "id": "d339c52c", + "metadata": {}, + "source": [ + "Remember, this example is just to show how you could operate on each group object in a groupby operation. If we wanted to just explore the December (or March) data, we should just filter for it directly:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c4fd9b2", + "metadata": {}, + "outputs": [], + "source": [ + "da[da[\"time.month\"] == 12].plot.hist()" + ] + }, { "cell_type": "markdown", "id": "32dfe5fd-0e8f-4b69-a3c1-03f73c484b6b", @@ -1201,11 +1389,9 @@ "tags": [] }, "source": [ - "### In most cases, avoid a for loop using `map`\n", + "#### In most cases, avoid a for loop using `map`\n", "\n", - "Apply functions that expect xarray Datasets or DataArrays.\n", - "\n", - "Avoid having to manually combine results using concat\n" + "`map` enables us to apply functions that expect xarray Datasets or DataArrays. This makes it easy to perform calculations on the grouped data, add the results from each group back to the original object, and avoid having to manually combine results (using concat).\n" ] }, { @@ -1217,9 +1403,9 @@ }, "outputs": [], "source": [ - "def iqr(da, dim):\n", + "def iqr(gb_da, dim):\n", " \"\"\"Calculates interquartile range\"\"\"\n", - " return (da.quantile(q=0.75, dim=dim) - da.quantile(q=0.25, dim=dim)).rename(\"iqr\")\n", + " return (gb_da.quantile(q=0.75, dim=dim) - gb_da.quantile(q=0.25, dim=dim)).rename(\"iqr\")\n", "\n", "\n", "da.groupby(\"time.month\").map(iqr, dim=\"time\")" @@ -1230,7 +1416,7 @@ "id": "3122e22a-77f0-402f-baf6-111821973250", "metadata": {}, "source": [ - "---\n" + "***" ] }, { @@ -1243,22 +1429,22 @@ "Xarray provides methods for high-level analysis patterns:\n", "\n", "1. `rolling` :\n", - " [Operate on rolling windows of your data e.g. running mean](https://docs.xarray.dev/en/stable/user-guide/computation.html#rolling-window-operations)\n", + " [Operate on rolling (fixed length, overlapping) windows of your data e.g. running mean.](https://docs.xarray.dev/en/stable/user-guide/computation.html#rolling-window-operations)\n", "1. `coarsen` :\n", - " [Downsample your data](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays)\n", + " [Operate on blocks (fixed length) of your data (downsample).](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays)\n", "1. `groupby` :\n", - " [Bin data in to groups and reduce](https://docs.xarray.dev/en/stable/groupby.html)\n", - "1. `groupby_bins`: GroupBy after discretizing a numeric variable.\n", + " [Parse data into groups (using an exact value) and operate on each one (reduce data).](https://docs.xarray.dev/en/stable/groupby.html)\n", + "1. `groupby_bins`: [GroupBy after discretizing a numeric (non-exact, e.g. float) variable.](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.groupby_bins.html)\n", "1. `resample` :\n", " [Groupby specialized for time axes. Either downsample or upsample your data.](https://docs.xarray.dev/en/stable/user-guide/time-series.html#resampling-and-grouped-operations)\n", - "1. `weighted` :\n", - " [Weight your data before reducing](https://docs.xarray.dev/en/stable/user-guide/computation.html#weighted-array-reductions)\n", + "1. [Weight your data before reducing.](https://docs.xarray.dev/en/stable/user-guide/computation.html#weighted-array-reductions)\n", "\n", - "## More resources\n", + "Xarray also provides a consistent interface to make using those patterns easy:\n", "\n", - "1. More tutorials here: https://tutorial.xarray.dev/\n", - "1. Answers to common questions on \"how to do X\" are here:\n", - " https://docs.xarray.dev/en/stable/howdoi.html\n" + "1. Iterate over the operators (`rolling`, `coarsen`, `groupby`, `groupby_bins`, `resample`).\n", + "1. Apply functions that accept numpy-like arrays with `reduce`.\n", + "1. Reshape to a new xarray object with `.construct` (`rolling`, `coarsen` only).\n", + "1. Apply functions that accept xarray objects with `map` (`groupby`, `groupby_bins`, `resample` only).\n" ] } ],