diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 42fcf502..a6b1f814 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -28,10 +28,6 @@ repos: hooks: - id: flake8 - - repo: https://github.com/asottile/seed-isort-config - rev: v2.2.0 - hooks: - - id: seed-isort-config - repo: https://github.com/PyCQA/isort rev: 5.12.0 hooks: diff --git a/_config.yml b/_config.yml index 3e3e184c..d5d27504 100644 --- a/_config.yml +++ b/_config.yml @@ -65,9 +65,10 @@ sphinx: # maintain old paths and redirect them (so google results dont go to 404) # https://github.com/wpilibsuite/sphinxext-rediraffe - sphinxext.rediraffe + - sphinx_exercise config: # application/vnd.holoviews_load.v0+json, application/vnd.holoviews_exec.v0+json - suppress_warnings: ['mystnb.unknown_mime_type'] + suppress_warnings: ['mystnb.unknown_mime_type', 'misc.highlighting_failure'] notfound_context: body: "
Sorry, this page doesn't exist. Many sections of this book have been updated recently.
Try the search box 🔎 to find what you're looking for!
" notfound_urls_prefix: / diff --git a/_toc.yml b/_toc.yml index 81c44f23..47e8d35c 100644 --- a/_toc.yml +++ b/_toc.yml @@ -47,8 +47,12 @@ parts: - file: advanced/apply_ufunc/apply_ufunc.md sections: - file: advanced/apply_ufunc/simple_numpy_apply_ufunc - - file: advanced/apply_ufunc/simple_dask_apply_ufunc - - file: advanced/apply_ufunc/vectorize_1d + - file: advanced/apply_ufunc/core-dimensions + - file: advanced/apply_ufunc/complex-output-numpy + - file: advanced/apply_ufunc/automatic-vectorizing-numpy + - file: advanced/apply_ufunc/dask_apply_ufunc + - file: advanced/apply_ufunc/numba-vectorization + - file: advanced/apply_ufunc/example-interp - file: advanced/map_blocks/map_blocks.md sections: - file: advanced/map_blocks/simple_map_blocks diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb new file mode 100644 index 00000000..d88eb7bb --- /dev/null +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -0,0 +1,359 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6849dcdc-3484-4f41-8b23-51613d36812f", + "metadata": { + "tags": [] + }, + "source": [ + "(vectorize)=\n", + "# Automatic Vectorization" + ] + }, + { + "cell_type": "markdown", + "id": "afc56d28-6e55-4967-b27d-28e2cc539cc7", + "metadata": { + "tags": [] + }, + "source": [ + "Previously we looked at [applying functions](gentle-intro) on numpy arrays, and the concept of [core dimensions](core-dimensions).\n", + "We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n", + "argument. \n", + "\n", + "However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n", + "argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions\n", + "--- termed \"loop dimensions\" or \"broadcast dimensions\".\n", + "\n", + "\n", + "A good example is numpy's 1D interpolate function `numpy.interp`:\n", + "\n", + "```\n", + " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", + " Docstring:\n", + " One-dimensional linear interpolation.\n", + "\n", + " Returns the one-dimensional piecewise linear interpolant to a function\n", + " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", + "```\n", + "\n", + "This function expects 1D arrays as input, so there is one core dimension and we cannot easily apply \n", + "it to a nD array since there is no `axis` keyword argument. \n", + "\n", + "\n", + "Our goal here is to \n", + "1. Understand the difference between core dimensions and loop dimensions\n", + "1. Understand vectorization\n", + "1. Learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize` keyword argument.\n", + "\n", + "## Core dimensions and looping\n", + "\n", + "Let's say we want to\n", + "interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n", + "1. loop over the `space` dimension, \n", + "1. subset the array to a 1D array at that `space` location, \n", + "1. Interpolate the 1D arrays to the new `time` vector, and\n", + "1. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", + "\n", + "In pseudo-code this might look like\n", + "\n", + "```python\n", + "for index in range(size_of_space_axis):\n", + " out[index, :] = np.interp(..., array[index, :], ...)\n", + "```\n", + "\n", + "\n", + "```{exercise}\n", + ":label: coreloopdims\n", + "\n", + "Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n", + "Which dimension is the core dimension, and which is the \"loop dimension\"?\n", + "```\n", + "```{solution} coreloopdims\n", + ":class: dropdown\n", + "\n", + "`time` is the core dimension, and `space` is the loop dimension.\n", + "```\n", + "\n", + "## Vectorization\n", + "\n", + "The pattern of looping over any number of \"loop dimensions\" and applying a function along \"core dimensions\" \n", + "is so common that numpy provides wrappers that automate these steps: \n", + "1. [numpy.apply_along_axis](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html)\n", + "1. [numpy.apply_over_axes](https://numpy.org/doc/stable/reference/generated/numpy.apply_over_axes.html)\n", + "1. [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html)\n", + "\n", + "\n", + "`apply_ufunc` provides an easy interface to `numpy.vectorize` through the keyword argument `vectorize`. Here we see how to use\n", + "that to automatically apply `np.interp` along a single axis of a nD array\n", + "\n", + "## Load data\n", + "\n", + "First lets load an example dataset\n", + "\n", + "```{tip}\n", + "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76aa13b8-5ced-4468-a72e-6b0a29172d6d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%xmode minimal\n", + "\n", + "import xarray as xr\n", + "import numpy as np\n", + "\n", + "xr.set_options(display_expand_data=False)\n", + "\n", + "air = (\n", + " xr.tutorial.load_dataset(\"air_temperature\")\n", + " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", + " .isel(time=slice(4), lon=slice(3)) # choose a small subset for convenience\n", + ")\n", + "air" + ] + }, + { + "cell_type": "markdown", + "id": "81356724-6c1a-4d4a-9a32-bb906a9419b2", + "metadata": { + "tags": [] + }, + "source": [ + "## Review\n", + "\n", + "\n", + "We'll work with the `apply_ufunc` call from the section on [handling dimensions that change size](complex-output-change-size). See the \"Handling Complex Output\" section for how to get here.\n", + "\n", + "This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb286fa0-deba-4929-b18a-79af5acb0b5b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air.isel(lon=0, time=0), # this version only works with 1D vectors\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e3382658-14e1-4842-a618-ce7a27948c31", + "metadata": { + "tags": [] + }, + "source": [ + "## Try nD input\n", + "\n", + "Our goal is to interpolate latitude at every longitude and time, such that we go from a dataset with dimensions `(time: 4, lat: 25, lon: 3)` to `(time: 4, lat: 100, lon: 3)`. \n", + "\n", + "If we blindly try passing `air` (a 3D DataArray), we get a hard-to-understand error" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1476bcce-cc7b-4252-90dd-f45502dffb09", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1d1da9c2-a634-4920-890c-74d9bec9eab9", + "metadata": { + "tags": [] + }, + "source": [ + "We will use a \"wrapper\" function `debug_interp` to examine what gets passed to `numpy.interp`.\n", + "\n", + "```{tip}\n", + "Such wrapper functions are a great way to understand and debug `apply_ufunc` use cases.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa306dcf-eec3-425c-b278-42d15bbc0e4f", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "def debug_interp(xi, x, data):\n", + " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = xr.apply_ufunc(\n", + " debug_interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6f5c928b-f8cb-4016-9d6d-39743f9c2976", + "metadata": { + "tags": [] + }, + "source": [ + "That's a hard-to-interpret error from NumPy but our `print` call helpfully printed the shapes of the input data: \n", + "\n", + " data: (4, 3, 25) | x: (25,) | xi: (100,)\n", + "\n", + "We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. But `numpy.interp` requires a 1D input, and thus the error.\n", + "\n", + "Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`." + ] + }, + { + "cell_type": "markdown", + "id": "737cc6b4-522f-488c-9124-524cc42ebef3", + "metadata": { + "tags": [] + }, + "source": [ + "## Vectorization with `np.vectorize`\n" + ] + }, + { + "cell_type": "markdown", + "id": "b6dac8da-8420-4fc4-9aeb-29b8999d4b37", + "metadata": { + "tags": [] + }, + "source": [ + "`apply_ufunc` makes it easy to loop over the loop dimensions by specifying `vectorize=True`:\n", + "\n", + " vectorize : bool, optional\n", + " If True, then assume ``func`` only takes arrays defined over core\n", + " dimensions as input and vectorize it automatically with\n", + " :py:func:`numpy.vectorize`. This option exists for convenience, but is\n", + " almost always slower than supplying a pre-vectorized function.\n", + " Using this option requires NumPy version 1.12 or newer.\n", + " \n", + "\n", + "```{warning}\n", + "Also see the numpy documentation for [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html). Most importantly\n", + "\n", + " The vectorize function is provided primarily for convenience, not for performance. \n", + " The implementation is essentially a for loop.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d72fdd8c-44d2-4f6e-9fc4-7084e0e49986", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "outputs": [], + "source": [ + "interped = xr.apply_ufunc(\n", + " debug_interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", + " vectorize=True,\n", + ")\n", + "interped" + ] + }, + { + "cell_type": "markdown", + "id": "d81f399e-1649-4d4b-ad28-81cba8403210", + "metadata": { + "tags": [] + }, + "source": [ + "Wow that worked!\n", + "\n", + "Notice that \n", + "1. the printed input shapes are all 1D and correspond to one vector of size 25 along the `lat` dimension.\n", + "2. `debug_interp` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n", + "3. The result `interped` is now an xarray object with coordinate values copied over from `data`. \n", + "\n", + "\n", + "```{note}\n", + "`lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as noted in the docstring for `input_core_dims`\n", + "\n", + " Core dimensions are automatically moved to the last axes of input\n", + " variables before applying ``func``, which facilitates using NumPy style\n", + " generalized ufuncs [2]_.\n", + "```\n", + "\n", + "## Conclusion\n", + "This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n", + "\n", + "The `vectorize` keyword argument, when set to True, will use `numpy.vectorize` to apply the function by looping over the \"loop dimensions\" --- dimensions that are not the core dimensions for the applied function." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb new file mode 100644 index 00000000..74387ace --- /dev/null +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -0,0 +1,375 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0203c86c-2e42-4226-acd1-1e9bcffc6708", + "metadata": { + "tags": [] + }, + "source": [ + "(complex-output)=\n", + "# Handling complex output\n", + "\n", + "We've seen how to use `apply_ufunc` to handle relatively simple functions that transform every element, or reduce along a single dimension.\n", + "\n", + "This lesson will show you how to handle cases where the output is more complex in two ways:\n", + "1. Handle adding a new dimension by specifying `output_core_dims`\n", + "1. Handling the change in size of an existing dimension by specifying `exclude_dims` in addition to `output_core_dims`\n", + "\n", + "\n", + "## Introduction\n", + "\n", + "A good example of a function that returns relatively complex output is numpy's 1D interpolate function `numpy.interp`:\n", + "\n", + "```\n", + " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", + " Docstring:\n", + " One-dimensional linear interpolation.\n", + "\n", + " Returns the one-dimensional piecewise linear interpolant to a function\n", + " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", + "```\n", + "\n", + "This function expects a 1D array as input, and returns a 1D array as output. That is, `numpy.interp` has one core dimension.\n", + "\n", + "\n", + "```{tip}\n", + "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7842446-6b7d-494b-9625-5dd04dc6e9ad", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%xmode minimal\n", + "\n", + "import xarray as xr\n", + "import numpy as np\n", + "\n", + "np.set_printoptions(threshold=10, edgeitems=2)\n", + "xr.set_options(display_expand_data=False)\n", + "\n", + "air = (\n", + " xr.tutorial.load_dataset(\"air_temperature\")\n", + " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", + " .isel(time=-0, lon=0) # choose a 1D subset\n", + ")\n", + "air" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "936f0cb2-491b-4560-9470-d98df595b3d1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Our goal is to densify from 25 to 100 coordinate values:s\n", + "newlat = np.linspace(15, 75, 100)\n", + "np.interp(newlat, air.lat.data, air.data)" + ] + }, + { + "cell_type": "markdown", + "id": "db79d3e3-2a1a-4644-82a0-d58da74e0e36", + "metadata": { + "tags": [] + }, + "source": [ + "(interp-add-new-dim)=\n", + "## Adding a new dimension\n", + "\n", + "1D interpolation transforms the size of the input along a single dimension.\n", + "\n", + "Logically, we can think of this as removing the old dimension and adding a new dimension.\n", + "\n", + "We provide this information to `apply_ufunc` using the `output_core_dims` keyword argument\n", + "\n", + "```\n", + " output_core_dims : List[tuple], optional\n", + " List of the same length as the number of output arguments from\n", + " ``func``, giving the list of core dimensions on each output that were\n", + " not broadcast on the inputs. By default, we assume that ``func``\n", + " outputs exactly one array, with axes corresponding to each broadcast\n", + " dimension.\n", + "\n", + " Core dimensions are assumed to appear as the last dimensions of each\n", + " output in the provided order.\n", + "```\n", + "\n", + "For `interp` we expect one returned output with one new core dimension that we will call `\"lat_interp\"`.\n", + "\n", + "Specify this using `output_core_dims=[[\"lat_interp\"]]`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5e9b4b4-712d-4ef7-8e13-9bbbad59728a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # function to apply\n", + " newlat, # 1st input to np.interp\n", + " air.lat, # 2nd input to np.interp\n", + " air, # 3rd input to np.interp\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]], # one entry per function input, 3 in total!\n", + " output_core_dims=[[\"lat_interp\"]],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c0a5b8d4-729e-4d0e-b284-4751c5edc37c", + "metadata": { + "tags": [] + }, + "source": [ + "```{exercise}\n", + ":label: newdim\n", + "\n", + "Apply the following function using `apply_ufunc`. It adds a new dimension to the input array, let's call it `newdim`. Specify the new dimension using `output_core_dims`. Do you need any `input_core_dims`?\n", + "\n", + "```python\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=-1)\n", + "```\n", + "````{solution} newdim\n", + ":class: dropdown\n", + "\n", + "``` python\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=-1)\n", + "\n", + "\n", + "xr.apply_ufunc(\n", + " add_new_dim,\n", + " air,\n", + " output_core_dims=[[\"newdim\"]],\n", + ")\n", + "```\n", + "````" + ] + }, + { + "cell_type": "markdown", + "id": "7767a63d-20c5-4c2d-8bf0-3b26bc2b336f", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "(complex-output-change-size)=\n", + "## Dimensions that change size\n", + "\n", + "Imagine that you want the output to have the same dimension name `\"lat\"` i.e. applying`np.interp` changes the size of the `\"lat\"` dimension.\n", + "\n", + "We get an a error if we specify `\"lat\"` in `output_core_dims`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fc8f8f5-0560-46f2-be65-8daabea2d837", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5276692d-0e1d-498a-8d60-e08a4d8b9d3a", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "As the error message points out,\n", + "```\n", + "Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", + "```\n", + "\n", + "Looking at the docstring we need to specify `exclude_dims` as a \"set\":\n", + "\n", + "```\n", + "exclude_dims : set, optional\n", + " Core dimensions on the inputs to exclude from alignment and\n", + " broadcasting entirely. Any input coordinates along these dimensions\n", + " will be dropped. Each excluded dimension must also appear in\n", + " ``input_core_dims`` for at least one argument. Only dimensions listed\n", + " here are allowed to change size between input and output objects.\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61a39f36-9386-407b-b1c0-b49953da9d0c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a3c45b16-795c-4549-95ee-8e9c0d7dd517", + "metadata": { + "tags": [] + }, + "source": [ + "## Returning multiple variables\n", + "\n", + "Another common, but more complex, case is to handle multiple outputs returned by the function.\n", + "\n", + "As an example we will write a function that returns the minimum and maximum value along the last axis of the array.\n", + "\n", + "We will work with a 2D array, and apply the function `minmax` along the `\"lat\"` dimension:\n", + "```python\n", + "def minmax(array):\n", + " return array.min(axis=-1), array.max(axis=-1)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c7accd69-fece-46be-852b-0cb7c432197a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def minmax(array):\n", + " return array.min(axis=-1), array.max(axis=-1)\n", + "\n", + "\n", + "air2d = xr.tutorial.load_dataset(\"air_temperature\").air.isel(time=0)\n", + "air2d" + ] + }, + { + "cell_type": "markdown", + "id": "c99f6a5e-f977-4828-9418-202d93d0acda", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "By default, Xarray assumes one array is returned by the applied function.\n", + "\n", + "Here we have two returned arrays, and the input core dimension `\"lat\"` is removed (or reduced over).\n", + "\n", + "So we provide `output_core_dims=[[], []]` i.e. an empty list of core dimensions for each of the two returned arrays." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "204c649e-18d4-403a-9366-c46caaaefb52", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "minda, maxda = xr.apply_ufunc(\n", + " minmax,\n", + " air2d,\n", + " input_core_dims=[[\"lat\"]],\n", + " output_core_dims=[[], []],\n", + ")\n", + "minda" + ] + }, + { + "cell_type": "markdown", + "id": "b9b023e8-5ca4-436a-bdfb-3ce35f6ea712", + "metadata": { + "tags": [] + }, + "source": [ + "````{exercise}\n", + ":label: generalize\n", + "\n", + "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the above use of `apply_ufunc` generalizes to an array with more than one dimension? \n", + "\n", + "Try applying the minmax function to a 3d air temperature dataset \n", + "```python\n", + "air3d = xr.tutorial.load_dataset(\"air_temperature\").air)\n", + "``` \n", + "Your goal is to have a minimum and maximum value of temperature across all latitudes for a given time and longitude.\n", + "````\n", + "\n", + "````{solution} generalize\n", + ":class: dropdown\n", + "\n", + "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]` just as before.\n", + "\n", + "\n", + "```python\n", + "minda, maxda = xr.apply_ufunc(\n", + " minmax,\n", + " air3d,\n", + " input_core_dims=[[\"lat\"]],\n", + " output_core_dims=[[],[]],\n", + ")\n", + "```\n", + "````" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/advanced/apply_ufunc/core-dimensions.ipynb b/advanced/apply_ufunc/core-dimensions.ipynb new file mode 100644 index 00000000..2c4d39d9 --- /dev/null +++ b/advanced/apply_ufunc/core-dimensions.ipynb @@ -0,0 +1,372 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8a5fd05a-5187-4fb6-8809-049e7158124b", + "metadata": { + "tags": [] + }, + "source": [ + "# Core dimensions\n", + "\n", + "[Previously](gentle-intro) we learned to use `apply_ufunc` on simple functions that acted element by element. \n", + "\n", + "Here we move on to slightly more complex functions like `np.mean` that can act along a subset of an input array's dimensions.\n", + "\n", + "Such operations involve the concept of \"core dimensions\". \n", + "\n", + "Our learning goals are:\n", + "- Learn how to identify \"core dimensions\" for the function you're applying.\n", + "- Learn that \"core dimensions\" are automatically moved or transposed to the end of the array.\n", + "\n", + "\n", + "## Introduction\n", + "\n", + "For using more complex operations that consider some array values collectively,\n", + "it’s important to understand the idea of **core dimensions**. \n", + "Usually, they correspond to the fundamental dimensions over\n", + "which an operation is defined, e.g., the summed axis in `np.sum`. One way to think about core dimensions \n", + "is to consider the smallest dimensionality of data that the function acts on.\n", + "\n", + "```{important}\n", + "\n", + "A good clue that core dimensions are needed is the presence of an `axis` argument on the\n", + "corresponding NumPy function.\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "626c49ba-7d57-42e1-8e20-7d1b3ec7147d", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb4f4b80-d990-4bde-abf9-f77bac55c7ff", + "metadata": {}, + "outputs": [], + "source": [ + "%xmode minimal\n", + "\n", + "import numpy as np\n", + "import xarray as xr\n", + "\n", + "# limit the amount of information printed to screen\n", + "xr.set_options(display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" + ] + }, + { + "cell_type": "markdown", + "id": "61124d8d-2992-46d1-bd49-6282fe9aba82", + "metadata": {}, + "source": [ + "Let's load a dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "920c4b39-6f55-43cf-99df-7eb6ea01ad82", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.load_dataset(\"air_temperature\")\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "cbaab26c-24ed-4e0f-8285-f0ffff92ec14", + "metadata": {}, + "source": [ + "## Reducing with `np.mean`\n", + "\n", + "Let's write a function that computes the mean along `time` for a provided xarray object. \n", + "\n", + "This function requires one core dimension `time`. For `ds.air` note that `time` is the 0th axis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6637aa49-4f2f-4526-929c-65086e0196a1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.dims" + ] + }, + { + "cell_type": "markdown", + "id": "7b66f1f2-369c-42b5-b082-735c71a2e1f9", + "metadata": { + "tags": [] + }, + "source": [ + "`get_axis_num` is a useful method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30e7e10b-5447-4384-bc25-0cdea72da25a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.get_axis_num(\"time\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f53ec56a-63bb-4103-87e9-4e7c14dc6c1a", + "metadata": {}, + "outputs": [], + "source": [ + "np.mean(ds.air, axis=ds.air.get_axis_num(\"time\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8214ae7-00fa-47fa-af79-79c3731b8cc1", + "metadata": {}, + "outputs": [], + "source": [ + "np.mean(ds.air.data, axis=0)" + ] + }, + { + "cell_type": "markdown", + "id": "938bd6a2-f0c0-412a-8f9b-4b2147750892", + "metadata": {}, + "source": [ + "Let's try to use `apply_ufunc` to replicate `np.mean(ds.air.data, axis=0)`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "424a54a2-b704-407b-80d8-27107949f184", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " # function to apply\n", + " np.mean,\n", + " # object with data to pass to function\n", + " ds,\n", + " # keyword arguments to pass to np.mean\n", + " kwargs={\"axis\": 0},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0ec1348a-cf6d-4d7e-b01b-5a9ea32f7565", + "metadata": { + "tags": [] + }, + "source": [ + "The error here\n", + "```\n", + "applied function returned data with unexpected number of dimensions. \n", + "Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", + "```\n", + "\n", + "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input.\n", + "\n", + "Do that by passing a list of dimension names for each input object. For this function we have one input : `ds` and with a single core dimension `\"time\"` so we have `input_core_dims=[[\"time\"]]`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e8d7580-2300-47f3-953b-403967eb7fcd", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " np.mean,\n", + " ds,\n", + " # specify core dimensions as a list of lists\n", + " # here 'time' is the core dimension on `ds`\n", + " input_core_dims=[\n", + " [\"time\"], # core dimension for ds\n", + " ],\n", + " kwargs={\"axis\": 0},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6e856818-80fe-4412-a71f-4f77bb6911c6", + "metadata": { + "tags": [] + }, + "source": [ + "This next error is a little confusing.\n", + "\n", + "```\n", + "size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. \n", + "Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", + "```\n", + "\n", + "\n", + "A good trick here is to pass a little wrapper function to `apply_ufunc` instead and inspect the shapes of data received by the wrapper.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "562d5758-f492-4d9a-bf68-960da7580582", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "def wrapper(array, **kwargs):\n", + " print(f\"received {type(array)} shape: {array.shape}, kwargs: {kwargs}\")\n", + " result = np.mean(array, **kwargs)\n", + " print(f\"result.shape: {result.shape}\")\n", + " return result\n", + "\n", + "\n", + "xr.apply_ufunc(\n", + " wrapper,\n", + " ds,\n", + " # specify core dimensions as a list of lists\n", + " # here 'time' is the core dimension on `ds`\n", + " input_core_dims=[[\"time\"]],\n", + " kwargs={\"axis\": 0},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "27008aae-f1b7-45b2-88c7-2c440f23a958", + "metadata": {}, + "source": [ + "Now we see the issue:\n", + "\n", + " received👆
Click the Dashboard link above. Or click the \"Search\" button in the dashboard.\n", + "\n", + "Let's test that the dashboard is working..\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4419e2a6-9ff7-4d33-b6da-243210c34a37", + "metadata": {}, + "outputs": [], + "source": [ + "import dask.array\n", + "\n", + "dask.array.ones((1000, 4), chunks=(2, 1)).compute() # should see activity in dashboard" + ] + }, + { + "cell_type": "markdown", + "id": "e6abae80-004a-481f-9b1a-c476de951ef0", + "metadata": {}, + "source": [ + "Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b74da273-e603-442f-9970-ef3eb17a3ad9", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": 100})\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "a6b21b09-3ebb-4efb-9c57-44bf587ba92d", + "metadata": { + "tags": [] + }, + "source": [ + "## A simple example\n", + "\n", + "All the concepts from applying numpy functions carry over.\n", + "\n", + "However the handling of dask arrays needs to be explicitly activated.\n", + "\n", + "There are three options for the `dask` kwarg.\n", + "\n", + "```\n", + " dask : {\"forbidden\", \"allowed\", \"parallelized\"}, default: \"forbidden\"\n", + " How to handle applying to objects containing lazy data in the form of\n", + " dask arrays:\n", + "\n", + " - 'forbidden' (default): raise an error if a dask array is encountered.\n", + " - 'allowed': pass dask arrays directly on to ``func``. Prefer this option if\n", + " ``func`` natively supports dask arrays.\n", + " - 'parallelized': automatically parallelize ``func`` if any of the\n", + " inputs are a dask array by using :py:func:`dask.array.apply_gufunc`. Multiple output\n", + " arguments are supported. Only use this option if ``func`` does not natively\n", + " support dask arrays (e.g. converts them to numpy arrays).\n", + "```\n", + "\n", + "We will work through the following two:\n", + "\n", + "1. `dask=\"allowed\"` Dask arrays are passed to the user function. This is a good\n", + " choice if your function can handle dask arrays and won't compute the result unless \n", + " explicitly requested.\n", + "2. `dask=\"parallelized\"`. This applies the user function over blocks of the dask\n", + " array using `dask.array.blockwise`. This is useful when your function cannot\n", + " handle dask arrays natively (e.g. scipy API)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26800228-97cc-4f35-baf8-a5face577543", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "# Expect an error here\n", + "def squared_error(x, y):\n", + " return (x - y) ** 2\n", + "\n", + "\n", + "xr.apply_ufunc(squared_error, ds.air, 1)" + ] + }, + { + "cell_type": "markdown", + "id": "5e2c1b9f-f436-414c-944b-fa134837ee32", + "metadata": { + "tags": [] + }, + "source": [ + " \n", + "A good thing to check is whether the applied function (here `squared_error`) can handle pure dask arrays. \n", + "To do this call `squared_error(ds.air.data, 1)` and make sure of the following:\n", + "1. That you don't see any activity on the dask dashboard\n", + "2. That the returned result is a dask array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b2b6ffab-5b27-4169-b4c3-ce605707ba9d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "squared_error(ds.air.data, 1)" + ] + }, + { + "cell_type": "markdown", + "id": "a4c17b34-7fd9-45ef-9f71-041fc8b16fdf", + "metadata": { + "tags": [] + }, + "source": [ + "Since `squared_error` can handle dask arrays without computing them, we specify\n", + "`dask=\"allowed\"`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e0806da-b6ea-4618-8394-b2e888f4c556", + "metadata": {}, + "outputs": [], + "source": [ + "sqer = xr.apply_ufunc(\n", + " squared_error,\n", + " ds.air,\n", + " 1,\n", + " dask=\"allowed\",\n", + ")\n", + "sqer # dask-backed DataArray! with nice metadata!" + ] + }, + { + "cell_type": "markdown", + "id": "c1418c06-e1f8-4db8-bdf5-a4e23f6524e1", + "metadata": { + "tags": [] + }, + "source": [ + "### Understanding what's happening\n", + "\n", + "Let's again use the wrapper trick to understand what `squared_error` receives.\n", + "\n", + "We see that it receives a dask array (analogous to the numpy array in the previous example)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d9a5eb4-c6e9-445f-87bb-5bcaa653439f", + "metadata": {}, + "outputs": [], + "source": [ + "def wrapper(x, y):\n", + " print(f\"received x of type {type(x)}, shape {x.shape}\")\n", + " print(f\"received y of type {type(y)}\")\n", + " return squared_error(x, y)\n", + "\n", + "\n", + "xr.apply_ufunc(wrapper, ds.air, 1, dask=\"allowed\")" + ] + }, + { + "cell_type": "markdown", + "id": "ce1b0d1c-8bf3-4ddb-aa5f-e2ca70415ad6", + "metadata": { + "tags": [] + }, + "source": [ + "## Core dimensions\n", + "\n", + "`squared_error` operated on a per-element basis. How about a reduction like `np.mean`?\n", + "\n", + "Such functions involve the concept of \"core dimensions\". This concept is independent of the underlying array type, and is a property of the applied function. See the [core dimensions with NumPy](core-dimensions) tutorial for more.\n", + "\n", + "\n", + "```{exercise}\n", + ":label: daskmean\n", + "\n", + "Use `dask.array.mean` as an example of a function that can handle dask\n", + "arrays and uses an `axis` kwarg. \n", + "```\n", + "\n", + "````{solution} daskmean\n", + ":class: dropdown\n", + "```python\n", + "def time_mean(da):\n", + " return xr.apply_ufunc(\n", + " dask.array.mean,\n", + " da,\n", + " input_core_dims=[[\"time\"]],\n", + " dask=\"allowed\",\n", + " kwargs={\"axis\": -1}, # core dimensions are moved to the end\n", + " )\n", + "\n", + "\n", + "time_mean(ds.air)\n", + "````\n" + ] + }, + { + "cell_type": "markdown", + "id": "7952ceb1-8402-41d1-8813-31228c3e9ae6", + "metadata": { + "tags": [] + }, + "source": [ + "Again, this is identical to the built-in `mean`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c343d01-cdf6-4e48-a349-606f06c4c251", + "metadata": {}, + "outputs": [], + "source": [ + "def time_mean(da):\n", + " return xr.apply_ufunc(\n", + " dask.array.mean,\n", + " da,\n", + " input_core_dims=[[\"time\"]],\n", + " dask=\"allowed\",\n", + " kwargs={\"axis\": -1}, # core dimensions are moved to the end\n", + " )\n", + "\n", + "\n", + "ds.air.mean(\"time\").identical(time_mean(ds.air))" + ] + }, + { + "cell_type": "markdown", + "id": "8abf7aa7-e1e6-4935-9d59-58d4f587135c", + "metadata": { + "tags": [] + }, + "source": [ + "## Automatically parallelizing dask-unaware functions\n", + "\n", + "### Basics\n", + "\n", + "Not all functions can handle dask arrays appropriately by default.\n", + "\n", + "A very useful `apply_ufunc` feature is the ability to apply arbitrary functions\n", + "in parallel to each block. This ability can be activated using\n", + "`dask=\"parallelized\"`. \n", + "\n", + "We will use `scipy.integrate.trapz` as an example of a function that cannot\n", + "handle dask arrays and requires a core dimension. If we call `trapz` with a dask\n", + "array, we get a numpy array back that is, the values have been eagerly computed.\n", + "This is undesirable behaviour\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43a04e83-c061-4e14-80d9-b73d6c36981a", + "metadata": {}, + "outputs": [], + "source": [ + "import scipy as sp\n", + "import scipy.integrate\n", + "\n", + "sp.integrate.trapz(\n", + " ds.air.data, axis=ds.air.get_axis_num(\"lon\")\n", + ") # does NOT return a dask array, you should see activity on the dashboard" + ] + }, + { + "cell_type": "markdown", + "id": "f518099f-7778-4f0a-9d40-950968c651d5", + "metadata": { + "tags": [] + }, + "source": [ + "Let's activate automatic parallelization by using `apply_ufunc` with `dask=\"parallelized\"`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8faef8dd-13b0-46c6-a19a-08e5e095fd4c", + "metadata": {}, + "outputs": [], + "source": [ + "integrated = xr.apply_ufunc(\n", + " sp.integrate.trapz,\n", + " ds,\n", + " input_core_dims=[[\"lon\"]],\n", + " kwargs={\"axis\": -1},\n", + " dask=\"parallelized\",\n", + ")\n", + "integrated" + ] + }, + { + "cell_type": "markdown", + "id": "3ec8e13e-77e0-4e46-a32d-dafa5da6fa6a", + "metadata": { + "tags": [] + }, + "source": [ + "And make sure the returned data is a dask array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95c9e6e4-5011-4968-84f7-befe25c049bb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "integrated.air.data" + ] + }, + { + "cell_type": "markdown", + "id": "59e97071-493c-4df3-8214-d9d5ada940d2", + "metadata": {}, + "source": [ + "Now you have control over executing this parallel computation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "901f0d1d-a0c5-49fe-a27e-4b831ae0f9f4", + "metadata": {}, + "outputs": [], + "source": [ + "# Dask -> Numpy array of integrated values\n", + "parallelized_results = integrated.compute()\n", + "parallelized_results" + ] + }, + { + "cell_type": "markdown", + "id": "6b91b8ca-cec9-49ca-b136-bf3b5f612d91", + "metadata": { + "tags": [] + }, + "source": [ + "### Understanding `dask=\"parallelized\"`\n", + "\n", + "It is very important to understand what `dask=\"parallelized\"` does. To fully understand it, requires understanding some core concepts.\n", + "\n", + "```{seealso}\n", + "For `dask=\"parallelized\"` `apply_ufunc` will call `dask.array.apply_gufunc`. See the dask documentation on [generalized ufuncs](https://docs.dask.org/en/stable/array-gufunc.html) and [`apply_gufunc`](https://docs.dask.org/en/stable/generated/dask.array.gufunc.apply_gufunc.html) for more.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "116fd5c7-5d15-4c05-96d0-73acb28e5403", + "metadata": { + "tags": [] + }, + "source": [ + "#### Embarrassingly parallel or blockwise operations\n", + "\n", + "`dask=\"parallelized\"` works well for \"blockwise\" or \"embarrassingly parallel\" operations ([Wikipedia](https://en.wikipedia.org/wiki/Embarrassingly_parallel)).\n", + "\n", + "These are operations where one block or chunk of the output array corresponds to one block or chunk of the input array. Specifically, the blocks or chunks of the _core dimension_ is what matters. Importantly, no communication between blocks is necessary to create the output, which makes parallelization quite simple or \"embarrassing\".\n", + "\n", + "Let's look at the dask repr for `ds` and note chunksizes are (100,25,53) for a array with shape (2920, 25, 53). This means that each block or chunk of the array contains all `lat`, `lon` points and a subset of `time` points." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45d737df-8a6b-4244-bef7-c3464a1b65da", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.data" + ] + }, + { + "cell_type": "markdown", + "id": "82b66790-4c7c-4902-aa3b-89f92c5641b5", + "metadata": { + "tags": [] + }, + "source": [ + "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarrassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", + "\n", + "```{caution} Question\n", + "Do you understand why `integrate(ds)` when `ds` has a single chunk along `lon` is a \"embarrassingly parallel\" operation?\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "141c4c5d-d3dc-433c-a89f-8c8e7084f333", + "metadata": { + "tags": [] + }, + "source": [ + "```{exercise} \n", + ":label: rechunk\n", + "Apply the integrate function to `ds` after rechunking to have a different chunksize along `lon` using `ds.chunk(lon=4)` (for example). What happens?\n", + "```\n", + "\n", + "```{solution} rechunk\n", + ":class: dropdown\n", + "\n", + "`apply_ufunc` complains that it cannot automatically parallelize because the dataset `ds` is now chunked along the core dimension `lon`. You should see the following error:\n", + "\n", + " ValueError: dimension lon on 0th function argument to apply_ufunc with dask='parallelized' \n", + " consists of multiple chunks, but is also a core dimension. To fix, either rechunk \n", + " into a single array chunk along this dimension, i.e., \n", + " ``.chunk(dict(lon=-1))``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` \n", + " but beware that this may significantly increase memory usage.\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "f411e1dc-5413-4400-8127-757f4bb7d640", + "metadata": { + "tags": [] + }, + "source": [ + "#### Understanding execution\n", + "\n", + "We are layering many concepts together there so it is important to understand how the function is executed, and what input it will receive. Again we will use our wrapper trick." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c454151-b331-43d1-8eb1-c0103e821712", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def integrate_wrapper(array, **kwargs):\n", + " print(f\"received array of type {type(array)}, shape {array.shape}\")\n", + " result = sp.integrate.trapz(array, **kwargs)\n", + " print(f\"received array of type {type(result)}, shape {result.shape}\")\n", + " return result\n", + "\n", + "\n", + "integrated = xr.apply_ufunc(\n", + " integrate_wrapper,\n", + " ds,\n", + " input_core_dims=[[\"lon\"]],\n", + " kwargs={\"axis\": -1},\n", + " dask=\"parallelized\",\n", + ")\n", + "integrated" + ] + }, + { + "cell_type": "markdown", + "id": "33c4e90c-f153-407f-ba1b-e100e827e2c2", + "metadata": {}, + "source": [ + "Note that we received an Xarray object back (`integrated`) but our wrapper function was called with a numpy array of shape `(1,1,1)`.\n", + "\n", + "```{important}\n", + "the full 3D array has **not yet been** passed to `integrate_wrapper`. Yet dask needs to know the shape and dtype of the result. This is key. \n", + "```\n", + "\n", + "The `integrate_wrapper` function is treated like a black box, and its effect on the inputs has to either be described through additional keyword arguments, or inferred by passing dummy inputs.\n", + "\n", + "To do so, `dask.array.apply_gufunc` calls the user function with dummy inputs (here a numpy array of shape `(1,1,1)`), and inspects the returned value to understand that one dimension was removed (returned a numpy array of shape `(1,1)`.\n", + "\n", + "````{caution}\n", + ":class: dropdown\n", + "\n", + "Some functions can have trouble handling such dummy inputs. Alternatively you can pass `meta = np.ones((1,1))` in `dask_gufunc_kwargs` to prevent dask from providing dummy inputs to the array.\n", + "```python\n", + "xr.apply_ufunc(\n", + " integrate_wrapper,\n", + " ds,\n", + " input_core_dims=[[\"lon\"]],\n", + " kwargs={\"axis\": -1},\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs={\"meta\": np.ones((1,1))},\n", + ")\n", + "```\n", + "````\n", + "\n", + "Since no errors were raised we proceed as-is.\n", + "\n", + "Let's compute the array to get real values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b53fed48-5dc2-4a0b-985c-71bd436ca058", + "metadata": { + "tags": [ + "output-scroll" + ] + }, + "outputs": [], + "source": [ + "integrated.compute()" + ] + }, + { + "cell_type": "markdown", + "id": "c265429d-6abb-4475-a9b0-2dcb0902afb1", + "metadata": {}, + "source": [ + "We see that `integrate_wrapper` is called many times! As many times as there are blocks in the array in fact, which is 30 here (`ds.air.data.numblocks`).\n", + "\n", + "Our function is independently executed on each block of the array, and then the results are concatenated to form the final result.\n", + "\n", + "Conceptually, there is a two-way flow of information between various packages when executing `integrated.compute()`:\n", + "\n", + "`xarray.apply_ufunc` ↔ `dask.array.apply_gufunc` ↔ `integrate_wrapper` ↔ `scipy.integrate.trapz` ↔ `ds.air.data`\n", + "\n", + "\n", + "When executed\n", + "\n", + "1. Xarray loops over all data variables.\n", + "1. Xarray unwraps the underlying dask array (e.g. `ds.air`) and passes that to dask's `apply_gufunc`.\n", + "1. `apply_gufunc` calls `integrate_wrapper` on each block of the array.\n", + "1. For each block, `integrate_wrapper` calls `scipy.integrate.trapz` and returns one block of the output array.\n", + "1. dask stitches all the output blocks to form the output array.\n", + "1. `xarray.apply_ufunc` wraps the output array with Xarray metadata to give the final result.\n", + "\n", + "Phew!\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "4d971644-7cd2-4cda-8004-c0bb588e3a8b", + "metadata": { + "tags": [] + }, + "source": [ + "## More complex situations\n", + "\n", + "Here we quickly demonstrate that all the concepts from the numpy material earlier carry over.\n", + "\n", + "Xarray needs a lot of extra metadata, so depending\n", + "on the function, extra arguments such as `output_dtypes` and `output_sizes` may\n", + "be necessary for supporting dask arrays. We demonstrate this below." + ] + }, + { + "cell_type": "markdown", + "id": "0c984ddb-3798-435b-b33b-0f9d8a645b83", + "metadata": { + "tags": [] + }, + "source": [ + "### Adding new dimensions\n", + "\n", + "We use the [expand_dims example](newdim) that changes the size of the input along a single dimension.\n", + "\n", + "```python\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=0)\n", + "```\n", + "\n", + "When automatically parallelizing with `dask`, we need to provide some more information about the outputs.\n", + "1. When adding a new dimensions, we need to provide the size in `dask_gufunc_kwargs` using the key `output_sizes`\n", + "2. Usually we need provide the datatype or `dtype` of the returned array. Usually the dtype of the input is a good guess." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d82f436-65fe-484e-9295-382f6c725b80", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=-1)\n", + "\n", + "\n", + "xr.apply_ufunc(\n", + " add_new_dim, # first the function\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " output_core_dims=[[\"newdim\"]],\n", + " dask=\"parallelized\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d92e60e2-2652-4023-a6ce-7f83046e44ad", + "metadata": {}, + "source": [ + "Provide the size of the newly added dimension `newdim` in `output_sizes` as part of the `dask_gufunc_kwargs` keyword argument:\n", + "\n", + " dask_gufunc_kwargs (dict, optional) – Optional keyword arguments passed to dask.array.apply_gufunc() \n", + " if dask=’parallelized’. Possible keywords are output_sizes, allow_rechunk and meta.\n", + " \n", + "The syntax is \n", + "```python\n", + "dask_gufunc_kwargs={\n", + " \"output_sizes\": {\"newdim\": 1}\n", + "}\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32bc121c-cb91-430b-96ab-a39079b5e80c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " add_new_dim, # first the function\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " output_core_dims=[[\"newdim\"]],\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs={\"output_sizes\": {\"newdim\": 1}},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b3c08d6e-cc7b-4d6d-b074-ff90ba99e6c7", + "metadata": {}, + "source": [ + "### Dimensions that change size" + ] + }, + { + "cell_type": "markdown", + "id": "72177e24-0c3b-4e73-9a2b-9929d19c0490", + "metadata": {}, + "source": [ + "We will now repeat the [interpolation example from earlier](interp-add-new-dim) with `\"lat\"` as the output core dimension. See the numpy notebook on [complex output](complex-output) for more.\n", + "\n", + "```python\n", + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp,\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")\n", + "```\n", + "\n", + "We will first add `dask=\"parallelized\"` and provide `output_sizes` in `dask_gufunc_kwargs`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "180ae3e2-82a2-4d2c-8217-88c97533eb1e", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + " # The following are dask-specific\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "62424e94-baed-425c-8169-da769da480b8", + "metadata": {}, + "source": [ + "This error means that we need to provide `output_dtypes`\n", + "\n", + " output_dtypes (list of dtype, optional) – Optional list of output dtypes. \n", + " Only used if dask='parallelized' or vectorize=True." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "797a2875-e727-4866-af43-5ed0e8eaed5f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 100, \"lon\": -1}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + " # The following are dask-specific\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", + " output_dtypes=[ds.air.dtype],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "344b3051-b469-4859-8412-dc60552b1d14", + "metadata": {}, + "source": [ + "```{tip}\n", + "Dask can sometimes figure out the output sizes and dtypes. The usual workflow is to read the error messages and iteratively pass more information to `apply_ufunc`.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "9646a631-2318-43c4-923d-a5384885134a", + "metadata": {}, + "source": [ + "### Automatic Vectorizing\n", + "\n", + "[Automatic vectorizing](vectorize) with `vectorize=True` also carries over!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8accd13-1afe-4085-97a6-86e4a5bd4405", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "interped = xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.chunk({\"time\": 100, \"lon\": -1}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", + " output_dtypes=[ds.air.dtype],\n", + " vectorize=True,\n", + ")\n", + "interped" + ] + }, + { + "cell_type": "markdown", + "id": "e484de1b-9d2f-4ad0-b978-550df593ee2f", + "metadata": {}, + "source": [ + "Again, it is important to understand the conceptual flow of information between the variuus packages when executing `interped.compute()` which looks ilke\n", + "\n", + "`xarray.apply_ufunc` ↔ `dask.array.apply_gufunc` ↔ `numpy.vectorize` ↔ `numpy.interp`\n", + "\n", + "\n", + "When executed\n", + "\n", + "1. Xarray loops over all data variables.\n", + "1. Xarray unwraps the underlying dask array (e.g. `ds.air`) and passes that to dask's `apply_gufunc`.\n", + "1. `apply_gufunc` calls the vectorized function on each block of the array.\n", + "1. For each block, `numpy.vectorize` handles looping over the loop dimensions \n", + " and passes 1D vectors along the core dimension to `numpy.interp`\n", + "1. The 1D results for each block are concatenated by `numpy.vectorize` to create one output block.\n", + "1. dask stitches all the output blocks to form the output array.\n", + "1. `xarray.apply_ufunc` wraps the output array with Xarray metadata to give the final result.\n", + "\n", + "Phew!\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3618181-7432-409e-95e7-d7bc4cc567ce", + "metadata": { + "tags": [] + }, + "source": [ + "## Clean up the cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5d9ab78-1b89-4530-a904-1d1e8475e949", + "metadata": { + "tags": [ + "remove-output" + ] + }, + "outputs": [], + "source": [ + "client.close();" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/advanced/apply_ufunc/vectorize_1d.ipynb b/advanced/apply_ufunc/example-interp.ipynb similarity index 83% rename from advanced/apply_ufunc/vectorize_1d.ipynb rename to advanced/apply_ufunc/example-interp.ipynb index f4398ce1..2cfe45d1 100644 --- a/advanced/apply_ufunc/vectorize_1d.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -2,20 +2,24 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ - "# Applying unvectorized functions\n", + "# np.interp : An end-to-end example\n", "\n", "**Author** [Deepak Cherian (NCAR)](https://cherian.net)" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ - "This example will illustrate how to conveniently apply an unvectorized function `func` to xarray objects using `apply_ufunc`. `func` expects 1D numpy arrays and returns a 1D numpy array. Our goal is to coveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.\n", + "This example will illustrate how to conveniently apply an unvectorized function `func` to xarray objects using `apply_ufunc`. `func` expects 1D numpy arrays and returns a 1D numpy array. Our goal is to conveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.\n", "\n", - "We will illustrate this using `np.interp`: \n", + "We will illustrate this using [`np.interp`](https://numpy.org/doc/stable/reference/generated/numpy.interp.html): \n", "\n", " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", " Docstring:\n", @@ -27,6 +31,24 @@ "and write an `xr_interp` function with signature\n", "\n", " xr_interp(xarray_object, dimension_name, new_coordinate_to_interpolate_to)\n", + " \n", + " \n", + "## Learning goals \n", + "\n", + "Our goal is to use `apply_ufunc` with a general function so that we can reuse our code to apply to different xarray datasets or along different dimensions. Specifically, this example will illustrate \n", + "1. Specifying core dimensions with `input_core_dims`\n", + "1. Handling core dimensions of the output with `output_core_dims`\n", + "1. Handling core dimensions that change size using `exclude_dims`\n", + "1. Automatic vectorizing or looping over dimensions that are not core dimensions using `vectorize=True`\n", + "1. Automatically parallelization with dask arrays using `dask=\"parallelized\"`\n", + "1. High-performance vectorization with numba and `vectorize=False`.\n", + "\n", + "It puts together all the concepts covered earlier.\n", + "\n", + "\n", + "```{tip}\n", + "We'll reduce the length of error messages using in this tutorial using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```\n", "\n", "## Load data\n", "\n", @@ -36,13 +58,19 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ + "%xmode minimal\n", + "\n", "import xarray as xr\n", "import numpy as np\n", "\n", - "xr.set_options(display_style=\"html\") # fancy HTML repr\n", + "# limit the amount of information printed to screen\n", + "xr.set_options(display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)\n", "\n", "air = (\n", " xr.tutorial.load_dataset(\"air_temperature\")\n", @@ -54,7 +82,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." ] @@ -62,7 +92,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "newlat = np.linspace(15, 75, 100)\n", @@ -79,7 +111,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "def interp1d_np(data, x, xi):\n", @@ -139,15 +173,12 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "## `exclude_dims`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ + "## `exclude_dims`\n", + "\n", "\n", "```\n", "exclude_dims : set, optional\n", @@ -248,7 +279,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "xr.apply_ufunc(\n", @@ -273,7 +306,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "interped = xr.apply_ufunc(\n", @@ -298,9 +333,11 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ - "## Vectorization with `np.vectorize`" + "## Automatic vectorization with `np.vectorize`" ] }, { @@ -334,20 +371,22 @@ " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", " output_core_dims=[[\"lat\"]],\n", " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" + ")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", "\n", " data: (10, 53, 25) | x: (25,) | xi: (100,)\n", "\n", "We see that `air` has been passed as a 3D numpy array which is not what `np.interp` expects. Instead we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", + "\n", + "\n", "`apply_ufunc` makes this easy by specifying `vectorize=True`:\n", "\n", " vectorize : bool, optional\n", @@ -357,10 +396,10 @@ " almost always slower than supplying a pre-vectorized function.\n", " Using this option requires NumPy version 1.12 or newer.\n", " \n", - "Also see the documentation for `np.vectorize`:👆
Click the Dashboard link above. Or click the \"Search\" button in the dashboard.\n", - "\n", - "Let's test that the dashboard is working..\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4419e2a6-9ff7-4d33-b6da-243210c34a37", - "metadata": {}, - "outputs": [], - "source": [ - "import dask.array\n", - "\n", - "dask.array.ones((1000, 4), chunks=(2, 1)).compute() # should see activity in dashboard" - ] - }, - { - "cell_type": "markdown", - "id": "e6abae80-004a-481f-9b1a-c476de951ef0", - "metadata": {}, - "source": [ - "Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b74da273-e603-442f-9970-ef3eb17a3ad9", - "metadata": {}, - "outputs": [], - "source": [ - "ds = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": 100})\n", - "ds" - ] - }, - { - "cell_type": "markdown", - "id": "a6b21b09-3ebb-4efb-9c57-44bf587ba92d", - "metadata": {}, - "source": [ - "## A simple example\n", - "\n", - "All the concepts from applying numpy functions carry over.\n", - "\n", - "However the handling of dask arrays needs to be explicitly activated." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "26800228-97cc-4f35-baf8-a5face577543", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "# Expect an error here\n", - "def squared_error(x, y):\n", - " return (x - y) ** 2\n", - "\n", - "\n", - "xr.apply_ufunc(squared_error, ds.air, 1)" - ] - }, - { - "cell_type": "markdown", - "id": "28f4f80d-37cf-416a-a32d-9186c8df5f2d", - "metadata": {}, - "source": [ - "There are two options for the `dask` kwarg.\n", - "\n", - "1. `dask=\"allowed\"` Dask arrays are passed to the user function. This is a good\n", - " choice if your function can handle dask arrays and won't call compute\n", - " explicitly.\n", - "2. `dask=\"parallelized\"`. This applies the user function over blocks of the dask\n", - " array using `dask.array.blockwise`. This is useful when your function cannot\n", - " handle dask arrays natively (e.g. scipy API).\n", - "\n", - "Since `squared_error` can handle dask arrays without computing them, we specify\n", - "`dask=\"allowed\"`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4e0806da-b6ea-4618-8394-b2e888f4c556", - "metadata": {}, - "outputs": [], - "source": [ - "sqer = xr.apply_ufunc(\n", - " squared_error,\n", - " ds.air,\n", - " 1,\n", - " dask=\"allowed\",\n", - ")\n", - "sqer # dask-backed DataArray! with nice metadata!" - ] - }, - { - "cell_type": "markdown", - "id": "c1418c06-e1f8-4db8-bdf5-a4e23f6524e1", - "metadata": {}, - "source": [ - "Let's again use the wrapper trick to understand what `squared_error` receives.\n", - "\n", - "We see that it receives a dask array." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7d9a5eb4-c6e9-445f-87bb-5bcaa653439f", - "metadata": {}, - "outputs": [], - "source": [ - "def wrapper(x, y):\n", - " print(f\"received x of type {type(x)}, shape {x.shape}\")\n", - " print(f\"received y of type {type(y)}\")\n", - " return squared_error(x, y)\n", - "\n", - "\n", - "xr.apply_ufunc(wrapper, ds.air, 1, dask=\"allowed\")" - ] - }, - { - "cell_type": "markdown", - "id": "ce1b0d1c-8bf3-4ddb-aa5f-e2ca70415ad6", - "metadata": {}, - "source": [ - "## Reductions and core dimensions\n", - "\n", - "`squared_error` operated on a per-element basis. How about a reduction like `np.mean`?\n", - "\n", - "Such functions involve the concept of \"core dimensions\". One way to think about core dimensions is to consider the smallest dimensionality of data necessary to apply the function.\n", - "\n", - "For using more complex operations that consider some array values collectively,\n", - "it’s important to understand the idea of **core dimensions**. \n", - "Usually, they correspond to the fundamental dimensions over\n", - "which an operation is defined, e.g., the summed axis in `np.sum`. A good clue\n", - "that core dimensions are needed is the presence of an `axis` argument on the\n", - "corresponding NumPy function.\n", - "\n", - "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", - "the last dimension of any input arguments before applying the given function.\n", - "This means that for functions that accept an `axis` argument, you usually need\n", - "to set `axis=-1`\n", - "\n", - "Let's use `dask.array.mean` as an example of a function that can handle dask\n", - "arrays and uses an `axis` kwarg\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "447f79cd-8ee2-416f-9ebe-b1b4317583d0", - "metadata": {}, - "outputs": [], - "source": [ - "def time_mean(da):\n", - " return xr.apply_ufunc(\n", - " dask.array.mean,\n", - " da,\n", - " input_core_dims=[[\"time\"]],\n", - " dask=\"allowed\",\n", - " kwargs={\"axis\": -1}, # core dimensions are moved to the end\n", - " )\n", - "\n", - "\n", - "time_mean(ds.air)" - ] - }, - { - "cell_type": "markdown", - "id": "7952ceb1-8402-41d1-8813-31228c3e9ae6", - "metadata": {}, - "source": [ - "Again, this identical to the built-in `mean`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6c343d01-cdf6-4e48-a349-606f06c4c251", - "metadata": {}, - "outputs": [], - "source": [ - "ds.air.mean(\"time\").identical(time_mean(ds.air))" - ] - }, - { - "cell_type": "markdown", - "id": "8abf7aa7-e1e6-4935-9d59-58d4f587135c", - "metadata": {}, - "source": [ - "## Automatically parallelizing dask-unaware functions\n", - "\n", - "A very useful `apply_ufunc` feature is the ability to apply arbitrary functions\n", - "in parallel to each block. This ability can be activated using\n", - "`dask=\"parallelized\"`. Again xarray needs a lot of extra metadata, so depending\n", - "on the function, extra arguments such as `output_dtypes` and `output_sizes` may\n", - "be necessary.\n", - "\n", - "We will use `scipy.integrate.trapz` as an example of a function that cannot\n", - "handle dask arrays and requires a core dimension. If we call `trapz` with a dask\n", - "array, we get a numpy array back that is, the values have been eagerly computed.\n", - "This is undesirable behaviour\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43a04e83-c061-4e14-80d9-b73d6c36981a", - "metadata": {}, - "outputs": [], - "source": [ - "import scipy as sp\n", - "import scipy.integrate\n", - "\n", - "sp.integrate.trapz(ds.air.data, axis=ds.air.get_axis_num(\"lon\")) # does NOT return a dask array" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8faef8dd-13b0-46c6-a19a-08e5e095fd4c", - "metadata": {}, - "outputs": [], - "source": [ - "xr.apply_ufunc(\n", - " sp.integrate.trapz,\n", - " ds,\n", - " input_core_dims=[[\"lon\"]],\n", - " kwargs={\"axis\": -1},\n", - " dask=\"parallelized\",\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6b375c8a-b7ef-47a6-b007-1fc2f34a2cf5", - "metadata": {}, - "outputs": [], - "source": [ - "client.close()" - ] - }, - { - "cell_type": "markdown", - "id": "71492531-4eda-4c47-86e5-dad033c22751", - "metadata": {}, - "source": [ - "## More\n", - "\n", - "1. https://docs.xarray.dev/en/stable/examples/apply_ufunc_vectorize_1d.html\n", - "2. https://docs.dask.org/en/latest/array-best-practices.html\n" - ] - } - ], - "metadata": { - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index a5d9e32e..68fd8eb8 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -3,21 +3,35 @@ { "cell_type": "markdown", "id": "7a8ff28a-4be3-469f-8cf4-9297e71cc4ca", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ - "