|
6 | 6 | "source": [
|
7 | 7 | "# Xarray's Data structures\n",
|
8 | 8 | "\n",
|
9 |
| - "In this lesson, we cover the basics of Xarray data structures. Our\n", |
10 |
| - "learning goals are as follows. By the end of the lesson, we will be able to:\n", |
| 9 | + "In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:\n", |
11 | 10 | "\n",
|
12 |
| - "- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n", |
13 |
| - "\n", |
14 |
| - "---\n", |
15 |
| - "\n", |
16 |
| - "## Introduction\n", |
17 |
| - "\n", |
18 |
| - "Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n", |
19 |
| - "are an essential part of computational science. They are encountered in a wide\n", |
20 |
| - "range of fields, including physics, astronomy, geoscience, bioinformatics,\n", |
21 |
| - "engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n", |
22 |
| - "provides the fundamental data structure and API for working with raw ND arrays.\n", |
23 |
| - "However, real-world datasets are usually more than just raw numbers; they have\n", |
24 |
| - "labels which encode information about how the array values map to locations in\n", |
25 |
| - "space, time, etc.\n", |
26 |
| - "\n", |
27 |
| - "Here is an example of how we might structure a dataset for a weather forecast:\n", |
28 |
| - "\n", |
29 |
| - "<img src=\"https://docs.xarray.dev/en/stable/_images/dataset-diagram.png\" align=\"center\" width=\"80%\">\n", |
30 |
| - "\n", |
31 |
| - "You'll notice multiple data variables (temperature, precipitation), coordinate\n", |
32 |
| - "variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n", |
33 |
| - "fit into Xarray's data structures below.\n", |
34 |
| - "\n", |
35 |
| - "Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n", |
36 |
| - "powerful and concise interface. For example:\n", |
37 |
| - "\n", |
38 |
| - "- Apply operations over dimensions by name: `x.sum('time')`.\n", |
39 |
| - "\n", |
40 |
| - "- Select values by label (or logical location) instead of integer location:\n", |
41 |
| - " `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n", |
42 |
| - "\n", |
43 |
| - "- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n", |
44 |
| - " (array broadcasting) based on dimension names, not shape.\n", |
45 |
| - "\n", |
46 |
| - "- Easily use the split-apply-combine paradigm with groupby:\n", |
47 |
| - " `x.groupby('time.dayofyear').mean()`.\n", |
48 |
| - "\n", |
49 |
| - "- Database-like alignment based on coordinate labels that smoothly handles\n", |
50 |
| - " missing values: `x, y = xr.align(x, y, join='outer')`.\n", |
51 |
| - "\n", |
52 |
| - "- Keep track of arbitrary metadata in the form of a Python dictionary:\n", |
53 |
| - " `x.attrs`.\n", |
54 |
| - "\n", |
55 |
| - "The N-dimensional nature of xarray’s data structures makes it suitable for\n", |
56 |
| - "dealing with multi-dimensional scientific data, and its use of dimension names\n", |
57 |
| - "instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n", |
58 |
| - "more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n", |
59 |
| - "track of the order of an array’s dimensions or insert dummy dimensions of size 1\n", |
60 |
| - "to align arrays (e.g., using np.newaxis).\n", |
61 |
| - "\n", |
62 |
| - "The immediate payoff of using xarray is that you’ll write less code. The\n", |
63 |
| - "long-term payoff is that you’ll understand what you were thinking when you come\n", |
64 |
| - "back to look at it weeks or months later.\n" |
| 11 | + ":::{admonition} Learning Goals\n", |
| 12 | + "- Understand the basic Xarray data structures `DataArray` and `Dataset` \n", |
| 13 | + "- Customize the display of Xarray data structures\n", |
| 14 | + "- The connection between Pandas and Xarray data structures\n", |
| 15 | + ":::" |
65 | 16 | ]
|
66 | 17 | },
|
67 | 18 | {
|
|
72 | 23 | "\n",
|
73 | 24 | "Xarray provides two data structures: the `DataArray` and `Dataset`. The\n",
|
74 | 25 | "`DataArray` class attaches dimension names, coordinates and attributes to\n",
|
75 |
| - "multi-dimensional arrays while `Dataset` combines multiple arrays.\n", |
| 26 | + "multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n", |
76 | 27 | "\n",
|
77 | 28 | "Both classes are most commonly created by reading data.\n",
|
78 |
| - "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n", |
79 |
| - "\n", |
80 |
| - "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", |
81 |
| - "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." |
| 29 | + "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial." |
82 | 30 | ]
|
83 | 31 | },
|
84 | 32 | {
|
|
88 | 36 | "outputs": [],
|
89 | 37 | "source": [
|
90 | 38 | "import numpy as np\n",
|
91 |
| - "import xarray as xr" |
| 39 | + "import xarray as xr\n", |
| 40 | + "import pandas as pd\n", |
| 41 | + "\n", |
| 42 | + "# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n", |
| 43 | + "# The following settings reduce the amount of data displayed out by default\n", |
| 44 | + "xr.set_options(display_expand_attrs=False, display_expand_data=False)\n", |
| 45 | + "np.set_printoptions(threshold=10, edgeitems=2)" |
92 | 46 | ]
|
93 | 47 | },
|
94 | 48 | {
|
|
97 | 51 | "source": [
|
98 | 52 | "### Dataset\n",
|
99 | 53 | "\n",
|
100 |
| - "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n" |
| 54 | + "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n", |
| 55 | + "\n", |
| 56 | + "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", |
| 57 | + "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." |
101 | 58 | ]
|
102 | 59 | },
|
103 | 60 | {
|
|
147 | 104 | "cell_type": "markdown",
|
148 | 105 | "metadata": {},
|
149 | 106 | "source": [
|
150 |
| - "#### What is all this anyway? (String representations)\n", |
| 107 | + "#### HTML vs text representations\n", |
151 | 108 | "\n",
|
152 | 109 | "Xarray has two representation types: `\"html\"` (which is only available in\n",
|
153 | 110 | "notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n",
|
154 | 111 | "\n",
|
155 | 112 | "So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n",
|
156 |
| - "The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n", |
157 |
| - "view attributes and values for each value (right hand sheet icon and data symbol)." |
| 113 | + "The `\"html\"` representation is interactive, allowing you to collapse sections (▶) and\n", |
| 114 | + "view attributes and values for each value (📄 and ≡)." |
158 | 115 | ]
|
159 | 116 | },
|
160 | 117 | {
|
|
171 | 128 | "cell_type": "markdown",
|
172 | 129 | "metadata": {},
|
173 | 130 | "source": [
|
174 |
| - "The output consists of:\n", |
| 131 | + "☝️ From top to bottom the output consists of:\n", |
175 | 132 | "\n",
|
176 |
| - "- a summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first\n", |
177 |
| - " dimension is named `lat` and has a size of `25`, the second dimension is named\n", |
178 |
| - " `time` and has a size of `2920`, and the third dimension is named `lon` and has a size\n", |
179 |
| - " of `53`. Because we will access the dimensions by name, the order doesn't matter.\n", |
180 |
| - "- an unordered list of *coordinates* or dimensions with coordinates with one item\n", |
181 |
| - " per line. Each item has a name, one or more dimensions in parentheses, a dtype\n", |
182 |
| - " and a preview of the values. Also, if it is a dimension coordinate, it will be\n", |
183 |
| - " marked with a `*`.\n", |
184 |
| - "- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n", |
185 |
| - "- an unordered list of *attributes*, or metadata" |
| 133 | + "- **Dimensions**: summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first dimension is named `lat` and has a size of `25`, the second dimension is named `time` and has a size of `2920`, and the third dimension is named `lon` and has a size of `53`. Because we will access the dimensions by name, the order doesn't matter.\n", |
| 134 | + "- **Coordinates**: an unordered list of *coordinates* or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be printed in **bold** font. *dimensions without coordinates* appear in plain font (there are none in this example, but you might imagine a 'mask' coordinate that has a value assigned at every point).\n", |
| 135 | + "- **Data variables**: names of each nD *measurement* in the dataset, followed by its dimensions `(time, lat, lon)`, dtype, and a preview of values.\n", |
| 136 | + "- **Indexes**: Each dimension with coordinates is backed by an \"Index\". In this example, each dimension is backed by a `PandasIndex`\n", |
| 137 | + "- **Attributes**: an unordered list of metadata (for example, a paragraph describing the dataset)" |
186 | 138 | ]
|
187 | 139 | },
|
188 | 140 | {
|
|
379 | 331 | "methods on `xarray` objects:\n"
|
380 | 332 | ]
|
381 | 333 | },
|
382 |
| - { |
383 |
| - "cell_type": "code", |
384 |
| - "execution_count": null, |
385 |
| - "metadata": {}, |
386 |
| - "outputs": [], |
387 |
| - "source": [ |
388 |
| - "import pandas as pd" |
389 |
| - ] |
390 |
| - }, |
391 | 334 | {
|
392 | 335 | "cell_type": "code",
|
393 | 336 | "execution_count": null,
|
|
429 | 372 | "cell_type": "markdown",
|
430 | 373 | "metadata": {},
|
431 | 374 | "source": [
|
432 |
| - "**<code>to_series</code>**: This will always convert `DataArray` objects to\n", |
433 |
| - "`pandas.Series`, using a `MultiIndex` for higher dimensions\n" |
| 375 | + "### to_series\n", |
| 376 | + "This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n" |
434 | 377 | ]
|
435 | 378 | },
|
436 | 379 | {
|
|
446 | 389 | "cell_type": "markdown",
|
447 | 390 | "metadata": {},
|
448 | 391 | "source": [
|
449 |
| - "**<code>to_dataframe</code>**: This will always convert `DataArray` or `Dataset`\n", |
450 |
| - "objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n", |
451 |
| - "for this.\n" |
| 392 | + "### to_dataframe\n", |
| 393 | + "\n", |
| 394 | + "This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n", |
| 395 | + "broadcasted." |
452 | 396 | ]
|
453 | 397 | },
|
454 | 398 | {
|
|
459 | 403 | "source": [
|
460 | 404 | "ds.air.to_dataframe()"
|
461 | 405 | ]
|
462 |
| - }, |
463 |
| - { |
464 |
| - "cell_type": "markdown", |
465 |
| - "metadata": {}, |
466 |
| - "source": [ |
467 |
| - "Since columns in a `DataFrame` need to have the same index, they are\n", |
468 |
| - "broadcasted.\n" |
469 |
| - ] |
470 |
| - }, |
471 |
| - { |
472 |
| - "cell_type": "code", |
473 |
| - "execution_count": null, |
474 |
| - "metadata": {}, |
475 |
| - "outputs": [], |
476 |
| - "source": [ |
477 |
| - "ds.to_dataframe()" |
478 |
| - ] |
479 | 406 | }
|
480 | 407 | ],
|
481 | 408 | "metadata": {
|
|
0 commit comments