# Creating Data Structures

In [None]:
import numpy as np
import pandas as pd
import xarray as xr

xr.set_options(display_expand_data=False)

rng = np.random.default_rng(seed=0)  # we'll use this later

In the last lecture, we looked at the following example Dataset. In most cases Xarray Datasets are created by reading a file. We'll address this in the next lecture. Here we'll learn how to create Xarray objects from scratch

In [None]:
ds = xr.tutorial.load_dataset("air_temperature")
ds

## DataArray

The `DataArray` class is used to attach a name, dimension names, labels, and
attributes to an array.

Our goal will be to recreate the `ds.air` DataArray starting with the underlying numpy data

In [None]:
ds.air

In [None]:
array = ds.air.data

We do this using the [DataArray](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html) _constructor_.

In [None]:
xr.DataArray(array)

This works. Notice that the default dimension names are not so useful: `dim_0`, `dim_1`, `dim_2`


### Dimension Names

We can change this by specifying dimension names in the appropriate order using the `dims` kwarg 

In [None]:
xr.DataArray(array, dims=("time", "lat", "lon"))

Much better! But notice we have no entries under "Coordinates".

### Coordinates

While associating names with dimensions (or axes) of an array is quite useful, attaching coordinate labels to DataArrays makes a lot of analysis quite convenient.

First we'll simply add values for `lon` using the `coords` kwarg. For this datasets, longitudes are regularly spaced at 2.5° intervals between 200°E and 330°E.

`coords` takes a dictionary that maps the name of a dimension to one of
- another `DataArray` object
- a tuple of the form `(dims, data, attrs)` where `attrs` is optional. This is
  roughly equivalent to creating a new `DataArray` object with
  `DataArray(dims=dims, data=data, attrs=attrs)`
- a `numpy` array (or anything that can be coerced to one using `numpy.array`).

We'll start with the last one

In [None]:
lon_values = np.arange(200, 331, 2.5)
xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_values})

Assigning a plain numpy array is equivalent to creating a DataArray with those values and the same dimension  name

In [None]:
lon_da = xr.DataArray(lon_values, dims="lon")
da = xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_da})
da

We can also assign coordinates after a DataArray has been created.

In [None]:
da.coords["lat"] = np.arange(75, 14.9, -2.5)
da

### Attributes 

Arbitrary attributes can be assigned using the `.attrs` property

In [None]:
da.attrs["attribute"] = "hello"
da

or specified in the constructor

In [None]:
da2 = xr.DataArray(
    array, dims=("time", "lat", "lon"), coords={"lon": lon_da}, attrs={"attribute": "hello"}
)
da2

### Non-dimension coordinates

Sometimes we want to attach coordinate variables along an existing dimension. Notice that 
1. `itime` is not bolded and 
2. has a name "time" that is different from the dimension name "time"

`itime` is an example of a non-dimension coordinate variable i.e. it is a coordinate variable that does not match a dimension name. Here we demonstrate the "tuple" form of assigninment:  `(dims, data, attrs)`

In [None]:
da.coords["itime"] = ("time", np.arange(2920), {"name": "value"})
da

### Exercises

create a `DataArray` named "height" from random data `rng.random((180, 360)) * 400`

1. with dimensions named "latitude" and "longitude"


In [None]:
xr.DataArray(rng.random((180, 360)) * 400, dims=("latitude", "longitude"), name="height")

2. with dimension coordinates:

- "latitude": -90 to 89 with step size 1
- "longitude": -180 to 179 with step size 1


In [None]:
xr.DataArray(
    rng.random((180, 360)) * 400,
    dims=("latitude", "longitude"),
    coords={"latitude": np.arange(-90, 90, 1), "longitude": np.arange(-180, 180, 1)},
)

3. with metadata for both data and coordinates:

- height: "type": "ellipsoid"
- latitude: "type": "geodetic"
- longitude: "prime_meridian": "greenwich"


In [None]:
xr.DataArray(
    rng.random((180, 360)) * 400,
    dims=("latitude", "longitude"),
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
        "longitude": (
            "longitude",
            np.arange(-180, 180, 1),
            {"prime_meridian": "greenwich"},
        ),
    },
    attrs={"type": "ellipsoid"},
    name="height",
)

## Dataset

`Dataset` objects collect multiple data variables, each with possibly different
dimensions.

The constructor of `Dataset` takes three parameters:

- `data_vars`: dict-like mapping names to values. Values are either `DataArray` objects
  or defined with tuples consisting of of dimension names and arrays.
- `coords`: same as for `DataArray`
- `attrs`: same as for `Dataset`

Creating an empty Dataset is easy!

In [None]:
xr.Dataset()

### Data Variables

Let's create a `Dataset` with two data variables: `da` and `da2`

In [None]:
ds = xr.Dataset({"air": da, "air2": da2})
ds

You can directly assign a new data variables

In [None]:
ds["air3"] = da
ds

### Coordinates

Coordinate variables can be assigned using the `coords` kwarg to `xr.Dataset`. Here we use `date_range` from pandas to create a time vector

In [None]:
xr.Dataset(
    {"air": da, "air2": da2},
    coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
)

Again we can assign coordinate variables after a Dataset has been created.

In [None]:
ds

In [None]:
ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
ds

### Attributes

In [None]:
xr.Dataset(
    {"air": da, "air2": da2},
    coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
    attrs={"key0": "value0"},
)

In [None]:
ds.attrs["key"] = "value"

### Exercises

1. create a Dataset with two variables along `latitude` and `longitude`:
   `height` and `gravity_anomaly`


In [None]:
height = rng.random((180, 360)) * 400
gravity_anomaly = rng.random((180, 360)) * 400 - 200

In [None]:
xr.Dataset(
    {
        "height": (("latitude", "longitude"), height),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
    }
)

2. add coordinates to `latitude` and `longitude`:

- `latitude`: from -90 to 90 with step size 1
- `longitude`: from -180 to 180 with step size 1


In [None]:
xr.Dataset(
    {
        "height": (("latitude", "longitude"), height),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
    },
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1)),
        "longitude": ("longitude", np.arange(-180, 180, 1)),
    },
)

3. add metadata to coordinates and variables:

- `latitude`: "type": "geodetic"
- `longitude`: "prime_meridian": "greenwich"
- `height`: "ellipsoid": "wgs84"
- `gravity_anomaly`: "ellipsoid": "grs80"


In [None]:
xr.Dataset(
    {
        "height": (("latitude", "longitude"), height, {"ellipsoid": "wgs84"}),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly, {"ellipsoid": "grs80"}),
    },
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
        "longitude": (
            "longitude",
            np.arange(-180, 180, 1),
            {"prime_meridian": "greenwich"},
        ),
    },
)