netCDF and Xarray
netCDF and Xarray
netCDF and Xarray
Dataset
Files in Dataset
Multidimensional arrays in climate science
● Metadata is a crucial requirement for scientific data
● What can you tell from the tas.npy dataset?
○ Global metadata - Institution? Model? Experiment?
○ Time - Units? Calendar (365,360 days)? Leap years?
○ Temperature - Units? Missing values? Mean, median, max/min? Point or cell?
Multidimensional arrays in climate science
● “Information about a document rather than a document content” (Raggett Hors
& Jacobs, 1999)
○ Units, coordinate systems, model and institution information...
● File name can provide metadata
Multidimensional arrays in climate science
● Open tas.npy, what can you say from it?
● We can use NumPy ‘np.save’ and ‘np.load’ to share our
multidimensional arrays but...
○ no metadata is available
○ no coordinates are available
○ disk representation is not optimal
■ c contiguous or f contiguous
■ chunked
Multidimensional arrays in climate science
We want two different things
● Data model and disk storage format for multiple chunked multidimensional
arrays with metadata (netCDF)
● Library for semantic data analysis (xarray)
netCDF
netCDF
Network Common Data format
● Developed by Unidata
● “a set of software libraries
and machine-independent
data formats that support
the creation, access, and
sharing of array-oriented
scientific data”
● C, Python and Java APIs
netCDF
Network Common Data format
● Data Model
○ Groups, Variables, Dimensions,
Attributes, Data types
● Classic vs version 4 data models
○ Since version 4, netCDF files are
valid HDF5 files
○ Alternative backends can be
implemented (NCZarr)
Multidimensional arrays in climate science
● netCDF files contain multiple multidimensional arrays (variables) and metadata
○ See the source of tas.npy
● netCDF files can be persisted using contiguous or chunked alignment
● netCDF files can be opened using the netCDF4-python library
netCDF
● Well integrated within the Python ecosystem
netCDF
● Format used for international climate research projects (CMIP6, CORDEX)
○ ESGF (Earth System Grid Federation)
● To avoid huge file sizes, datasets are often split by time and by variable
Dataset
Files in Dataset
netCDF
CF Conventions
● Metadata that provide a definitive description of what the data in each variable
represents, and the spatial and temporal properties of the data.
● Interoperability between applications that are “CF compliant”
● Standard table of variable standard names
● See CMIP6_ScenarioMIP_CSIRO_ACCESS-ESM1-5_ssp585_r1i1p1f1_Amon
again
xarray
xarray
● Xarray introduces labels in the form of dimensions, coordinates and attributes
on top of raw NumPy-like arrays
● More intuitive, more concise, and less error-prone developer experience
● Real-world datasets are usually more than just raw numbers
○ Labels which encode information about how the array values map to locations in
space, time, etc.
xarray
● Apply operations over dimensions by name: x.sum('time')
● Select values by label (or logical location) instead of integer location:
x.loc['2014-01-01'] or x.sel(time='2014-01-01')
● Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array
broadcasting) based on dimension names, not shape.
● Easily use the split-apply-combine paradigm with groupby:
x.groupby('time.dayofyear').mean()
● Database-like alignment based on coordinate labels that smoothly handles
missing values: x, y = xr.align(x, y, join='outer')
● Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs
xarray
Two core data structures, which build upon and extend the core strengths of NumPy
and pandas. Both data structures are fundamentally N-dimensional: