PyNIO Extended Selection

As of version 1.3.0b1 PyNIO supports an extended selection mechanism:

data = f.variables[varname][selection]

where selection is either:

a standard NumPy multi-dimensional slicing object
a string representing an extended selection in coordinate and/or index space

The standard slicing object is discussed in the main PyNIO documentation. This page documents the capabilities and syntax of extended selection using the string represention.

Syntax
Examples
Detailed Description
Known-issues

Syntax

Using named dimensions:: '<axis_name>|[<coord_name>|]<coord_selection> <axis_name>|[<coord_name>|]<coord_selection> ... '
Using positional syntax:: '<coord_selection> <coord_selection> ...'

where

<axis_name>

name of the coordinate axis (dimension)

|

bar separator character

<coord_name>|

name of multi-dimensional auxiliary coordinate variable followed by bar separator character (optional)

<coord_selection>

<prefix><selection><postfix>

<prefix> is one of:

None: <selection> is in native coordinate space
i: <selection> is in index space

<selection> is one of:

#: A scalar (single element specified as an index or a coordinate value).
#,#,..: A vector (multiple individual elements within the index or coordinate range).
#:#:# or #:#:i#: A slice (start:stop:step). This includes all the normal short form slice variations such as ':' (all elements), 'start:' (all elements beginning with 'start' coordinate or index value), etc. Unlike normal Python slices, whether in coordinate or index space, both the lower and upper bounds ('start' and 'stop' coordinate or index values) are included in selection. However, when using coordinate slicing, if a coordinate falls between index values only elements within the range are included in the selection. The optional i prefix for the step value indicates that while the start and stop values are coordinate space values, the step is in index space. The step value must be explicitly specified in order to distinguish the step prefix from the postfix i that indicates interpolation mode. When interpolation mode is in effect all three slice values are required.
Note that '#' represents a number (possibly with a decimal point) with an optional multiplier (e.g. 10k is 10000). Valid multipliers are k (10**3), M (10**6), h (3600), m (60), H (100).

<postfix> is one of:

None: Default setting: interpolate data if indirect selection via multi-dimensional coordinate; otherwise use nearest index.
i: Interpolate data to exact location(s).
n: Do not interpolate. Use data from nearest index.
m: Return a masked array with missing values if any selection coordinates are outside data coordinate boundaries, using default interpolation setting.
mi: Return a masked array with missing values if any selection coordinates are outside data coordinate boundaries, interpolating data to exact location.
mn: Return masked array with missing values if any selection coordinates are outside data coordinate boundaries without interpolation. Use data from nearest index.

The following extended selection examples are all based on a NetCDF file that is available in the example directory. The example 'nio05.py' prints out the results from each of these selections. The file was created using data from a GFS GRIB2 file. The variable and dimension names were shortened and simplified for clarity, and the data was sampled to reduce the file to an easily manageable size.

Given:
var = f.variables['tmp']
where var.dimensions is ('time','lev','lat','lon')

time has length 7 with coordinates [ 0,  3,  6,  9, 12, 15] 
    (units: 'hours since 11/15/2006 12:00')
lev has length 9 with coordinates  [1000, 5000, 15000, 30000, 45000, 60000, 75000, 90000, 97500]
    (units: 'Pa')
lat has length 61 with coordinates stepping by -3 from 90 to -90 
    (units: 'degrees_north')
lon has length 120 with coordinates stepping by 3 from 0 to 357
    (units: 'degrees_east')

Get temperature for the first time step and levels 1000 and 100000, latitude 60 and longitude 100-120. Use positional syntax; the closest value to the specified coordinate is selected:

a = tmp['i0 1000,100000 60 100:120']

Same thing but specify the dimensions by name:

a = tmp['time|i0 lev|1000,100000 lat|60 lon|100:120']

Now rearrange the dimension order. For the longitudes 100 and 120 get all the level values as the rightmost dimension:

a = tmp['time|i0 lat|60 lon|100,120 lev|:']

Suppose you need to set some of the values programmatically using variables that have been defined in your code. Although you cannot directly introduce variables into the specification string, you can use Python's string formatting syntax to put the values in the correct location in the string. For instance, suppose you have variables minlon and maxlon that you want to insert into the previous example. Here is one way:

a = tmp['time|i0 lat|60 lon|%f,%f lev|:' % (minlon,maxlon)]

Interpolate the level values from 0 to 100000 in steps of 10000. Use 'k' as a short form multiplier of 1000. Note that a minor amount of extrapolation can occur near the limits of the coordinate range:

a = tmp['time|i0 lat|60 lon|100,120 lev|0:100k:10ki']

Interpolate the level values from 0 to 120000 in steps of 10000. Values outside the extrapolation range get set to the bounding value:

a = tmp['time|i0 lat|60 lon|100,120 lev|0:120k:10ki']

Interpolate the level values from 0 to 120000 in steps of 10000 Use the 'm' flag to indicate that values outside the bounding array should be set to missing values:

a = tmp['time|i0 lat|60 lon|100,120 lev|0:120k:10kmi']

Using positional syntax, get temperature for the first time step and level, latitudes 30 - 40 and longitude 100. Note that the latitude coordinates are in descending order north to south. The default stride is the coordinate spacing between the first 2 elements -- if the coordinate values are descending the default spacing is negative. In this case it is -3.0:

a = tmp['i0 i0 40:30 100']

Make the latitude values go south to north. Since the spacing is known to be 3 degrees and since the default spacing is negative a positive spacing value of 3 is used to step in the opposite direction:

a = tmp['i0 i0 30:40:3 100']

Or alternatively use a negative index step (prefixing the step value with 'i'). In index space reversing the order always means a negative step:

a = tmp['i0 i0 30:40:i-1 100']

Use the geopotential height variable in the file to get temperature at constant geopotential height for two time steps:

a = tmp['time|0,3 lev|hgt|1500 lat|50,60 lon|237:252']

Indirect indexing uses interpolation by default; use 'n' suffix to turn off interpolation. This is useful for examining how the process works. If you examine the height and temperature variable carefully you will note that 1500 meters (geopotential height) shifts from closer to level 75000 to closer to 90000 between longitudes 240 and 243:

a =  tmp['time|0,3 lev|hgt|1500n lat|60 lon|237:252']

Detailed Description

PyNIO's extended slicing capability was originally developed by Juerg Schmidli of NCAR. The development team would like to acknowledge his major contribution to the project. The primary feature of the new code is the ability to specify slices of multidimensional data both in coordinate and index space. But it has other capabilities as well, including especially the ability to interpolate along any axis using step sizes that are arbitrary fractions of the original spacing, essentially implementing a rudimentary regridding facility. It also provides dimension-reordering (transposition) based on the order of named dimensions in the specification string, as well as vector subscripting in either coordinate or index space.

The string representation is specified using dimension (axis) names or positional syntax. If dimension names are used all specified dimensions must be named. If a dimension name is omitted, all elements of the omitted dimension are returned. For selection using named dimensions, the syntax consists of a dimension name followed by the vertical bar (|) character followed by a selection specification. The selection specifications for each dimension are separated by white space.

Coordinate and index space selection

Coordinate space selection depends on a convention widely adopted for file formats such as NetCDF, where associated with the named dimensions in the file are one-dimensional coordinate variables that have the same name as the dimension. Published conventions for NetCDF, particularly COARDS and more recently CF, encourage or may even require this association. With this convention, each element of the coordinate variable locates in coordinate space the corresponding element of the data variable along the dimension with the given name. By default, values in the selection specification are assumed to represent coordinate values.

The i prefix indicates the selection along the dimension is specified in index space. However, if a dimension does not have an associated coordinate variable in the file, the prefix becomes optional.It might help to imagine that, in the absence of a coordinate variable, the fall-back default coordinates become the array indexes. Fractional index values are accepted. Unless interpolation is specified, these are rounded to the nearest integer value using NumPy rounding rules.

Scalar, vector, and slice selection types

Both the coordinate and index space selection modes support three forms of selection: scalar, vector, and slice. Scalar selection is specified with a single value. Vector selection uses a comma-separated list of values. When not in interpolation mode both scalar and vector selection choose values at the coordinate locations nearest the specified values.

Slice selection uses a syntax similar to standard Python and NumPy slices, with start, stop, and step values separated by the colon (:) character. Together these values define a sub-range of the dimension with an optional sampling interval. As with normal Python slicing syntax, in many cases one or two of the colon-separated values can be omitted. Ellipsis is not supported, however.

When a range is specified using slicing syntax, the start and stop values represent included boundaries, whether in coordinate or in index space. This is unlike standard Python or NumPy slicing, where the stop value is an excluded bound, but this difference is required for coordinate space selection to make sense. Additionally it enables the implementation of the interpolation capability both in coordinate and index space. However, note that when using non-interpolated coordinate space selection, unless exact values contained in the coordinate array are chosen, the selected coordinates will lie inside the bounds at both ends of the selected range.

By default, when using coordinate space selection, the step value, if included, represents a spacing in the units of the coordinate axis. By default, it is the quantity required to step between adjacent values along the axis (however, see the caveat concerning coordinate space subscripting). If the coordinate elements are descending, then the default spacing is negative, and the start value must be greater than the stop value. When using coordinate space selection, the spacing value itself may be prefixed with the i character to indicate that the spacing value specifies a step in index space.

PyNIO supports axis reversal both in coordinate space and in index space. To reverse an axis using coordinate values the start and stop values must be exchanged and the sign of the step reversed. If the coordinate values are descending in the forward direction, reversing them requires the start value to be less than the end value and a positive coordinate step value or a negative index space step. Index space axis reversal works as in NumPy or in standard Python.

Indirect selection using a multi-dimensional auxiliary coordinate variable

A multi-dimensional variable may be used as an auxiliary coordinate variable to specify the coordinate selection indirectly. For example, the data variable may have its vertical coordinate specified in units of pressure. However, the user may desire to have a selection based on geopotential height. If a geopotential height variable is available in the file that has the same pressure coordinate dimension as the data variable, then the geopotential height variable can serve as an auxiliary coordinate that allows the selection of the elements of the data variable corresponding to specific heights.

Interpolation

By default, no interpolation occurs except when using indirect selection with an auxiliary coordinate variable. For scalar or vector selections, the element closest to the selected coordinate location is chosen. For slice selections, only elements that fall on or inside the range boundaries are returned.

If interpolation is enabled then a basic bilinear intepolation along the axis is performed using the two data elements on each side of the desired coordinate location. Interpolation is possible either in index or coordinate space. Also, if the data location is outside the coordinate bounds but within half the spacing between coordinate points, extrapolation may be performed.

Dimension reordering

Positional syntax assumes that selections for each axis are given in the order of the variable's dimensions. In this case dimensional reordering is not possible. However, when dimensions are named, the order of appearance in the string indicates the order desired for the output array. If the order is not the same as the dimension order of the variable as it appears in the file, the array will be transposed to the specified order. (See note concerning dimension reordering)

Aliasing dimension names using the axis attribute

If any of a file's coordinate variables have an axis attribute with one of the CF-compliant axis names 'T', 'Z', 'Y', or 'X' as values, then these names can serve as convenient short aliases for the actual dimension names in the selection specification string, provided that the Nio option UseAxisAttribute is set True. Note however, that if this option is set True and the axis attribute exists for a coordinate variable, then the name given by the axis attribute must be used; using the real dimension name will raise an exception.

Known issues

The coordinate space subscripting step size is calculated based on the first two elements of the coordinate array. When the the coordinate axis values are irregularly spaced and when not using interpolation, non-default coordinate space step sizes can led to strange results. Either use interpolation or use an index space step value.
Dimension reordering does currently not work when an extended selection string contains an indirect selection using an auxiliary multidimensional coordinate variable.