The pytesmo validation framework

The pytesmo validation framework takes care of iterating over datasets, spatial and temporal matching as well as scaling. It uses metric calculators to then calculate metrics that are returned to the user. There are several metrics calculators included in pytesmo but new ones can be added simply.

Overview

How does the validation framework work? It makes these assumptions about the used datasets:

The dataset readers that are used have a method for reading data (usually called read) that can be called either by a grid point index (gpi) which can be any indicator that identifies a certain grid point or by using longitude and latitude. This means that both call signatures read(gpi) and read(lon, lat) must be valid. Please check the pygeobase documentation for more details on how a fully compatible dataset class should look. But a simple read method should do for the validation framework. This assumption can be relaxed by using the read_ts_names keyword in the pytesmo.validation_framework.data_manager.DataManager class.
The reading method returns a pandas.DataFrame time series.
Ideally the datasets classes also have a grid attribute that is a pygeogrids grid. This makes the calculation of lookup tables easily possible and the nearest neighbor search faster.

Fortunately these assumptions are true about the dataset readers included in pytesmo.

It also makes a few assumptions about how to perform a validation. For a comparison study it is often necessary to choose a spatial reference grid, a temporal reference and a scaling or data space reference.

Spatial reference

The spatial reference is the one to which all the other datasets are matched spatially. Often through nearest neighbor search. The validation framework uses grid points of the dataset specified as the spatial reference to spatially match all the other datasets with nearest neighbor search. Other, more sophisticated spatial matching algorithms are not implemented at the moment. If you need a more complex spatial matching then a preprocessing of the data is the only option at the moment.

Temporal reference

The temporal reference is the dataset to which the other dataset are temporally matched. That means that the nearest observation to the reference timestamps in a certain time window is chosen for each comparison dataset. This is by default done by the temporal matching module included in pytesmo. How many datasets should be matched to the reference dataset at once can be configured, we will cover how to do this later.

Data space reference

It is often necessary to bring all the datasets into a common data space by using. Scaling is often used for that and pytesmo offers a choice of several scaling algorithms (e.g. CDF matching, min-max scaling, mean-std scaling, triple collocation based scaling). The data space reference can also be chosen independently from the other two references.

Data Flow

After it is initialized, the validation framework works through the following steps:

Read all the datasets for a certain job (gpi, lon, lat)
Read all the masking dataset if any
Mask the temporal reference dataset using the masking data
Temporally match all the chosen combinations of temporal reference and other datasets
Turn the temporally matched time series over to the metric calculators
Get the calculated metrics from the metric calculators
Put all the metrics into a dictionary by dataset combination and return them.

Masking datasets

Masking datasets can be used if the datasets that are compared do not contain the necessary information to mask them. For example we might want to use modelled soil temperature data to mask our soil moisture observations before comparing them. To be able to do that we just need a Dataset that returns a pandas.DataFrame with one column of boolean data type. Everywhere where the masking dataset is True the data will be masked.

Let’s look at a first example.

Example soil moisture validation: ASCAT - ISMN

This example shows how to setup the pytesmo validation framework to perform a comparison between ASCAT and ISMN data.

[1]:

from pathlib import Path

import pytesmo.validation_framework.metric_calculators as metrics_calculators
from datetime import datetime
import warnings
from ascat.read_native.cdr import AscatGriddedNcTs
from ismn.interface import ISMN_Interface  # install ismn: 'pip install ismn'
from pytesmo.validation_framework.validation import Validation
from pytesmo.validation_framework.results_manager import netcdf_results_manager
from pytesmo.utils import rootdir

import tempfile

You need the test data from https://github.com/TUW-GEO/pytesmo-test-data for this example

[2]:

from tempfile import mkdtemp
output_folder = mkdtemp()
print('Data is stored in:', output_folder)

Data is stored in: /tmp/tmp_rb5_o93

First we initialize the data readers that we want to use. In this case the ASCAT soil moisture time series and in situ data from the ISMN.

Initialize ASCAT reader

[3]:

testdata_path = rootdir() / "tests" / "test-data"
ascat_data_folder = testdata_path / "sat" / "ascat" / "netcdf" / "55R22"
ascat_grid_fname = testdata_path / "sat" / "ascat" / "netcdf" / "grid" / "TUW_WARP5_grid_info_2_1.nc"
static_layer_path = testdata_path / "sat" / "h_saf" / "static_layer"


#init the AscatSsmCdr reader with the paths
with warnings.catch_warnings():
    warnings.filterwarnings('ignore') # some warnings are expected and ignored

    ascat_reader = AscatGriddedNcTs(
        ascat_data_folder,
        "TUW_METOP_ASCAT_WARP55R22_{:04d}",
        grid_filename=ascat_grid_fname,
        static_layer_path=static_layer_path
    )
ascat_reader.read_bulk = True

Initialize ISMN reader

[4]:

ismn_data_folder = testdata_path / "ismn/multinetwork/header_values"
meta_path = tempfile.mkdtemp()
ismn_reader = ISMN_Interface(ismn_data_folder, meta_path=meta_path)

Processing metadata for all ismn stations into folder /shares/wpreimes/home/code/pytesmo/tests/test-data/ismn/multinetwork/header_values.
This may take a few minutes, but is only done once...
Hint: Use `parallel=True` to speed up metadata generation for large datasets

Files Processed: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 53.19it/s]

Metadata generation finished after 0 Seconds.
Metadata and Log stored in /tmp/tmpro_61tg2

Found existing ismn metadata in /tmp/tmpro_61tg2/header_values.csv.

The validation is run based on jobs. A job consists of at least three lists or numpy arrays specifing the grid point index, its latitude and longitude. In the case of the ISMN we can use the dataset_ids that identify every time series in the downloaded ISMN data as our grid point index. We can then get longitude and latitude from the metadata of the dataset.

DO NOT CHANGE the name jobs because it will be searched during the parallel processing!

[5]:

jobs = []

ids = ismn_reader.get_dataset_ids(variable='soil_moisture', min_depth=0, max_depth=0.1)

for idx in ids:
    metadata = ismn_reader.read_metadata(idx)
    jobs.append((idx, metadata['longitude'].val, metadata['latitude'].val))
    break

print("Jobs (gpi, lon, lat):")
print(jobs)

Jobs (gpi, lon, lat):
[(0, 102.1333, 33.8833)]

For this example we only process one job / pixel / sensor.

It is important here that the ISMN reader has a reading function that works by just using the dataset_id. In this way the validation framework can go through the jobs and read the correct time series.

[6]:

data = ismn_reader.read(ids[0])
print('ISMN data example:')
print(data.head())

ISMN data example:
                     soil_moisture soil_moisture_flag soil_moisture_orig_flag
date_time
2008-07-01 00:00:00            0.5                C03                       M
2008-07-01 01:00:00            0.5                C03                       M
2008-07-01 02:00:00            0.5                C03                       M
2008-07-01 03:00:00            0.5                C03                       M
2008-07-01 04:00:00            0.5                C03                       M

Initialize the Validation class

The Validation class is the heart of the validation framework. It contains the information about which datasets to read using which arguments or keywords and if they are spatially compatible. It also contains the settings about which metric calculators to use and how to perform the scaling into the reference data space. It is initialized in the following way:

[7]:

datasets = {
    'ISMN': {
        'class': ismn_reader,
        'columns': ['soil_moisture']
    },
    'ASCAT': {
        'class': ascat_reader,
        'columns': ['sm'],
        'kwargs': {'mask_frozen_prob': 80,
                   'mask_snow_prob': 80,
                   'mask_ssf': True}
    }}

The datasets dictionary contains all the information about the datasets to read. The class is the dataset class to use which we have already initialized. The columns key describes which columns of the dataset interest us for validation. This a mandatory field telling the framework which other columns to ignore. In this case the columns soil moisture_flag and soil moisture_orig_flag will be ignored by the ISMN reader. We can also specify additional keywords that should be given to the read method of the dataset reader. In this case we want the ASCAT reader to mask the ASCAT soil moisture using the included frozen and snow probabilities as well as the SSF. There are also other keys that can be used here. Please see the documentation for explanations.

[8]:

period = [datetime(2007, 1, 1), datetime(2014, 12, 31)]
basic_metrics = metrics_calculators.BasicMetrics(other_name='k1')

process = Validation(
    datasets, 'ISMN',
    temporal_ref='ASCAT',
    scaling='lin_cdf_match',
    scaling_ref='ASCAT',
    metrics_calculators={(2, 2): basic_metrics.calc_metrics},
    period=period)

/shares/wpreimes/home/code/pytesmo/src/pytesmo/validation_framework/validation.py:141: UserWarning: You are using the default temporal matcher. If you are using one of the newer metric calculators (PairwiseIntercomparisonMetrics, TripleCollocationMetrics) you should probably use `make_combined_temporal_matcher` instead. Have a look at the documentation of the metric calculators for more info.
  warnings.warn(

During the initialization of the Validation class we can also tell it other things that it needs to know. In this case it uses the datasets we have specified earlier. The spatial reference is the 'ISMN' dataset which is the second argument. The ‘metrics_calculators’ argument looks a little bit strange so let’s look at it in more detail.

It is a dictionary with a tuple as the key and a function as the value. The key tuple (n, k) has the following meaning: n datasets are temporally matched together and then given in sets of k columns to the metric calculator. The metric calculator then gets a DataFrame with the columns [‘ref’, ‘k1’, ‘k2’ …] and so on depending on the value of k. The value of (2, 2) makes sense here since we only have two datasets and all our metrics also take two inputs.

This can be used in more complex scenarios to e.g. have three input datasets that are all temporally matched together and then combinations of two input datasets are given to one metric calculator while all three datasets are given to another metric calculator. This could look like this:

{ (3 ,2): metric_calc,
  (3, 3): triple_collocation}

Create the variable save_path which is a string representing the path where the results will be saved. DO NOT CHANGE the name save_path because it will be searched during the parallel processing!

[9]:

save_path = output_folder

import pprint
for job in jobs:

    results = process.calc(*job)
    pprint.pprint(results)
    netcdf_results_manager(results, save_path)

{(('ASCAT', 'sm'), ('ISMN', 'soil_moisture')): {'BIAS': array([0.237454], dtype=float32),
                                                'R': array([0.4996146], dtype=float32),
                                                'RMSD': array([11.583476], dtype=float32),
                                                'gpi': array([0], dtype=int32),
                                                'lat': array([33.8833]),
                                                'lon': array([102.1333]),
                                                'n_obs': array([357], dtype=int32),
                                                'p_R': array([6.127213e-24], dtype=float32),
                                                'p_rho': array([2.471651e-28], dtype=float32),
                                                'p_tau': array([nan], dtype=float32),
                                                'rho': array([0.53934574], dtype=float32),
                                                'tau': array([nan], dtype=float32)}}

/shares/wpreimes/home/code/pytesmo/src/pytesmo/scaling.py:85: DeprecationWarning: Pytesmo method 'pytesmo.scaling.lin_cdf_match' is deprecated and will be removed soon.
  scaling_func(df[series].values, reference.values, **kwargs),

The validation is then performed by looping over all the defined jobs and storing the results. You can see that the results are a dictionary where the key is a tuple defining the exact combination of datasets and columns that were used for the calculation of the metrics. The metrics itself are a dictionary of metric-name: numpy.ndarray which also include information about the gpi, lon and lat. Since all the information contained in the job is given to the metric calculator they can be stored in the results.

Storing of the results to disk is at the moment supported by the netcdf_results_manager which creates a netCDF file for each dataset combination and stores each metric as a variable. We can inspect the stored netCDF file which is named after the dictionary key:

[10]:

import netCDF4
results_fname = Path(save_path) / "ASCAT.sm_with_ISMN.soil_moisture.nc"

with netCDF4.Dataset(str(results_fname)) as ds:
    for var in ds.variables:
        print(var, ds.variables[var][:])

lon [102.1333]
lat [33.8833]
idx [0]
_row_size [0]
time []
gpi [0]
n_obs [357]
R [0.4996146]
p_R [6.127213e-24]
rho [0.53934574]
p_rho [2.471651e-28]
RMSD [11.583476]
BIAS [0.237454]
tau [nan]
p_tau [nan]

Parallel processing

The same code can be executed in parallel by defining the following start_processing function.

[11]:

def start_processing(job):
    try:
        return process.calc(*job)
    except RuntimeError:
        return process.calc(*job)

pytesmo.validation_framework.start_validation can then be used to run your validation in parallel. Your setup code can look like this Ipython notebook without the loop over the jobs. Otherwise the validation would be done twice. Save it into a .py file e.g. my_validation.py.

After starting the ipyparallel cluster you can then execute the following code:

from pytesmo.validation_framework import start_validation

# Note that before starting the validation you must start a controller
# and engines, for example by using: ipcluster start -n 4
# This command will launch a controller and 4 engines on the local machine.
# Also, do not forget to change the setup_code path to your current setup.

setup_code = "my_validation.py"
start_validation(setup_code)

Masking datasets

Masking datasets are datasets that return a pandas DataFrame with boolean values. True means that the observation should be masked, False means it should be kept. All masking datasets are temporally matched in pairs to the temporal reference dataset. Only observations for which all masking datasets have a value of False are kept for further validation.

The masking datasets have the same format as the dataset dictionary and can be specified in the Validation class with the masking_datasets keyword.

Masking adapter

To easily transform an existing dataset into a masking dataset pytesmo offers a adapter class that calls the reading method of an existing dataset and creates a masking dataset based on an operator, a given threshold, and (optionally) a column name.

[12]:

from pytesmo.validation_framework.adapters import MaskingAdapter

ds_mask = MaskingAdapter(ismn_reader, '<', 0.2, 'soil_moisture')
print(ds_mask.read(ids[0]).head())

                     soil_moisture
date_time
2008-07-01 00:00:00          False
2008-07-01 01:00:00          False
2008-07-01 02:00:00          False
2008-07-01 03:00:00          False
2008-07-01 04:00:00          False

/tmp/ipykernel_102873/3230815451.py:3: DeprecationWarning: `MaskingAdapter` is deprecated, use `SelfMaskingAdapter` or `AdvancedMaskingAdapter` instead.
  ds_mask = MaskingAdapter(ismn_reader, '<', 0.2, 'soil_moisture')

Self-masking adapter

pytesmo also has a class that masks a dataset “on-the-fly”, based on one of the columns it contains and an operator and a threshold. In contrast to the masking adapter mentioned above, the output of the self-masking adapter is the masked data, not the the mask. The self-masking adapter wraps a data reader, which must have a read_ts or read method. Alternatively, to use a method with a name other than read/read_ts, use the read_name keyword which is available for each Adapter - it is still required that the method returns a pandas DataFrame! Calling the reading method will return the masked data - more precisely a DataFrame with only rows where the masking condition is true.

[13]:

from pytesmo.validation_framework.adapters import SelfMaskingAdapter

ds_mask = SelfMaskingAdapter(ismn_reader, '<', 0.2, 'soil_moisture')
print(ds_mask.read(ids[0]).head())

                     soil_moisture soil_moisture_flag soil_moisture_orig_flag
date_time
2008-11-29 09:00:00           0.19            D01,D03                       M
2008-11-29 10:00:00           0.19            D01,D03                       M
2008-11-29 11:00:00           0.19            D01,D03                       M
2008-11-30 03:00:00           0.19            D01,D03                       M
2008-11-30 04:00:00           0.19            D01,D03                       M