Temporal Collocation of irregularly sampled timeseries

Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming from different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time.

A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors).

Temporal collocation in pytesmo

Pytesmo contains the function pytesmo.temporal_matching.temporal_collocation for temporally collocating timeseries. Currently, it implements nearest neighbour matching and a windowed mean. It requires a reference index (can also be a DataFrame or a Series), a DataFrame (or Series) to be collocated, and a window.

collocated = temporal_collocation(reference, input_frame, window)

The window argument corresponds to the time intervals that are included in the nearest neighbour search in each direction, e.g. if the reference time is \(t\) and the window \(\Delta\), the nearest neighbour inside \([t-\Delta, t+\Delta]\) is returned. If no neighbour is found np.nan is used as replacement. NaNs can be dropped from the returned dataframe by providing the optional keyword argument dropna=True to the function.

Below are two simple examples which demonstrate the usage. The first example assumes that the index of data to be collocated is shifted by 3 hours with respect to the reference, while using a 6 hour window. The second example uses an index that is randomly shifted by \(\pm12\) hours with respect to the reference. The second example also uses a 6 hour window, which results in some missing values in the resulting dataframe.

[47]:
import numpy as np
import pandas as pd

from pytesmo.temporal_matching import temporal_collocation, combined_temporal_collocation
[48]:
# create reference time series
ref = pd.date_range("2020-01-01", "2020-12-31", freq="D")
# temporal_collocation can also take a DataFrame or Series as reference input,
# in case their index is a DatetimeIndex.

# create other time series as dataframe
values = np.random.randn(len(ref), 3)
shifted = pd.DataFrame(values, index=ref + pd.Timedelta(hours=3),
                       columns=list(map(lambda x: f"shifted_{x}", range(3))))
random_shift = np.random.uniform(-12, 12, len(ref))
random = pd.DataFrame(values, index=ref + pd.to_timedelta(random_shift, "H"),
                      columns=list(map(lambda x: f"random_{x}", range(3))))

shifted.head()
[48]:
shifted_0 shifted_1 shifted_2
2020-01-01 03:00:00 -0.687795 -0.626649 0.237109
2020-01-02 03:00:00 -0.514778 -1.981137 0.354644
2020-01-03 03:00:00 -0.600629 -0.761766 0.169777
2020-01-04 03:00:00 -0.650058 -0.548499 0.548560
2020-01-05 03:00:00 1.331785 -1.611482 -1.325902
[49]:
random.head()
[49]:
random_0 random_1 random_2
2019-12-31 21:00:12.990837600 -0.687795 -0.626649 0.237109
2020-01-02 03:18:11.512674000 -0.514778 -1.981137 0.354644
2020-01-03 04:15:20.703038399 -0.600629 -0.761766 0.169777
2020-01-03 22:47:25.851343200 -0.650058 -0.548499 0.548560
2020-01-05 01:02:46.090482000 1.331785 -1.611482 -1.325902

We can now match the shifted timeseries to the reference index by using a 6-hour window, either for a nearest neighbour search, or for taking a windowed mean. Both should return unchanges timeseries, except for the index.

[50]:
# match the regularly shifted data
window = pd.Timedelta(hours=6)
matched_shifted_nn = temporal_collocation(ref, shifted, window, method="nearest")
matched_shifted_mean = temporal_collocation(ref, shifted, window, method="mean")

# the data should be the same before and after matching for both methods
assert np.all(shifted.values == matched_shifted_nn.values)
assert np.all(shifted.values == matched_shifted_mean.values)

matched_shifted_nn


[50]:
shifted_0 shifted_1 shifted_2
2020-01-01 -0.687795 -0.626649 0.237109
2020-01-02 -0.514778 -1.981137 0.354644
2020-01-03 -0.600629 -0.761766 0.169777
2020-01-04 -0.650058 -0.548499 0.548560
2020-01-05 1.331785 -1.611482 -1.325902
... ... ... ...
2020-12-27 -1.030740 -0.520146 -0.049081
2020-12-28 -1.221772 -0.770726 0.585857
2020-12-29 0.420476 -2.558351 -0.797550
2020-12-30 0.146506 1.014053 -0.629454
2020-12-31 -1.296130 2.000808 -1.084081

366 rows × 3 columns

We can do the same for the randomly shifted timeseries. Here we should see some changes, because sometimes there’s no value inside the window that we are looking at. However, the result of mean and nearest neighbour should be the same

[51]:
# match the randomly shifted data
matched_random_mean = temporal_collocation(ref, random, window, method="mean")
matched_random_nn = temporal_collocation(ref, random, window, method="nearest")

# the data should be the same as before matching at the locations where the shift
# was below 6 hours, and should be np.nan when shift was larger
should_be_nan = np.abs(random_shift) > 6

assert np.all(matched_random_nn[~should_be_nan].values == random[~should_be_nan].values)
assert np.all(np.isnan(matched_random_nn[should_be_nan].values))

for c in matched_random_nn.columns:
    df = pd.DataFrame(index=range(366),
                      data={'mean': matched_random_mean[c].values,
                            'nn': matched_random_nn[c].values}).dropna()
    np.testing.assert_almost_equal(
        df.diff(axis=1).iloc[:, 1].values,
        np.zeros(df.index.size),
        decimal=4)

matched_random_nn
[51]:
random_0 random_1 random_2
2020-01-01 -0.687795 -0.626649 0.237109
2020-01-02 -0.514778 -1.981137 0.354644
2020-01-03 -0.600629 -0.761766 0.169777
2020-01-04 -0.650058 -0.548499 0.548560
2020-01-05 1.331785 -1.611482 -1.325902
... ... ... ...
2020-12-27 -1.030740 -0.520146 -0.049081
2020-12-28 -1.221772 -0.770726 0.585857
2020-12-29 0.420476 -2.558351 -0.797550
2020-12-30 0.146506 1.014053 -0.629454
2020-12-31 -1.296130 2.000808 -1.084081

366 rows × 3 columns

Returning the original index

temporal_collocation can also return the original index of the data that was matched as a separate column in the resulting DataFrame, if required, and can additionally also calculate the distance to the reference. The column names are “index_other” and “distance_other”, respectively.

[52]:
# also return original index and distance
matched_shifted = temporal_collocation(
    ref, shifted, window,
    method="nearest",
    return_index=True,
    return_distance=True
)

# the index should be the same as unmatched, and the distance should be 3  hours
assert np.all(matched_shifted["index_other"].values == shifted.index.values)
assert np.all(matched_shifted["distance_other"] == pd.Timedelta(hours=3))

matched_shifted
[52]:
shifted_0 shifted_1 shifted_2 index_other distance_other
2020-01-01 -0.687795 -0.626649 0.237109 2020-01-01 03:00:00 0 days 03:00:00
2020-01-02 -0.514778 -1.981137 0.354644 2020-01-02 03:00:00 0 days 03:00:00
2020-01-03 -0.600629 -0.761766 0.169777 2020-01-03 03:00:00 0 days 03:00:00
2020-01-04 -0.650058 -0.548499 0.548560 2020-01-04 03:00:00 0 days 03:00:00
2020-01-05 1.331785 -1.611482 -1.325902 2020-01-05 03:00:00 0 days 03:00:00
... ... ... ... ... ...
2020-12-27 -1.030740 -0.520146 -0.049081 2020-12-27 03:00:00 0 days 03:00:00
2020-12-28 -1.221772 -0.770726 0.585857 2020-12-28 03:00:00 0 days 03:00:00
2020-12-29 0.420476 -2.558351 -0.797550 2020-12-29 03:00:00 0 days 03:00:00
2020-12-30 0.146506 1.014053 -0.629454 2020-12-30 03:00:00 0 days 03:00:00
2020-12-31 -1.296130 2.000808 -1.084081 2020-12-31 03:00:00 0 days 03:00:00

366 rows × 5 columns

Flags

Satellite data often contains flags indicating quality issues with the data. With temporal_collocation it is possible to use this information. Flags can either be provided as array (of the same length as the input DataFrame), or the name of a column in the DataFrame to be used as flag can be provided as string. Any non-zero flag is interpreted as indicating invalid data. By default this will not be used, but when passing use_invalid=True, the invalid values will be used in case no valid match was found.

For the following example, we reuse the input data shifted by 3 hours, but we will now assume that the first 3 observations had quality issues.

[53]:
# flag the first 3 observations as invalid
flag = np.zeros(len(ref), dtype=bool)
flag[0:3] = True
flag[0:10]
[53]:
array([ True,  True,  True, False, False, False, False, False, False,
       False])
[54]:
matched_flagged = temporal_collocation(ref, shifted, window, flag=flag)

# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted
assert np.all(np.isnan(matched_flagged.values[0:3, :]))
assert np.all(matched_flagged.values[3:, :] == matched_shifted.values[3:, 0:3])  # excluding additonal columns
matched_flagged
[54]:
shifted_0 shifted_1 shifted_2
2020-01-01 NaN NaN NaN
2020-01-02 NaN NaN NaN
2020-01-03 NaN NaN NaN
2020-01-04 -0.650058 -0.548499 0.548560
2020-01-05 1.331785 -1.611482 -1.325902
... ... ... ...
2020-12-27 -1.030740 -0.520146 -0.049081
2020-12-28 -1.221772 -0.770726 0.585857
2020-12-29 0.420476 -2.558351 -0.797550
2020-12-30 0.146506 1.014053 -0.629454
2020-12-31 -1.296130 2.000808 -1.084081

366 rows × 3 columns

[55]:
# This also works when the flag is already in the input frame, but note that
# in the output frame the nonzero flag values are replaced by NaN
flagged = shifted.assign(my_flag=flag)
matched_flagged = temporal_collocation(ref, flagged, window, flag="my_flag")

# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted
assert np.all(np.isnan(matched_flagged.iloc[0:3, 0:3].values))
assert np.all(matched_flagged.iloc[3:, 0:3].values == matched_shifted.values[3:, 0:3])  # excluding additonal columns
matched_flagged
[55]:
shifted_0 shifted_1 shifted_2 my_flag
2020-01-01 NaN NaN NaN NaN
2020-01-02 NaN NaN NaN NaN
2020-01-03 NaN NaN NaN NaN
2020-01-04 -0.650058 -0.548499 0.548560 False
2020-01-05 1.331785 -1.611482 -1.325902 False
... ... ... ... ...
2020-12-27 -1.030740 -0.520146 -0.049081 False
2020-12-28 -1.221772 -0.770726 0.585857 False
2020-12-29 0.420476 -2.558351 -0.797550 False
2020-12-30 0.146506 1.014053 -0.629454 False
2020-12-31 -1.296130 2.000808 -1.084081 False

366 rows × 4 columns

Combined collocation

It is also possible to match multiple timeseries together against a reference dataset using the function pytesmo.temporal_matching.combined_temporal_collocation. With the keyword argument combined_dropna it’s possible to drop data where one of the input datasets has missing values from all other input datasets.

[56]:
combined_match = combined_temporal_collocation(ref, (random, shifted), window, combined_dropna=True)
# matched dataframe should have same length as matched_random_nn without NaNs
assert len(combined_match == len(matched_random_nn.dropna()))
combined_match
[56]:
random_0 random_1 random_2 shifted_0 shifted_1 shifted_2
2020-01-01 -0.687795 -0.626649 0.237109 -0.687795 -0.626649 0.237109
2020-01-02 -0.514778 -1.981137 0.354644 -0.514778 -1.981137 0.354644
2020-01-03 -0.600629 -0.761766 0.169777 -0.600629 -0.761766 0.169777
2020-01-04 -0.650058 -0.548499 0.548560 -0.650058 -0.548499 0.548560
2020-01-05 1.331785 -1.611482 -1.325902 1.331785 -1.611482 -1.325902
... ... ... ... ... ... ...
2020-12-27 -1.030740 -0.520146 -0.049081 -1.030740 -0.520146 -0.049081
2020-12-28 -1.221772 -0.770726 0.585857 -1.221772 -0.770726 0.585857
2020-12-29 0.420476 -2.558351 -0.797550 0.420476 -2.558351 -0.797550
2020-12-30 0.146506 1.014053 -0.629454 0.146506 1.014053 -0.629454
2020-12-31 -1.296130 2.000808 -1.084081 -1.296130 2.000808 -1.084081

366 rows × 6 columns