Temporal Collocation of irregularly sampled timeseries

Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming from different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time.

A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors).

Temporal collocation in pytesmo

Pytesmo contains the function pytesmo.temporal_matching.temporal_collocation for temporally collocating timeseries. Currently, it implements nearest neighbour matching and a windowed mean. It requires a reference index (can also be a DataFrame or a Series), a DataFrame (or Series) to be collocated, and a window.

collocated = temporal_collocation(reference, input_frame, window)

The window argument corresponds to the time intervals that are included in the nearest neighbour search in each direction, e.g. if the reference time is \(t\) and the window \(\Delta\), the nearest neighbour inside \([t-\Delta, t+\Delta]\) is returned. If no neighbour is found np.nan is used as replacement. NaNs can be dropped from the returned dataframe by providing the optional keyword argument dropna=True to the function.

Below are two simple examples which demonstrate the usage. The first example assumes that the index of data to be collocated is shifted by 3 hours with respect to the reference, while using a 6 hour window. The second example uses an index that is randomly shifted by \(\pm12\) hours with respect to the reference. The second example also uses a 6 hour window, which results in some missing values in the resulting dataframe.

[47]:

import numpy as np
import pandas as pd

from pytesmo.temporal_matching import temporal_collocation, combined_temporal_collocation

[48]:

# create reference time series
ref = pd.date_range("2020-01-01", "2020-12-31", freq="D")
# temporal_collocation can also take a DataFrame or Series as reference input,
# in case their index is a DatetimeIndex.

# create other time series as dataframe
values = np.random.randn(len(ref), 3)
shifted = pd.DataFrame(values, index=ref + pd.Timedelta(hours=3),
                       columns=list(map(lambda x: f"shifted_{x}", range(3))))
random_shift = np.random.uniform(-12, 12, len(ref))
random = pd.DataFrame(values, index=ref + pd.to_timedelta(random_shift, "H"),
                      columns=list(map(lambda x: f"random_{x}", range(3))))

shifted.head()

[48]:

	shifted_0	shifted_1	shifted_2
2020-01-01 03:00:00	-0.687795	-0.626649	0.237109
2020-01-02 03:00:00	-0.514778	-1.981137	0.354644
2020-01-03 03:00:00	-0.600629	-0.761766	0.169777
2020-01-04 03:00:00	-0.650058	-0.548499	0.548560
2020-01-05 03:00:00	1.331785	-1.611482	-1.325902

[49]:

random.head()

[49]:

	random_0	random_1	random_2
2019-12-31 21:00:12.990837600	-0.687795	-0.626649	0.237109
2020-01-02 03:18:11.512674000	-0.514778	-1.981137	0.354644
2020-01-03 04:15:20.703038399	-0.600629	-0.761766	0.169777
2020-01-03 22:47:25.851343200	-0.650058	-0.548499	0.548560
2020-01-05 01:02:46.090482000	1.331785	-1.611482	-1.325902

We can now match the shifted timeseries to the reference index by using a 6-hour window, either for a nearest neighbour search, or for taking a windowed mean. Both should return unchanges timeseries, except for the index.

[50]:

# match the regularly shifted data
window = pd.Timedelta(hours=6)
matched_shifted_nn = temporal_collocation(ref, shifted, window, method="nearest")
matched_shifted_mean = temporal_collocation(ref, shifted, window, method="mean")

# the data should be the same before and after matching for both methods
assert np.all(shifted.values == matched_shifted_nn.values)
assert np.all(shifted.values == matched_shifted_mean.values)

matched_shifted_nn

[50]:

	shifted_0	shifted_1	shifted_2
2020-01-01	-0.687795	-0.626649	0.237109
2020-01-02	-0.514778	-1.981137	0.354644
2020-01-03	-0.600629	-0.761766	0.169777
2020-01-04	-0.650058	-0.548499	0.548560
2020-01-05	1.331785	-1.611482	-1.325902
...	...	...	...
2020-12-27	-1.030740	-0.520146	-0.049081
2020-12-28	-1.221772	-0.770726	0.585857
2020-12-29	0.420476	-2.558351	-0.797550
2020-12-30	0.146506	1.014053	-0.629454
2020-12-31	-1.296130	2.000808	-1.084081

366 rows × 3 columns

We can do the same for the randomly shifted timeseries. Here we should see some changes, because sometimes there’s no value inside the window that we are looking at. However, the result of mean and nearest neighbour should be the same

[51]:

# match the randomly shifted data
matched_random_mean = temporal_collocation(ref, random, window, method="mean")
matched_random_nn = temporal_collocation(ref, random, window, method="nearest")

# the data should be the same as before matching at the locations where the shift
# was below 6 hours, and should be np.nan when shift was larger
should_be_nan = np.abs(random_shift) > 6

assert np.all(matched_random_nn[~should_be_nan].values == random[~should_be_nan].values)
assert np.all(np.isnan(matched_random_nn[should_be_nan].values))

for c in matched_random_nn.columns:
    df = pd.DataFrame(index=range(366),
                      data={'mean': matched_random_mean[c].values,
                            'nn': matched_random_nn[c].values}).dropna()
    np.testing.assert_almost_equal(
        df.diff(axis=1).iloc[:, 1].values,
        np.zeros(df.index.size),
        decimal=4)

matched_random_nn

[51]:

	random_0	random_1	random_2
2020-01-01	-0.687795	-0.626649	0.237109
2020-01-02	-0.514778	-1.981137	0.354644
2020-01-03	-0.600629	-0.761766	0.169777
2020-01-04	-0.650058	-0.548499	0.548560
2020-01-05	1.331785	-1.611482	-1.325902
...	...	...	...
2020-12-27	-1.030740	-0.520146	-0.049081
2020-12-28	-1.221772	-0.770726	0.585857
2020-12-29	0.420476	-2.558351	-0.797550
2020-12-30	0.146506	1.014053	-0.629454
2020-12-31	-1.296130	2.000808	-1.084081

366 rows × 3 columns

Returning the original index

temporal_collocation can also return the original index of the data that was matched as a separate column in the resulting DataFrame, if required, and can additionally also calculate the distance to the reference. The column names are “index_other” and “distance_other”, respectively.

[52]:

# also return original index and distance
matched_shifted = temporal_collocation(
    ref, shifted, window,
    method="nearest",
    return_index=True,
    return_distance=True
)

# the index should be the same as unmatched, and the distance should be 3  hours
assert np.all(matched_shifted["index_other"].values == shifted.index.values)
assert np.all(matched_shifted["distance_other"] == pd.Timedelta(hours=3))

matched_shifted

[52]:

	shifted_0	shifted_1	shifted_2	index_other	distance_other
2020-01-01	-0.687795	-0.626649	0.237109	2020-01-01 03:00:00	0 days 03:00:00
2020-01-02	-0.514778	-1.981137	0.354644	2020-01-02 03:00:00	0 days 03:00:00
2020-01-03	-0.600629	-0.761766	0.169777	2020-01-03 03:00:00	0 days 03:00:00
2020-01-04	-0.650058	-0.548499	0.548560	2020-01-04 03:00:00	0 days 03:00:00
2020-01-05	1.331785	-1.611482	-1.325902	2020-01-05 03:00:00	0 days 03:00:00
...	...	...	...	...	...
2020-12-27	-1.030740	-0.520146	-0.049081	2020-12-27 03:00:00	0 days 03:00:00
2020-12-28	-1.221772	-0.770726	0.585857	2020-12-28 03:00:00	0 days 03:00:00
2020-12-29	0.420476	-2.558351	-0.797550	2020-12-29 03:00:00	0 days 03:00:00
2020-12-30	0.146506	1.014053	-0.629454	2020-12-30 03:00:00	0 days 03:00:00
2020-12-31	-1.296130	2.000808	-1.084081	2020-12-31 03:00:00	0 days 03:00:00

366 rows × 5 columns

Flags

Satellite data often contains flags indicating quality issues with the data. With temporal_collocation it is possible to use this information. Flags can either be provided as array (of the same length as the input DataFrame), or the name of a column in the DataFrame to be used as flag can be provided as string. Any non-zero flag is interpreted as indicating invalid data. By default this will not be used, but when passing use_invalid=True, the invalid values will be used in case no valid match was found.

For the following example, we reuse the input data shifted by 3 hours, but we will now assume that the first 3 observations had quality issues.

[53]:

# flag the first 3 observations as invalid
flag = np.zeros(len(ref), dtype=bool)
flag[0:3] = True
flag[0:10]

[53]:

array([ True,  True,  True, False, False, False, False, False, False,
       False])

[54]:

matched_flagged = temporal_collocation(ref, shifted, window, flag=flag)

# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted
assert np.all(np.isnan(matched_flagged.values[0:3, :]))
assert np.all(matched_flagged.values[3:, :] == matched_shifted.values[3:, 0:3])  # excluding additonal columns
matched_flagged

[54]:

	shifted_0	shifted_1	shifted_2
2020-01-01	NaN	NaN	NaN
2020-01-02	NaN	NaN	NaN
2020-01-03	NaN	NaN	NaN
2020-01-04	-0.650058	-0.548499	0.548560
2020-01-05	1.331785	-1.611482	-1.325902
...	...	...	...
2020-12-27	-1.030740	-0.520146	-0.049081
2020-12-28	-1.221772	-0.770726	0.585857
2020-12-29	0.420476	-2.558351	-0.797550
2020-12-30	0.146506	1.014053	-0.629454
2020-12-31	-1.296130	2.000808	-1.084081

366 rows × 3 columns

[55]:

# This also works when the flag is already in the input frame, but note that
# in the output frame the nonzero flag values are replaced by NaN
flagged = shifted.assign(my_flag=flag)
matched_flagged = temporal_collocation(ref, flagged, window, flag="my_flag")

# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted
assert np.all(np.isnan(matched_flagged.iloc[0:3, 0:3].values))
assert np.all(matched_flagged.iloc[3:, 0:3].values == matched_shifted.values[3:, 0:3])  # excluding additonal columns
matched_flagged

[55]:

	shifted_0	shifted_1	shifted_2	my_flag
2020-01-01	NaN	NaN	NaN	NaN
2020-01-02	NaN	NaN	NaN	NaN
2020-01-03	NaN	NaN	NaN	NaN
2020-01-04	-0.650058	-0.548499	0.548560	False
2020-01-05	1.331785	-1.611482	-1.325902	False
...	...	...	...	...
2020-12-27	-1.030740	-0.520146	-0.049081	False
2020-12-28	-1.221772	-0.770726	0.585857	False
2020-12-29	0.420476	-2.558351	-0.797550	False
2020-12-30	0.146506	1.014053	-0.629454	False
2020-12-31	-1.296130	2.000808	-1.084081	False

366 rows × 4 columns

Combined collocation

It is also possible to match multiple timeseries together against a reference dataset using the function pytesmo.temporal_matching.combined_temporal_collocation. With the keyword argument combined_dropna it’s possible to drop data where one of the input datasets has missing values from all other input datasets.

[56]:

combined_match = combined_temporal_collocation(ref, (random, shifted), window, combined_dropna=True)
# matched dataframe should have same length as matched_random_nn without NaNs
assert len(combined_match == len(matched_random_nn.dropna()))
combined_match

[56]:

	random_0	random_1	random_2	shifted_0	shifted_1	shifted_2
2020-01-01	-0.687795	-0.626649	0.237109	-0.687795	-0.626649	0.237109
2020-01-02	-0.514778	-1.981137	0.354644	-0.514778	-1.981137	0.354644
2020-01-03	-0.600629	-0.761766	0.169777	-0.600629	-0.761766	0.169777
2020-01-04	-0.650058	-0.548499	0.548560	-0.650058	-0.548499	0.548560
2020-01-05	1.331785	-1.611482	-1.325902	1.331785	-1.611482	-1.325902
...	...	...	...	...	...	...
2020-12-27	-1.030740	-0.520146	-0.049081	-1.030740	-0.520146	-0.049081
2020-12-28	-1.221772	-0.770726	0.585857	-1.221772	-0.770726	0.585857
2020-12-29	0.420476	-2.558351	-0.797550	0.420476	-2.558351	-0.797550
2020-12-30	0.146506	1.014053	-0.629454	0.146506	1.014053	-0.629454
2020-12-31	-1.296130	2.000808	-1.084081	-1.296130	2.000808	-1.084081

366 rows × 6 columns