Temporal Collocation of irregularly sampled timeseries
Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming from different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time.
A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors).
Temporal collocation in pytesmo
Pytesmo contains the function pytesmo.temporal_matching.temporal_collocation
for temporally collocating timeseries. Currently, it implements nearest neighbour matching and a windowed mean. It requires a reference index (can also be a DataFrame or a Series), a DataFrame (or Series) to be collocated, and a window.
collocated = temporal_collocation(reference, input_frame, window)
The window argument corresponds to the time intervals that are included in the nearest neighbour search in each direction, e.g. if the reference time is \(t\) and the window \(\Delta\), the nearest neighbour inside \([t-\Delta, t+\Delta]\) is returned. If no neighbour is found np.nan
is used as replacement. NaNs can be dropped from the returned dataframe by providing the optional keyword argument dropna=True
to the function.
Below are two simple examples which demonstrate the usage. The first example assumes that the index of data to be collocated is shifted by 3 hours with respect to the reference, while using a 6 hour window. The second example uses an index that is randomly shifted by \(\pm12\) hours with respect to the reference. The second example also uses a 6 hour window, which results in some missing values in the resulting dataframe.
[47]:
import numpy as np
import pandas as pd
from pytesmo.temporal_matching import temporal_collocation, combined_temporal_collocation
[48]:
# create reference time series
ref = pd.date_range("2020-01-01", "2020-12-31", freq="D")
# temporal_collocation can also take a DataFrame or Series as reference input,
# in case their index is a DatetimeIndex.
# create other time series as dataframe
values = np.random.randn(len(ref), 3)
shifted = pd.DataFrame(values, index=ref + pd.Timedelta(hours=3),
columns=list(map(lambda x: f"shifted_{x}", range(3))))
random_shift = np.random.uniform(-12, 12, len(ref))
random = pd.DataFrame(values, index=ref + pd.to_timedelta(random_shift, "H"),
columns=list(map(lambda x: f"random_{x}", range(3))))
shifted.head()
[48]:
shifted_0 | shifted_1 | shifted_2 | |
---|---|---|---|
2020-01-01 03:00:00 | -0.687795 | -0.626649 | 0.237109 |
2020-01-02 03:00:00 | -0.514778 | -1.981137 | 0.354644 |
2020-01-03 03:00:00 | -0.600629 | -0.761766 | 0.169777 |
2020-01-04 03:00:00 | -0.650058 | -0.548499 | 0.548560 |
2020-01-05 03:00:00 | 1.331785 | -1.611482 | -1.325902 |
[49]:
random.head()
[49]:
random_0 | random_1 | random_2 | |
---|---|---|---|
2019-12-31 21:00:12.990837600 | -0.687795 | -0.626649 | 0.237109 |
2020-01-02 03:18:11.512674000 | -0.514778 | -1.981137 | 0.354644 |
2020-01-03 04:15:20.703038399 | -0.600629 | -0.761766 | 0.169777 |
2020-01-03 22:47:25.851343200 | -0.650058 | -0.548499 | 0.548560 |
2020-01-05 01:02:46.090482000 | 1.331785 | -1.611482 | -1.325902 |
We can now match the shifted timeseries to the reference index by using a 6-hour window, either for a nearest neighbour search, or for taking a windowed mean. Both should return unchanges timeseries, except for the index.
[50]:
# match the regularly shifted data
window = pd.Timedelta(hours=6)
matched_shifted_nn = temporal_collocation(ref, shifted, window, method="nearest")
matched_shifted_mean = temporal_collocation(ref, shifted, window, method="mean")
# the data should be the same before and after matching for both methods
assert np.all(shifted.values == matched_shifted_nn.values)
assert np.all(shifted.values == matched_shifted_mean.values)
matched_shifted_nn
[50]:
shifted_0 | shifted_1 | shifted_2 | |
---|---|---|---|
2020-01-01 | -0.687795 | -0.626649 | 0.237109 |
2020-01-02 | -0.514778 | -1.981137 | 0.354644 |
2020-01-03 | -0.600629 | -0.761766 | 0.169777 |
2020-01-04 | -0.650058 | -0.548499 | 0.548560 |
2020-01-05 | 1.331785 | -1.611482 | -1.325902 |
... | ... | ... | ... |
2020-12-27 | -1.030740 | -0.520146 | -0.049081 |
2020-12-28 | -1.221772 | -0.770726 | 0.585857 |
2020-12-29 | 0.420476 | -2.558351 | -0.797550 |
2020-12-30 | 0.146506 | 1.014053 | -0.629454 |
2020-12-31 | -1.296130 | 2.000808 | -1.084081 |
366 rows × 3 columns
We can do the same for the randomly shifted timeseries. Here we should see some changes, because sometimes there’s no value inside the window that we are looking at. However, the result of mean and nearest neighbour should be the same
[51]:
# match the randomly shifted data
matched_random_mean = temporal_collocation(ref, random, window, method="mean")
matched_random_nn = temporal_collocation(ref, random, window, method="nearest")
# the data should be the same as before matching at the locations where the shift
# was below 6 hours, and should be np.nan when shift was larger
should_be_nan = np.abs(random_shift) > 6
assert np.all(matched_random_nn[~should_be_nan].values == random[~should_be_nan].values)
assert np.all(np.isnan(matched_random_nn[should_be_nan].values))
for c in matched_random_nn.columns:
df = pd.DataFrame(index=range(366),
data={'mean': matched_random_mean[c].values,
'nn': matched_random_nn[c].values}).dropna()
np.testing.assert_almost_equal(
df.diff(axis=1).iloc[:, 1].values,
np.zeros(df.index.size),
decimal=4)
matched_random_nn
[51]:
random_0 | random_1 | random_2 | |
---|---|---|---|
2020-01-01 | -0.687795 | -0.626649 | 0.237109 |
2020-01-02 | -0.514778 | -1.981137 | 0.354644 |
2020-01-03 | -0.600629 | -0.761766 | 0.169777 |
2020-01-04 | -0.650058 | -0.548499 | 0.548560 |
2020-01-05 | 1.331785 | -1.611482 | -1.325902 |
... | ... | ... | ... |
2020-12-27 | -1.030740 | -0.520146 | -0.049081 |
2020-12-28 | -1.221772 | -0.770726 | 0.585857 |
2020-12-29 | 0.420476 | -2.558351 | -0.797550 |
2020-12-30 | 0.146506 | 1.014053 | -0.629454 |
2020-12-31 | -1.296130 | 2.000808 | -1.084081 |
366 rows × 3 columns
Returning the original index
temporal_collocation
can also return the original index of the data that was matched as a separate column in the resulting DataFrame, if required, and can additionally also calculate the distance to the reference. The column names are “index_other” and “distance_other”, respectively.
[52]:
# also return original index and distance
matched_shifted = temporal_collocation(
ref, shifted, window,
method="nearest",
return_index=True,
return_distance=True
)
# the index should be the same as unmatched, and the distance should be 3 hours
assert np.all(matched_shifted["index_other"].values == shifted.index.values)
assert np.all(matched_shifted["distance_other"] == pd.Timedelta(hours=3))
matched_shifted
[52]:
shifted_0 | shifted_1 | shifted_2 | index_other | distance_other | |
---|---|---|---|---|---|
2020-01-01 | -0.687795 | -0.626649 | 0.237109 | 2020-01-01 03:00:00 | 0 days 03:00:00 |
2020-01-02 | -0.514778 | -1.981137 | 0.354644 | 2020-01-02 03:00:00 | 0 days 03:00:00 |
2020-01-03 | -0.600629 | -0.761766 | 0.169777 | 2020-01-03 03:00:00 | 0 days 03:00:00 |
2020-01-04 | -0.650058 | -0.548499 | 0.548560 | 2020-01-04 03:00:00 | 0 days 03:00:00 |
2020-01-05 | 1.331785 | -1.611482 | -1.325902 | 2020-01-05 03:00:00 | 0 days 03:00:00 |
... | ... | ... | ... | ... | ... |
2020-12-27 | -1.030740 | -0.520146 | -0.049081 | 2020-12-27 03:00:00 | 0 days 03:00:00 |
2020-12-28 | -1.221772 | -0.770726 | 0.585857 | 2020-12-28 03:00:00 | 0 days 03:00:00 |
2020-12-29 | 0.420476 | -2.558351 | -0.797550 | 2020-12-29 03:00:00 | 0 days 03:00:00 |
2020-12-30 | 0.146506 | 1.014053 | -0.629454 | 2020-12-30 03:00:00 | 0 days 03:00:00 |
2020-12-31 | -1.296130 | 2.000808 | -1.084081 | 2020-12-31 03:00:00 | 0 days 03:00:00 |
366 rows × 5 columns
Flags
Satellite data often contains flags indicating quality issues with the data. With temporal_collocation
it is possible to use this information. Flags can either be provided as array (of the same length as the input DataFrame), or the name of a column in the DataFrame to be used as flag can be provided as string. Any non-zero flag is interpreted as indicating invalid data. By default this will not be used, but when passing use_invalid=True
, the invalid values will be used in case no valid
match was found.
For the following example, we reuse the input data shifted by 3 hours, but we will now assume that the first 3 observations had quality issues.
[53]:
# flag the first 3 observations as invalid
flag = np.zeros(len(ref), dtype=bool)
flag[0:3] = True
flag[0:10]
[53]:
array([ True, True, True, False, False, False, False, False, False,
False])
[54]:
matched_flagged = temporal_collocation(ref, shifted, window, flag=flag)
# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted
assert np.all(np.isnan(matched_flagged.values[0:3, :]))
assert np.all(matched_flagged.values[3:, :] == matched_shifted.values[3:, 0:3]) # excluding additonal columns
matched_flagged
[54]:
shifted_0 | shifted_1 | shifted_2 | |
---|---|---|---|
2020-01-01 | NaN | NaN | NaN |
2020-01-02 | NaN | NaN | NaN |
2020-01-03 | NaN | NaN | NaN |
2020-01-04 | -0.650058 | -0.548499 | 0.548560 |
2020-01-05 | 1.331785 | -1.611482 | -1.325902 |
... | ... | ... | ... |
2020-12-27 | -1.030740 | -0.520146 | -0.049081 |
2020-12-28 | -1.221772 | -0.770726 | 0.585857 |
2020-12-29 | 0.420476 | -2.558351 | -0.797550 |
2020-12-30 | 0.146506 | 1.014053 | -0.629454 |
2020-12-31 | -1.296130 | 2.000808 | -1.084081 |
366 rows × 3 columns
[55]:
# This also works when the flag is already in the input frame, but note that
# in the output frame the nonzero flag values are replaced by NaN
flagged = shifted.assign(my_flag=flag)
matched_flagged = temporal_collocation(ref, flagged, window, flag="my_flag")
# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted
assert np.all(np.isnan(matched_flagged.iloc[0:3, 0:3].values))
assert np.all(matched_flagged.iloc[3:, 0:3].values == matched_shifted.values[3:, 0:3]) # excluding additonal columns
matched_flagged
[55]:
shifted_0 | shifted_1 | shifted_2 | my_flag | |
---|---|---|---|---|
2020-01-01 | NaN | NaN | NaN | NaN |
2020-01-02 | NaN | NaN | NaN | NaN |
2020-01-03 | NaN | NaN | NaN | NaN |
2020-01-04 | -0.650058 | -0.548499 | 0.548560 | False |
2020-01-05 | 1.331785 | -1.611482 | -1.325902 | False |
... | ... | ... | ... | ... |
2020-12-27 | -1.030740 | -0.520146 | -0.049081 | False |
2020-12-28 | -1.221772 | -0.770726 | 0.585857 | False |
2020-12-29 | 0.420476 | -2.558351 | -0.797550 | False |
2020-12-30 | 0.146506 | 1.014053 | -0.629454 | False |
2020-12-31 | -1.296130 | 2.000808 | -1.084081 | False |
366 rows × 4 columns
Combined collocation
It is also possible to match multiple timeseries together against a reference dataset using the function pytesmo.temporal_matching.combined_temporal_collocation
. With the keyword argument combined_dropna
it’s possible to drop data where one of the input datasets has missing values from all other input datasets.
[56]:
combined_match = combined_temporal_collocation(ref, (random, shifted), window, combined_dropna=True)
# matched dataframe should have same length as matched_random_nn without NaNs
assert len(combined_match == len(matched_random_nn.dropna()))
combined_match
[56]:
random_0 | random_1 | random_2 | shifted_0 | shifted_1 | shifted_2 | |
---|---|---|---|---|---|---|
2020-01-01 | -0.687795 | -0.626649 | 0.237109 | -0.687795 | -0.626649 | 0.237109 |
2020-01-02 | -0.514778 | -1.981137 | 0.354644 | -0.514778 | -1.981137 | 0.354644 |
2020-01-03 | -0.600629 | -0.761766 | 0.169777 | -0.600629 | -0.761766 | 0.169777 |
2020-01-04 | -0.650058 | -0.548499 | 0.548560 | -0.650058 | -0.548499 | 0.548560 |
2020-01-05 | 1.331785 | -1.611482 | -1.325902 | 1.331785 | -1.611482 | -1.325902 |
... | ... | ... | ... | ... | ... | ... |
2020-12-27 | -1.030740 | -0.520146 | -0.049081 | -1.030740 | -0.520146 | -0.049081 |
2020-12-28 | -1.221772 | -0.770726 | 0.585857 | -1.221772 | -0.770726 | 0.585857 |
2020-12-29 | 0.420476 | -2.558351 | -0.797550 | 0.420476 | -2.558351 | -0.797550 |
2020-12-30 | 0.146506 | 1.014053 | -0.629454 | 0.146506 | 1.014053 | -0.629454 |
2020-12-31 | -1.296130 | 2.000808 | -1.084081 | -1.296130 | 2.000808 | -1.084081 |
366 rows × 6 columns