Machine Learning practitioners often advise taking first differences of the target time series prior to forecasting. “Stationary data is easier to model and will very likely result in more skillful forecasts,” writes Jason Brownlee in Machine Learning Mastery, who afterwards demonstrates the use of differences to remove trend in a time series. Differences are not limited to *y(t) – y(t-1); *they can be taken at various lags and stacked iteratively to create a time series that is stable and easy to work with. But after the forecasting work is done, there must be a reconstruction. This post examines the conceptual issues of this reconstruction, using a simple example and only first principles.

Statisticians and econometricians believe that over-differencing is real, so we should enter this investigation vigilant for downsides to differencing. There is no theory here, rather a purely mechanical view of the differencing and reconstitution process when there is a seasonal difference followed by a first difference. But the simplicity of the example is helpful for clarifying the exact steps taken as well as the contributions of the real data versus the forecasted series. This purely mechanical analysis will also make suggestions about forecast quality in the face of repeated differencing.

## A simple example

We start with an artificially simple time series y(t):

t | y(t) |

0 | 1 |

1 | 3 |

2 | 6 |

3 | 9 |

4 | 10 |

Consider the protocol of taking the third difference followed by the first difference. The table of third differences is smaller due to the end effect:

t | diff3 y(t) |

3 | 8 |

4 | 7 |

Now taking first differences (here there’s only one to take), we arrive at the final data set for model building:

t | diff1 diff 3 y(t) |

4 | -1 |

Forget for a moment that you’d have a hard time building a forecasting model with one data point and imagine that [-1] is representative of a larger data set used to build a model. This data set, after all differencing operations, is what we use for building models. If we spend extensive time in the forecasting process, we may get very comfortable with this data set and forget that we are two levels deep in differencing.

## The Forecast

Given our single data point’s magical ability to generate a forecasting model, we have forecasted values from time periods past 4, i.e., the unseen future. The known past will be displayed as boldface text.

t | diff1 diff3 y(t) |

4 | -1 |

5 | 3 |

6 | 7 |

7 | -2 |

8 | -1 |

9 | 1 |

Proceeding in a last-in-first-out manner, we must undo the first differences before undoing the third differences. Again, we must reach back into our original data set, a level above in its differencing history.

t | diff 3 y(t) |

3 | 8 |

4 | 7 |

5 | 10 |

6 | 17 |

7 | 15 |

8 | 14 |

9 | 15 |

To initiate the sequence of unraveling events, we really only needed the value from t = 4, but that data point needed the value at t = 3 to even exist given it is a first difference.

Now we unravel the third differences, which requires us to dig back into the original series for at least the last three observed periods (2 <= t <= 4). Italics will be used to show which third differences are anchored on forecasts rather than real values.

t | y(t) |

0 | 1 |

1 | 3 |

2 | 6 |

3 | 9 |

4 | 10 |

5 | 16 |

6 | 26 |

7 | 25 |

8 | 30 |

9 | 41 |

## Takeaways

Consider only the reconstruction of of the final forecasts from the third differences. For future time periods 5 <= t <= 7, the anchor of the reconstructed forecast (on the original scale) was a real data point from the original series, with no uncertainty. Contrast that with t = 8, a forecast which must be anchored on a previous forecast. Every three time points (i.e., the number of time points used for this level of differencing), there will be an additional level of uncertainly as forecasts are added to another level of forecasts.

After the third difference, there was the first difference, meaning that all forecast values available at the last stage of reconstruction where themselves reconstructions. Being a first difference, all but one value (at t = 5) were anchored upon other forecasts, and the level of stacking forecast upon forecast increases with every additional time increment. The forecast at t = 5 is the only one anchored on real data points for both the first and the third difference.

Differencing may be a way to tame unweildy time series, but there is a cost. The alternatives may not be superior; achieving stationarity using a structural trend and seasonality removal trades compounding forecast uncertainty for the uncertainty in extrapolation of these structural element into the future.

One point is clear: if you have to difference, a larger differencing lag provides more time before the compounding effects of forecasts-upon-forecasts start to pile up. If that larger differencing lag lines up with real seasonal effects, then you’ve got both trend and seasonality removal alongside a (potentially substantial) grace period before the uncertainty begins to compound.