Gabriel Maher

Simultaneous Forecasting of Multiple Time-Series

September 30th, 2019

In typical data science projects we know there is always more data we could be looking at, more features that could be informative, and that one set of data rarely ever paints the whole picture. Similarly with time-series, we can work on forecasting one set of historical data, but could we improve our forecasts by using additional external time-series? If so, how would that work?

It turns out that we can formualte this as a multiple regression problem, where we need to forecast the original time-series and each additional external time-series we use.

As we will see, it is not too hard to produce decent forecasts this way. However we will also discuss issues that arise and certain optimizations that can be made.

Code for this post is available here.

Example Vector Time-Series

For this post let us consider a 2 dimensional autoregressive series with a weekly seasonality and some normal random noise $\epsilon_t$. $$ \begin{aligned} y_{t} &= \phi_1 y_{t-7} + \phi_2 x_{t-1} +\epsilon_t \\ x_{t} &= \theta_1 x_{t-7} + \epsilon_t \end{aligned} $$

This model could represent something like a daily demand with a daily advertising schedule that increases or decreases the demand that day.

Setting $\phi_1=0.99$, $\phi_2=0.05$, and $\theta_1=1$ and using the initial conditions $$ \begin{aligned} [y_1,...,y_7] &= [1,1,2,2,4,8,5]\\ [x_1,...,x_7] &= [-1,-3,1,3,0,0,0] \end{aligned} $$ the series look like this

We see the clear recurring seasonal pattern, with random fluctutions due to the noise. Additionally note the small influence of the $x_t$ series on $y_t$. So we see that to be able to predict the first time series we also need to be able to predict the second.

Vector Lasso Model

There are many ways we could model this series. A simple way is to use a linear vector autoregressive model with a specified window size. In this model we let both $y_t$ and $x_t$ only be functions of the past observed values. In particular, both $x_t$ and $y_t$ can depend on past values of themselves and each other. In mathematical terms $$ \begin{aligned} \hat{y_t} &= \sum_{i=1}^Wa_i y_{t-i}+b_ix_{t-i}\\ \hat{x_t} &= \sum_{i=1}^Wc_i y_{t-i}+d_ix_{t-i} \end{aligned} $$ where $a_i$, $b_i$, $c_i$ and $d_i$ are the learned parameters of the model.

Now in this simple example we know exactly how far back the historical dependence of the time series goes. However, in practice we will not know the actual historical dependence and corresponding window size $W$. To emulate a realistic scenario, we will choose a window size of $W=14$, twice as big as needed, to see how this affects our model.

The question remains, how do we find good values for the unknown parameters in our model? We can interpret this problem as multiple linear regression. We want to predict the time series values, $y_t$ and $x_t$. The common features of our model are the history of the $y$ and $x$ series. If we then create a dataset of example history vectors and predictions we can use standard machine learning libraries to fit our model.

We now need to decide which model to use. We could use ordinary least-squares, ridge regression or lasso regression. In this case we will use Lasso to promote simpler models with few nonzero parameters.

Using a lasso coefficient of $0.01$ we get the following coefficient values for the $y_t$ predictions We see that the model discovered historical $y$ coefficients that are close to the true values. However, for the $x$ coefficients it missed the 0 lag and instead picked up a nonzero lag at lag 8. This is not unusual, the first and eighth lag are colinear due to the autoregressive nature of the $x_t$ series, and so lag 8 will have predictive power for $y_t$.

Then for the $x_t$ predictions we get we see that our model came close to finding the true coefficient value at lag 7. Interestingly the coefficient at lag 14 also ended up being nonzero. Due to the autorcorrelation, lags 7 and 14 are going to be colinear, so lag 14 will also have predictive power, so this was to be expected and is not a flaw in the model.

Now using our model we can consider the one-step-ahead predictions on the test set. As can be seen they look very accurate, despite the fact that our model did completely resemble the true model.

One-step forecasts can be misleading because they make the model seem very accurate. Of course we are mostly interested in forecasting much further than one step. To be able to do this we need to forecast both series. Below are the 100 step forecasts for both the $y_t$ and $x_t$ series.

Again they seem fairly accurate, but we now do see that certain aspects of the series are missed by our model. In the $x$ forecasts the main highs and lows are predicted fairly accurately, but there are smaller highs and lows that are missed. Then this affects the $y$ series due to the dependence of $y$ on $x$. This behavior is likely due to some of the small spurious nonzero coefficients that were detected in the $x$ model.

To remedy the situation we could consider using a lasso model with different lasso parameters for each series. This would allow us to more accurately tune the coefficients. Of course in practice we do not know the exact coefficients, so tuning the lasso parameters would have to rely on cross-validation and could still be susceptible to a certain amount of error.