Comparison of Machine Learning Methods for Forecasting Web Traffic


Lately, in relation to time series forecasting, I have been considering the following questions

  • How can we make accurate long-term forecasts when we do not have much data?
  • What methods, if any, produce the best accuracy?
  • In particular, will they work with real data and not just toy examples?

To answer these questions I ran a forecasting experiment using some real web traffic data. This post covers the results of the experiment. The code is available here.

The main conclusion is that, when training data is limited, we need to use methods that are good at feature selection, such as Lasso regression, to produce accurate forecasts.

Forecasting With Limited Training Data

When forecasting time-series, we often want to forecast far into the future relative to our historical data. For example, given a few years of past data, we want to know what to expect for the next few years.

However, to be able to forecast, we need to determine how much information the past data gives us about the future data. We then need to select and fit a model that can capture those relationships. If we have lots of historical data this process can be straightforward. However, when only limited data is available, it becomes more difficult, both to find statistically significant relationships in the data and to fit more sophisticated models.

Web Traffic Data

I used data from a Kaggle competition on forecasting web traffic for wikipedia articles (available here). The dataset contains 1.5-2 years of daily web traffic data for 145,000 wikipedia articles. The articles range from sparse traffic, a few visits per day, to very heavy traffic, 10,000-100,000 visits per day.

Forecasting sparse traffic could introduce difficulties, so here I consider the web traffic for the "Internet" wikipedia article. Below is a graph of the historical traffic data. There definitely seem to be reapeating seasonal patterns across different time scales. There are also occasional outlier spikes in traffic.

We can investigate the first few months to get a better idea of any short term patterns. Here we see some evidence of a repeating weekly cyclical pattern in the traffic data.

Analyzing Historical Dependencies

Looking at the data gave us a qualitative idea of the possible relationships we can model. Now we need to analyze the data to determine how we can describe those relationships quantitatively.

Here we need to choose a historical window within which to look for relationships, i.e. how far back in time does the past data still contain information about the future. Intuitively, web traffic data seems like it could depend a lot on the time of year, the month and the day of the week. E.g. if it is Winter somewhere and cold outside, we would perhaps expect to see increased traffic. To find these relationships we will need a historical window of at least a year, so 365 steps back.

To keep the analysis and experimentation fair I split the data into a training set and testing set with 390 and 160 data points respectively. Note that with the 365 step historical window, 390 training points only gives us 25 example windows to learn from, meaning that we are definitely in the limited training data regime.

Now let us explore the relationships in the data. A simple tool for this is the correlation between the current value and all the past values in the historical window. This is known as the autocorrelation function (ACF). However, the ACF can only provide somewhat indirect information on the relationships in the data so it can take some skill to analyze. This is because, if the current value has a relationship with the past at some lag, the ACF will be nonzero for that lag, but will also be nonzero for subsequent lags.

To see this consider an AR(2) process $$ y_t = \phi y_{t-2}+\epsilon_t $$ Here we would certainly expect to see a nonzero correlation at the second lag. But if we recursively apply the AR formula we get $$ y_t = \phi(\phi y_{t-4}+\epsilon_{t-2}) + \epsilon_t = \phi^2 y_{t-4} + \epsilon_{t-2} + \epsilon_t $$ So we see that there will also be a nonzero correlation for the fourth lag. In particular, if $\phi$ is negative it will cause the ACF to oscillate between positive and negative. However, typically the indirect correlations will be decaying. So it is really when the magnitude of the ACF has a sudden change that there is a direct relationship.

So let's look at the ACF. I plot the ACF, along with the 20% upper and lower confidence bounds. We see periodic behavior at different time scales. This indeed suggests there are short and long-term relationships in the data. However, inspecting the upper and lower confidence bounds it is difficult to say whether the correlations are statistically significant.

Another tool that can be used to find relationships in time-series is the partial-autocorrelation function (PACF). The PACF is similar to the ACF in that it calculates the relationship between the current value and past lags. However, the PACF subtracts out any indirect correlations caused by the inbetween lags. The PACF is shown below. Unfortunately it seems that with the limited training data we have, mostly large spurious relationships were discovered.

We can finally try just fitting various forms of linear regression and analyzing the found coefficient values. Both linear regression and ridge regression did not discover any large relationships in the data. However, when we apply Lasso regression, we see that stronger short term relationships in addition to a few long-term relationships.

Machine Learning and Statistical Forecasting

I now test several machine learning methods for forecasting:

  • Linear regression
  • Ridge Regression
  • Lasso Regression
  • Gradient Boosting Trees
  • Random Forests (RF)
  • Support Vector Regression with RBF kernel (SVR)
  • Feed-forward Neural Network (3 layers) (NN)

I also test the Holt-Winters (HW) statistical time-series modeling method.

To test the method I use the 390 training data points to forecast 160 steps into the future and compare to the test data. For the historical window I used 365 steps to see whether the tested methods can capture any yearly relationships in the data. Most of the methods I tested have hyperparameters that can be optimized to improve their accuracy. To tune the hyperparameters, I used the first 80 forecast steps as a validation dataset. So using automated hyperparameter optimization would be useful, but is beyond the scope of this post for now. Interestingly enough, it was difficult to manually tune all but the linear regression based models. For Lasso, a regularization value of 50 worked well. The ridge regression model was not very sensitive to the regularization value, so I set it to 100.

Here is a plot of the validation forecasts. We see that the HW and Lasso models produced detailed forecasts, capturing some of the high and low points of web traffic. The other methods seem to have captured the weekly variations, but do not model the longer term variations in the data.

Finally here are the forecasts on the full test set. The results are mostly similar to the validation dataset.

Discussion

Since most methods ended up producing a somewhat average forecast, whereas a properly tuned Lasso produced more detailed forecasts, we see that making accurate forecasts on real data requires detailed feature selection. This means we need to use a method (such as Lasso) that can determine the specific seasonal relationships in our data.

I tested the methods in this post on other web traffic data. To get Lasso to work well, it was important to tune the regularization parameter on the validation dataset. Without tuning, Lasso was prone to overfitting and did not produce accurate forecasts. In contrast, the Holt-Winters approach seemed to work well fairly consistently. Therefore, in the case of limited training data, we need to be very disciplined in our hyperparameter optimization process or we can use models with few parameters like the Holt-Winters model.