Developing a Statistical Early Warning System

How To Approach Search Engine Marketing
October 9, 2019
Explore TF-IDF with a sliding chart!
January 15, 2020
Show all

Developing a Statistical Early Warning System

Data, and its analysis is a core skill in digital marketing and SEO. As part of our efforts to try and improve the way we serve our clients, Kenny and I have been looking at how we interpret the plethora of datum that the various online services we use, such as Google search console, or Facebook advertising, provide us with.

Impressions to

A good example of the kind of data we tend to deal with is visualised in the above chart. This is a plot of the number of impressions our site received in the 12 months to November 2019. As you can see, it is comprised of at least two, if not three, distinct trends. But this assertion is based on a subjective view of the data, and we want to offer our clients an opinion based on a rigorous and objective one.

Our goal, therefore, is to develop a process which leads to such an objective view.

At an abstract level, we want to be able to discern the status quo, and realise when the status quo has changed. Our strategy for developing such a process involves creating a predictive model of the data which can be used to forecast future values.

Predictive models create a series of values which are calculated by applying a set of functions to the last value in the sequence. For instance:

S_t = \theta S_{t-1} + m + Z_t

Where t is a point in time, ⍬ is a tuning parameter, m is a constant and Z is a random variable which represents random fluctuations that cannot be determined.

The model is developed using data that is already available to us. As such, the model represents the status quo. With the model, we can calculate a forecast of what future values should arise, which can be for as many periods ahead (e.g. 31 days) as is necessary, and once that time has passed we can calculate the error of the forecast, which is the observed value minus the forecast value.

Take the following model and forecast we produced for the above impressions data:

The orange line is the value that the model predicts for the data which it is trained on. The model is arrived at by finding the set of parameters which best “fit” the training data. The red band is the forecast’s 95% confidence interval, which is a statistical measure that says that if one takes a forecasted value, then 95% of the time the observed value will fall within this range. Lastly, the green line is the forecast. As you can see, the forecast is a straight line.

The green line fits the first segment of the data quite well, but the it starts to run into trouble around August when there is a conspicuous dip in the series followed by a reversal and a new trend emerges which is steeper and more positive than that which the model was trained for.

The changes in trend are easy to spot on a chart with this much hindsight, but how could we have realised that the trend had changed in early August? How could we tell that the precursor of the new trend wasn’t just random fluctuations around the trend?

To answer this problem we look to the forecasting errors.

If a model is producing reasonable forecasts there will still be a difference between the forecast value and the observed value. These differences are represented by the Z component above and should be “stationary”, which means that, overall, they should have zero mean and constant variance. This can be checked by looking at a histogram of the errors, such as the following:

You can see from the above that most of the errors are close to zero, and the number of errors tail off in both directions as the magnitude of the error rises. We get comparatively few big errors, but lots of small ones.

The shape of the above histogram is similar to the shape we would expect if the errors are “normally distributed”. The normal distribution is a kind of probability distribution which appears very frequently in the natural world. Things like the height of human beings, the length of tree leaves, the return on equities from one time period to the next, population IQ, and lots of other worldly phenomena can be represented with a normal distribution.

If a forecasting model is working properly, we should see its errors conform to a normal distribution, with a mean of zero (the mean is the location of the peak at the centre of the distribution) and a symmetrical, bell-shaped tailing off in frequency off as the magnitude of the errors increase. The shape of the “bell” is defined by the standard deviation, which is a measure of the variance within the data.

So how does this help us? Consider any point in time, its forecast and the observed value. The difference (error) between the forecast and the observed value will take a value that falls somewhere within the normal distribution, and we can use that distribution to work out how likely it is that this error occurred due to random variation from its probability. A low probability value (i.e. one with a high magnitude) is unlikely to be due to a random fluctuation. Several high magnitude errors are very unlikely to be due to random fluctuations and so a series of high magnitude errors indicates that something structural in the data has changed.

In our way of thinking, the model represents the status quo. By focussing on the forecast errors we are alerted to a change in the status quo in the form of a new trend, but one which is much easier and faster to detect than simply looking at the chart and trying to intuit when things have changed. Take the following plot:

There is quite a lot going on here, so let me explain.

The dots represent the observed values. The darker they are, the less probable they are according to our model. The green line is a moving average of those probabilities. The straight orange line is the forecast which is based on modelling the data up to the period immediately prior to its start.

During August the probabilities of the observed values given our model, P(E), suddenly become distinctly less likely, and this is reflected in a decline in the green moving average (MA(7) P(E)). It is this moving average which gives us the signal that something is awry, that something has changed and we would be alert to the possibility that we may need to respond.

We can take this a step further by applying a (fairly naive) level which acts as a signal threshold. If the threshold is breached by falling below a particular value (lower value means lower probability) then we can infer that the old model is invalid and the paradigm that the website is working in has changed:

The red line, which is set at 0.3, is comprehensively breached in early August which is when the data starts to diverge from the model. It briefly recovers and then drops away again. Using this method, we have early objective evidence that the world has changed and can alert our clients and investigate the phenomena accordingly.

In the next blog post, I’ll tell you what we did to precipitate this change.