## Understand the logic behind the fundamental algorithm used inside the gradient descent

In time series analysis, there is often a need to understand the trend direction of a sequence by taking into account previous values. Approximation of the next values in a sequence can be performed in several ways, including the usage of simple baselines or the construction of advanced machine learning models.

An** exponential (weighted) moving average** is a robust trade-off between these two methods. Having a simple recursive method under the hood makes it possible to efficiently implement the algorithm. At the same time, it is very flexible and can be successfully adapted for most types of sequences.

This article covers the motivation behind the method, a description of its workflow and bias correction — an effective technique to overcome a bias obstacle in approximation.

Imagine a problem of approximating a given parameter that changes in time. On every iteration, we are aware of all of its previous values. The objective is to predict the next value which depends on the previous values.

One of the naive strategies is to simply take the average of the last several values. This might work in certain cases but it is not very suitable for scenarios when a parameter is more dependent on the most recent values.

One of the possible ways to overcome this issue is to distribute higher weights to more recent values and assign fewer weights to prior values. The exponential moving average is exactly a strategy that follows this principle. **It is based on the assumption that more recent values of a variable contribute more to the formation of the next value than precedent values**.

To understand how the exponential moving average works, let us look at its recursive equation:

- vₜ is a time series that approximates a given variable. Its index t corresponds to the timestamp t. Since this formula is recursive, the value v₀ for the initial timestamp t = 0 is needed. In practice, v₀ is usually taken as 0.
- θ is the observation on the current iteration.
- β is a hyperparameter between 0 and 1 which defines how weight importance should be distributed between a previous average value vₜ-₁ and the current observation θ

Let us write this formula for first several parameter values:

As a result, the final formula looks like this:

We can see that the most recent observation θ has a weight of 1, the second last observation — β, the third last — β², etc. Since 0 < β < 1, the multiplication term βᵏ goes exponentially down with the increase of k, *so the older the observations, the less important they are*. Finally, every sum term is multiplied by (1 —β).

In practice, the value for β is usually chosen close to 0.9.

Using the famous second wonderful limit from mathematical analysis, it is possible to prove the following limit:

By making a substitution β = 1 – *x*, we can rewrite it in the form below:

We also know that in the equation for the exponential moving average, every observation value is multiplied by a term βᵏ where k indicates how many timestamps ago the observation was computed. Since the base β is equal in both cases, we can equate the exponents of both formulas:

By using this equation, for a chosen value of β, we can compute an approximate number of timestamps t it takes for weight terms to reach the value of 1 / e ≈ 0.368). It means that observations computed within last t iterations have a weight term greater than 1 / e and those more precedent calculated out of last t timestamp range come up with weights lower than 1 / e having a much less significance.

In reality, weights lower than 1 / e make a tiny impact on the exponentially weighted average. That is why it is said that **for a given value of β, the exponential weighted average takes into consideration the last t = 1 / (1 – β) observations**.

To get a better sense of the formula, let us plug in different values for β**:**

For instance, taking β

= 0.9 indicates that approximately in t = 10 iterations, the weight decays to 1 / e, compared to the weight of the current observation. In other words, the exponential weighted average mostly depends only on the last t = 10 observations.

The common problem with using exponential weighted average is that in most problems it cannot approximate well the first series values. It occurs due to the absence of a sufficient amount of data on the first iterations. For example, imagine we are given the following time series sequence:

The goal is to approximate it with the exponential weighted average. However, if we use the normal formula, then the first several values will put a large weight on v₀ which is 0 whereas most of the points on the scatterplot are above 20. As a consequence, a sequence of first weighted averages will be too low to precisely approximate the original sequence.

One of the naive solutions is to take a value for v₀ being close to the first observation θ₁. Though this approach works well in some situations, it is still not perfect, especially in cases when a given sequence is volatile. For example, if θ₂ differs too much from θ₁, then while calculating the second value v₂, the weighted average will normally put much more importance on the previous trend v₁ than the current observation θ₂. As a result, the approximation will be very poor.

A much more flexible solution is to use a technique called “**bias correction**”. Instead of simply using computed values vₖ, they are divided by (1 —βᵏ). Assuming that β is chosen close to 0.9–1, this expression tends to be close to 0 for first iterations where k is small. Thus, instead of slowly accumulating the first several values where v₀ = 0, they are now divided by a relatively small number scaling them into larger values.

In general, this scaling works very well and precisely adapts the first several terms. When k becomes larger, the denominator gradually approaches 1, thus gradually omitting the effect of this scaling which is no longer needed, because starting from a certain iteration, the algorithm can rely with a high confidence on its recent values without any additional scaling.

In this article, we have covered an extremely useful technique for approximating a time series sequence. The robustness of the exponential weighted average algorithm is primarily achieved by its hyperparameter β which can be adapted for a particular type of sequence. Apart from it, the introduced bias correction mechanism makes it possible to efficiently approximate data even on early timestamps when there is too little information.

Exponential weighted average has a wide application scope in time series analysis. Additionally, it used in variations of gradient descent algorithm for convergence acceleration. One of the most popular of them is the Momentum optimizer in deep learning which removes unnecessary oscillations of an optimized function aligning it more precisely towards a local minimum.

*All images unless otherwise noted are by the author*

This post originally appeared on TechToday.