STOCHASTIC APPROXIMATION IN THE FUNCTION OF SIGNS OF THE ERROR PROCESS

STOCHASTIC APPROXIMATION IN THE FUNCTION OF SIGNS OF THE ERROR PROCESS

09. 09. 2019

ABSTRACT
Stochastic approximation (STA) is a recursive parameter estimator, a simple heuristic algorithm often used in modeling stochastic processes. The convergence rate is 1/n-like, n = 1,2,3,... .The recursive estimation structure of STA can be recognized in recursive least squares, maximum-likelihood, etc. algorithms. The algorithm is accelerated by limiting conditions. In this paper, we present an algorithm based on the analysis of the prediction error process and an algorithm for controlling the convergence rate based on the signs of the errors, which prescribes the zero expected value and sign symmetry of the prediction error process.

INTRODUCTION

The stochastic approximation algorithm (STA) is a heuristic recursive parameter estimation algorithm, often used in mathematical modeling. The prediction error process is the difference between the measured value of the output and the value estimated in the previous step. In the case of linear models and scalar output in the estimated Θ parameters, the model can always be reduced to the form y_n = Θ_n^T g_n , n = 1,2,3,..., where y_n denotes the predicted value of the output at time n, so the prediction error process has the form of y_n,measured - y_n, and where g_n is named the vector of function components or the gradient vector in the literature, because y_n = δ y_n / δ Θ . T denotes the transpose. The parameter vector to be estimated is recursively modified by calculating the direction of the modification based on the gradient vector denoted by g_n:

Θ_n = Θ_n-1 + 1/n (y_{n, measured} - y_n) y_n .

The prediction error process and the estimation process denoted by yn are uncorrelated for known parameters and the STA is an optimal estimate in the least squares sense, its convergence is proven in the case of 1/n → 1/(n+ const.) substitution. Note: in the case of exponential forgetting, it can be calculated in the form (1 - 1/n) Θ_n + 1/n (y_{n, measured}- y_n) y_n, where 1 - 1/n has the values usually 0.90 - 0.999. Exponential forgetting is a good algorithm for updating initial values. A good starting data are when the arithmetic mean of the prediction errors estimated for a small sample equal to zero.

By increasing the dimension of the estimated parameter vector, which is also the gradient vector, the convergence of the estimate, i.e. the learning rate, deteriorates rapidly, and the error process may oscillate and diverge. Note: the collinearity of the gradient auxiliary variable components means that two or more function components are strongly correlated with each other, carrying the same information, which distorts the estimates and worsens the predictive ability of the model. There are several techniques available to deal with collinearity, such as variable selection or dimension reduction methods. Beginner researchers sometimes do not pay enough attention to the variability of the input signals and function components (= persistently exiting signs), which can be checked by the sign changes of the output with an expected value of zero.

The stochastic approximation is an approx. 70-year-old algorithm family, many accelerated versions are known (https://en.wikipedia.org/wiki/Stochastic_approximation). Here a normalized version of the gradient is used, so it does not diverge in the case of large gradients:

Θ_n = Θ_n-1 + 1/n (y_{n, measured}- y_n) (Σ g_i² )^-1/2 g_n,

provided that the components of g_i are not equal to zero. (It can be easily rewritten in diagonal matrix form).
The algorithm is less sensitive to large values of the gradient - due to the factors (Σ g_i² )^-1/2 and i = 1,2,..., dim g = dim Θ_n) - than the original STA algorithm. The case with small gradients is numerically simple to handle. In the case of large gradients, it is advisable to replace the gradient components with their moving averages.

Convergence of the learning rate and the error process
In the following, we examine the learning rate and the error process. The bias or the error process ΔΘ_n^T g_n. The error process is linearly proportional to the unknown ΔΘ_n bias. If 1/n approaches zero faster than ΔΘ_n , it causes a biased estimate, which we try to avoid with a scalar multiplier c_n calculated from the properties of the error process. The scalar multiplier 1/n is called the learning rate, which decreases by O (1/n), where O denotes Ordo. The statement is also valid for closed-loop linear control systems provided that there is a one-step delay in the loop. The proof can be done based on the Steiner formula written for the estimation error (https://en.wikipedia.org/wiki/Bias_of_an_estimator). Only suboptimal STA algorithms exist, but suboptimal learning is also made up of optimal sub-strategies. Due to our incomplete a priori knowledge, in the case of suboptimal STA: convergence can be achieved by decreasing the error process by 1/n and ensuring its expected value of zero.

If the decrease is significantly faster than 1/n, it can stop the recursive estimation, resulting in a biased estimate. For example, if we reduce the error to zero with a separate iteration at each step to determine a good starting value, we get a noise-tracking estimate, (https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff), which is still a frequently used acceleration. In the case of large but bounded errors relative to the absolute values of the parameter components: the estimates oscillate and then diverge, which can be prevented by artificially decreasing the step size, at least to 1/n. (https://pp.bme.hu/ee/article/view/4974, Bencsik, I. (1974) “CONVERGENCE RATE ACCELERATION OF SUCCESSIVE APPROXIMATION ALGORITHMS IN REAL-TIME CASE”, Periodica Polytechnica Electrical Engineering (Archives), 18(1), pp. 99-103.) It is common to accept a small biased estimate, which speeds up convergence.

A problem related to measurement noise: estimating the true value of a signal in the case of independent measurement noise with known standard deviation. For example, if the signal were white noise with measurement noise and its standard deviation were given, then the estimate of the true value: the measured instantaneous value of the white noise should be multiplied by the ratio

(standard deviation of white noise - standard deviation of measurement noise)/standard deviation of white noise

, based on the Kalman filter

(https://en.wikipedia.org/wiki/Kalman_filter).

The problem related to system noise - to a lesser extent measurement noise - is the noise sensitivity of the estimates: when estimating good starting values, a learning sample helps to deal with the problem. (Based on the Kalman filter, it can be seen that for a truly optimal estimation of the output and parameters in the least squares sense, the so-called extended Kalman filter, the degree and the system noise, i.e. the standard deviation of the white noise and the standard deviation of the measurement noise, are also required. The estimation then coincides with the Bayesian estimation and the MLH estimation.) The noise processes of the Kalman filter are assumed to be independent (i.e. white) Gaussian noises with zero expectation, or noises with a uniform distribution with zero expectation, the latter being a new discovery.

The zero expected value of the error process, the suboptimal convergence rate, is ensured by a special choice of the step size: by a suitable choice of a scalar multiplier denoted by c_n , which is recalculated in each step: we not only require that the error process decreases by 1/n, but also that the sign of the error process differs in ≈ 50% of the steps. (Sign test: István Vincze: Matematikai Statisztika, Műszaki Könyvkiadó, 1968, p. 160, in Hungarian) The learning rate is then characterized by c_n/n. The variability of the input and the signals i. e. persistently exciting inputs in general is important because without sufficient variability of the signals, the estimate will not change to the required extent. This variability can be partly ensured by calculating the scalar multiplier of the step size c_n in each step in such a way that the error - after a trial step with a unit scalar multiplier - changes sign and simultaneously decreases in absolute value by 1/n: this is the optimal learning rate for STA algorithms. The steepest descent method is then also applied to the sign of the error process with zero expected value. With cn › 0, faster convergence than 1/n can be set, but the price of faster learning is the biased estimate.

ABOUT THE ZERO EXPECTED VALUE PREDICTION ERROR PROCESS

We assume that the estimation, system and measurement noise processes are white Gaussian processes with zero expected value and are uncorrelated with the process under investigation and modeled. The assumed noise models influence the modeling. To evaluate the prediction error process, we will use a short sample containing the last 20 elements. By testing the symmetry of the signs of the errors, we find a scalar c_n ≤ 0, which is recalculated at each step, ensuring that the error decreases in absolute value by O(1/n) and that the signs of the errors occur in the same number of times. We check the latter using an indicator variable: we determine the expected value of the number of identical signs.

Statement: In a number system with base b, if 1/b is the probability of occurrence of digits, let P(X=k) denote the probability of an arbitrary pattern obtained by independent trials of length k in the case of a countably infinite set. (E.g. based on the countable part of the binary form of the number pi. The concept of pattern is related to the difference of signs.) Then for general base b:

P(X=k) = p (1-p)^k-1 = (b-1)/b 1/b^k-1 = (b-1)/b^k , for k = 1, 2, 3, ...,∞

where (b-1)/b is the parameter of the geometric distribution, and also the probability of singletons. The expected value of the random variable X is b/(b-1), which is the reciprocal of the parameter (b-1)/b when the entropy of the distribution is maximum within the exponential distribution family. In case b=2 the expected value is 2.

The expected value of a series without singletons is equal to (2b-1)/(b-1) = 3, where k = 2,3,4,.. , .
For a finite sample of size k_max the sum of the probabilities (b -1) Σ_k b ^-k is:

(b -1) Σ_k b^-k = 1- b ^-kmax ≈ 1/2, k = 2,3.., which we chose as k _max = 20.

So the sum of the probabilities P(X=k) of the sequences k = 2,3,4.. in a window containing 20 elements is approximately 1/2, and the probability of occurrence of singletons is also 1/2 probability for maximum entropy. Based on the statement, the frequencies of occurrence of the signs of the errors in the window containing 20 elements can be evaluated, and the values of cn can be calculated in the STA algorithm so that the signs occur symmetrically. (Another possible way to calculate the step size modifying cn is to require the absolute value of the error process y_{n, measured} - y_n to decrease in the interval between 1/n and 1/n^1/2 and to change sign at each step.)