ON THE CONVERGENCE OF STOCHASTIC APPROXIMATION

ON THE CONVERGENCE OF STOCHASTIC APPROXIMATION

AUGUST 19, 2019

ABSTRACT

An algorithm is presented in this paper, as well. The considered problem is the O(1/n) convergence rate of parameter estimation of noise corrupted systems of linear in parameters by stochastic approximation algorithms. The paper is dealing with the problem of the convergence of stochastic approximation algorithm family based on the signes of the output prediction error process which is assumed to be a zero mean white noise process asymptotically and is orthogonal (uncorrelated) to the predicted output process. The noise processes in the Kalman filter are assumed to be independent (i.e. white) Gaussian or uniformly distributed noise with zero expected value, the latter being a new finding. Tshebishev type criteria are presented.

INTROCUCTION

The problem to compute the unknown Θ_nparameters of a dynamical system when only noise corrupted measurements of the y_n outputs are available at discrete time n = 1,2,3,.... This problem is close to the classical linear regression and prediction problem with noise corrupted measurements, see the extended Kalman filter. In the scalar case and linear in parameters the prediction model is assumed in the form of y_n = Θ_n^Tg_n where T stands for the transpose and g_n denotes grad_{_Θ} y_n i.e. the gradient of y_n by Θ in the n-th step.

The origin of the stochastic approximation algorithms family (STA, https://en.wikipedia.org/wiki/Stochastic_approximation) is the heuristic combination of two methods: the recursive estimation of the relative frequencies and the gradient method (https://en.wikipedia.org/wiki/Stochastic_gradient_descent):

A_avr,_n = 1/n Σ_iA_i = 1/n {(n-1) A_avr.n-1 + A_n } = (1- 1/n) A_avr.n-1 + A_n/ n = A_avr.n-1 + 1/n (A_n - A_{_avr.n-1})

in the form of:

Θ_n= Θ_n-1 + 1/n (y_n,measured-y_n) g_n .

Due to an idea the gradient is computed in normed form, when the aldorithm has the form of:

Θ_n= Θ_n-1 + 1/n ( y_n,measured - y_n) (Σ g_i²)^{^-1/2} g_n,

to prevent the divergence caused by the large values gradients. i = 1,2,...., dim g. It is assumed that the inverse exists. In the nexts the prediction error process is discussed, which goes asymptoticallyto zero mean process by O (1/n) where O denotes Ordo and 1/n is considered as the learning rate.

THE LEARNING RATE AND CONVERGENCE OF THE PREDICTION ERROR PROCESS (See https://en.wikipedia.org/wiki/Fisher_information)

There are known methods to accelerate the convergence rate of STA algorithms of LMS relatively to O(1/n) but the the consequence of the acceleration is bias in the parameters. The usual compromise is small bias with fewer estimated parameters (https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). The estimation can stop or in the opposite case the estimates can oscillate. (https://pp.bme.hu/ee/article/view/4974, Bencsik, I. (1974) “CONVERGENCE RATE ACCELERATION OF SUCCESSIVE APPROXIMATION ALGORITHMS IN REAL-TIME CASE ”, Periodica Polytechnica Electrical Engineering (Archives), 18(1), pp. 99-103.) A general assumption is at STA algorithms that the quadratic moving average of the predicted output error process is decreasing with increasing n. The decrease of a quadratic moving average of the output error process with modified step size is ensured by a c_nscalar. The system, estimation and measurement noise processes are assumed to be zero mean white Gaussian noise processes uncorrelated with the predicted process. The assumed noise models involve the parameter estimation method. This is a reason for using the prediction type model in this paper, as well. A wellknown applications are the STA algorithms of ΘnTg_n form models by whitening the residual processes which are prediction error type process in this paper.

The modification ot the original step size by a c_nscalar is not only due to O(1/n) of step size but due to the signs of the prediction errors to get equal numbers of +/-signs in a moving window. By a moving average of an indicator value the symmetry of the signs is checked and c_n› 0 are modified after a test-step in every step. The y_n,measured - y_n prediction error can converges by n^-1 to n^-1/2 (https://arxiv.org/abs/1805.08114) to the system noise process.

Increasing the number of degrees of the model has a "cost". A workable solution is to let N denote dim Θ_n, the parameter vector and the number of dimensions of the model, and D ² (y) / M(y ²) denote the criterion value. Then it makes sense to compute a brain N + 1 dimensional model if

(N + 1) D ²_N+1(y) < N D ²_N (y).

Last modification: AUGUST 20, 2019