ON THE CONVERGENCE OF STOCHASTIC APPROXIMATION
 
 
AUGUST 19, 2019
 
 
ABSTRACT
An algorithm is presented in this paper, as well. The considered problem is the O(1/n) convergence rate of parameter estimation of noise corrupted systems of linear in parameters by stochastic approximation algorithms. The paper is dealing with  the problem of the convergence of stochastic approximation algorithm family based on the signes of the output prediction error process which is assumed to be a zero mean white noise process asymptotically and is orthogonal (uncorrelated) to the predicted output process. Tshebishev type criteria are presented.
 
 
INTROCUCTION
The  problem to compute the unknown Θparameters of a dynamical system when only noise corrupted measurements of the yn outputs are available at discrete time n = 1,2,3,.... This problem is close to the classical linear regression and prediction problem with noise corrupted measurements, see the extended Kalman filter. In the scalar case and linear in parameters the prediction model is assumed in the form of yn = Θng where T stands for the transpose and gn denotes gradΘ y i.e. the gradient of y by Θ in the n-th step
The origin of the stochastic approximation algorithms family (STA, https://en.wikipedia.org/wiki/Stochastic_approximationis the heuristic combination of two methods: the recursive estimation of the relative frequencies and the gradient method (https://en.wikipedia.org/wiki/Stochastic_gradient_descent):
 
Aavr,n = 1/n ΣAi = 1/n {(n-1) Aavr.n-1 + An } = (1- 1/n) Aavr.n-1 + A/ n = Aavr.n-1 + 1/n (An - Aavr.n-1)
 
in the form of:
ΘΘn-1 + 1/n  (yn,measured -yn) gn .
 
Due to an idea the gradient is computed in normed form, when the aldorithm has the form of:
 
                                           ΘΘn-1 + 1/n  ( yn,measuredyn(Σ gi2)-1/2 gn                               ,  
 
to prevent the divergence caused by the large values gradients. i = 1,2,...., dim g. It is assumed that the inverse exists. In the nexts the prediction error process is discussed, which goes asymptoticallyto zero mean process by O (1/n) where O denotes Ordo and 1/n is considered as the learning rate. 
 
THE LEARNING RATE AND CONVERGENCE OF THE PREDICTION ERROR PROCESS (See https://en.wikipedia.org/wiki/Fisher_information)
There are known methods to accelerate the convergence rate of STA algorithms of LMS relatively to O(1/n) but the the consequence of the acceleration is bias in the parameters. The usual compromise is small bias with fewer estimated parameters (https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). The estimation can stop or in the opposite case the estimates can oscillate. (https://pp.bme.hu/ee/article/view/4974, Bencsik, I. (1974) “CONVERGENCE RATE ACCELERATION OF SUCCESSIVE APPROXIMATION ALGORITHMS IN REAL-TIME CASE ”, Periodica Polytechnica Electrical Engineering (Archives), 18(1), pp. 99-103.) A general assumption is at STA algorithms that the quadratic moving average of the predicted output error process is decreasing with increasing n. The decrease of a quadratic moving average of the output error process with modified step size is ensured by a cscalar. The system, estimation and measurement noise processes are assumed to be zero mean white Gaussian noise processes uncorrelated with the predicted process. The assumed noise models involve the parameter estimation method. This is a reason for using the prediction type model in this paper, as well. A wellknown applications are the STA algorithms of ΘnTgn form models by whitening the residual processes which are prediction error type process in this paper.
The modification ot the original step size by a cscalar is not only due to O(1/n) of step size but due to the signs of the prediction errors to get equal numbers of +/-signs in a moving window. By a moving average of an indicator value the symmetry of the signs is checked and c› 0 are modified after a test-step in every step. The yn,measuredyn prediction error can converges by n-1 to n-1/2 (https://arxiv.org/abs/1805.08114to the system noise process.
*
Increasing the number of degrees of the model has a "cost". A workable solution is to let N denote dim Θn, the parameter vector and the number of dimensions of the model, and  D 2 (y) / M(y 2) denote the criterion value. Then it makes sense to compute a brain N + 1 dimensional model if
 
                                                                                                             (N + 1) D 2 N+1 (y) < N D 2N (y).
 
 

Last modification: AUGUST 20, 2019