STOCHASTIC APPROXIMATION IN THE FUNCTION OF SIGNS OF THE ERROR PROCESS
 
 
09. 09. 2019
 
 
 
 
 
 
 
 
 
ABSTRACT
In this comment the prediction error type output process is considered and is assumed to be a zero mean normal white noise process and uncorrelated to the predicted output process. To check the symmetry of +/- signs  is proposed in a windows: it consists of 50% singles and 50% sequences (with the mean of 3) by theory. A scalar is modified in every step to get symmetric sings: 50% + or - signs in the window. Tshebishev type criteria are presented.
 
INTRODUCTION
 
Istochastic approximation type algorithms the error processes can be error processes of equation errors, statistical errors, residual (prediction, filtered) ones and may be others. The statistical disturbance type error process is the deviation of the observed value from the -generally unobservable measured- true value, an example is the measurement noise. The residual error (that of predictors, filters) is the difference between the observed value and the estimated value. In the considered case the observed values are be corrupted by measurement disturbances. The sum of the residuals is zero and are assumed to be uncorrelated. (Whitening in closed loop: https://arc.aiaa.org/doi/abs/10.2514/2.4574?journalCode=jgcd). The sum of squares of the residuals and the sample mean can be shown to be independent of each other (https://en.wikipedia.org/wiki/Errors_and_residuals). Otherwise in a numerical computation, error may arise because of truncation and roundoff.
 
The  considered problem is to compute the unknown vector Θof the parameters of a dynamical system when only noise corrupted measurements of the outputs are available at discrete time n = 1,2,3,.... This problem is close to the classical linear regression and prediction problem with noise corrupted measurements, see the Kalman filter, with whitened residuals (https://arc.aiaa.org/doi/abs/10.2514/2.4574?journalCode=jgcd). In the scalar case the predictor model of linear in parameters is assumed in the form of yn = Θng where T stands for the transpose and gn denotes gradΘ y i.e. the gradient of the estimation y by Θ in the n-th step
The stochastic approximation algorithms family (STA, https://en.wikipedia.org/wiki/Stochastic_approximationis the heuristic recursive estimation of Θ with gradient method (https://en.wikipedia.org/wiki/Stochastic_gradient_descent):

 
ΘΘn-1 + 1/n  (yn,measured -yn) gn .
 
To avoid the large magnitude gradients the gradient is computed in a normed form (see the ADAGRAD-STA algorithm: https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
 
                                                            ΘΘn-1 + 1/n  ( yn,measuredyn (Σ  g2) -1/2  gn                                              
 
 It is assumed that the inverse exists. It is easy to compute the elements of  gi , i = 1,2,..., dim g) to avoid effects of large gardient values. In the nexts the prediction error process is discussed, which goes asymptotically to zero in mean by O (1/n) where O denotes Ordo and 1/n is considered sometimes as the learning rate which can be accelerated until n -1/2 (https://en.wikipedia.org/wiki/Stochastic_approximation). But the consequence of the acceleration can result bias in the parameters. The yn,measuredyn  prediction error converges by n-1 to n-1/2 (https://arxiv.org/abs/1805.08114to the system noise process.
 
THE PREDICTION ERROR PROCESS WITH ZERO MEAN
The system, estimation and measurement noise processes are assumed to be zero mean white Gaussian noise processes uncorrelated with the predicted process. The assumed noise models involve the parameter estimation method, as well. A short sample (≤ 20 stepmean could serve as a device to control the mean and the signes of the prediction error process. The symmetry of signs could be ensured by a c› 0  scalar to be updated in every step. 
The original step size is modified by a cscalar due to O(1/n) of step size and due to the symmetric signs of the steps in a window criteria of prediction errors to get equal numbers of ± signs in a moving window and decreasing error process by O (1/n). Applying moving window of fixed width (of 16-20 steps of window): so the average of an indicator value and the symmetry of the signs must be be discussed. First the average length of the symmetric sings is determined in the truly random case. 
 
In base b the probabilities of occurrences of length k subsequences - with equal frequencies of 1/b of independent trials– are P(X=k) when any patterns of length k are considered in the computable infinite case:

                                                                  P(X=k) = p (1-p)k-1 (b-1)/b 1/b k-1, for k = 1, 2, 3, ...,∞                                                                                          ,

where the parameter (b-1)/b is equal to the probability of singles. Any length k pattern has the same probability of P(X=k). The expected value of the random variable X is equal to b/(b-1) and reciprocal to the parameter (b-1)/b of this geometric distribution (https://en.wikipedia.org/wiki/Geometric_distribution) and when the entropy of the distribution is maximum in the exponential family. The variance is equal to b/(b-1)2Thus in the case of b = 2 the expected value is equal to 2. 

Comment: the expected value is equal to (2b-1)/(b-1) = 3 when b=2, k = 2,3,4,.. . 

For a finite kmax  the (b -1) Σk b-k  formula has the form of

                      (b -1) Σk b-k  = 1- b-kmax   ≈   1/2, for k = 2,3.., cc. 20, when b = 2, as well.                       .
 
Follows that a windows of nearly 20 steps width would consist of 50% singles and 50% sequences with the probability of nearly 1. Based on this criteria the cscalar is modified to get symmetric sings in the window of 20. 
Another way to compute the scalar cn: in the - it is a guess which has not been proved- optimal case the cscalar is mofied in such a way that the absoluth value of (yn,measuredyn) would be decreasing in the interval of 1/n and 1/n1/2 and its sign is changing in every step.  
  
         .