Interesting Probability Problem for Serious Geeks: Self-correcting Random Walks
This is another off-the-beaten-path problem, one that you won’t find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some spare time could find an elegant solution. Share it with your colleagues to see how math-savvy they are, or with your students. I was able to make substantial progress in 1-2 hours of work using Excel alone, thought I haven’t found a final solution yet (maybe you will.) My Excel spreadsheet with all computations is accessible from this article. You don’t need a deep statistical background to quickly discover some fun and interesting results playing with this stuff. Computer scientists, software engineers, quants, BI and analytic professionals from beginners to veterans, will also be able to enjoy it!
We are dealing with a stochastic process barely more complicated than a random walk. Random walks are also called drunken walks, as they represent the path of a drunken guy moving left and right seemingly randomly, and getting lost over time. Here the process is a self-correcting random walk, also called controlled random walk, in the sense that the walker, less drunk than in a random walk, is able to correct any departure from a straight path, more and more over time, by either slightly over- or under-correcting at each step. One of the model parameter (the positive parameter a) represents how drunk the walker is, with a = 0 being the worst. Unless a = 0, the amplitude of the corrections decreases over time to the point that eventually (after many steps) the walker walks almost straight and arrives at his destination. This model represents many physical processes, for instance the behavior of a stock market somewhat controlled by a government to avoid bubbles and implosions, and it is defined as follows:
Let’s start with X(1) = 0, and define X(k) recursively as follows, for k > 1:
and let’s define U(k), Z(k), and Z as follows:
where the V(k)’s are deviates from independent uniform variables on [0, 1], obtained for instance using the function RAND in Excel. So there are two positive parameters in this problem, a and b, and U(k) is always between 0 and 1. When b = 1, the U(k)’s are just standard uniform deviates, and if b = 0, then U(k) = 1. The case a = b = 0 is degenerate and should be ignored. The case a > 0 and b = 0 is of special interest, and it is a number theory problem in itself. Also, just like in random walks or Markov chains, the X(k)’s are not independent; they are indeed highly auto-correlated.
Prove that if a > 0, then X(k) rapidly converges to 0 as k increases. Also prove that the limiting distribution Z
- always exists,
- always takes values between -1 and +1, with min(Z) = -1 and max(Z) = +1,
- is symmetric, with mean and median equal to 0
- and does not depend on a, but only on b (if b not too close to 0)
For instance, for b =1, even a = 0 yields the same triangular distribution for Z, as any a > 0.
If a > 0 and b = 0, (the non-stochastic case) prove that
- Z can only take 3 values: -1 with probability 0.25, +1 with probability 0.25, and 0 with probability 0.50
- If U(k) and U(k+1) have the same sign,then U(k+2) is of opposite sign
And here is a more challenging question: In general, what is the limiting distribution of Z? Also, what happens if you replace the U(k)’s with (say) Gaussian deviates? Or with U(k) = | sin (k*k) | which has a somewhat random behavior?
Hints to solve this problem
It is necessary to use a decent random number generator to solve this problem. With simulations done in Excel, plotting the empirical distribution of Z(k) for large values of k, and matching the kurtosis, variance and empirical percentiles with those of known statistical distributions, one quickly notices that when b = 1 (and even if a = 0) the limiting distribution Z is well approximated by a symmetric triangular distribution on [-1, 1], and thus centered on 0, with a kurtosis of -3/5 and and a variance of 1/6. In short, this is the distribution of the difference of two uniform random variables on [0, 1]. In other words, it is the distribution of U(3) – U(2). Of course, this needs to be proved rigorously. Note that the limiting distribution Z can be estimated by computing the values Z(n+1), …, Z(n+m) for large values of n and m, using just one instance of this simulated stochastic process.
Does this result generalize to other values of b? That is, does Z always have the distribution of U(3) – U(2)? Obviously not for the case b = 0, and probably not for small values of b such as b = 0.01.
Figure 1: Mixture-like distribution of Z (estimated) when b = 0.01 and a = 0.8
Interestingly, for small values of b, the limiting distribution Z looks like a mixture of (barely overlapping) simple distributions. So it could be used as a statistical model in clustering problems, each component of the mixture representing a cluster. See Figure 1.
Figure 2: Triangular distribution of Z (estimated) when b = 1 and a = 0.8
The spreadsheet with all computations and model fitting can be downloaded here..
- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge
- Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics
- What is Data Science? 24 Fundamental Articles Answering This Question
- Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
- Advanced Machine Learning with Basic Excel