How might the use of Gaussian distributions play tricks on you



·         Why and how Gaussian distributions can model several different systems from Science to Finance;
·         Discussion on the validity criteria aiming to investigate the boundaries for the use of Gaussian distributions;

It is wide well-known that one of the most important concepts in statistical analysis is the Gaussian or Normal probability distribution. The equation that describes the probability density of the Gaussian distribution was chosen by the mathematician Ian Stewart as one of the most remarkable equations in the history, together with the renowned $E=mc^2$ from the theory of relativity and the Black-Scholes equation (discussed in this post and in this video) used for the modelling of financial derivatives. One can safely say that most of the graduate students and professionals in engineering, earth and social sciences, logistics, finance, and business had seen or used the Gaussian distribution at least once in their lives. What makes it an important concept is the fact that it is able to model several different systems, such as experimental measurement errors, height and weight distributions in a population, call intervals in a call center, number of cars passing a highway toll, number of passes in a soccer game, and so on.
Resultado de imagem para crisis 2008However, although people are familiar with use of the Gaussian distribution, few of them can seriously discuss situations in which it should not be used or used with caveats. A celebrated example of this limitation concerns the relation between finance models and the crisis in 2007-2008 as discussed by Nassim Taleb in his must-read book The Black Swan. When the dust settled, market players performed a rigorous revision of their mathematical models, also motivated by more strictly regulations.
In this context, this post aims to discuss (quite informally) the validity criteria of the Gaussian distribution without using any complicated math, providing simple but constructive examples.
So, let’s take a look at the basics concepts. As you may know, the normalized probability density of the Gaussian distribution is uniquely determined by two parameters, the average $\mu$ and the standard deviation $\sigma$, by means of:
$p(x)=\frac{1}{\sqrt{2 \pi \sigma^2}} exp \{\frac{-(x-\mu)^2}{2 \sigma^2}  \}$
where $x$ is a real value. The plot of this equation is exemplified as follows:

It is important to understand why the Gaussian distribution is able to fit a wide number of systems. Formally, the answer passes through the Central Limit Theorem. From a practical (and informal) point of view, we can say that if the values of a system come from the collective response of a large number of independent random variables, then it is very likely that this system obeys a Normal distribution. In practice, these variables usually denote complicated or even inaccessible mechanisms which are simply encompassed in the final value that quantifies the system state. So, the statistical nature of the system comes from the collective effect of randomly switching on and offs to which the variables are subjected.
In the following, two illustrative examples to shed some light in the general idea:
1.      Consider a survey regarding the heights of a large population of kids around the same age. The graph of the result containing the number of kids in the horizontal axis and the heights in the vertical one tends to a bell curve, such as the ones illustrated above. In this example, the collective response included in the system comes from many factors, genes of the parents, nutrition (and details behind it), specific medicines, exercises (such as body stretching), and so on.
2.  Consider a game of heads or tails using let´s say twenty coins at the same time. Moreover, consider the twenty coins are flipped at least twenty times. Then, we have 400 samples. As you can imagine, if you organize the number of heads (or tails) in a histogram, the corresponding fitted curve tends to the Normal distribution, improving the match as the number of coins and flips increases. In this example, the collective variables of the system can be the force of the flipping, position of the coins before flipping, trajectory of the hands, wind velocity, among others.
If one wants to determine if a system meets or not the requirements for the Gaussian distribution, one should think in terms of the statistical behavior of the several variables which compose the system. In principle, the variance of a single inner variable must be much smaller than the overall variance of the system. In more simple terms, the amplitude of variation of a single variable should not be responsible for drastic changes in the collective effect. The formal statement of this requirement regarding the variances is given by the Lindeberg´s condition.
As a rule of thumb one can think that the Gaussian distribution works for systems with low probability of surprises, that is, low likelihood of values very far from the average in terms of the standard deviation of the data.
Resultado de imagem para bell of liberty
Gaussian distribution curves a.k.a bell curves

In practice, it is necessary to be very careful with the Lindeberg´s condition, because sometimes this condition is just apparently fulfilled. It might happen that due to the low probability of some drastic events, their presence should be considered as simple outliers, when in reality it is a possible effect of the system that must not be excluded, leading to an incomplete analysis and consequently to poor decisions. Other tricky situation comes out when the statistical parameters of the system, $\mu$ and $\sigma$, are subjected to high variability on time. As a rule of thumb one can think that the Gaussian distribution works for systems with low probability of surprises, that is, low likelihood of values very far from the average in terms of the standard deviation of the data.
In the following, I will discuss two simple examples. It is very important to mention that these examples do not represent any absolute true and are given simply to illustrate the topic. A deeper analysis of the exemplified situations is part of the reader´s exercise.

  1. Suppose, for instance, a company specialized in retail sale, like a department store or supermarket, desires to optimize its product stock. One of the variables the company is interested to control, for instance, could be is the interval between sales of determined set of products. In retail sale, such as a supermarket, it is very likely that the sale of determined product is driven by a large number of customers, which may be modeled in terms of Gaussian distributions, although seasonality might apply. For example, fewer coats are sold during the summer.
  2. The challenge in the trade of financial assets is to determine if the current price is low or high, which is done in essence based on the analysis of former prices or/and in the implied market expectation for a determined period ahead in the future. Considering the dynamic nature of the prices, a wise and (why not) easy approach for this analysis relies on the evaluation of probabilities and expected values associated which come from the collective sense of the market. In fact, financial markets work under the "forces" of thousands or even millions of bids and asks, which randomly drive the prices and indexes to any direction. One should notice, however, that few of them are able to influence the system much more substantially than others, such as the high volume operations conducted by governments in the forex market to control currency exchanges. This last statement shows that the Lindeberg condition is not satisfied for all moments of the financial market. The tricky situation appears because the effect of stronger players is not felt all the time, in addition one cannot say that this group have a organized action, diminishing its influence and creating periods for which the use of Gaussian curves is reasonable. The problem here is in the fact that the Gaussian distribution underestimates probabilities of significant movements of the market which in the context of  some markets might not be simply considered as outliers. It is worth remembering that the Black-Scholes formula relies uniquely on the Gaussian distribution of the logarithmic return of assets.

I hope you enjoyed the post. Leave your comments and share with your friends!

May the Force be with you.

#=================================
Diogo de Moura Pedroso
LinkedIn: www.linkedin.com/in/diogomourapedroso

Comments