Modeling seasonality
- Statistical modeling of seasonal variables
- Logistic function and Fourier series
Hey all!
There are several random variables which present a certain seasonality. For a simple example, think about the number of jackets sold along the year. For a more complex example, think about the amount of carbon being consumed by Thermal power stations let´s say in Germany along the year. These examples present an yearly frequency, but other variables might have different time-dependent levels with different or even combined frequencies. In this post, I´ll present a statistical model focused on dealing with these seasonalities, providing probability distributions and points of regime transition. This post today has been inspired by an amazing paper on applied statistics published by Lima and Lall in 2009 that is open-access here.
Periodicity... |
The starting point of the model is the logistic function:
$p(t)=\frac{1}{1+exp \{ -x(t) \}}$
This function maps the variable $x \in \Re$ into an unique value within the interval $[0,1]$. The limits 0 and 1 of this interval are obtained with $x \to - \infty$ and $x \to \infty$, respectively. In the present model, the variable $x$ is modeled as:
$x(t)=\sum_{n} a_{n}+b_{\omega_n}cos(\omega_n t)+c_{n} sin(\omega_n t)$
which is simply a Fourier series. For the sake of simplicity we are neglecting any possibility of contributions in the imaginary space, which is quite reasonable. Therefore, $a, b, c \in \Re$. The parameter $n$ denotes the number of the components of frequency $\omega_n$ included in the model. The time $t$ can be given either discrete or continuously.
Hands on
In order to calibrate the model, fitting $a, b, c$, we need to provide an adequate training data. For didactic, I´m going to explain this process, illustrating it with an example from climatology, similar to one used in the paper. In principle, it is assumed only binary events. For instance, today might rain or not. If it rains, we set $p$ to 1 for today or to 0 if does not. This is done successively for each day in the training data. Of course, this process of classification cannot be so all-or-nothing, and in fact what can be done is to determine a tolerance above which the probability $p$ will be set to 1. Notice that conceptually the values 0 or 1 can be attributed since they are related to realized observations. Hence, the objective is to fit $a, b, c$ maximizing the distribution that makes the realized observations more likely. The model intends to predict the probabilities for the next period.
The following code (in Python) performs the calibration described above:
In order to test this model, I used meteorological data provided by the INMET (agency from Brazil) considering a meteorological equipment located in Cuiaba, Brazil. The data were collected from 1961 to 2016. The figure below presents the results. The x-axis is given in Julian days, that is, January 1st is 1, January 2nd is 2, and so on. The solid blue line denotes the probability distribution obtained with the parameters $a, b, c$ fitted with the whole data, except the year 2016, which is used for comparison. Considering an yearly frequency, we set a single term for the Fourier series $n=1$ and the frequency component is set to $\omega = 2 \pi / 365$.
The model described by Lima and Lall in 2009, assumes the inflection point of the probability distribution as the transition of regime from a rainy to a dry season and vice-versa. These points are indicated by the vertical lines in the figure below. The small black circles indicate the observations of rain during 2016. No tolerance has been considered in these results, that is, any rain measurement higher than zero is enough to set $p=1$ in that determined day. The dashed black line is the probability distribution fitted to the data of 2016.
The probability distribution curve adjusted for the period 1961 to 2015 predicts very well the pattern observed in 2016. In addition, one can also observe that the higher concentration of rain occurs within the vertical lines proposed by the model. Nevertheless, one can observe a small shift of the peak of the dashed line to the right with respect to the solid line, which indicates a rainy season in 2016 slightly longer than the one adjusted for the previous period.
It is worth to mention that one can use this model to evaluate other variables such as commodities demand, sales, crime distribution, among others. The creativity of the quant is the limit.
I hope you enjoyed the post! Leave your comments and share!
May the Force be with you.
The following code (in Python) performs the calibration described above:
In order to test this model, I used meteorological data provided by the INMET (agency from Brazil) considering a meteorological equipment located in Cuiaba, Brazil. The data were collected from 1961 to 2016. The figure below presents the results. The x-axis is given in Julian days, that is, January 1st is 1, January 2nd is 2, and so on. The solid blue line denotes the probability distribution obtained with the parameters $a, b, c$ fitted with the whole data, except the year 2016, which is used for comparison. Considering an yearly frequency, we set a single term for the Fourier series $n=1$ and the frequency component is set to $\omega = 2 \pi / 365$.
The model described by Lima and Lall in 2009, assumes the inflection point of the probability distribution as the transition of regime from a rainy to a dry season and vice-versa. These points are indicated by the vertical lines in the figure below. The small black circles indicate the observations of rain during 2016. No tolerance has been considered in these results, that is, any rain measurement higher than zero is enough to set $p=1$ in that determined day. The dashed black line is the probability distribution fitted to the data of 2016.
The probability distribution curve adjusted for the period 1961 to 2015 predicts very well the pattern observed in 2016. In addition, one can also observe that the higher concentration of rain occurs within the vertical lines proposed by the model. Nevertheless, one can observe a small shift of the peak of the dashed line to the right with respect to the solid line, which indicates a rainy season in 2016 slightly longer than the one adjusted for the previous period.
But, sincerely buddy, this model is too simple... why this post?
Yes, indeed. Even when considering the theoretical machinery that I have skipped here, the model has a simple approach. This is the beauty. Never forget the Occam´s razor. There are several possible studies to which this model can be applied. First, I should mention that a researcher can determine the transition of regime within a scale of days, tuning uniquely a tolerance parameter (as mentioned before, I set it to zero in my example). One can study, for instance, shifts of the seasons along a period of observation. Also the duration of a regime can be studied.It is worth to mention that one can use this model to evaluate other variables such as commodities demand, sales, crime distribution, among others. The creativity of the quant is the limit.
I hope you enjoyed the post! Leave your comments and share!
May the Force be with you.
#=================================
Diogo de Moura Pedroso
LinkedIn: www.linkedin.com/in/diogomourapedroso
Comments
Post a Comment