Walking through regularization

  • Ill-posed and ill-conditioned problems;
  • Concepts and applications of regularization;
  • Calculus of regularization weight.

Hello, all!!
In the previous post, I discussed some concepts of PCA (Principal Component Analysis) which is one of the main analytical tools that should be handled by quantitative analysts, data scientists, statisticians, among other professionals related to modelling. The topic today is regularization which is also one of the mainstream techniques, quite important in calibration of models. Regularization is a technique that allows an algorithm to choose simpler models in the fitting process of a data set, enforcing therefore the idea of the Occam´s razor.
Let´s first discuss some introductory concepts.
The expression of a (quantitative) model comprises of parameters and variables that intends to reproduce the set of characteristics of the system being modeled.

Ill-posed and ill-conditioned problems 

ill-posed problem
The expression of a (quantitative) model comprises of parameters and variables that intend to reproduce the set of characteristics of the system being modeled. The calibration or even the fitting of a model to determined data set is the process of attribute suitable values to the parameters of the model, in such a way that the model is able to reproduce the values of the data set with a low enough margin of error. In this context, an ill-posed problem is the one at which the suitable parametrization is not unique, that is, different set of values might be used to reproduce the data set. Figure illustrates an ill-posed problem. Both blue and green curves reproduce suitably the values of the data set with low margin errors.
Some problems are commonly ill-posed. For instance, when dealing with a problem that we do not know previously which variables should really compose a suitable model, for instance a linear model. In this case, we may end up using much more variables than necessary, resulting in overfitting and non-unique solutions. Overfitting refers to the situation at which the model contains more variables than what can be justified by the data set. As an example in quantitative finance, the use of Dupire´s equation for calibration of volatility surfaces is highly ill-posed. (more on calibration of volatility surface using optimization engines in this previous post). Dupire´s equation will be discussed in some future post.
Some problems are ill-conditioned. Conceptually, ill-conditioned refers to problems that are analytically well-posed, but due to numerical discretization a non unique set of solutions is possible. This is particularly common in inverse problems.
As mentioned before, regularization intends to enforce simpler solutions when calibrating or fitting a model against a data set. Therefore, the analyst should consider the different aspects of the problem before neglecting the use of regularization.

First steps

Consider the following objective function (for some minimization problem, for instance):
$OF = \sum_j[Y(x_j) - \sum_i \beta_i g_i(x_j)]^2 + \lambda \sum_i |\beta_i|^2$
where $Y(x_j)$ denotes the observed (or experimental) data set, $g_i(x)$ are the basis functions and $\beta_i$ are coefficients. The last expression in right hand side describes the regularization (particularly, a L2-regularization known as Thikonov regularization), where the parameter $\lambda$ is a real number that indicates the weight of the regularization in the calibration process.
Let´s take a closer look in the expression above. The first part is simply a reduction of the fitting errors of the model with respect to the data set, using square module of the differences. The second part does not contain anything related to the data set neither the results of the model, but only with the parametrization itself. In principle, the interest relies on the model whose vector of parameters presents the lowest possible quadratic norm.
An overview of the role of the regularization can be visualized in the figures below:
Resultado de imagem para regularization
Both figures present the same data set (green circles) with the model indicated by the pink line. In the right side, one observes a more "noisy" line, passing through all the experimental points. This model is clearly overfitted, reproducing almost exactly the data set but with none predictability. In order to check overfitting, one can verify relative stability of the parameters in case of removing some points of the data set or changing the data set a bit. In case of overfitting the parameters will change considerably. The figure in left side shows a much more stable model obtained applying regularization. Using regularization, the optimization algorithm will give preference to those models whose vector of parameters has lower norms, which may result in many null parameters, mainly in linear models. The model in the left side is considerably more predictive than the other one, since it presents more stable parameters with respect to changes in the data set.
Other point to take into account as a motivation for the use of regularization is the fact that a "pure" fitting process might be strongly influenced by the experimental errors and rounding approximations of the data set. These factors may lead to noise in the solution that can be damped by regularization.

Other paths to Rome

The regularization presented above (Thikonov regularization) is the most common one, but several others exist. A simple variation with respect to the Thikanov regularization consists in considering the first order norm (L1-regularization) instead of the square norm, that is:
$OF = \sum_j[Y(x_j) - \sum_i \beta_i g_i(x_j)]^2 + \lambda \sum_i |\beta_i|$
In fact, the discussion on the choose between L1- or L2-regularization is long, mainly when considering situations with a large number of basis points. The following papers are interesting: paper1, paper2, and paper3 (this blog may also help and this blog brings a list of differences for a fast consultation). In principle, L1-regularization is more able to perform what is called feature selection, in which some of the parameters $\beta_i$ are set to zero, generating more sparsity in the vector of parameters $\beta$ when compared to the results using L2-regularization. In principle, L1-regularization has higher probability of providing simpler models with respect to the number of considered basis functions and then can better handle the overfitting problem. On the other hand, L2-regularization has higher probability of providing less noisy solutions with higher predictability for problems with much more complex structure. Several complex problems, such as pricing of financial derivatives uses L2-regularization. Another important characteristic of the L2-regularization is the fact that the regularization term is differentiable then one can apply gradient methods for the optimization process, which can bring considerable reduction of computational cost.
Another interesting solution is the Elastic Net (EN) regularization in which both L1 and L2 are employed:
$OF = \sum_j[Y(x_j) - \sum_i \beta_i g_i(x_j)]^2 + \lambda \sum_i [(1-\alpha)|\beta_i|+\alpha |\beta_i|^2$
This solution mix the characteristics of both most common regularization methods with the cost of introduce an additional parameter $\alpha$.

Getting the regularization factor

The regularization factor $\lambda$ plays an important role in the regularization process. In principle, it controls the impact of the penalization of the parameters with respect to the penalization of the model error itself. In an ideal situation, when the errors with respect to the data set are small enough, the penalization of the regularization must be dominant. On the other hand, when the errors are large enough, regularization must have a minor impact in the objective function. An ideal value of $\lambda$ must reflect such situation.
In principle, the regularization factor is obtained by some ad hoc method. Here, I will discuss a very interesting graphical method called L-curve, first published by Hansen in 2005. It is worth noticing however that several other methods exist. Consider the classic least square problem using Thikonov regularization:
$OF=||AX-B||^2 + \lambda^2 ||X||^2$
The L-curve will be constructed with $||AX-B||^2$ in the horizontal axis and $||X||^2$ in the vertical axis for different values of $\lambda$. The result is illustrated below:
In the figure above, higher values of $\lambda$ lead to oversmoothing, that is, the model does not fit the problem. Lower values of $\lambda$ lead to under smoothing. The appropriate values for $\lambda$ are those nearby the kink of the graph, which in the example corresponds to $\lambda=0.02$ (value not shown in the figure). Once that in the real world (sometimes called "in production") the data set varies over time or over runs, producing distinct L-curves, the value of $\lambda$ in the kink of the curve might represent an excellent educated guess. Of course, there is a whole theoretical tool box behind the L-curve that is far beyond this post. This theory for example helps in the definition of the better domain of $\lambda$ close to kink as well as other properties of the L-curve.

I hope you enjoyed the post. See you next week!!

Live long and prosper!

#=================================
Diogo de Moura Pedroso

Comments