Walking through PCA (Principal Component Analysis)

Some posts ago, I discussed some issues and limitations when using Gaussian distributions, which is one of the most important tools in quantitative analysis. In this post, I will walk through the concept of PCA (Principal Component Analysis), which is also one the mainstream techniques for the analysis of multi stochastic variable problems. In fact, PCA has found applications in several areas, such as machine learning, financial engineering, remote sensing, mechatronics, and statistical analysis, of course. As always, this post is not formal, since the goal here is to spread accessible information, discussing concepts. For completeness of the this post, I also recommend this Shlens´ paper, which brings an interesting discussion on the topic.
What is PCA for?
It is a mathematical procedure that allows us to transform a set of (possibly) correlated variable into a set of uncorrelated variables. Mathematically, we want to obtain a set which allows to write $p(x_1, x_2)=p(x_1)p(x_2)$ where $p(.)$ is the probability density and $x_i$ are variables. But, why do we need it? Several reasons, but in general the idea is to reduce computational cost and complexity when simulating or analyzing a problem. For instance, once we get a set of independent variables, we can choose to reduce the dimensionality of the system, picking only a number of variables enough to explain the system we are investigating.
It is better to work with some examples.
Example 1: One needs to evaluate the performance of a portfolio comprised of possibly correlated stocks by simulating the trends of these stocks in the near future. PCA can then be used to reduce the number of computational cost of this simulation.
Example 2: Shlens´ paper proposes the following situation: one decides to use cameras to evaluate a simple spring-mass system which oscillates in a z-axis direction. In order to reduce measurement noise, one decides to use more than one camera (let´s say three cameras) distributed randomly with respect to z-axis, that is, cameras might be located anywhere. PCA can be used to recover the information of the movement in the z-axis from the correlated sampling obtained by the cameras.
Example 3: In a machine learning project, several variables might be available for the data scientist. However, not all these variables present some explanation power on the phenomenon she is studying. Consider, for instance, one is studying energy blackouts in a city in which the following data is available: age of installations, meteorological history, power consumption, and so on. In fact, a problem like this might present tens of variables, only meteorological variables, for instance, might be at least precipitation, thunder (number and magnitude), wind (direction and magnitude), humidity, and temperature. One can use PCA to reduce the number of variables that can in fact provide useful information for this machine learning project.
Assumptions and limitations
As mentioned before, PCA is one the main methods in quantitative analysis. Several other authors agree with me (this post, for instance, brings also other concepts). However, despite being a mainstream method, it presents several assumptions and limitations that should be taken into account before using PCA. Let´s take a look:
- Gaussian variables: it is assumed that the distribution of the considered variables are fully described by the first two moments (mean and variance). Distributions other than Gaussian, such as Poisson, cannot be used in a PCA analysis;
- Linearity: it is assumed that the variables present only linear relations between them;
- Zero mean: it is necessary to transform the variables in order to obtain a zero mean distribution;
- Variance is the most relevant parameter: it is assumed that large variance carries more relevant information of the system;
- Principal Components are orthogonal: it means that the new reference system (basis) build by PCA must present orthogonal axis.
Calculating PCA
Consider a $n \times m$ matrix $\mathbf{X}$ whose columns indicate the $m$ trials (measurements, for instance) and the rows indicate the $n$ variables obtained in each measurement. In this case, we assume that the variables in $\mathbf{X}$ present zero mean (assumption 3 of the list above). In this case, the covariance matrix of these variables can be written as $\Sigma_X=\frac{1}{n-1}\mathbf{XX^T}$. Our goal will be to find a orthonormal matrix $\mathbf{P}$ such that the covariance of the transformation $\mathbf{Y}=\mathbf{PX}$ is diagonal, that is, the matrix $\Sigma_Y=\mathbf{P}\Sigma_X\mathbf{P^T}$ is diagonal.
In order to obtain the matrix $\mathbf{Y}$, we diagonalize matrix $\Sigma_X$ in such a way that the rows of $\mathbf{P}$ contains the eigenvectors and a diagonal matrix $\mathbf{D}$ with the eigenvalues, that is, $\Sigma_X=\mathbf{P^TDP}$. This last step is a great trick, because in fact the only information really relevant for the process is the covariance matrix $\Sigma_X$. No parametrization is required.
With the definitions above, one can notice that the matrix $\mathbf{Y}$ is composed of independent random variables. Moreover, we can rewrite it as $\mathbf{Y}=\mathbf{\sqrt{D}\widetilde{Y}}$, where $\mathbf{\widetilde{Y}}$ is composed of normal variables with zero mean and unit variance. Therefore, we can write the matrix $\mathbf{X}$ in terms of independent variables (which is amazing!!!), that is:
The first step is the reorganization of the matrix $\mathbf{D}$ in such a way that the diagonal values are organized in a descending order. Of course, we change the order of the columns in $\mathbf{P}$ in order to keep correspondence with the eigenvalues in $\mathbf{D}$. After that, we evaluate the percentage of explanation of variance that each component has in the problem. This value is simply obtained by considering that the sum of the eigenvalues $S= \sum_{i}S_i$ explains 100%, of the variance and each component explains $\frac{S_i}{S}$ of the variance. The trick to reduce dimensionality of the problem becomes quite clear at this moment. We should simply choose the (largest) eigenvalues which explains a desirable amount of the variance, neglecting the rows of the matrix with the lowest eigenvalues. In this sense, we reduce the number of rows in $\mathbf{D}$ and also in the matrix $\mathbf{\widetilde{Y}}$, resulting in a matrix $\mathbf{X}$ with reduced dimensionality.
With the definitions above, one can notice that the matrix $\mathbf{Y}$ is composed of independent random variables. Moreover, we can rewrite it as $\mathbf{Y}=\mathbf{\sqrt{D}\widetilde{Y}}$, where $\mathbf{\widetilde{Y}}$ is composed of normal variables with zero mean and unit variance. Therefore, we can write the matrix $\mathbf{X}$ in terms of independent variables (which is amazing!!!), that is:
$\mathbf{X}=\mathbf{P^T\sqrt{D}\widetilde{Y}}$
the rows of $\mathbf{P^T}$ are the (principal) components.
In principle, equation as it is written above does not bring so much benefits, since the dimensionality of the problem remains the same. However, one can make some approximations leading to a reduction of the dimensionality.The first step is the reorganization of the matrix $\mathbf{D}$ in such a way that the diagonal values are organized in a descending order. Of course, we change the order of the columns in $\mathbf{P}$ in order to keep correspondence with the eigenvalues in $\mathbf{D}$. After that, we evaluate the percentage of explanation of variance that each component has in the problem. This value is simply obtained by considering that the sum of the eigenvalues $S= \sum_{i}S_i$ explains 100%, of the variance and each component explains $\frac{S_i}{S}$ of the variance. The trick to reduce dimensionality of the problem becomes quite clear at this moment. We should simply choose the (largest) eigenvalues which explains a desirable amount of the variance, neglecting the rows of the matrix with the lowest eigenvalues. In this sense, we reduce the number of rows in $\mathbf{D}$ and also in the matrix $\mathbf{\widetilde{Y}}$, resulting in a matrix $\mathbf{X}$ with reduced dimensionality.
Implementation
All the process described in the previous section are available in different black box packages. There are at least two well-known packages which implements PCA: scikit-learn and dask. In fact, the interface for implementation is identical! Internally, however, the methods differentiate themselves with respect to the algorithm of calculation (see the links). Performance is also similar.
The criterion for choosing one package or another is based on your objectives with the PCA. Both packages were initially thought for machine learning applications. Therefore, the first criterion could be be related with the package that the team decides to work with. This site intends to discuss a bit of this comparison (scikit-learn vs dask) for ML applications. However, not always one is focused on ML applications and simply desires some analysis in the context of one of the applications mentioned in the beginning of this post. In this case, I personally use scikit-learn because I can rely on some other implementations available in this package so it does not make sense to import a lot of packages with the same or similar functionalities.
I hope you enjoyed the post. Leave your comments and share!
Live long and prosper!
#=================================
Diogo de Moura Pedroso
LinkedIn: www.linkedin.com/in/diogomourapedroso
Comments
Post a Comment