Julio Sotelo | Principal Component Analysis PCA

Principal Component Analysis PCA

Many times we face the need to analysis data collections that have a large number of variables. This leads to some problems when performing analysis trying to get information out from the data. When trying to use multivariate regression analysis to describe or predict we may not know which variables to use; even using different approaches as backward or forward selection could be costly. Not only that, it could bring another problem: multicollinearity. Those approaches do not care about the correlation of the selected x’s, making multiple regression difficult. Here is when PCA comes in handy.

Principal Component Analysis helps into identifying commonalities among the variables and grouping them into components that we should interpret hoping to find common sense in them. By doing these we may drop or surrogate variables not just for the statistical benefit of our regression model. We may actually select which variables to use and what to do with those left aside.

PCA is a mathematical approach rather that a statistical one. By using the directions (eigenvectors) and the spread in each direction (eigenvalues) we may rearrange the variables in order to gather the most variance possible in just few components. At least thats what we hope for. So we know that even when we transform the data the meaning and its relation remains.PCA is an interesting a powerful tool, could be use in different steps of data mining. For dimension reduction as it helps us to perform feature extraction; and for pattern discovery as we may use it to describe a phenomena.

Most important & difficult task – Explaining it

We have recently use PCA to describe the interrelation client and customers; we identify that the most challenging part in performing this analysis is making your client able to understand it. So try to tell a coherent story. Not and easy task when trying to explain how merging variables helps into identify patterns in a relationship. Trying to surrogate a single variable may be the best approach, you might loose predictive power, but explaining the phenomena gets easier. If you get an eureka moment in your customer then the adoption process could get closer.

Hadoop for Amazon product co-purchasing network
March 8, 2017

Principal Component Analysis PCA

Related