A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS …

A TUTORIAL ON PRINCIPAL COMPONENT ANALYSISD erivation, Discussion and Singular Value DecompositionJon March 2003|Version 1 PRINCIPAL COMPONENT ANALYSIS (PCA) is a mainstayof modern data ANALYSIS - a black box that is widelyused but poorly understood. The goal of this paper isto dispel the magic behind this black box. This tutorialfocuses on building a solid intuition for how and whyprincipal COMPONENT ANALYSIS works; furthermore, itcrystallizes this knowledge by deriving from first prin-cipals, the mathematics behind PCA . This tutorialdoes not shy away from explaining the ideas infor-mally, nor does it shy away from the hope is that by addressing both aspects, readersof all levels will be able to gain a better understand-ing of the power of PCA as well as the when, the howand the why of applying this COMPONENT ANALYSIS (PCA)

Has been calledone of the most valuable results from applied lin-ear used abundantly in all formsof ANALYSIS - from neuroscience to computer graphics- because it is a simple, non-parametric method ofextracting relevant information from confusing datasets. With minimal additional effortPCAprovidesa roadmap for how to reduce a complex data set toa lower dimension to reveal the sometimes hidden,simplified dynamics that often underlie goal of this TUTORIAL is to provide both an intu-itive feel forPCA, and a thorough discussion of thistopic.

We will begin with a simple example and pro-vide an intuitive explanation of the goal ofPCA. Wewill continue by adding mathematical rigor to placeit within the framework of linear algebra and explic-itly solve this problem. We will see how and whyPCAis intimately related to the mathematical tech-nique of singular value decomposition (SVD). Thisunderstanding will lead us to a prescription for howto applyPCAin the real world. We will discuss boththe assumptions behind this technique as well as pos-sible extensions to overcome these discussion and explanations in this paper areinformal in the spirit of a TUTORIAL .

The goal of thispaper is toeducate. Occasionally, rigorous mathe-matical proofs are necessary although relegated tothe Appendix. Although not as vital to the TUTORIAL ,the proofs are presented for the adventurous readerwho desires a more complete understanding of themath. The only assumption is that the reader has aworking knowledge of linear algebra. Nothing feel free to contact me with any suggestions,corrections or : A Toy ExampleHere is the perspective: we are an experimenter. Weare trying to understand some phenomenon by mea-suring various quantities ( spectra, voltages, ve-locities, etc.)

In our system. Unfortunately, we cannot figure out what is happening because the dataappears clouded, unclear and even redundant. Thisis not a trivial problem, but rather a fundamentalobstacle to experimental science. Examples aboundfrom complex systems such as neuroscience, photo-science, meteorology and oceanography - the numberof variables to measure can be unwieldy and at timesevendeceptive, because the underlying dynamics canoften be quite for example a simple toy problem fromphysics diagrammed in Figure 1. Pretend we arestudying the motion of the physicist s ideal system consists of a ball of massmattached toamassless, frictionlessspring.

The ball is released asmall distance away from equilibrium ( the springis stretched). Because the spring is ideal, it oscil-lates indefinitely along thex-axis about its equilib-rium at a set is a standard problem in physics in which the1 Figure 1: A diagram of the toy along thexdirection is solved by an explicitfunction of time. In other words, the underlying dy-namics can be expressed as a function of a single , being ignorant experimenters we do notknow any of this. We do not know which, let alonehow many, axes and dimensions are important tomeasure.

Thus, we decide to measure the ball s posi-tion in a three-dimensional space (since we live in athree dimensional world). Specifically, we place threemovie cameras around our system of interest. At200 Hz each movie camera records an image indicat-ing a two dimensional position of the ball (a projec-tion). Unfortunately, because of our ignorance, wedo not even know what are thereal x , y and z axes, so we choose three camera axes{~a,~b,~c}at somearbitrary angles with respect to the system. The an-gles between our measurements might not even be90o!

Now, we record with the cameras for 2 big question remains:how do we get from thisdata set to a simple equation ofx?We know a-priori that if we were smart experi-menters, we would have just measured the positionalong thex-axis with one camera. But this is notwhat happens in the real world. We often do notknow what measurements best reflect the dynamicsof our system in question. Furthermore, we some-times record more dimensions than we actually need!Also, we have to deal with that pesky, real-worldproblem ofnoise. In the toy example this meansthat we need to deal with air, imperfect cameras oreven friction in a less-than-ideal spring.

Noise con-taminates our data set only serving to obfuscate thedynamics toy example is the challengeexperimenters face will refer to thisexample as we delve further into abstract , by the end of this paper we will have agood understanding of how to systematically extractxusing PRINCIPAL COMPONENT : Change of BasisThe Goal: PRINCIPAL COMPONENT ANALYSIS computesthe most meaningfulbasisto re-express a noisy, gar-bled data set. The hope is that this new basis willfilter out the noise and reveal hidden dynamics. Inthe example of the spring, the explicit goal ofPCAisto determine: the dynamics are along thex-axis.

In other words, the goal ofPCAis to determine that x- the unit basis vector along thex-axis - is the im-portant dimension. Determining this fact allows anexperimenter to discern which dynamics are impor-tant and which are just Naive BasisWith a more precise definition of our goal, we needa more precise definition of our data as well. Foreach time sample (or experimental trial), an exper-imenter records a set of data consisting of multiplemeasurements ( voltage, position, etc.). Thenumber of measurement types is thedimensionofthe data set.

A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS …

Tags:

Information

Advertisement

Transcription of A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS …

Related search queries

A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS …

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries