Example: tourism industry

A Tutorial on Principal Component Analysis

A Tutorial on Principal Component AnalysisJonathon Shlens Google ResearchMountain View, CA 94043(Dated: April 7, 2014 ; Version ) Principal Component Analysis (PCA) is a mainstay of modern data Analysis - a black box that is widely usedbut (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. Thismanuscript focuses on building a solid intuition for how and why Principal Component Analysis works. Thismanuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA.

(Dated: April 7, 2014; Version 3.02) Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. This manuscript focuses on building a solid intuition for how and why principal component analysis ...

Tags:

  2014

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A Tutorial on Principal Component Analysis

1 A Tutorial on Principal Component AnalysisJonathon Shlens Google ResearchMountain View, CA 94043(Dated: April 7, 2014 ; Version ) Principal Component Analysis (PCA) is a mainstay of modern data Analysis - a black box that is widely usedbut (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. Thismanuscript focuses on building a solid intuition for how and why Principal Component Analysis works. Thismanuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA.

2 Thistutorial does not shy away from explaining the ideas informally, nor does it shy away from the mathematics. Thehope is that by addressing both aspects, readers of all levels will be able to gain a better understanding of PCA aswell as the when, the how and the why of applying this INTRODUCTIONP rincipal Component Analysis (PCA) is a standard tool in mod-ern data Analysis - in diverse fields from neuroscience to com-puter graphics - because it is a simple, non-parametric methodfor extracting relevant information from confusing data minimal effort PCA provides a roadmap for how to re-duce a complex data set to a lower dimension to reveal thesometimes hidden, simplified structures that often underlie goal of this Tutorial is to provide both an intuitive feel forPCA, and a thorough discussion of this topic.

3 We will beginwith a simple example and provide an intuitive explanationof the goal of PCA. We will continue by adding mathemati-cal rigor to place it within the framework of linear algebra toprovide an explicit solution. We will see how and why PCAis intimately related to the mathematical technique of singularvalue decomposition (SVD). This understanding will lead usto a prescription for how to apply PCA in the real world and anappreciation for the underlying assumptions. My hope is thata thorough understanding of PCA provides a foundation forapproaching the fields of machine learning and discussion and explanations in this paper are informal inthe spirit of a Tutorial .

4 The goal of this paper is , rigorous mathematical proofs are necessary al-though relegated to the Appendix. Although not as vital to thetutorial, the proofs are presented for the adventurous readerwho desires a more complete understanding of the math. Myonly assumption is that the reader has a working knowledgeof linear algebra. My goal is to provide a thorough discussionby largely building on ideas from linear algebra and avoidingchallenging topics in statistics and optimization theory (butsee Discussion). Please feel free to contact me with any sug-gestions, corrections or comments.

5 Electronic address: MOTIVATION: A TOY EXAMPLEHere is the perspective: we are an experimenter. We are tryingto understand some phenomenon by measuring various quan-tities ( spectra, voltages, velocities, etc.) in our , we can not figure out what is happening be-cause the data appears clouded, unclear and even is not a trivial problem, but rather a fundamental obstaclein empirical science. Examples abound from complex sys-tems such as neuroscience, web indexing, meteorology andoceanography - the number of variables to measure can beunwieldy and at times evendeceptive, because the underlyingrelationships can often be quite for example a simple toy problem from physics dia-grammed in Figure 1.

6 Pretend we are studying the motionof the physicist s ideal spring. This system consists of a ballof massmattached to a massless, frictionless spring. The ballis released a small distance away from equilibrium ( thespring is stretched). Because the spring is ideal, it oscillatesindefinitely along thex-axis about its equilibrium at a set is a standard problem in physics in which the motionalong thexdirection is solved by an explicit function of other words, the underlying dynamics can be expressed asa function of a single , being ignorant experimenters we do not know anyof this.

7 We do not know which, let alone how many, axesand dimensions are important to measure. Thus, we decide tomeasure the ball s position in a three-dimensional space (sincewe live in a three dimensional world). Specifically, we placethree movie cameras around our system of interest. At 120 Hzeach movie camera records an image indicating a two dimen-sional position of the ball (a projection). Unfortunately, be-cause of our ignorance, we do not even know what are the realx,yandzaxes, so we choose three camera positions~a,~band~cat some arbitrary angles with respect to the system.

8 The anglesbetween our measurements might not even be 90o! Now, werecord with the cameras for several minutes. The big questionremains:how do we get from this data set to a simple [ ] 3 Apr 20142camera Acamera Bcamera CFIG. 1 A toy example. The position of a ball attached to an oscillat-ing spring is recorded using three cameras A, B and C. The positionof the ball tracked by each camera is depicted in each panel x?We know a-priori that if we were smart experimenters, wewould have just measured the position along thex-axis withone camera.

9 But this is not what happens in the real often do not know which measurements best reflect thedynamics of our system in question. Furthermore, we some-times record more dimensions than we actually , we have to deal with that pesky, real-world problem ofnoise. In the toy example this means that we need to dealwith air, imperfect cameras or even friction in a less-than-idealspring. Noise contaminates our data set only serving to obfus-cate the dynamics toy example is the challengeexperimenters face this example in mind aswe delve further into abstract concepts.

10 Hopefully, by the endof this paper we will have a good understanding of how tosystematically extractxusing Principal Component FRAMEWORK: CHANGE OF BASISThe goal of Principal Component Analysis is to identify themost meaningful basis to re-express a data set. The hope isthat this new basis will filter out the noise and reveal hiddenstructure. In the example of the spring, the explicit goal ofPCA is to determine: the dynamics are along thex-axis. Inother words, the goal of PCA is to determine that x, theunit basis vector along thex-axis, is the important this fact allows an experimenter to discern whichdynamics are important, redundant or A Naive BasisWith a more precise definition of our goal, we need a moreprecise definition of our data as well.


Related search queries