Example: bankruptcy

Visualizing Data using t-SNE

JournalofMachineLearningResearch9 ,5000 LETilburg, King s College Road,M5S3G4 Toronto,ON,CanadaEditor:YoshuaBengioAbst ractWe presenta newtechniquecalled t-SNE thatvisualizeshigh-dimensionaldatabygivi ngeachdatapointa locationina two a variationofStochasticNeighborEmbedding(H intonandRoweis,2002)thatismucheasiertoop timize,andproducessignificantlybettervis ualizationsbyreducingthetendency betterthanexistingtechniquesat creatinga singlemapthatrevealsstructureat many particularlyimportantforhigh-dimensional datathatlieonseveraldifferent,butrelated ,low-dimensionalmanifolds, ,weshow howt-SNEcanuserandomwalksonneighborhoodg raphstoallowtheimplicitstructureofalloft hedatato influencethewayin whicha subsetofthedatais illustratetheperformanceoft-SNEona widevarietyofdatasetsandcompareit withmany othernon-parametricvisualizationtechniqu es,includingSammonmapping,Isomap.

VISUALIZING DATA USING T-SNE 2. Stochastic Neighbor Embedding Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean dis-tances between datapoints into conditional probabilities that represent similarities.1 The similarity of datapoint xj to datapoint xi is the conditional probability, pjji, that xi would pick xj as its neighbor

Tags:

  Using

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Visualizing Data using t-SNE

1 JournalofMachineLearningResearch9 ,5000 LETilburg, King s College Road,M5S3G4 Toronto,ON,CanadaEditor:YoshuaBengioAbst ractWe presenta newtechniquecalled t-SNE thatvisualizeshigh-dimensionaldatabygivi ngeachdatapointa locationina two a variationofStochasticNeighborEmbedding(H intonandRoweis,2002)thatismucheasiertoop timize,andproducessignificantlybettervis ualizationsbyreducingthetendency betterthanexistingtechniquesat creatinga singlemapthatrevealsstructureat many particularlyimportantforhigh-dimensional datathatlieonseveraldifferent,butrelated ,low-dimensionalmanifolds, ,weshow howt-SNEcanuserandomwalksonneighborhoodg raphstoallowtheimplicitstructureofalloft hedatato influencethewayin whicha subsetofthedatais illustratetheperformanceoft-SNEona widevarietyofdatasetsandcompareit withmany othernon-parametricvisualizationtechniqu es,includingSammonmapping,Isomap.

2 Visualization,dimensionalityreduction,ma nifoldlearning,embeddingalgorithms, differentdomains,anddealswithdataofwidel yvaryingdimensionality. Cellnucleithatarerelevanttobreastcancer, forexample,aredescribedbyapproximately30 variables(Streetetal.,1993),whereasthepi xelintensityvectorsusedtorepresentimages ortheword-countvectorsusedtorepresentdoc umentstypicallyhave ,a varietyoftechniquesforthevisualizationof suchhigh-dimensionaldatahave beenproposed,many ofwhicharereviewedbydeOliveiraandLevkowi tz(2003).Importanttechniquesincludeicono graphicdisplayssuchasChernoff faces(Chernoff, 1973),pixel-basedtechniques(Keim,2000),a ndtechniquesthatrepre-sentthedimensionsi n thedataasverticesin a graph(Battistaet al.,1994).Mostofthesetechniquessimplypro videtoolstodisplaymorethantwo datadimensions,andleave theinterpretationofthec 2008 LaurensvanderMaatenandGeoffrey ,dimensionalityreductionmethodsconvertth ehigh-dimensionaldatasetX=fx1;x2; :::;xngintotwo orthree-dimensionaldataY=fy1;y2; :::;yngthatcanbedisplayedina , werefertothelow-dimensionaldatarepresent ationYasa map, beenproposedthatdifferinthetypeofstructu rethey (PCA;Hotelling,1933)andclassicalmultidim ensionalscaling(MDS;Torgerson,1952) low-dimensional,non-linearmanifoldit isusu-allymoreimportanttokeepthelow-dime nsionalrepresentationsofverysimilardatap ointsclosetogether, whichis typicallynotpossiblewitha largenumberofnonlineardimensionalityredu ctiontechniquesthataimtopreserve thelocalstructureofdatahave beenproposed,many ofwhicharereviewedbyLeeandVerleysen(2007 ).

3 Inparticular, wementionthefollowingseventechniques:(1) Sammonmapping(Sammon,1969),(2)curvilinea rcomponentsanalysis(CCA;DemartinesandH erault,1997),(3)StochasticNeighborEmbedd ing(SNE;HintonandRoweis,2002),(4)Isomap( Tenenbaumet al.,2000),(5)MaximumVarianceUnfolding(MV U;Weinbergeret al.,2004),(6)LocallyLinearEmbedding(LLE; RoweisandSaul,2000),and(7)LaplacianEigen maps(BelkinandNiyogi,2002).Despitethestr ongper-formanceofthesetechniquesonartifi cialdatasets,they areoftennotverysuccessfulat visualizingreal, , mostofthetechniquesarenotcapableofretain ingboththelocalandtheglobalstructureofth edataina ,a recentstudyrevealsthatevena semi-supervisedvariantofMVUisnotcapableo fseparatinghandwrittendigitsintotheirnat uralclusters(Songet al.,2007).Inthispaper, wedescribea wayofconvertinga high-dimensionaldatasetintoa matrixofpair-wisesimilaritiesandweintrod ucea newtechnique,called t-SNE , capableofcapturingmuchofthelocalstructur eofthehigh-dimensionaldataverywell,while alsorevealingglobalstructuresuchasthepre senceofclustersat illustratetheperformanceoft-SNEbycompari ngit tothesevendimensionalityreductiontech-ni quesmentionedabove onfive datasetsfroma ,mostofthe(7+1) 5=40mapsarepresentedinthesupplementalmat erial, , weoutlineSNEaspresentedbyHintonandRoweis (2002), ,wepresentt-SNE, , , Section5 showshowt-SNEcanbemodifiedtovisualizerea l-worlddatasetsthatcontainmany morethan10.

4 Moredetailin (SNE) datapointxiis theconditionalprobability,pjji, thatxiwouldpickxjasitsneighborif neighborswerepickedinproportiontotheirpr obabilitydensityundera ,pjjiis relativelyhigh,whereasforwidelyseparated datapoints,pjjiwillbealmostinfinitesimal (forreasonablevaluesofthevarianceoftheGa ussian, i). Mathematically, theconditionalprobabilitypjjiis givenbypjji=exp kxi xjk2=2 2i k6=iexp kxi xkk2=2 2i ;(1)where iis thevarianceoftheGaussianthatis centeredondatapointxi. Themethodfordeterminingthevalueof iis presentedlaterin modelingpairwisesimilarities, , it is possibletocomputea similarconditionalprobability,whichweden otebyqjji. We set2thevarianceoftheGaussianthatis employedinthecomputationoftheconditional probabilitiesqjjito1p2. Hence,wemodelthesimilarityofmappointyjto mappointyibyqjji=exp kyi yjk2 k6=iexp( kyi ykk2):Again,sinceweareonlyinterestedinmo delingpairwisesimilarities,wesetqiji= themappointsyiandyjcorrectlymodelthesimi laritybetweenthehigh-dimensionaldata-poi ntsxiandxj, ,SNEaimstofinda low-dimensionaldatarepresentationthatmin imizesthemismatchbetweenpjjiandqjji.

5 Anaturalmeasureofthefaithfulnesswithwhic hqjjimodelspjjiistheKullback-Leiblerdive rgence(whichis inthiscaseequaltothecross-entropy uptoanadditive constant).SNEminimizesthesumofKullback-L eiblerdivergencesoveralldatapointsusinga givenbyC= iKL(PijjQi) = i jpjjilogpjjiqjji;(2)inwhichPirepresentst heconditionalprobabilitydistributionover allotherdatapointsgivendata-pointxi, andQirepresentstheconditionalprobability distributionoverallothermappointsgivenma ppointyi. BecausetheKullback-Leiblerdivergenceis notsymmetric,differenttypesoferrorinthep airwisedistancesinthelow-dimensionalmapa renotweightedequally. Inparticular, thereis a largecostforusingwidelyseparatedmappoint storepresentnearbydatapoints( , datasetsthatconsistofpairwisesimilaritie sbetweenobjectsratherthanhigh-dimensiona lvectorrepresentationsofeachobject, ,humanwordassociationdataconsistsofthepr obabilityofproducingeachpossiblewordinre sponsetoagivenword,asa resultofwhichit is ,welosethepropertythatthedatais a perfectmodelofitselfif weembedit ina spaceofthesamedimensionality, becauseinthehigh-dimensionalspace,weused a differentvariance smallqjjitomodela largepjji), butthereisonlya ,theSNEcostfunctionfocusesonretainingthe localstructureofthedatainthemap(forreaso nablevaluesofthevarianceoftheGaussianint hehigh-dimensionalspace, i).

6 Theremainingparametertobeselectedis thevariance ioftheGaussianthatis centeredovereachhigh-dimensionaldatapoin t,xi. It is notlikelythatthereis a singlevalueof ithatis optimalforalldatapointsinthedatasetbecau sethedensityofthedatais likelytovary. Indenseregions,a smallervalueof iis particularvalueof iinducesa probabilitydistribution,Pi, whichincreasesas binarysearchforthevalueof ithatproducesaPiwitha fixedperplexitythatis definedasPer p(Pi) =2H(Pi);whereH(Pi)is theShannonentropy ofPimeasuredinbitsH(Pi) = jpjjilog2pjji:Theperplexitycanbeinterpre tedasa smoothmeasureoftheeffective fairlyrobusttochangesintheperplexity, isperformedusinga surprisinglysimpleform C yi=2 j(pjji qjji+pijj qijj)(yi yj):Physically, thegradientmaybeinterpretedastheresultan tforcecreatedbya setofspringsbetweenthemappointyiandallot hermappointsyj. Allspringsexerta forcealongthedirection(yi yj).

7 Thespringbetweenyiandyjrepelsorattractst hemappointsdependingonwhetherthedistance betweenthetwo inthemapis proportionalto itslength,andalsoproportionaltoitsstiffn ess,whichis themismatch(pjji qjji+pijj qijj) initializedbysamplingmappointsrandomlyfr omanisotropicGaussianwithsmallvarianceth atis ,a relativelylargemomentumtermis addedto ,thecurrentgradientisaddedtoanexponentia llydecayingsumofpreviousgradientsinorder todeterminethechangesinthecoordinatesoft hemappointsat , thegradientupdatewitha momentumtermis givenbyY(t)=Y(t 1)+ C Y+ (t) Y(t 1) Y(t 2) ; (t)indicatesthesolutionat iterationt, indicatesthelearningrate,and (t)representsthemomentumat ,intheearlystagesoftheoptimization,Gauss iannoiseis thevarianceofthenoisechangesveryslowlyat thecriticalpointatwhichtheglobalstructur eofthemapstartstoform,SNEtendstofindmaps witha , thisrequiressensiblechoicesoftheinitiala mountofGaussiannoiseandtherateatwhichit , is thereforecommontoruntheoptimizationsever altimesona ,SNEis inferiortomethodsthatallowconvex optimizationandit discussedSNEasit waspresentedbyHintonandRoweis(2002).

8 AlthoughSNEcon-structsreasonablygoodvisu alizations,it is hamperedbya costfunctionthatis difficulttooptimizeandbya problemwereferto asthe crowdingproblem .Inthissection,wepresenta new techniquecalled t-DistributedStochasticNeighborEmbedding or t-SNE ways:(1)it usesasymmetrizedversionoftheSNEcostfunct ionwithsimplergradientsthatwasbrieflyint roducedbyCooket al.(2007)and(2)it usesa Student-tdistributionratherthana Gaussiantocomputethesim-ilaritybetweentw o pointsinthelow-dimensionalspace. t-SNEemploysa ,wefirstdiscussthesymmetricversionofSNE( ).Subsequently, wediscussthecrowdingproblem( ),andtheuseofheavy-taileddistributionsto addressthisproblem( ).We concludethesectionbydescribingourapproac htotheoptimizationofthet-SNEcostfunction ( ). tominimizingthesumoftheKullback-Leiblerd ivergencesbetweenthecondi-tionalprobabil itiespjjiandqjji, it is alsopossibletominimizea singleKullback-Leiblerdivergencebetweena jointprobabilitydistribution,P, inthehigh-dimensionalspaceanda jointprobabilitydistribution,Q, inthelow-dimensionalspace:C=KL(PjjQ) = i jpi jlogpi jqi j:whereagain, refertothistypeofSNEassymmetricSNE,becau seithasthepropertythatpi j=pjiandqi j=qjifor8i;j.

9 InsymmetricSNE, visualizationofthedatais notnearlyasproblematicaspickingthemodelt hatdoesbestona ,theaimis toseethestructureinthetrainingdata, jaregivenbyqi j=exp kyi yjk2 k6=lexp( kyk ylk2);(3)Theobviouswaytodefinethepairwis esimilaritiesinthehigh-dimensionalspacep i jispi j=exp kxi xjk2=2 2 k6=lexp( kxk xlk2=2 2);butthiscausesproblemswhena high-dimensionaldatapointxiis anoutlier( ,allpairwisedis-tanceskxi xjk2arelargeforxi).Forsuchanoutlier, thevaluesofpi jareextremelysmallforallj, result,thepositionofthemappointis circumventthisproblembydefiningthejointp robabilitiespi jinthehigh-dimensionalspaceto bethesymmetrizedconditionalprobabilities ,thatis,wesetpi j=pjji+pijj2n. Thisensuresthat jpi j>12nforalldatapointsxi, asa resultofwhicheachdatapointximakesa , thesimplerformofitsgradient, fairlysimilartothatofasymmetricSNE,andis givenby C yi=4 j(pi j qi j)(yi yj).

10 Inpreliminaryexperiments,weobservedthats ymmetricSNEseemstoproducemapsthatarejust asgoodasasymmetricSNE,andsometimesevena setofdatapointsthatlieona two-dimensionalcurvedmanifoldwhichis approximatelylinearona smallscale,andwhichis embeddedwithina is possibletomodelthesmallpairwisedistances betweendatapointsfairlywellina two-dimensionalmap,whichis oftenillustratedontoy examplessuchasthe Swissroll embeddedwithina thepairwisedistancesina ,intendimensions,it is possibletohave 11datapointsthataremutuallyequidistantan dthereis nowaytomodelthisfaithfullyina relatedproblemis theverydifferentdistributionofpairwisedi stancesinthetwo spherecenteredondatapointiscalesasrm, thedatapointsareapproximatelyuniformlydi stributedintheregionaroundiontheten-dime nsionalmanifold,andwetrytomodelthedis-ta ncesfromitotheotherdatapointsinthetwo-di mensionalmap,wegetthefollowing crowdingproblem :theareaofthetwo-dimensionalmapthatis availableto accommodatemoderatelydistantdatapointswi llnotbenearlylargeenoughcomparedwiththea reaavailableto ,if wewantto modelthesmalldistancesaccuratelyin themap, moderatedistancefromdatapointiwillhave ,thespringconnectingdatapointitoeachofth esetoo-distantmappointswillthusexerta verysmallattractive forcesareverysmall,theverylargenumberofs uchforcescrushestogetherthepointsinthece nterofthemap, ,butthatit slightrepulsiontoallspringswaspre-sented byCooket al.


Related search queries