Visualizing Data using t-SNE

JournalofMachineLearningResearch9 ,5000 LETilburg, King s College Road,M5S3G4 Toronto,ON,CanadaEditor:YoshuaBengioAbst ractWe presenta newtechniquecalled t-SNE thatvisualizeshigh-dimensionaldatabygivi ngeachdatapointa locationina two a variationofStochasticNeighborEmbedding(H intonandRoweis,2002)thatismucheasiertoop timize,andproducessignificantlybettervis ualizationsbyreducingthetendency betterthanexistingtechniquesat creatinga singlemapthatrevealsstructureat many particularlyimportantforhigh-dimensional datathatlieonseveraldifferent,butrelated ,low-dimensionalmanifolds, ,weshow howt-SNEcanuserandomwalksonneighborhoodg raphstoallowtheimplicitstructureofalloft hedatato influencethewayin whicha subsetofthedatais illustratetheperformanceoft-SNEona widevarietyofdatasetsandcompareit withmany othernon-parametricvisualizationtechniqu es,includingSammonmapping,Isomap.

Visualization,dimensionalityreduction,ma nifoldlearning,embeddingalgorithms, differentdomains,anddealswithdataofwidel yvaryingdimensionality. Cellnucleithatarerelevanttobreastcancer, forexample,aredescribedbyapproximately30 variables(Streetetal.,1993),whereasthepi xelintensityvectorsusedtorepresentimages ortheword-countvectorsusedtorepresentdoc umentstypicallyhave ,a varietyoftechniquesforthevisualizationof suchhigh-dimensionaldatahave beenproposed,many ofwhicharereviewedbydeOliveiraandLevkowi tz(2003).Importanttechniquesincludeicono graphicdisplayssuchasChernoff faces(Chernoff, 1973),pixel-basedtechniques(Keim,2000),a ndtechniquesthatrepre-sentthedimensionsi n thedataasverticesin a graph(Battistaet al.,1994).Mostofthesetechniquessimplypro videtoolstodisplaymorethantwo datadimensions,andleave theinterpretationofthec 2008 LaurensvanderMaatenandGeoffrey ,dimensionalityreductionmethodsconvertth ehigh-dimensionaldatasetX=fx1;x2; :::;xngintotwo orthree-dimensionaldataY=fy1;y2; :::;yngthatcanbedisplayedina , werefertothelow-dimensionaldatarepresent ationYasa map, beenproposedthatdifferinthetypeofstructu rethey (PCA;Hotelling,1933)andclassicalmultidim ensionalscaling(MDS;Torgerson,1952) low-dimensional,non-linearmanifoldit isusu-allymoreimportanttokeepthelow-dime nsionalrepresentationsofverysimilardatap ointsclosetogether, whichis typicallynotpossiblewitha largenumberofnonlineardimensionalityredu ctiontechniquesthataimtopreserve thelocalstructureofdatahave beenproposed,many ofwhicharereviewedbyLeeandVerleysen(2007 ).

Inparticular, wementionthefollowingseventechniques:(1) Sammonmapping(Sammon,1969),(2)curvilinea rcomponentsanalysis(CCA;DemartinesandH erault,1997),(3)StochasticNeighborEmbedd ing(SNE;HintonandRoweis,2002),(4)Isomap( Tenenbaumet al.,2000),(5)MaximumVarianceUnfolding(MV U;Weinbergeret al.,2004),(6)LocallyLinearEmbedding(LLE; RoweisandSaul,2000),and(7)LaplacianEigen maps(BelkinandNiyogi,2002).Despitethestr ongper-formanceofthesetechniquesonartifi cialdatasets,they areoftennotverysuccessfulat visualizingreal, , mostofthetechniquesarenotcapableofretain ingboththelocalandtheglobalstructureofth edataina ,a recentstudyrevealsthatevena semi-supervisedvariantofMVUisnotcapableo fseparatinghandwrittendigitsintotheirnat uralclusters(Songet al.,2007).Inthispaper, wedescribea wayofconvertinga high-dimensionaldatasetintoa matrixofpair-wisesimilaritiesandweintrod ucea newtechnique,called t-SNE , capableofcapturingmuchofthelocalstructur eofthehigh-dimensionaldataverywell,while alsorevealingglobalstructuresuchasthepre senceofclustersat illustratetheperformanceoft-SNEbycompari ngit tothesevendimensionalityreductiontech-ni quesmentionedabove onfive datasetsfroma ,mostofthe(7+1) 5=40mapsarepresentedinthesupplementalmat erial, , weoutlineSNEaspresentedbyHintonandRoweis (2002), ,wepresentt-SNE, , , Section5 showshowt-SNEcanbemodifiedtovisualizerea l-worlddatasetsthatcontainmany morethan10.

Moredetailin (SNE) datapointxiis theconditionalprobability,pjji, thatxiwouldpickxjasitsneighborif neighborswerepickedinproportiontotheirpr obabilitydensityundera ,pjjiis relativelyhigh,whereasforwidelyseparated datapoints,pjjiwillbealmostinfinitesimal (forreasonablevaluesofthevarianceoftheGa ussian, i). Mathematically, theconditionalprobabilitypjjiis givenbypjji=exp kxi xjk2=2 2i k6=iexp kxi xkk2=2 2i ;(1)where iis thevarianceoftheGaussianthatis centeredondatapointxi. Themethodfordeterminingthevalueof iis presentedlaterin modelingpairwisesimilarities, , it is possibletocomputea similarconditionalprobability,whichweden otebyqjji. We set2thevarianceoftheGaussianthatis employedinthecomputationoftheconditional probabilitiesqjjito1p2. Hence,wemodelthesimilarityofmappointyjto mappointyibyqjji=exp kyi yjk2 k6=iexp( kyi ykk2):Again,sinceweareonlyinterestedinmo delingpairwisesimilarities,wesetqiji= themappointsyiandyjcorrectlymodelthesimi laritybetweenthehigh-dimensionaldata-poi ntsxiandxj, ,SNEaimstofinda low-dimensionaldatarepresentationthatmin imizesthemismatchbetweenpjjiandqjji.

Anaturalmeasureofthefaithfulnesswithwhic hqjjimodelspjjiistheKullback-Leiblerdive rgence(whichis inthiscaseequaltothecross-entropy uptoanadditive constant).SNEminimizesthesumofKullback-L eiblerdivergencesoveralldatapointsusinga givenbyC= iKL(PijjQi) = i jpjjilogpjjiqjji;(2)inwhichPirepresentst heconditionalprobabilitydistributionover allotherdatapointsgivendata-pointxi, andQirepresentstheconditionalprobability distributionoverallothermappointsgivenma ppointyi. BecausetheKullback-Leiblerdivergenceis notsymmetric,differenttypesoferrorinthep airwisedistancesinthelow-dimensionalmapa renotweightedequally. Inparticular, thereis a largecostforusingwidelyseparatedmappoint storepresentnearbydatapoints( , datasetsthatconsistofpairwisesimilaritie sbetweenobjectsratherthanhigh-dimensiona lvectorrepresentationsofeachobject, ,humanwordassociationdataconsistsofthepr obabilityofproducingeachpossiblewordinre sponsetoagivenword,asa resultofwhichit is ,welosethepropertythatthedatais a perfectmodelofitselfif weembedit ina spaceofthesamedimensionality, becauseinthehigh-dimensionalspace,weused a differentvariance smallqjjitomodela largepjji), butthereisonlya ,theSNEcostfunctionfocusesonretainingthe localstructureofthedatainthemap(forreaso nablevaluesofthevarianceoftheGaussianint hehigh-dimensionalspace, i).

Theremainingparametertobeselectedis thevariance ioftheGaussianthatis centeredovereachhigh-dimensionaldatapoin t,xi. It is notlikelythatthereis a singlevalueof ithatis optimalforalldatapointsinthedatasetbecau sethedensityofthedatais likelytovary. Indenseregions,a smallervalueof iis particularvalueof iinducesa probabilitydistribution,Pi, whichincreasesas binarysearchforthevalueof ithatproducesaPiwitha fixedperplexitythatis definedasPer p(Pi) =2H(Pi);whereH(Pi)is theShannonentropy ofPimeasuredinbitsH(Pi) = jpjjilog2pjji:Theperplexitycanbeinterpre tedasa smoothmeasureoftheeffective fairlyrobusttochangesintheperplexity, isperformedusinga surprisinglysimpleform C yi=2 j(pjji qjji+pijj qijj)(yi yj):Physically, thegradientmaybeinterpretedastheresultan tforcecreatedbya setofspringsbetweenthemappointyiandallot hermappointsyj. Allspringsexerta forcealongthedirection(yi yj).

Thespringbetweenyiandyjrepelsorattractst hemappointsdependingonwhetherthedistance betweenthetwo inthemapis proportionalto itslength,andalsoproportionaltoitsstiffn ess,whichis themismatch(pjji qjji+pijj qijj) initializedbysamplingmappointsrandomlyfr omanisotropicGaussianwithsmallvarianceth atis ,a relativelylargemomentumtermis addedto ,thecurrentgradientisaddedtoanexponentia llydecayingsumofpreviousgradientsinorder todeterminethechangesinthecoordinatesoft hemappointsat , thegradientupdatewitha momentumtermis givenbyY(t)=Y(t 1)+ C Y+ (t) Y(t 1) Y(t 2) ; (t)indicatesthesolutionat iterationt, indicatesthelearningrate,and (t)representsthemomentumat ,intheearlystagesoftheoptimization,Gauss iannoiseis thevarianceofthenoisechangesveryslowlyat thecriticalpointatwhichtheglobalstructur eofthemapstartstoform,SNEtendstofindmaps witha , thisrequiressensiblechoicesoftheinitiala mountofGaussiannoiseandtherateatwhichit , is thereforecommontoruntheoptimizationsever altimesona ,SNEis inferiortomethodsthatallowconvex optimizationandit discussedSNEasit waspresentedbyHintonandRoweis(2002).

AlthoughSNEcon-structsreasonablygoodvisu alizations,it is hamperedbya costfunctionthatis difficulttooptimizeandbya problemwereferto asthe crowdingproblem .Inthissection,wepresenta new techniquecalled t-DistributedStochasticNeighborEmbedding or t-SNE ways:(1)it usesasymmetrizedversionoftheSNEcostfunct ionwithsimplergradientsthatwasbrieflyint roducedbyCooket al.(2007)and(2)it usesa Student-tdistributionratherthana Gaussiantocomputethesim-ilaritybetweentw o pointsinthelow-dimensionalspace. t-SNEemploysa ,wefirstdiscussthesymmetricversionofSNE( ).Subsequently, wediscussthecrowdingproblem( ),andtheuseofheavy-taileddistributionsto addressthisproblem( ).We concludethesectionbydescribingourapproac htotheoptimizationofthet-SNEcostfunction ( ). tominimizingthesumoftheKullback-Leiblerd ivergencesbetweenthecondi-tionalprobabil itiespjjiandqjji, it is alsopossibletominimizea singleKullback-Leiblerdivergencebetweena jointprobabilitydistribution,P, inthehigh-dimensionalspaceanda jointprobabilitydistribution,Q, inthelow-dimensionalspace:C=KL(PjjQ) = i jpi jlogpi jqi j:whereagain, refertothistypeofSNEassymmetricSNE,becau seithasthepropertythatpi j=pjiandqi j=qjifor8i;j.

InsymmetricSNE, visualizationofthedatais notnearlyasproblematicaspickingthemodelt hatdoesbestona ,theaimis toseethestructureinthetrainingdata, jaregivenbyqi j=exp kyi yjk2 k6=lexp( kyk ylk2);(3)Theobviouswaytodefinethepairwis esimilaritiesinthehigh-dimensionalspacep i jispi j=exp kxi xjk2=2 2 k6=lexp( kxk xlk2=2 2);butthiscausesproblemswhena high-dimensionaldatapointxiis anoutlier( ,allpairwisedis-tanceskxi xjk2arelargeforxi).Forsuchanoutlier, thevaluesofpi jareextremelysmallforallj, result,thepositionofthemappointis circumventthisproblembydefiningthejointp robabilitiespi jinthehigh-dimensionalspaceto bethesymmetrizedconditionalprobabilities ,thatis,wesetpi j=pjji+pijj2n. Thisensuresthat jpi j>12nforalldatapointsxi, asa resultofwhicheachdatapointximakesa , thesimplerformofitsgradient, fairlysimilartothatofasymmetricSNE,andis givenby C yi=4 j(pi j qi j)(yi yj).

Inpreliminaryexperiments,weobservedthats ymmetricSNEseemstoproducemapsthatarejust asgoodasasymmetricSNE,andsometimesevena setofdatapointsthatlieona two-dimensionalcurvedmanifoldwhichis approximatelylinearona smallscale,andwhichis embeddedwithina is possibletomodelthesmallpairwisedistances betweendatapointsfairlywellina two-dimensionalmap,whichis oftenillustratedontoy examplessuchasthe Swissroll embeddedwithina thepairwisedistancesina ,intendimensions,it is possibletohave 11datapointsthataremutuallyequidistantan dthereis nowaytomodelthisfaithfullyina relatedproblemis theverydifferentdistributionofpairwisedi stancesinthetwo spherecenteredondatapointiscalesasrm, thedatapointsareapproximatelyuniformlydi stributedintheregionaroundiontheten-dime nsionalmanifold,andwetrytomodelthedis-ta ncesfromitotheotherdatapointsinthetwo-di mensionalmap,wegetthefollowing crowdingproblem :theareaofthetwo-dimensionalmapthatis availableto accommodatemoderatelydistantdatapointswi llnotbenearlylargeenoughcomparedwiththea reaavailableto ,if wewantto modelthesmalldistancesaccuratelyin themap, moderatedistancefromdatapointiwillhave ,thespringconnectingdatapointitoeachofth esetoo-distantmappointswillthusexerta verysmallattractive forcesareverysmall,theverylargenumberofs uchforcescrushestogetherthepointsinthece nterofthemap, ,butthatit slightrepulsiontoallspringswaspre-sented byCooket al.

Visualizing Data using t-SNE

Tags:

Information

Advertisement

Transcription of Visualizing Data using t-SNE

Related search queries

Visualizing Data using t-SNE

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries