Example: dental hygienist

Hierarchical Clustering - Princeton University

ElementsofMachineLearningPrincetonUniver sityK-Meansclusteringisagoodgeneral-purp osewaytothinkaboutdiscoveringgroupsindat a, ,itrequirestheusertospecifythenumberofcl ustersinadvance, , ,K-Meansisnondeterministic;thesolutionit findswilldependontheinitializationandeve ngoodinitializationalgorithmssuchasK-Mea ns++ , ,wepartitionorganismsintodifferentspecie s,butsciencehasalsodevelopedarichtaxonom yoflivingthings:kingdom,phylum,class, (usuallybinary) , clustersofclusters (HAC)startsatthebottom,witheverydatumini tsownsingletoncluster, ,thisalgorithmmaintainsan activeset , , shapedpaths,wherethelegsshow1 Algorithm1 HierarchicalAgglomerativeClusteringNote: writtenforclarity, :Input:Datavectors{xn}Nn=1,group-wisedis tanceD (G,G )2:A :forn :A A {{xn}} :endfor6:T A , :while|A|>1do :G 1,G 2 argminG1,G2 A;G1,G2 AD (G1,G2) :A (A\{G 1})\{G 2} :A A {G 1 G 2} :T T {G 1 G 2} :endwhile13: andavaliddendrogram ,thedistancebetweentwomergedgroupsGandG mustalwaysbegreaterthanorequaltothedista ncebetweenanyofthepreviously-mergedsubgr oupsthatformedGandG.

Hierarchical Clustering Ryan P. Adams COS 324 – Elements of Machine Learning Princeton University K-Means clustering is a good general-purpose way to think about discovering groups in data, but there are several aspects of it that are unsatisfying. For one, it …

Tags:

  Learning, Hierarchical, Clustering, Hierarchical clustering

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Hierarchical Clustering - Princeton University

1 ElementsofMachineLearningPrincetonUniver sityK-Meansclusteringisagoodgeneral-purp osewaytothinkaboutdiscoveringgroupsindat a, ,itrequirestheusertospecifythenumberofcl ustersinadvance, , ,K-Meansisnondeterministic;thesolutionit findswilldependontheinitializationandeve ngoodinitializationalgorithmssuchasK-Mea ns++ , ,wepartitionorganismsintodifferentspecie s,butsciencehasalsodevelopedarichtaxonom yoflivingthings:kingdom,phylum,class, (usuallybinary) , clustersofclusters (HAC)startsatthebottom,witheverydatumini tsownsingletoncluster, ,thisalgorithmmaintainsan activeset , , shapedpaths,wherethelegsshow1 Algorithm1 HierarchicalAgglomerativeClusteringNote: writtenforclarity, :Input:Datavectors{xn}Nn=1,group-wisedis tanceD (G,G )2:A :forn :A A {{xn}} :endfor6:T A , :while|A|>1do :G 1,G 2 argminG1,G2 A;G1,G2 AD (G1,G2) :A (A\{G 1})\{G 2} :A A {G 1 G 2} :T T {G 1 G 2} :endwhile13: andavaliddendrogram ,thedistancebetweentwomergedgroupsGandG mustalwaysbegreaterthanorequaltothedista ncebetweenanyofthepreviously-mergedsubgr oupsthatformedGandG.

2 Figure1bshowsadendrogramforasetofprofess ionalbasketballplayers, ,wemightobservethatthisstructureseemstoc orrespondtoposition;alloftheplayersinthe bottomsubtreebetweenDwightHowardandPaulM illsaparecentersorpowerforwards(exceptfo rPaulPiercewhoisconsideredmoreofasmallfo rward) ( ,StephenCurryandTonyParker)andshootinggu ards( ,DwayneWadeandKobeBryant).AtthetopareKev inDurantandLeBronJames, ; theD (G,G ) ,welookedatdistancesbetweendataitems; , ,weconsiderthedistancesbetweentwogroupsG ={xn}Nn=1andG ={ym}Mm=1,whereNandMare1 Thesearenotnecessarily distances (a)PairwiseDistances(b)Single-LinkageDen drogramFigure1:Thesefiguresdemonstratehi erarchicalagglomerativeclusteringofhigh- scoringprofes-sionalbasketballplayersint heNBA,basedonasetofnormalizedfeaturessuc hasassistsandreboundspergame, (a) :powerforwardsandcentersappearinthebotto mright,pointguardsinthemiddleblock,withs omeunusualplayersinthetopright.(b) linkages . -S L ({xn}Nn=1,{ym}Mm=1)=minn,m||xn ym||,(1)where||x y|| , , ,whenthealgorithmterminates, (subjecttonotaddingaloop), , ,twomergetwoclusterswiththesingle-linkag ecriterion, chaining , , :Ratherthanchoosingtheshortestdistance,i ncomplete-linkageclusteringthedistancebe tweentwogroupsisdeterminedbythelargestdi stanceoverallpossiblepairs, ,D -C L ({xn}Nn=1,{ym}Mm=1)=maxn,m||xn ym||,(2)whereagain||x y|| , :Ratherthantheworstorbestdistances,whenu singtheaverage-linkagecriterionweaverage overallpossiblepairsbetweenthegroups:D -A ({xn}Nn=1,{ym}Mm=1)=1 NMN n=1M m=1||xn ym||.

3 (3) ,3b, :Anotheralternativeapproachtocomputingth edistancebetweenclus-tersistolookatthedi fferencebetweentheircentroids:D -C {xn}Nn=1,{ym}Mm=1)=||#1NN n=1xn$ #1MM m=1ym$||.(4)Notethatthisissomethingthato nlymakessenseifanaverageofdataitemsissen sible; , , , ,orinstance-based, ,itisusefultothinkaboutthespaceofpossibl ethingsthatcanbelearnedfromdata, , ; (a)Single-Linkage(b)Complete-Linkage(c)A verage-Linkage(d)CentroidFigure2:Fourdif ferenttypesoflinkagecriteriaforhierarchi calagglomerativeclustering(HAC).(a)Singl elinkagelooksatminimumdistancebetweenall inter-grouppairs.(b)Completelinkagelooks atthemaximumdistancebetweenallinter-grou ppairs.(c)Averagelinkageusestheaveragedi stancebetweenallinter-grouppairs.(d) (2008). , , ,10,100, , ,wheretherowsareanimalssuchas lion and germanshepherd whilethefeaturesarecolumnssuchas tall and jungle .Figure6bshowsamatrixofHammingdistancesi nthisfeaturespace, ~ckemp/ Time(a)Single-Linkage1234564050607080901 00 DurationWait Time(b)Complete-Linkage12345640506070809 0100 DurationWait Time(c)Average-Linkage123456405060708090 100 DurationWait Time(d)CentroidFigure3:Thesefiguresshowc lusteringsfromthefourdifferentgroupdista ncecriteria, ,HACwasrun,thetreewastruncatedatsixgroup s,andthesegroupsareshownasdifferentcolor s.

4 (a)Single-Linkage(b)Complete-Linkage(c)A verage-Linkage(d)CentroidFigure4:Thesefi guresshowclusteringsfromthefourdifferent groupdistancecriteria,appliedto1500synth etic pinwheel ,HACwasrun,thetreewastruncatedatthreegro ups,andthesegroupsareshownasdifferentcol ors.(a)Thesingle-linkagecriterioncangive stringyclusters,soitcancapturethepinwhee lshapes(b-d)Complete,average, :NationalGovernmentsandDemographicsThese dataarebinarypropertiesof14nations(colle ctedin1965), ,Burma,China,Cuba,Egypt,India,Indonesia, Israel,Jordan,theNetherlands,Poland,theU SSR,theUnitedKingdom, ,socialstructures, , , 104 Interpoint Distances(a) 104 Interpoint Distances(b) 104 Interpoint Distances(c) 104 Interpoint Distances(d)1000 DimensionsFigure5 (zero)andmaximum( D) , , ; , , ,itfirstdividesthedataintoKclustersusing , ,K-MeansorK-Medoids, ,itsuffersfromallofthedifficultiesandnon -determinismofflatclustering, Chapter17ofManningetal.(2008)isfreelyava ilableonlineandisanexcellentresource.

5 Dudaetal.(2001) :writtenforclarity, :Input:Datavectors{xn}Nn=1,Flatclusterin gprocedureF C (G,K)2:3:functionS D (G,K) :{Hk}Kk=1 F C (G,K) :S 6:fork :if|Hk|=1then8:S S {Hk} :else10:S S S D (Hk,K) :endif12:endfor13:Return:S :endfunction15:16:Return:S D ({xn}Nn=1,K) ,PrabhakarRaghavan,andHinrichSch , , , , TODO8(a)AnimalsandFeatures(b)HammingDist ances(c)Average-LinkageDendrogramFigure6 :ThesefiguresshowtheresultofrunningHACon adatasetof50animals,eachwith85binaryfeat ures.(a)Thefeaturematrix,wheretherowsare animalsandthecolumnsarebinaryfeatures.(b ) (c) (a)NationsandFeatures(b)EuclideanDistanc es(c)Complete-LinkageDendrogramFigure7:T hesefiguresshowtheresultofrunningHAConad atasetof14nations,withbinaryfeatures.(a) Thefeaturematrix, (b) (c) (a)SenatorsandVotes(b)EuclideanDistances (c)Average-LinkingDendogramFigure8:These figuresshowtheresultofrunningHAConadatas etof104senatorsinthe113thUScongress,with binaryfeaturescorrespondingtovoteson172b ills.

6 (a)Thefeaturematrix, (b) (c)AdendrogramarisingfromHACwithcomplete linkage.


Related search queries