Transcription of Cluster Analysis: Basic Concepts and Algorithms
1 8 Cluster Analysis: Basic Concepts andAlgorithmsCluster analysis divides data into groups (clusters) that are meaningful, useful,or both. If meaningful groups are the goal, then the clusters should capture thenatural structure of the data . In some cases, however, Cluster analysis is only auseful starting point for other purposes, such as data summarization. Whetherfor understanding or utility, Cluster analysis has long played an importantrole in a wide variety of fields: psychology and other social sciences, biology,statistics, pattern recognition, information retrieval, machine learning, anddata have been many applications of Cluster analysis to practical prob-lems.
2 We provide some specific examples, organized by whether the purposeof the clustering is understanding or for UnderstandingClasses, or conceptually meaningful groupsof objects that share common characteristics, play an important role in howpeople analyze and describe the world. Indeed, human beings are skilled atdividing objects into groups (clustering) and assigning particular objects tothese groups (classification). For example, even relatively young children canquickly label the objects in a photograph as buildings, vehicles, people, ani-mals, plants, etc.
3 In the context of understanding data , clusters are potentialclasses and Cluster analysis is the study of techniques for automatically findingclasses. The following are some examples:488 Chapter 8 Cluster analysis : Basic Concepts and Algorithms have spent many years creating a taxonomy (hi-erarchical classification) of all living things: kingdom, phylum, class,order, family, genus, and species. Thus, it is perhaps not surprising thatmuch of the early work in Cluster analysis sought to create a disciplineof mathematical taxonomy that could automatically find such classifi-cation structures.
4 More recently, biologists have applied clustering toanalyze the large amounts of genetic information that are now example, clustering has been used to find groups of genes that havesimilar functions. Information World Wide Web consists of billions ofWeb pages, and the results of a query to a search engine can returnthousands of pages. Clustering can be used to group these search re-sults into a small number of clusters, each of which captures a particularaspect of the query. For instance, a query of movie might returnWeb pages grouped into categories such as reviews, trailers, stars, andtheaters.
5 Each category ( Cluster ) can be broken into subcategories (sub-clusters), producing a hierarchical structure that further assists a user sexploration of the query results. the Earth s climate requires finding patternsin the atmosphere and ocean. To that end, Cluster analysis has beenapplied to find patterns in the atmospheric pressure of polar regions andareas of the ocean that have a significant impact on land climate. Psychology and illness or condition frequently has anumber of variations, and Cluster analysis can be used to identify thesedifferent subcategories.
6 For example, clustering has been used to identifydifferent types of depression. Cluster analysis can also be used to detectpatterns in the spatial or temporal distribution of a disease. collect large amounts of information on currentand potential customers. Clustering can be used to segment customersinto a small number of groups for additional analysis and for UtilityCluster analysis provides an abstraction from in-dividual data objects to the clusters in which those data objects reside. Ad-ditionally, some clustering techniques characterize each Cluster in terms of acluster prototype; , a data object that is representative of the other ob-jects in the Cluster .
7 These Cluster prototypes can be used as the basis for a489number of data analysis or data processing techniques. Therefore, in the con-text of utility, Cluster analysis is the study of techniques for finding the mostrepresentative Cluster prototypes. data analysis techniques, such as regression orPCA, have a time or space complexity ofO(m2) or higher (wheremisthe number of objects), and thus, are not practical for large data , instead of applying the algorithm to the entire data set, it canbe applied to a reduced data set consisting only of Cluster on the type of analysis , the number of prototypes, and theaccuracy with which the prototypes represent the data , the results canbe comparable to those that would have been obtained if all the datacould have been used.
8 Prototypes can also be used for data compres-sion. In particular, a table is created that consists of the prototypes foreach Cluster ; , each prototype is assigned an integer value that is itsposition (index) in the table. Each object is represented by the indexof the prototype associated with its Cluster . This type of compression isknown asvector quantizationand is often applied to image, sound,and video data , where (1) many of the data objects are highly similarto one another, (2) some loss of information is acceptable, and (3) asubstantial reduction in the data size is desired.
9 Efficiently Finding Nearest nearest neighborscan require computing the pairwise distance between all points. Oftenclusters and their Cluster prototypes can be found much more objects are relatively close to the prototype of their Cluster , then we canuse the prototypes to reduce the number of distance computations thatare necessary to find the nearest neighbors of an object. Intuitively, if twocluster prototypes are far apart, then the objects in the correspondingclusters cannot be nearest neighbors of each other. Consequently, tofind an object s nearest neighbors it is only necessary to compute thedistance to objects in nearby clusters, where the nearness of two clustersis measured by the distance between their prototypes.
10 This idea is mademore precise in Exercise 25 on page chapter provides an introduction to Cluster analysis . We begin witha high-level overview of clustering, including a discussion of the various ap-proaches to dividing objects into sets of clusters and the different types ofclusters. We then describe three specific clustering techniques that represent490 Chapter 8 Cluster analysis : Basic Concepts and Algorithmsbroad categories of Algorithms and illustrate a variety of Concepts : K-means,agglomerative hierarchical clustering, and DBSCAN. The final section of thischapter is devoted to Cluster validity methods for evaluating the goodnessof the clusters produced by a clustering algorithm.