Transcription of Cluster Analysis: Basic Concepts and Algorithms
1 8 Cluster Analysis: Basic Concepts andAlgorithmsCluster analysis divides data into groups (clusters) that are meaningful, useful,or both. If meaningful groups are the goal, then the clusters should capture thenatural structure of the data. In some cases, however, Cluster analysis is only auseful starting point for other purposes, such as data summarization. Whetherfor understanding or utility, Cluster analysis has long played an importantrole in a wide variety of fields: psychology and other social sciences, biology,statistics, pattern recognition, information retrieval, machine learning, anddata have been many applications of Cluster analysis to practical prob-lems. We provide some specific examples, organized by whether the purposeof the clustering is understanding or for UnderstandingClasses, or conceptually meaningful groupsof objects that share common characteristics, play an important role in howpeople analyze and describe the world.
2 Indeed, human beings are skilled atdividing objects into groups (clustering) and assigning particular objects tothese groups (classification). For example, even relatively young children canquickly label the objects in a photograph as buildings, vehicles, people, ani-mals, plants, etc. In the context of understanding data, clusters are potentialclasses and Cluster analysis is the study of techniques for automatically findingclasses. The following are some examples:488 Chapter 8 Cluster analysis : Basic Concepts and Algorithms have spent many years creating a taxonomy (hi-erarchical classification) of all living things: kingdom, phylum, class,order, family, genus, and species. Thus, it is perhaps not surprising thatmuch of the early work in Cluster analysis sought to create a disciplineof mathematical taxonomy that could automatically find such classifi-cation structures. More recently, biologists have applied clustering toanalyze the large amounts of genetic information that are now example, clustering has been used to find groups of genes that havesimilar functions.
3 Information World Wide Web consists of billions ofWeb pages, and the results of a query to a search engine can returnthousands of pages. Clustering can be used to group these search re-sults into a small number of clusters, each of which captures a particularaspect of the query. For instance, a query of movie might returnWeb pages grouped into categories such as reviews, trailers, stars, andtheaters. Each category ( Cluster ) can be broken into subcategories (sub-clusters), producing a hierarchical structure that further assists a user sexploration of the query results. the Earth s climate requires finding patternsin the atmosphere and ocean. To that end, Cluster analysis has beenapplied to find patterns in the atmospheric pressure of polar regions andareas of the ocean that have a significant impact on land climate. Psychology and illness or condition frequently has anumber of variations, and Cluster analysis can be used to identify thesedifferent subcategories.
4 For example, clustering has been used to identifydifferent types of depression. Cluster analysis can also be used to detectpatterns in the spatial or temporal distribution of a disease. collect large amounts of information on currentand potential customers. Clustering can be used to segment customersinto a small number of groups for additional analysis and for UtilityCluster analysis provides an abstraction from in-dividual data objects to the clusters in which those data objects reside. Ad-ditionally, some clustering techniques characterize each Cluster in terms of acluster prototype; , a data object that is representative of the other ob-jects in the Cluster . These Cluster prototypes can be used as the basis for a489number of data analysis or data processing techniques. Therefore, in the con-text of utility, Cluster analysis is the study of techniques for finding the mostrepresentative Cluster prototypes.
5 Data analysis techniques, such as regression orPCA, have a time or space complexity ofO(m2) or higher (wheremisthe number of objects), and thus, are not practical for large data , instead of applying the algorithm to the entire data set, it canbe applied to a reduced data set consisting only of Cluster on the type of analysis , the number of prototypes, and theaccuracy with which the prototypes represent the data, the results canbe comparable to those that would have been obtained if all the datacould have been used. prototypes can also be used for data compres-sion. In particular, a table is created that consists of the prototypes foreach Cluster ; , each prototype is assigned an integer value that is itsposition (index) in the table. Each object is represented by the indexof the prototype associated with its Cluster . This type of compression isknown asvector quantizationand is often applied to image, sound,and video data, where (1) many of the data objects are highly similarto one another, (2) some loss of information is acceptable, and (3) asubstantial reduction in the data size is desired.
6 Efficiently Finding Nearest nearest neighborscan require computing the pairwise distance between all points. Oftenclusters and their Cluster prototypes can be found much more objects are relatively close to the prototype of their Cluster , then we canuse the prototypes to reduce the number of distance computations thatare necessary to find the nearest neighbors of an object. Intuitively, if twocluster prototypes are far apart, then the objects in the correspondingclusters cannot be nearest neighbors of each other. Consequently, tofind an object s nearest neighbors it is only necessary to compute thedistance to objects in nearby clusters, where the nearness of two clustersis measured by the distance between their prototypes. This idea is mademore precise in Exercise 25 on page chapter provides an introduction to Cluster analysis . We begin witha high-level overview of clustering, including a discussion of the various ap-proaches to dividing objects into sets of clusters and the different types ofclusters.
7 We then describe three specific clustering techniques that represent490 Chapter 8 Cluster analysis : Basic Concepts and Algorithmsbroad categories of Algorithms and illustrate a variety of Concepts : K-means,agglomerative hierarchical clustering, and DBSCAN. The final section of thischapter is devoted to Cluster validity methods for evaluating the goodnessof the clusters produced by a clustering algorithm. More advanced clusteringconcepts and Algorithms will be discussed in Chapter 9. Whenever possible,we discuss the strengths and weaknesses of different schemes. In addition,the bibliographic notes provide references to relevant books and papers thatexplore Cluster analysis in greater OverviewBefore discussing specific clustering techniques, we provide some necessarybackground. First, we further define Cluster analysis , illustrating why it isdifficult and explaining its relationship to other techniques that group we explore two important topics: (1) different ways to group a set ofobjects into a set of clusters, and (2) types of What Is Cluster analysis ?
8 Cluster analysis groups data objects based only on information found in thedata that describes the objects and their relationships. The goal is that theobjects within a group be similar (or related) to one another and different from(or unrelated to) the objects in other groups. The greater the similarity (orhomogeneity) within a group and the greater the difference between groups,the better or more distinct the many applications, the notion of a Cluster is not well defined. To betterunderstand the difficulty of deciding what constitutes a Cluster , consider , which shows twenty points and three different ways of dividing them intoclusters. The shapes of the markers indicate Cluster membership. (b) and (d) divide the data into two and six parts, respectively. However,the apparent division of each of the two larger clusters into three subclustersmay simply be an artifact of the human visual system. Also, it may not beunreasonable to say that the points form four clusters, as shown in (c).
9 This figure illustrates that the definition of a Cluster is imprecise andthat the best definition depends on the nature of data and the desired analysis is related to other techniques that are used to divide dataobjects into groups. For instance, clustering can be regarded as a form ofclassification in that it creates a labeling of objects with class ( Cluster ) , it derives these labels only from the data. In contrast, (a) Original points.(b) Two clusters.(c) Four clusters.(d) Six ways of clustering the same set of the sense of Chapter 4 issupervised classification; , new, unlabeledobjects are assigned a class label using a model developed from objects withknown class labels. For this reason, Cluster analysis is sometimes referredto asunsupervised classification. When the term classification is usedwithout any qualification within data mining, it typically refers to , while the termssegmentationandpartitioningare sometimesused as synonyms for clustering, these terms are frequently used for approachesoutside the traditional bounds of Cluster analysis .
10 For example, the termpartitioning is often used in connection with techniques that divide graphs intosubgraphs and that are not strongly connected to clustering. Segmentationoften refers to the division of data into groups using simple techniques; ,an image can be split into segments based only on pixel intensity and color, orpeople can be divided into groups based on their income. Nonetheless, somework in graph partitioning and in image and market segmentation is relatedto Cluster Different Types of ClusteringsAn entire collection of clusters is commonly referred to as aclustering, and inthis section, we distinguish various types of clusterings: hierarchical (nested)versus partitional (unnested), exclusive versus overlapping versus fuzzy, andcomplete versus versus PartitionalThe most commonly discussed distinc-tion among different types of clusterings is whether the set of clusters is nested492 Chapter 8 Cluster analysis : Basic Concepts and Algorithmsor unnested, or in more traditional terminology, hierarchical or partitional.