Example: dental hygienist

Classification: Basic Concepts and Techniques

3 Classification: BasicConcepts andTechniquesHumans have an innate ability to classify things into categories, , mundanetasks such as filtering spam email messages or more specialized tasks suchas recognizing celestial objects in telescope images (see Figure ). Whilemanual classification often suffices for small and simple data sets with onlya few attributes, larger and more complex data sets require an automatedsolution.(a) A spiral galaxy.(b) An elliptical of galaxies from telescope images taken from the NASA website. 114 Chapter 3 ClassificationClassificationmodelInputAt tribute set(x)OutputClass label(y)Figure schematic illustration of a classification chapter introduces the Basic Concepts of classification and describessome of its key issues such as model overfitting, model selection, and modelevaluation.

3 Classification: Basic Concepts and Techniques Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundane tasks such as filtering spam email messages or ...

Tags:

  Basics, Concept, Technique, Classification, Basic concepts and techniques

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Classification: Basic Concepts and Techniques

1 3 Classification: BasicConcepts andTechniquesHumans have an innate ability to classify things into categories, , mundanetasks such as filtering spam email messages or more specialized tasks suchas recognizing celestial objects in telescope images (see Figure ). Whilemanual classification often suffices for small and simple data sets with onlya few attributes, larger and more complex data sets require an automatedsolution.(a) A spiral galaxy.(b) An elliptical of galaxies from telescope images taken from the NASA website. 114 Chapter 3 ClassificationClassificationmodelInputAt tribute set(x)OutputClass label(y)Figure schematic illustration of a classification chapter introduces the Basic Concepts of classification and describessome of its key issues such as model overfitting, model selection, and modelevaluation.

2 While these topics are illustrated using a classification techniqueknown as decision tree induction, most of the discussion in this chapter isalso applicable to other classification Techniques , many of which are coveredin Chapter Basic ConceptsFigure illustrates the general idea behind classification. The data for aclassification task consists of a collection of instances (records). Each suchinstance is characterized by the tuple (x,y), wherexis the set of attributevalues that describe the instance andyis the class label of the instance. Theattribute setxcan contain attributes of any type, while the class labelymustbe modelis an abstract representation of the relationshipbetween the attribute set and the class label. As will be seen in the nexttwo chapters, the model can be represented in many ways, , as a tree, aprobability table, or simply, a vector of real-valued parameters.

3 More formally,we can express it mathematically as a target functionfthat takes as input theattribute setxand produces an output corresponding to the predicted classlabel. The model is said to classify an instance (x,y) correctly iff(x)= of classification setClass labelSpam filteringFeatures extracted from email messageheader and contentspam or non-spamTumor identificationFeatures extracted from magnetic reso-nance imaging (MRI) scansmalignant or benignGalaxy classificationFeatures extracted from telescope imageselliptical, spiral, orirregular-shaped Concepts115 Table sample data for the vertebrate classification shows examples of attribute sets and class labels for variousclassification tasks. Spam filtering and tumor identification are examples ofbinary classification problems, in which each data instance can be categorizedinto one of two classes.

4 If the number of classes is larger than 2, as in the galaxyclassification example, then it is called a multiclass classification illustrate the Basic Concepts of classification in this chapter with thefollowing two [Vertebrate Classification]Table shows a sample dataset for classifying vertebrates into mammals, reptiles, birds, fishes, and am-phibians. The attribute set includes characteristics of the vertebrate such asits body temperature, skin cover, and ability to fly. The data set can also beused for a binary classification task such as mammal classification, by groupingthe reptiles, birds, fishes, and amphibians into a single category called [Loan Borrower Classification]Consider the problem ofpredicting whether a loan borrower will repay the loan or default on the loanpayments. The data set used to build the classification model is shown in The attribute set includes personal information of the borrower such asmarital status and annual income, while the class label indicates whether theborrower had defaulted on the loan payments.

5 116 Chapter 3 ClassificationTable sample data for the loan borrower classification OwnerMarital StatusAnnual IncomeDefaulted?1 YesSingle125000No2 NoMarried100000No3 NoSingle70000No4 YesMarried120000No5 NoDivorced95000 Yes6 NoSingle60000No7 YesDivorced220000No8 NoSingle85000 Yes9 NoMarried75000No10 NoSingle90000 YesA classification model serves two important roles in data mining. First, it isused as apredictive modelto classify previously unlabeled instances. A goodclassification model must provide accurate predictions with a fast responsetime. Second, it serves as adescriptive modelto identify the characteristicsthat distinguish instances from different classes. This is particularly usefulfor critical applications, such as medical diagnosis, where it is insufficient tohave a model that makes a prediction without justifying how it reaches sucha example, a classification model induced from the vertebrate data setshown in Table can be used to predict the class label of the followingvertebrate:VertebrateBodySkinGi vesAquaticAerialHasHiber-ClassNameTemper atureCoverBirthCreatureCreatureLegsnates Labelgila monstercold-bloodedscalesnononoyesyes?

6 In addition, it can be used as a descriptive model to help determine charac-teristics that define a vertebrate as a mammal, a reptile, a bird, a fish, or anamphibian. For example, the model may identify mammals as warm-bloodedvertebrates that give birth to their are several points worth noting regarding the previous , although all the attributes shown in Table are qualitative, there areno restrictions on the type of attributes that can be used as predictor class label, on the other hand, must be of nominal type. This distinguishesclassification from other predictive modeling tasks such as regression, wherethe predicted value is often quantitative. More information about regressioncan be found in Appendix point worth noting is that not all attributes may be relevantto the classification task.

7 For example, the average length or weight of a Framework for Classification117vertebrate may not be useful for classifying mammals, as these attributescan show same value for both mammals and non-mammals. Such an attributeis typically discarded during preprocessing. The remaining attributes mightnot be able to distinguish the classes by themselves, and thus, must be used inconcert with other attributes. For instance, theBody Temperatureattributeis insufficient to distinguish mammals from other vertebrates. When it is usedtogether withGives Birth, the classification of mammals improves signifi-cantly. However, when additional attributes, such asSkin Coverare included,the model becomes overly specific and no longer covers all mammals. Findingthe optimal combination of attributes that best discriminates instances fromdifferent classes is the key challenge in building classification General Framework for ClassificationClassification is the task of assigning labels to unlabeled data instances and aclassifieris used to perform such a task.

8 A classifier is typically described interms of a model as illustrated in the previous section. The model is createdusing a given a set of instances, known as thetraining set, which contains at-tribute values as well as class labels for each instance. The systematic approachfor learning a classification model given a training set is known as alearningalgorithm. The process of using a learning algorithm to build a classificationmodel from the training data is known asinduction. This process is alsooften described as learning a model or building a model. This process ofapplying a classification model on unseen test instances to predict their classlabels is known asdeduction. Thus, the process of classification involves twosteps: applying a learning algorithm to training data to learn a model, andthen applying the model to assign labels to unlabeled instances.

9 Figure the general framework for techniquerefers to a general approach to classification, , the decision tree technique that we will study in this chapter. Thisclassification technique like most others, consists of a family of related modelsand a number of algorithms for learning these models. In Chapter 4, wewill study additional classification Techniques , including neural networks andsupport vector couple notes on terminology. First, the terms classifier and model are often taken to be synonymous. If a classification technique builds a single,global model, then this is fine. However, while every model defines a classifier,not every classifier is defined by a single model. Some classifiers, such ask-nearest neighbor classifiers, do not build an explicit model (Section ), while 118 Chapter 3 ClassificationFigure framework for building a classification classifiers, such as ensemble classifiers, combine the output of a collectionof models (Section ).

10 Second, the term classifier is often used in a moregeneral sense to refer to a classification technique . Thus, for example, decisiontree classifier can refer to the decision tree classification technique or a specificclassifier built using that technique . Fortunately, the meaning of classifier is usually clear from the the general framework shown in Figure , the induction and deductionsteps should be performed separately. In fact, as will be discussed later inSection , the training and test sets should be independent of each otherto ensure that the induced model can accurately predict the class labels ofinstances it has never encountered before. Models that deliver such predictiveinsights are said to have goodgeneralization of a model (classifier) can be evaluated by comparing the predictedlabels against the true labels of instances.


Related search queries