Classiﬁcation: Basic Concepts, Decision Trees, and Model ...

4 Classification: Basic Concepts, Decision Trees, andModel EvaluationClassification, which is the task of assigning objects to one of several predefinedcategories, is a pervasive problem that encompasses many diverse include detecting spam email messages based upon the messageheader and content, categorizing cells as malignant or benign based upon theresults of MRI scans, and classifying galaxies based upon their shapes (seeFigure ).(a) A spiral galaxy.(b) An elliptical of galaxies. The images are from the NASA Chapter 4 ClassificationClassificationmodelInputAt tribute set(x)OutputClass label(y)Figure as the task of mapping an input attribute setxinto its class chapter introduces the Basic concepts of classification, describes someof the key issues such as Model overfitting, and presents methods for evaluatingand comparing the performance of a classification technique.

While it focusesmainly on a technique known as Decision tree induction, most of the discussionin this chapter is also applicable to other classification techniques, many ofwhich are covered in Chapter PreliminariesThe input data for a classification task is a collection of records. Each record,also known as an instance or example, is characterized by a tuple (x,y), wherexis the attribute set andyis a special attribute, designated as the class label(also known as category or target attribute). Table shows a sample data setused for classifying vertebrates into one of the following categories: mammal,bird, fish, reptile, or amphibian. The attribute set includes properties of avertebrate such as its body temperature, skin cover, method of reproduction,ability to fly, and ability to live in water.

Although the attributes presentedin Table are mostly discrete, the attribute set can also contain continuousfeatures. The class label, on the other hand, must be a discrete is a key characteristic that distinguishes classification fromregression,a predictive modeling task in whichyis a continuous attribute. Regressiontechniques are covered in Appendix (Classification).Classification is the task of learning atar-get functionfthat maps each attribute setxto one of the predefined target function is also known informally as aclassification classification Model is useful for the following ModelingA classification Model can serve as an explanatorytool to distinguish between objects of different classes. For example, it wouldbe useful for both biologists and others to have a descriptive Model vertebrate data the data shown in Table and explains what features define avertebrate as a mammal, reptile, bird, fish, or ModelingA classification Model can also be used to predictthe class label of unknown records.

As shown in Figure , a classificationmodel can be treated as a black box that automatically assigns a class labelwhen presented with the attribute set of an unknown record. Suppose we aregiven the following characteristics of a creature known as a gila monster:NameBodySkinGivesAquaticAerialHa sHiber-ClassTemperatureCoverBirthCreatur eCreatureLegsnatesLabelgila monstercold-bloodedscalesnononoyesyes?We can use a classification Model built from the data set shown in Table determine the class to which the creature techniques are most suited for predicting or describing datasets with binary or nominal categories. They are less effective for ordinalcategories ( , to classify a person as a member of high-, medium-, or low-income group) because they do not consider the implicit order among thecategories.

Other forms of relationships, such as the subclass superclass re-lationships among categories ( , humans and apes are primates, which in148 Chapter 4 Classificationturn, is a subclass of mammals) are also ignored. The remainder of this chapterfocuses only on binary or nominal class General Approach to Solving a ClassificationProblemA classification technique (or classifier) is a systematic approach to buildingclassification models from an input data set. Examples include Decision treeclassifiers, rule-based classifiers, neural networks, support vector machines,and na ve Bayes classifiers. Each technique employs alearning algorithmto identify a Model that best fits the relationship between the attribute set andclass label of the input data. The Model generated by a learning algorithmshould both fit the input data well and correctly predict the class labels ofrecords it has never seen before.

Therefore, a key objective of the learningalgorithm is to build models with good generalization capability; , modelsthat accurately predict the class labels of previously unknown shows a general approach for solving classification , atraining setconsisting of records whose class labels are known mustInductionDeductionModelLearnModelApp lyModelLearningAlgorithmTraining SetTest SetTidClassAttrib1 Attrib2 Attrib312345678910Ye sNoNoYe sNoNoYe sNoNoNoNoNoNoNoYe sNoNoYe sNoYe s125K100K70K120K95K60K220K85K75K90 KLargeMediumSmallMediumLargeMediumLargeS mallMediumSmallTidClassAttrib1 Attrib2 Attrib31112131415 NoYe sYe sNoNo?????55K80K110K95K67 KSmallMediumLargeSmallLargeFigure approach for building a classification Approach to Solving a Classification Problem149 Table matrix for a 2- class ClassClass=1 class =0 ActualClass=1f11f10 ClassClass=0f01f00be provided.

The training set is used to build a classification Model , which issubsequently applied to thetest set, which consists of records with unknownclass of the performance of a classification Model is based on thecounts of test records correctly and incorrectly predicted by the Model . Thesecounts are tabulated in a table known as aconfusion matrix. Table the confusion matrix for a binary classification problem. Each entryfijin this table denotes the number of records from classipredicted to beof classj. For instance,f01is the number of records from class 0 incorrectlypredicted as class 1. Based on the entries in the confusion matrix, the totalnumber of correct predictions made by the Model is (f11+f00) and the totalnumber of incorrect predictions is (f10+f01).Although a confusion matrix provides the information needed to determinehow well a classification Model performs, summarizing this information witha single number would make it more convenient to compare the performanceof different models.

This can be done using aperformance metricsuch asaccuracy, which is defined as follows:Accuracy =Number of correct predictionsTotal number of predictions=f11+f00f11+f10+f01+f00.( )Equivalently, the performance of a Model can be expressed in terms of itserror rate, which is given by the following equation:Error rate =Number of wrong predictionsTotal number of predictions=f10+f01f11+f10+f01+f00.( )Most classification algorithms seek models that attain the highest accuracy, orequivalently, the lowest error rate when applied to the test set. We will revisitthe topic of Model evaluation in Section Chapter Decision tree InductionThis section introduces adecision treeclassifier, which is a simple yet widelyused classification How a Decision tree WorksTo illustrate how classification with a Decision tree works, consider a simplerversion of the vertebrate classification problem described in the previous sec-tion.

Instead of classifying the vertebrates into five distinct groups of species,we assign them to two categories: mammals and a new species is discovered by scientists. How can we tell whetherit is a mammal or a non-mammal? One approach is to pose a series of questionsabout the characteristics of the species. The first question we may ask iswhether the species is cold- or warm-blooded. If it is cold-blooded, then it isdefinitely not a mammal. Otherwise, it is either a bird or a mammal. In thelatter case, we need to ask a follow-up question: Do the females of the speciesgive birth to their young? Those that do give birth are definitely mammals,while those that do not are likely to be non-mammals (with the exception ofegg-laying mammals such as the platypus and spiny anteater).The previous example illustrates how we can solve a classification problemby asking a series of carefully crafted questions about the attributes of thetest record.

Each time we receive an answer, a follow-up question is askeduntil we reach a conclusion about the class label of the record. The series ofquestions and their possible answers can be organized in the form of a decisiontree, which is a hierarchical structure consisting of nodes and directed shows the Decision tree for the mammal classification problem. Thetree has three types of nodes: Aroot nodethat has no incoming edges and zero or more outgoingedges. Internal nodes, each of which has exactly one incoming edge and twoor more outgoing edges. Leaforterminalnodes, each of which has exactly one incoming edgeand no outgoing a Decision tree , each leaf node is assigned a class label. Thenon-terminalnodes, which include the root and other internal nodes, containattribute test conditions to separate records that have different characteris-tics.

Classiﬁcation: Basic Concepts, Decision Trees, and Model ...

Tags:

Information

Transcription of Classiﬁcation: Basic Concepts, Decision Trees, and Model ...

Related search queries

Classiﬁcation: Basic Concepts, Decision Trees, and Model ...

Tags:

Information

Documents from same domain

Related documents

Related search queries