An Introduction to Feature Extraction - ClopiNet

An Introduction to Feature ExtractionIsabelle Guyon1and Andr e Elisseeff21 ClopiNet , 955 Creston Rd., Berkeley, CA 94708, Research GmbH, Z urich Research Laboratory, S aumerstrasse 4, CH-8803R uschlikon, chapter introduces the reader to the various aspects of Feature extractioncovered in this book. Section1reviews definitions and notations and proposesa unified view of the Feature Extraction problem. Section2is an overview ofthe methods and results presented in the book, emphasizing novel contribu-tions. Section3provides the reader with an entry point in the field of featureextraction by showing small revealing examples and describing simple but ef-fective algorithms. Finally, Section4introduces a more theoretical formalismand points to directions of research and open Feature Extraction BasicsIn this section, we present key notions that will be necessary to understandthe first part of the book and we synthesize different notions that will be seenseparately later Predictive modelingThis book is concerned with problems of predictive modeling or supervisedmachine learning.

The latter refers to a branch of computer Science interestedin reproducing human learning capabilities with computer programs. The termmachine learning was first coined by Samuel in the 50 s and was meant toencompass many intelligent activities that could be transferred from humanto machine. The term machine should be understood in an abstract way: notas a physically instantiated machine but as an automated system that may, forinstance, be implemented in software. Since the 50 s machine learning researchhas mostly focused on finding relationships in data and analyzing the processesfor extracting such relations, rather than building truly intelligent systems .Machine learning problems occur when a task is defined by a series ofcases or examples rather than by predefined rules. Such problems are found in2 Isabelle Guyon and Andr e Elisseeffa wide variety of application domains, ranging from engineering applicationsin robotics and pattern recognition (speech, handwriting, face recognition), toInternet applications (text categorization) and medical applications (diagno-sis, prognosis, drug discovery).

Given a number of training examples (alsocalled data points, samples, patterns or observations) associated with desiredoutcomes, the machine learning process consists of finding the relationshipbetween the patterns and the outcomes using solely the training shares a lot with human learning where students are given examples ofwhat is correct and what is not and have to infer which rule underlies thedecision. To make it concrete, consider the following example: the data pointsor examples are clinical observations of patient and the outcome is the healthstatus: healthy or suffering from goal is to predict the unknownoutcome for new test examples, the health status of new patients. Theperformance on test data is called generalization . To perform this task, onemust build a predictive model orpredictor, which is typically a function withadjustable parameters called a learning machine.

The training examples areused to select an optimum set of will see along the chapters of this book that enhancing learning ma-chine generalization often motivates Feature selection. For that reason, classi-cal learning machines ( Fisher s linear discriminant and nearest neighbors)and state-of-the-art learning machines ( neural networks, tree classifiers,Support Vector Machines (SVM)) are reviewed in Chapter1. More advancedtechniques like ensemble methods are reviewed in Chapter5. Less conventionalneuro-fuzzy approaches are introduced in Chapter8. Chapter2provides guid-ance on how to assess the performance of learning , before any modeling takes place, a data representation must be cho-sen. This is the object of the following Feature constructionIn this book, data are represented by a fixed number of features which canbe binary, categorical or continuous.

Feature is synonymous of input variableor a good data representation is very domain specific andrelated to available measurements. In our medical diagnosis example, the fea-tures may be symptoms, that is, a set of variables categorizing the healthstatus of a patient ( fever, glucose level, etc.).Human expertise, which is often required to convert raw data into a setof useful features, can be complemented byautomatic Feature constructionmethods. In some approaches, Feature construction is integrated in the mod-eling process. For examples the hidden units of artificial neural networks3 The outcome, also called target value, may be binary for a 2-class classificationproblem, categorical for a multi-class problem, ordinal or continuous for is sometimes necessary to make the distinction between raw input variablesand features that are variables constructed for the original input variables.

We willmake it clear when this distinction is internal representations analogous to constructed features. In otherapproaches, Feature construction is a preprocessing. To describe preprocessingsteps, let us introduce some notations. Letxbe a pattern vector of dimen-sionn,x= [x1, x2, ..xn]. The componentsxiof this vector are the originalfeatures. We callx a vector of transformed features of dimensionn . Prepro-cessing transformations may include: Standardization:Features can have different scales although they refer tocomparable objects. Consider for instance, a patternx= [x1, x2] wherex1is a width measured in meters andx2is a height measured in can be compared, added or subtracted but it would be unreasonableto do it before appropriate normalization. The following classical centeringand scaling of the data is often used:x i= (xi i)/ i, where iand iarethe mean and the standard deviation of featurexiover training examples.

Normalization:Consider for example the case wherexis an image andthexi s are the number of pixels with colori, it makes sense to normalizexby dividing it by the total number of counts in order to encode thedistribution and remove the dependence on the size of the image. Thistranslates into the formula:x =x/ x . Signal signal-to-noise ratio may be improved by apply-ing signal or image-processing filters. These operations include baseline orbackground removal, de-noising, smoothing, or sharpening. The Fouriertransform and wavelet transforms are popular methods. We refer to intro-ductory books in digital signal processing (Lyons,2004), wavelets (Walker,1999), image processing (R. C. Gonzalez,1992), and morphological imageanalysis (Soille,2004). Extraction of local features:For sequential, spatial or other structured data,specific techniques like convolutional methods using hand-crafted kernelsor syntactic and structural methods are used.

These techniques encodeproblem specific knowledge into the features. They are beyond the scopeof this book but it is worth mentioning that they can bring significantimprovement. Linear and non-linear space embedding methods:When the dimensionalityof the data is very high, some techniques might be used to project or em-bed the data into a lower dimensional space while retaining as much infor-mation as possible. Classical examples are Principal Component Analysis(PCA) and Multidimensional Scaling (MDS) (Kruskal and Wish,1978).The coordinates of the data points in the lower dimension space might beused as features or simply as a means of data visualization. Non-linear expansions:Although dimensionality reduction is often sum-moned when speaking about complex data, it is sometimes better to in-crease the dimensionality.

This happens when the problem is very complexand first order interactions are not enough to derive good results. This con-sists for instance in computing products of the original featuresxito Guyon and Andr e Elisseeff Feature algorithms do no handle well continuous makes sense then to discretize continuous values into a finite discreteset. This step not only facilitates the use of certain algorithms, it maysimplify the data description and improve data understanding (Liu andMotoda,1998).Some methods do not alter the space dimensionality ( signal enhance-ment, normalization, standardization), while others enlarge it (non-linear ex-pansions, Feature discretization), reduce it (space embedding methods) or canact in either direction ( Extraction of local features). Feature construction is one of the key steps in the data analysis process,largely conditioning the success of any subsequent statistics or machine learn-ing endeavor.

In particular, one should beware of not losing information atthe Feature construction stage. It may be a good idea to add the raw featuresto the preprocessed data or at least to compare the performances obtainedwith either representation. We argue that it is always better to err on theside of being too inclusive rather than risking to discard useful medical diagnosis example that we have used before illustrates this factors might influence the health status of a patient. To the usual clini-cal variables (temperature, blood pressure, glucose level, weight, height, etc.),one might want to add diet information (low fat, low carbonate, etc.), familyhistory, or even weather conditions. Adding all those features seems reason-able but it comes at a price: it increases the dimensionality of the patternsand thereby immerses the relevant information into a sea of possibly irrele-vant, noisy or redundant features.

An Introduction to Feature Extraction - ClopiNet

Tags:

Information

Transcription of An Introduction to Feature Extraction - ClopiNet

Related search queries

An Introduction to Feature Extraction - ClopiNet

Tags:

Information

Related documents

Related search queries