Text Mining Process, Techniques and Tools : an Overview

Text Mining process , Techniques and Tools : an OverviewVidhya. K. A1 & G. Aghila2 Text Mining has become an important research area, which refers to the application of machine learning (or data Mining ) Techniques in the study of Information Retrieval and Natural Language Processing. In sense, it is defined as the way ofdiscovering knowledge from ubiquitous text data which are easily accessible over the Internet or the Intranet. The survey ofText Mining Techniques , Text Mining Applications, literature survey of various applications and Tools has been Mining Techniques like Document clustering and Document Classification have been presented. Text Mining basedframework for applications like Summarization, Topic Discovery, Information Extraction, Information Retrieval terms andtechniques in each method has been discussed.

Various text Mining and data visualization Tools for application to patentinformation like their working mode, capabilities, data sources and result output have been presented. In depth analysis ofalgorithm related to classification Techniques its advantages and disadvantages and the working mode has been : Text Mining , Information Extraction, Information Retrieval, Document is the process of inferring for patterns with in astructured or unstructured data. There are various miningmethods out of which they differ in the context and type ofdataset that is applied. The process of extracting informationand knowledge from unstructured text led to the need forvarious Mining Techniques for useful pattern discovery. Data Mining (DM) and Text Mining (TM) [1] is similarin that both Techniques mine large amounts of data,looking for meaningful patterns [19].

Some of the miningtypes are data, text, web, business process and looks for patterns within structured data, that is,databases. The underlying technologies are based onstatistics and artificial intelligence, littering the field withbuzzwords such as classification and regression trees(CART), chi-squared automatic induction (CHAID) [2],[14] neural networks and genetic algorithms. TM looks forpatterns in unstructured data - memos and documents, pdfand text files. Consequently, it often uses language-basedtechniques, such as semantic analysis and Mining deals with the extraction of specificknowledge from the World Wide Web. More precisely [13],Web Content Mining is that part of Web Mining whichfocuses on the raw information available in Web pages [9];source data mainly consist of textual data in Web pages( , words, but also tags); typical applications are content-based categorization and content-based ranking of Journal of Information Technology and Knowledge ManagementJuly-December 2010, Volume 2, No.

2, pp. 613-6221,2 Department of Computer Science, Pondicherry University,Pondicherry, Web is transforming from a Web of data to a Webof both Semantic data and services [6] and this trend isproviding us with increasing opportunities to composepotentially interesting and useful services from existingservices. While the user may not sometimes have the specificqueries needed in top-down service composition approachesto identify them, Service Mining the early and proactiveexposure of these opportunities will be key to harvest thegreat potential of the large body of Web information systems record businessevents in so-called event logs [6] from which businessprocess Mining [10] takes these logs to discover Process, control, data, organizational, and social structures.

Althoughmany researchers are developing new and more powerfulprocess Mining Techniques [16] and software vendors areincorporating these in their software, few of the moreadvanced process Mining Techniques have been tested onreal-life processes. However there are many miningtechniques the orientation of this document is towards TextMining Techniques and its applicationThe paper is organized in the following way: Section IIfor Text Mining process , Section III Text Mining VsI nf o rm a tio n E xtr ac tio n , Se ction IV Tex t M in g V sInformation Retrieval, and Section V Text Ming Vs NaturalLanguage Processing, Section VI for Text Mining VsDocument Clustering, Section VII for Text Mining VsDocument Classification Section VIII for Text Mining Toolsfinally followed by Conclusion and Mining PROCESSText Mining (TM)refers some informational contentincluded in any of the items such as: newspaper; articles;books; reports; stories; manuals; blogs; email, and articlesin the WWW.

The quantum of text of the present day is pretty614 VIDHYA. K. A & G. AGHILA vast with ever-growing incremental power [19].The prime aim of the text Mining is to identify the usefulinformation without duplication from various documentswith synonymous understanding. TM is an empirical toolthat has a capacity of identifying new information that isnot apparent from a document 1 depicts the TM process that uses Informationretrieval and Natural Language Processing to mine largedataset and infer the knowledge available in the dataset. TheProcess of TM includes searching, extracting, categorizationwhere the themes are readable and the meaning is text Mining tasks include text categorization [15],text clustering, Information extraction, information retrieval[14], sentiment analysis, document summarization [4], [5]and entity relation modeling.

TM, also known as KnowledgeText Analysis, Text Data Mining or Knowledge-Discoveryin Text (KDT) refers generally to the process of extractinginformation and knowledge from unstructured starts with a collection of documents; which wouldretrieve a particular document and preprocess it by checkingformat and character sets. Then it would go through a textanalysis phase, sometimes repeating Techniques untilinformation is extracted. Three text analysis Techniques areshown in the example, but many other combinations oftechniques could be used depending on the goals of theorganization. The resulting information can be placed in amanagement information system [21], yielding an abundantamount of knowledge for the user of that system. Thefollowing figure explores the detail processing methods inText 1: Text Mining ProcessThe document collection from figure 1 is set of filesmight be with any extension like PDF, txt or even flat fileextension which are normally collected and named as noisyunstructured text data found in informal settings such asonline chat, SMS, emails, message boards, newsgroups,blogs, wikis and web pages.

Also, text data set is created byprocessing spontaneous speech, printed text and handwrittentext contains processing dataset is an unstructured dataset of documentswhich are pre-processed using the following three rules: Tokenize the file into individual tokens using spaceas the delimiter. Removing the stop word which does not conveyany meaning. Use porter stemmer algorithm to stem the wordswith common root Generation:The process of TM involvesgenerating features in a spread sheet format which is simplerand more restrictive than open- ended data Mining . TM isunstructured because it is very far the spread sheet modelthat we need to process data for prediction. Even then, thistype of transforming data from text to spread sheet modelcan be highly methodical and there is need for an organizedprocedure to fill in the cells of spread Selection:Feature Selection algorithms areadhoc which in turn is the process of selecting the importantfeatures which requires an exhaustive search of all subsetsof features of chosen cardinality.

If the large numbers areavailable this is impractical for supervised learningalgorithms the search is for satisfactory set of featuresinstead of optimal the appropriate selection of features the textmining Techniques are incorporated for the applications likeI nf o rm a tio n r etr ieval, I n fo r ma tio n E x tr a ction ,Sum mariz ation and Topic Discovery for n ecessaryknowledge discovery process of usingKnowledge Discovery in Database (KDD), which is thefundamental step in Text Mining , knowledge experts canobtain important strategic information for their has more intensive transformation methods to cross-examine traditional databases, where data are in structuredform, by automatically finding new and unknown patternsin huge quantity of data.

Mostly, structured data representonly a little part of the overall organization knowledge andthe knowledge is incorporated in textual u re 1 d ep icts th e k n ow led g e sto r ed in th emanagement information system where the knowledge isstored and retrieved when the system gets a new trainingset the system goes for incremental learning rather than theinitial learning process Example, Text Data Mining [18], [10] in customerrelationship management applications can contributesignificantly to the business rather than randomly contactinga customer through a call center or sending mail, a companycan concentrate its efforts on customer that are predicted tohave a high likelihood of responding to an offer. Moresophisticated methods may be used to optimize resourcesacross the dataset so that one may predict which channeland which offer the customer is most likely to respond toacross all potential offers that are made out of Mining process , TECHNIQUESAND Tools :AN MININGVS INFORMATION EXTRACTIONI nformation Extraction (IE) is the process of automaticextraction of structured information such as entities,relationship between entities and attributes describingentities from unstructured texts .

Text Mining Process, Techniques and Tools : an Overview

Tags:

Information

Advertisement

Transcription of Text Mining Process, Techniques and Tools : an Overview

Related search queries

Text Mining Process, Techniques and Tools : an Overview

Tags:

Information

Advertisement

Related documents

Related search queries