Web Mining — Concepts, Applications, and Research …

Chapter 21 Web Mining Concepts, applications , and ResearchDirectionsJaideep Srivastava, Prasanna Desikan, Vipin KumarWeb Mining is the application of data Mining techniques to extract knowledgefrom web data, including web documents, hyperlinks between documents, us-age logs of web sites, etc. A panel organized at ICTAI 1997 (Srivastava andMobasher 1997) asked the question Is there anything distinct about web min-ing (compared to data Mining in general)? While no definitive conclusionswere reached then, the tremendous attention on web Mining in the past fiveyears, and a number of significant ideas that have been developed, have cer-tainly answered this question in the affirmative in a big way.

In addition, afairly stable community of researchers interested in the area has been formed,largely through the successful series of WebKDD workshops, which have beenheld annually in conjunction with the ACM SIGKDD Conference since 1999(Masand and Spiliopoulou 1999; Kohavi, Spiliopoulou, and Srivastava 2001;Kohavi, Masand, Spiliopoulou, and Srivastava 2001; Masand, Spiliopoulou,Srivastava, and Zaiane 2002), and the web analytics workshops, which havebeen held in conjunction with the SIAM data Mining conference (Ghosh andSrivastava 2001a, b). A good survey of the Research in the field (through 1999)400 CHAPTER TWENTY-ONEis provided by Kosala and Blockeel (2000) and Madria, Bhowmick, Ng, andLim (1999).

Two different approaches were taken in initially defining web Mining . Firstwas a process-centric view, which defined web Mining as a sequence of tasks(Etzioni 1996). Second was a data-centric view, which defined web miningin terms of the types of web data that was being used in the Mining process(Cooley, Srivastava, and Mobasher 1997). The second definition has becomemore acceptable, as is evident from the approach adopted in most recent papers(Madria, Bhowmick, Ng, and Lim 1999; Borges and Levene 1998; Kosala andBlockeel 2000) that have addressed the issue. In this chapter we follow thedata-centric view of web Mining which is defined as follows,Web miningis the application of data Mining techniques to ex-tract knowledge from web data, web content, web structure,and web usage attention paid to web Mining , in Research , software industry, and web-based organization, has led to the accumulation of significant experience.

It isour goal in this chapter to capture them in a systematic manner, and identifydirections for future rest of this chapter is organized as follows: In section we providea taxonomy of web Mining , in section we summarize some of the keyconcepts in the field, and in section we describe successful applicationsof web Mining . In section we present some directions for future Research ,and in section we conclude the Web Mining TaxonomyWeb Mining can be broadly divided into three distinct categories, according tothe kinds of data to be mined. Figure shows the Web Content MiningWeb content Mining is the process of extracting useful information from thecontents of web documents.

Content data is the collection of facts a web pageis designed to contain. It may consist of text, images, audio, video, or struc-tured records such as lists and tables. Application of text Mining to web con-tent has been the most widely researched. Issues addressed in text Mining in-clude topic discovery and tracking, extracting association patterns, clusteringof web documents and classification of web pages. Research activities on thistopic have drawn heavily on techniques developed in other disciplines suchas Information Retrieval (IR) and Natural Language Processing (NLP). WhileSRIVASTAVA,DESIKAN,AND KUMAR401there exists a significant body of work in extracting knowledge from imagesin the fields of image processing and computer vision, the application of thesetechniques to web content Mining has been Web Structure MiningThe structure of a typical web graph consists of web pages as nodes, and hyper-links as edges connecting related pages.

Web structure Mining is the processof discovering structure information from the web. This can be further dividedinto two kinds based on the kind of structure information hyperlink is a structural unit that connects a location in a web page to adifferent location, either within the same web page or on a different web hyperlink that connects to a different part of the same page is called anintra-document hyperlink, and a hyperlink that connects two different pages iscalled aninter-document hyperlink. There has been a significant body of workon hyperlink analysis, of which Desikan, Srivastava, Kumar, and Tan (2002)provide an up-to-date StructureIn addition, the content within a Web page can also be organized in a tree-structured format, based on the various HTML and XML tags within the efforts here have focused on automatically extracting document objectmodel (DOM) structures out of documents (Wang and Liu 1998; Moh, Lim,and Ng 2000).

Web Usage MiningWeb usage Mining is the application of data Mining techniques to discoverinteresting usage patterns from web usage data, in order to understand andbetter serve the needs of web-based applications (Srivastava, Cooley, Desh-pande, and Tan 2000). Usage data captures the identity or origin of web usersalong with their browsing behavior at a web site. web usage Mining itself canbe classified further depending on the kind of usage data considered:Web Server DataUser logs are collected by the web server and typically include IP address,page reference and access TWENTY-ONEWeb MiningWeb Server logsApplication Server logsApplicationLevel logsWeb StructureMiningWeb UsageMiningHyperlinksIntra-DocumentHyper linkIntra-DocumentHyperlinkDocumentStruc tureWeb ContentMiningTextImageAudioVideoStructur edRecordWeb Mining researchhas focused on thisFigure.

Web Mining TaxonomyApplication Server DataCommercial application servers such as weblogic ,1,2 StoryServer,3have sig-nificant features to enable E-commerce applications to be built on top of themwith little effort. A key feature is the ability to track various kinds of businessevents and log them in application server Level DataNew kinds of events can be defined in an application, and logging can beturned on for them generating histories of these must be noted, however, that many end applications require a combina-tion of one or more of the techniques applied in the above the Key ConceptsIn this section we briefly describe the new concepts introduced by the webmining Research Ranking Metrics for Page Quality and RelevanceSearching the web involves two main steps:Extracting the pages relevant to aqueryandranking them according to their quality.

Ranking is important as it1 ,DESIKAN,AND KUMAR403helps the user look for quality pages that are relevant to the query. Differentmetrics have been proposed to rank web pages according to their quality. Webriefly discuss two of the prominent is a metric for ranking hypertext documents based on their , Brin, Motwani, and Winograd (1998) developed this metric for the pop-ular search engine Google4(Brin and Page 1998). The key idea is that a pagehas a high rank if it is pointed to by many highly ranked pages. So, the rank ofa page depends upon the ranks of the pages pointing to it. This process is doneiteratively until the rank of all pages are determined. The rank of a pagepcanbe written as:PR(p) =d/n+ (1 d) (q,p) G(PR(q)Outdegree(q))Here,nis the number of nodes in the graph andOutDegree(q)is the numberof hyperlinks on pageq.

Intuitively, the approach can be viewed as a stochasticanalysis of a random walk on the web graph. The first term in the right handside of the equation is the probability that a random web surfer arrives at apagepby typing the URL or from a bookmark; or may have a particular pageas his/her homepage. Heredis the probability that the surfer chooses a URLdirectly, rather than traversing a link5and1 dis the probability that a personarrives at a page by traversing a link. The second term in the right hand side ofthe equation is the probability of arriving at a page by traversing a and AuthoritiesHubs and authorities can be viewed as fans and centers in a bipartite coreof a web graph, where the fans represent the hubs and the centers representthe authorities.

Web Mining — Concepts, Applications, and Research …

Tags:

Information

Advertisement

Transcription of Web Mining — Concepts, Applications, and Research …

Related search queries

Web Mining — Concepts, Applications, and Research …

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries