Transcription of Extracting data from XML
1 Extracting data from XMLW ednesdayDTLP arsing - XML package2 basic models - DOM & SAXD ocument Object Model (DOM) Tree stored internally as C, or as regular R objectsUse XPath to query nodes of interest, extract recursive functions to "visit" nodes, Extracting information as it descends treeextract information to R data structures via handler functions that are called for particular XML elements by matching XML nameFor processing very large XML files with low-level state machine via R handler functions - ApproachDOM (with internal C representation and XPath)Given a node, several operationsxmlName() - element name ( namespace prefix)
2 XmlNamespace()xmlAttrs() - all attributesxmlGetAttr() - particular valuexmlValue() - get text (), node[[ i ]], node [[ "el-name" ]]xmlSApply()xmlNamespaceDefinitions()Ex amplesScraping HTML - (you name it!)zillow - house price estimatesPubMed articles/abstracts European Bank exchange ratesitunes - CDs, tracks, play lists, ..PMML - predictive modeling markup languageCIS - Current Index of Statistics/Google ScholarGoogle - Page Rank, Natural Language ProcessingWikipedia - History of changes, ..SBML - Systems biology markup languageBooks - DocbookSOAP - eBay, KEGG.
3 Yahoo Geo/places - given name, get most likely locationPubMedProfessionally archived collection of "medically-related" collection of information, includingarticle abstractssubmission, acceptance and publication 'll use a sample PubMed example article for get very large, rich <ArticleSet> with many articles via an HTTP query done from within R/XML package a look at the data , see what is available or read the documentationOr explore the = xmlTreeParse(" ", useInternal = TRUE)top = xmlRoot(doc)xmlName(top)[1] "ArticleSet"names(top) - child nodes of this root[1] "Article" "Article" - so 2 articles in this 's fetch the author list for each article.
4 Do it first for just one and then use "apply" to iteratenames( top[[ 1 ]] ) Journal ArticleTitle FirstPage "Journal" "ArticleTitle" "FirstPage" LastPage ELocationID ELocationID "LastPage" "ELocationID" "ELocationID" Language AuthorList GroupList "Language" "AuthorList" "GroupList" ArticleIdList History Abstract "ArticleIdList" "History" "Abstract" ObjectList "ObjectList" art = top[[ 1 ]] [[ "AuthorList" ]] what we wantnames(art)
5 [1] "Author" "Author" "Author" "Author" "Author" "Author"names(art[[1]])[1] "FirstName" "MiddleName" "LastName" "Suffix" [5] "Affiliation"So how do we get these values, to put in a data element is a node with text loop over the nodes and get the content as a string xmlSApply(art[[1]], xmlValue)To do this for all authors of the articlexmlSApply(art, function(x) xmlSApply(x, xmlValue))How do we deal with the different types of fields in the names? First, Middle, Last, Affiliation CollectiveNamedata representation/analysis question from DatesIn the <History> element, have date received, accepted, aheadofprintMay want to look at time publication lag ( received to publication time) for different get these dates for all the articles <History> <PubDate PubStatus="received"> <year>.
6 </year> <Month>06</Month> <Day>15</Day> <PubDate> <PubDate PubStatus="accepted"> <year>..</day> </PubDate>Find the element PubDate within History which has an attribute whose value is "received"Can use art[["History"]][["PubDate"]] to get all 3 what if we want to access the 'received' dates for all the articles in a single operation, then the accepted, ..Need a language to identify nodes with a particular characteristic/conditionXPathXPath is a language for expressing such node subsetting with rich semantics for identifying nodesby namewith specific attributes presentwith attributes with particular valueswith parents, ancestors, children XPath = YALTL (Yet another language to learn)
7 XPath language/node - top-level node//node - node at any levelnode[@attr-name] - node that has an attribute named "attr-name"node[@attr-name='bob'] - node that has attribute named attr-name with value 'bob'node/@x - value of attribute x in node with such a collection of nodes, attributes, 's find the date when the articles were receivednodes = getNodeSet(top, "//History/PubDate[@PubStatus='received' ]")2 nodes - 1 per articleExtract year, month, day lapply(nodes, function(x) xmlSApply(x, xmlValue))Easy to get date "accepted" and "aheadofprint"Text mining of abstractContent of abstract as wordsabstracts = xpathApply(top, "//Abstract", xmlValue)Now, break up into words, stem the words, remove the stop-words, abstractWords = lapply(abstracts, strsplit, "[[:space.)]]
8 ]]")library(Rstem)abstractWords = lapply(abstractWords, function(x) wordStem[[1]])Remove stop wordslapply(abstractWords, function(x) x[x %in% stopWords])Zillow - house pricesThanks to Roger, yesterday evening I found the Zillow XML API - (Application Programming Interface)Can register with Zillow, make queries to find estimated house prices for a given house, comparables, demographics, ..Put address, city-state-zip & Zillow login in URL requestCan put this at the end of a URL within xmlTreeParse()" 's%20 Way&citstatezip=Berkeley"But spaces are problematic, as are other I use library(RCurl)reply = getForm(" ", 'zws-id' = "AB-XXXXXXXXXXX_10312q", address = "1093 Zuchini Way", citystatezip = "Berkeley, CA, 94212")reply is text from the Web server containing XML<?
9 Xml version=\" \" encoding=\"utf-8\"?>\n<SearchResults:searchresults xsi:schemaLocation=\" /vstatic/71a179109333d30cfb3b2de866d9add9/static/ \" xmlns:xsi=\" \" xmlns:SearchResults=\" \">\n\n <request>\n <address>112 Bob's Way Avenue</address>\n <citystatezip>Berkeley, CA, 94212</citystatezip>\n </request>\n \n <message>\n <text>Request successfully processed</text>\n <code>0</code>\n\t\t\n </message>\n\n \n <response>\n\t\t<results>\n\t\t\t\n\t\t\t<result>\n\t\t\t\t\t<zpid>24842792</zpid>\n\t<links>\n\t\t<homedetails> </homedetails>\n\t\t<graphsanddata> **&s_cid=Pa-Cv-X1-CLz1carc3c49ms_htxqb&p artner=X1-CLz1carc3c49ms_htxqb</graphsanddata>\n\t\t<mapthishome> #src=url&s_cid=Pa-Cv-X1-CLz1carc3c49ms_h txqb&partner=X1-CLz1carc3c49ms_htxqb</mapthishome>\n\t\t<myestimator>
10 </myestimator>\n\t\t<myzestimator deprecated=\"true\"> </myzestimator>\n\t</links>\n\t<address>\n\t\t<street>1292 Bob's way</street>\n\t\t<zipcode>94</zipcode>\n\t\t<city>Berkeley</city>\n\t\t<state>CA</state>\n\t\t<latitude> </latitude>\n\t\t<longitude> </longitude>\n\t</address>\n\t\n\t\n\t<zestimate>\n\t\t<amount currency=\"USD\">803000</amount>\n\t\t<last-updated>07/14/2008</last-updated>\n\t\t\n\t\t\n\t\t\t<oneWeekChange deprecated=\"true\"> </oneWeekChange>\n\t\t\n\t\t\n\t\t\t<valueChange currency=\"USD\" duration=\"31\">-33500</valueChange>\n\t\t\n\t\t\n\t\t<valuationRange>\n\t\t\t<low currency=\"USD\">650430</low>\n\t\t\t<?