Example: biology

Project Report and Technical Documentation - SourceForge

A tooltoe ciently thetechnicalspeci nota engineerane cientindex,whichis theheartoftheproject,is a ,there hastobea exiblehandlingfor letypesthatare usedina typicalmoderno ,quanekocanbecon guredtoparseany informationand ledownloadsforquanekoare .. ofthisDocument.. 's Goals.. Overview.. Library.. Register.. erentTypesofIndexes.. ,Linux,MacOSX.. (32bit).. License.. DevelopmentPlatform.. ProgrammingLanguage.. GUIT oolkit..26AC++ ..28 BGlossary33 CReferences34 DIndex35 EAbouttheAuthors37 ProjectReportandTechnicalDocumentation11 IntroductionWheneveryou ndthatyouare onthesideof themajority, it is timetoreform.

Project Report and Technical Documentation Thomas Jund <info@jund.ch> Andrew Mustun <andrew@mustun.com> Laurent Cohn <info@cohn.ch> 24th May 2004 Version 1.0. ii Abstract In this paper we present quaneko, a tool to e ciently nd data on the local computer system. The purpose of this document is the technical specication and description of the

Tags:

  Report, Technical, Documentation, Technical documentation

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Project Report and Technical Documentation - SourceForge

1 A tooltoe ciently thetechnicalspeci nota engineerane cientindex,whichis theheartoftheproject,is a ,there hastobea exiblehandlingfor letypesthatare usedina typicalmoderno ,quanekocanbecon guredtoparseany informationand ledownloadsforquanekoare .. ofthisDocument.. 's Goals.. Overview.. Library.. Register.. erentTypesofIndexes.. ,Linux,MacOSX.. (32bit).. License.. DevelopmentPlatform.. ProgrammingLanguage.. GUIT oolkit..26AC++ ..28 BGlossary33 CReferences34 DIndex35 EAbouttheAuthors37 ProjectReportandTechnicalDocumentation11 IntroductionWheneveryou ndthatyouare onthesideof themajority, it is timetoreform.

2 (MarkTwain)Thischapterbrie yintroducestheprojectandde ,theskilltoretrieveinformationina , whilesearchingandbrowsingtheInternethasb ecomea matterofcourse, ndingtheright leonthelocalhard diskcanstillprovetobea tokeeptrackofthevariousdirectoriesand ndingolder ispossibletosearchfor lesusingthestandard searchtoolsprovidedbytheoperatingsystem( 'SearchforFilesandFolders'underWindowsor the nd/grepcommandsunderUnixsystems). : quanekocreatesanindexoverthe ,searchingfora word takesonlyseconds,nomatterhowmany lesneedtobesearched. quanekocansearchanyformattypeaslongasthe re is a toolavailableandcon leformatis aseasyascon guringa nutshell,quanekowasbuiltwithsearchperfor manceand exible designedtokeeptrackofuserspeci eddirectoriesand ,quanekolistsall basedona descriptionthatwasoriginallywordedbyKarl Rege: Diskcapacitygrowssteadilyandsodoesthenee dfortoolsthatallowtoorganizeandsearchfor are , ndingtheappropriatekeywordsisn' leformatsthatstorethetextinUnicodeorina ,possiblewithgenerationofindex,over lesincommonformats(doc,pdf,html,etc.)

3 ForkeywordssimilarlikeGoogledoesit (singlekeywordsonly). structuredinthreeparts: Inthe rstpartweanalyzetheproblem(section2). Thesecondpartis aboutthedesignandimplementationofquaneko (sections3 - 9). Finallywediscusstheevaluationandconclusi onoftheprojectinthethird part(section13).Theobjectiveofthisdocume ntis torevealtheinternalstructure anddesignofquaneko. Fur-ther, it containsthespeci cationfortheC++ type fori in*.dvi;doxdvi$i;done ina GUI?( interfaces.) 'spointofview. Itbrie ydescribestheuser's goals,howtheycurrentlysolvetheproblemof ndingdocuments, lesinspeci cformats( companyinternalformatora formatusedina certainprofessionsuchasa CAD leformat).

4 CSoftware 's theperformance(afewsecondsfora searchquery)andthatthesearchcanbecon guredtoincludetheuser's favorite letypessuchasdoc,html,pdf,txt, notvery exibleandrequire a nitionofthreeinterfaces: AGUI forusergroupsAandB. Acommandlineinterface(CLI)forusergroupD. AsimpleC++ (GUI,CLI,API)musto erthefollowingfunctionality: Creatingnewindexes. Parsing lesanddirectoriesintotheindex. Updatingexistingindexes. Queryinganindexfora o ersa userfriendlywaytoadjustapplicationoption sandtocon gure theformat ,theGUImustallowtheusertoaccomplishthefo llowingtasks: Con guringformat lters(Add/Remove/Edit).Optional: Previewofsearchresults.

5 Opensearchresultsina speci ,thecommandlineinterfacemusto erswitchesfor: Removingindividual lesordirectoriesfromanexistingindex. C++ detaileddocumentationoftheAPI,pleaserefe rtotheAPIreferencedocumentationinappendi xA, intendedtoworkunderallmajorUnixsystemsas wellasWindows(32bit)andMacOSX partsofquanekoare implementedinC++ ,theQttoolkit[8] is chosentoo ertherequiredportabilityincombinationwit hC++. Thethird tierstorestheindexandholdsthedata lesthatare lesystemasthethird 1:Architecture OverviewFora detaileddescriptionoftheUserInterfaces,p leaserefertotheQuanekoUserManual[1]. Thedata lesandtheindexoverthedata lesresideonthelocalhard 2:Core Library, ComponentsThecore librarycontainsallcore functionalityofquaneko.

6 Thefollowinglistdescribesthemostimportan tcomponentsoftheCore Library: TheFiltermoduleconverts lesinvariousformatsintoplaintext(section 4). TheStemmermoduleappliesstemmingtowords(s ection5). TheParserextractswordsfrom lesanddirectoriesthroughtheFilterInterfa ceandusestheIndexHandlertostore theinformationaboutwordsand , theParseris responsibletomanagethe leregisterandtheword register(section6). TheIndexHandlerpersistentlymanagesthelin ksbetweenwordsand les(section7). ,pleaserefertotheAPIreferencedocumentati oninsectionA. TheSettingsmodulestoresandhandlesmetainf ormationabouteveryindexaswellasuserprefe rences(section9).

7 ProjectReportandTechnicalDocumentation84 FILTERMODULE4 FilterModuleThe ltermodulemanagestheformat ltersforconvertinga non-text lescanbeconvertedintoplaintextwitha commandlineutilitycalled'pdf2txt'.The ltermodulemaintainsa listofallavailableformat ltersandcallsthethird Thepathtoa leinanykindofformat(txt,doc,html,pdf,..) . Asetofformat lters(readfromtheapplicationwidecon guration le). If a lteris availabletoconvertthegiven leintoplaintext,a plaintext leis produced. If there isno lteravailableforthatformat,anerrorishand edbacktothecallerandnooutputis basedona setofstemmersthatare availablefromtheSnowball[3] , stemmingcanbeenabledforexactlyonelanguag eordisabledatthetimea stemmingafterstemmingRFC( Technical )en 47000 3950016%MobyDicken 19500 1300033%GoethesFaustde 13500 1000026%Table1:Reductionofuniquewordsfou ndina readfromthedata lesaswellastothesearchword enteredbytheuser.

8 If theusersearchesforexamplefortheword 'cycling',quanekolooksintheindexfor'cycl 'whichwill ndall lescontaining'cycle','cycling','cycles'a nd'cycled'.Atthetimeofwritingthisdocumen t,stemmingis supportedforthefollowinglanguages: Danish(da) German(de) English(en) Spanish(es) Finnish( ) French(fr) Italian(it) Dutch(nl) Norwegian(no) Portuguese(pt) Russian(ru) Swedish(sv)ProjectReportandTechnicalDocu mentation106 PARSERMODULE6 ParserModuleTheparseris readstheplaintextoutputofa lterandaddsallwordsthatare , it canignore allnumbersandalwaysstripsawayanywhitespa ceorspecialcharactersfromthebeginningand endofa lesthatare nolongeravailableondisk,addsnew lesindirectoriesthatwerepreviouslyparsed andupdates lesinwhichtheword builtonthreedi erenttypesofindexes.

9 DirectIndex. InvertedIndex. ,tworegisterskeeptrackofthewordsand lesthathavebeenindexed: Word Register FileRegisterFigure 3 3 is savedwithanIndexIDandaWordID. TheIndexIDspeci eswhichindexlinkstheword toa IDis IDisuniqueandreferredtoasFullWordID. ,theword registeris :Examplefora Word leordirectoryhasa uniqueFileID. Themodi cationtimeofthe leis alsostoredtodetectchangesofthe (00:00:00 UTC,January1, 1970). sh31083360145/home/tux/data/ :Examplefora 4:Inthisexample,theword 'aardvark'is directlylinkedtotheonly leit usedforwordswhichoccurinone IDequalstheFileIDandtheIndexIDissetto-1( seeFigure 3).

10 Ourresearchhasshownthatbetween30%and60%o fallwordsappearinonlyone leandcantherefore beindexedwiththise cientmethod(seeFigure 7).02040608010012345678910% of words that match x filesMatching files (x)Figure 5:Anexampleforthedistributionofwordsin10 00 RFC onlyfoundinone 6:AnInvertedIndexlinksa word toa numberof thanone le(note:forwordsthatoccurinmany les,theArrayIndexis usedinstead, ).ProjectReportandTechnicalDocumentation 147 INDEXHANDLERE achrowofanInvertedIndexstoresa (notstoredinthe le)indicatestheWord canstore a word occursintwo les,it is indexedinInvertedIndex0,whichcanstore third leis foundwhichalsocontainsthesameword,thewor dmovestoinvertedindex1 whichcanstore canbeanynumberofInvertedIndexesinquaneko .


Related search queries