Example: stock market

Project Report and Technical Documentation

A tooltoe ciently thetechnicalspeci nota engineerane cientindex,whichis theheartoftheproject,is a ,there hastobea exiblehandlingfor letypesthatare usedina typicalmoderno ,quanekocanbecon guredtoparseany informationand ledownloadsforquanekoare .. ofthisDocument.. 's Goals.. Overview.. Library.. Register.. erentTypesofIndexes.. ,Linux,MacOSX.. (32bit).. License.. DevelopmentPlatform.. ProgrammingLanguage.. GUIT oolkit..26AC++ ..28 BGlossary33 CReferences34 DIndex35 EAbouttheAuthors37 ProjectReportandTechnicalDocumentation11 IntroductionWheneveryou ndthatyouare onthesideof themajority, it is timetoreform.(MarkTwain)Thischapterbrie yintroducestheprojectandde ,theskilltoretrieveinformationina , whilesearchingandbrowsingtheInternethasb ecomea matterofcourse, ndingtheright leonthelocalhard diskcanstillprovetobea tokeeptrackofthevariousdirectoriesand ndingolder ispossibletosearchfor lesusingthestandard searchtoolsprovidedbytheoperatingsystem( 'SearchforFilesandFolders'underWindowsor the nd/grepcommandsunderUnixsystems).

‹ In the rst part we analyze the problem (section 2). ‹ The second part is about the design and implementation of quaneko (sections 3 - 9). ‹ Finally wediscuss theevaluation and conclusion of theproject in thethird part (section 13). The objective of this document is to reveal the internal structure and design of quaneko. Fur-

Tags:

  Project, Documentation, Part

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Project Report and Technical Documentation

1 A tooltoe ciently thetechnicalspeci nota engineerane cientindex,whichis theheartoftheproject,is a ,there hastobea exiblehandlingfor letypesthatare usedina typicalmoderno ,quanekocanbecon guredtoparseany informationand ledownloadsforquanekoare .. ofthisDocument.. 's Goals.. Overview.. Library.. Register.. erentTypesofIndexes.. ,Linux,MacOSX.. (32bit).. License.. DevelopmentPlatform.. ProgrammingLanguage.. GUIT oolkit..26AC++ ..28 BGlossary33 CReferences34 DIndex35 EAbouttheAuthors37 ProjectReportandTechnicalDocumentation11 IntroductionWheneveryou ndthatyouare onthesideof themajority, it is timetoreform.(MarkTwain)Thischapterbrie yintroducestheprojectandde ,theskilltoretrieveinformationina , whilesearchingandbrowsingtheInternethasb ecomea matterofcourse, ndingtheright leonthelocalhard diskcanstillprovetobea tokeeptrackofthevariousdirectoriesand ndingolder ispossibletosearchfor lesusingthestandard searchtoolsprovidedbytheoperatingsystem( 'SearchforFilesandFolders'underWindowsor the nd/grepcommandsunderUnixsystems).

2 : quanekocreatesanindexoverthe ,searchingfora word takesonlyseconds,nomatterhowmany lesneedtobesearched. quanekocansearchanyformattypeaslongasthe re is a toolavailableandcon leformatis aseasyascon guringa nutshell,quanekowasbuiltwithsearchperfor manceand exible designedtokeeptrackofuserspeci eddirectoriesand ,quanekolistsall basedona descriptionthatwasoriginallywordedbyKarl Rege: Diskcapacitygrowssteadilyandsodoesthenee dfortoolsthatallowtoorganizeandsearchfor are , ndingtheappropriatekeywordsisn' leformatsthatstorethetextinUnicodeorina ,possiblewithgenerationofindex,over lesincommonformats(doc,pdf,html,etc.)for keywordssimilarlikeGoogledoesit (singlekeywordsonly). structuredinthreeparts: Inthe rstpartweanalyzetheproblem(section2). Thesecondpartis aboutthedesignandimplementationofquaneko (sections3 - 9). Finallywediscusstheevaluationandconclusi onoftheprojectinthethird part (section13).Theobjectiveofthisdocume ntis torevealtheinternalstructure anddesignofquaneko. Fur-ther, it containsthespeci cationfortheC++ type fori in*.

3 Dvi;doxdvi$i;done ina GUI?( interfaces.) 'spointofview. Itbrie ydescribestheuser's goals,howtheycurrentlysolvetheproblemof ndingdocuments, lesinspeci cformats( companyinternalformatora formatusedina certainprofessionsuchasa CAD leformat).CSoftware 's theperformance(afewsecondsfora searchquery)andthatthesearchcanbecon guredtoincludetheuser's favorite letypessuchasdoc,html,pdf,txt, notvery exibleandrequire a nitionofthreeinterfaces: AGUI forusergroupsAandB. Acommandlineinterface(CLI)forusergroupD. AsimpleC++ (GUI,CLI,API)musto erthefollowingfunctionality: Creatingnewindexes. Parsing lesanddirectoriesintotheindex. Updatingexistingindexes. Queryinganindexfora o ersa userfriendlywaytoadjustapplicationoption sandtocon gure theformat ,theGUImustallowtheusertoaccomplishthefo llowingtasks: Con guringformat lters(Add/Remove/Edit).Optional: Previewofsearchresults. Opensearchresultsina speci ,thecommandlineinterfacemusto erswitchesfor: Removingindividual lesordirectoriesfromanexistingindex.

4 C++ detaileddocumentationoftheAPI,pleaserefe rtotheAPIreferencedocumentationinappendi xA, intendedtoworkunderallmajorUnixsystemsas wellasWindows(32bit)andMacOSX partsofquanekoare implementedinC++ ,theQttoolkit[8] is chosentoo ertherequiredportabilityincombinationwit hC++. Thethird tierstorestheindexandholdsthedata lesthatare lesystemasthethird 1:Architecture OverviewFora detaileddescriptionoftheUserInterfaces,p leaserefertotheQuanekoUserManual[1]. Thedata lesandtheindexoverthedata lesresideonthelocalhard 2:Core Library, ComponentsThecore librarycontainsallcore functionalityofquaneko. Thefollowinglistdescribesthemostimportan tcomponentsoftheCore Library: TheFiltermoduleconverts lesinvariousformatsintoplaintext(section 4). TheStemmermoduleappliesstemmingtowords(s ection5). TheParserextractswordsfrom lesanddirectoriesthroughtheFilterInterfa ceandusestheIndexHandlertostore theinformationaboutwordsand , theParseris responsibletomanagethe leregisterandtheword register(section6).

5 TheIndexHandlerpersistentlymanagesthelin ksbetweenwordsand les(section7). ,pleaserefertotheAPIreferencedocumentati oninsectionA. TheSettingsmodulestoresandhandlesmetainf ormationabouteveryindexaswellasuserprefe rences(section9).ProjectReportandTechnic alDocumentation84 FILTERMODULE4 FilterModuleThe ltermodulemanagestheformat ltersforconvertinga non-text lescanbeconvertedintoplaintextwitha commandlineutilitycalled'pdf2txt'.The ltermodulemaintainsa listofallavailableformat ltersandcallsthethird Thepathtoa leinanykindofformat(txt,doc,html,pdf,..) . Asetofformat lters(readfromtheapplicationwidecon guration le). If a lteris availabletoconvertthegiven leintoplaintext,a plaintext leis produced. If there isno lteravailableforthatformat,anerrorishand edbacktothecallerandnooutputis basedona setofstemmersthatare availablefromtheSnowball[3] , stemmingcanbeenabledforexactlyonelanguag eordisabledatthetimea stemmingafterstemmingRFC( Technical )en 47000 3950016%MobyDicken 19500 1300033%GoethesFaustde 13500 1000026%Table1:Reductionofuniquewordsfou ndina readfromthedata lesaswellastothesearchword enteredbytheuser.

6 If theusersearchesforexamplefortheword 'cycling',quanekolooksintheindexfor'cycl 'whichwill ndall lescontaining'cycle','cycling','cycles'a nd'cycled'.Atthetimeofwritingthisdocumen t,stemmingis supportedforthefollowinglanguages: Danish(da) German(de) English(en) Spanish(es) Finnish( ) French(fr) Italian(it) Dutch(nl) Norwegian(no) Portuguese(pt) Russian(ru) Swedish(sv)ProjectReportandTechnicalDocu mentation106 PARSERMODULE6 ParserModuleTheparseris readstheplaintextoutputofa lterandaddsallwordsthatare , it canignore allnumbersandalwaysstripsawayanywhitespa ceorspecialcharactersfromthebeginningand endofa lesthatare nolongeravailableondisk,addsnew lesindirectoriesthatwerepreviouslyparsed andupdates lesinwhichtheword builtonthreedi erenttypesofindexes: DirectIndex. InvertedIndex. ,tworegisterskeeptrackofthewordsand lesthathavebeenindexed: Word Register FileRegisterFigure 3 3 is savedwithanIndexIDandaWordID. TheIndexIDspeci eswhichindexlinkstheword toa IDis IDisuniqueandreferredtoasFullWordID.

7 ,theword registeris :Examplefora Word leordirectoryhasa uniqueFileID. Themodi cationtimeofthe leis alsostoredtodetectchangesofthe (00:00:00 UTC,January1, 1970). sh31083360145/home/tux/data/ :Examplefora 4:Inthisexample,theword 'aardvark'is directlylinkedtotheonly leit usedforwordswhichoccurinone IDequalstheFileIDandtheIndexIDissetto-1( seeFigure 3).Ourresearchhasshownthatbetween30%and6 0%ofallwordsappearinonlyone leandcantherefore beindexedwiththise cientmethod(seeFigure 7).02040608010012345678910% of words that match x filesMatching files (x)Figure 5:Anexampleforthedistributionofwordsin10 00 RFC onlyfoundinone 6:AnInvertedIndexlinksa word toa numberof thanone le(note:forwordsthatoccurinmany les,theArrayIndexis usedinstead, ).ProjectReportandTechnicalDocumentation 147 INDEXHANDLERE achrowofanInvertedIndexstoresa (notstoredinthe le)indicatestheWord canstore a word occursintwo les,it is indexedinInvertedIndex0,whichcanstore third leis foundwhichalsocontainsthesameword,thewor dmovestoinvertedindex1 whichcanstore canbeanynumberofInvertedIndexesinquaneko .

8 However, inmostcases,havingmore than8 InvertedIndexesprovestobeine upto2(IndexID+1) ndingallFileIDstoa belongingWord ndallwordsthatoccurina certain erent requiresasmanybitsofmemoryasthere are lethatdoesnotcontaintheword,a '0'is storedandforevery lesthatcontainstheword a '1'.Figure 7:TheArrayIndexisa bitmapinwhicheverycolumnaccountsfora FileIDandeveryrowfora Word a word occursina le,thebitatposition(Word ID/FileID)is setto1, leneedstobehorizontallyresizedwhennoFile IDsare lesthattheyendupintheArrayIndexare neverremovedfromit erentTypesofIndexesOur rstapproachofcreatinganindexthatlinkseac hword toa numberof leswasa simplebitmap( ).Whentestsshowedthata lotofwordsappearinonlyoneortwo les(seeFigure 7, page14), ,2000 RFC leswitha Index Size in MBNumber of Inverted IndexesFigure 8 ,thateachInvertedIndexcanlinkupto2(n+1) lestoa rstInvertedIndex(0)canlinkeachword init toupto21=2 toupto211=2048 les(forthisexample,thiswouldmeanthattheA rrayIndexis neverused).

9 Table4 showsa more :Precon ,only6 InvertedIndexesare guredinthesettings leofquaneko( ).ProjectReportandTechnicalDocumentation 169 SETTINGS9 SettingsThecon gurationofquanekois savedina settings ,this leis placedintheuser's homedirectory. ,it is notclearlyde nedwhere theuser 's edineitheroftheenvironmentvariables:HOME ,USERPROFILE, leis:C:\Documentsand Settings\Tux\. :4/ lter1typeFiletypesfor the leextensionsthis ltercanhandleseparatedbya lter1appTheapplicationusedtoconvertthe ,%fmarkstheplacewhere thedata thetemporaryplaintextoutputis ,theoutputofthecommandtostdoutisredirect edintothetemporaryplaintext :html2text %f -o %o pdf2txt %f cp %f %o ../ lter1openTheapplicationusedtoexecuteando penanindexed ,%fmarkstheplacewhere thedata lepathis cSettingsKeyDescription/ lePointstothe lewhichcontainsthearrayindexforthespeci :/tmp/ leindex lePointstothe lewhichcontainsthe :/tmp/TestIndex/ leindex lePointstothe lewhichcontainstheword thisvalueis 1, numbersare lePointstothe lewhichcontainstheinvertedindexwiththegi venID(inthiscase0).

10 /invertedindex/0/maxMaximumnumberof leIDsstoredforoneword leIDsstoredforoneword intheindexwiththegivenID(currentlynotuse d)../lock1 indicatesthatthisindexis currentlyinuse(locked).0 meanstheindexis :da,de,en,es, ,fr, it,nl,no,pt,ru andsv. If leftempty, stemmingis [9, ]Thetestingtookapproximately50%ofthetota ldevelopmente havetestedthewholeapplication(blackbox)u singvarioustestcasestocoverdi ful de nedinthetestingscript:(1)..Test--updateF iledeletion(2)..Test--parseFiledeletion( 3)..Test--parseAll indexes(4)..Test--updateAll indexes,filesdeleted(5)..Test--parseAll indexed,filesdeleted(6)..Test--updateFil econtentexchange,modificationtime(7)..Te st--updateFilecontentexchange(8)..Test-- updateFilesemptied(9)..Test--updateFilea dded(10).. runandevaluatedonfunctionsfromthecore library. Criticalfunctionsare anexampleoutputofa performancetest:TIMER[27]:started1 LDS_ArrayIndex::getFileSize: 0:000086 TIMER[28]:started0 LDS_ArrayIndex::resizeFile: 0:000000 TIMER[29] LDS_ArrayIndex::getChar: 78 testformemoryleaks,thetoolValgrind[10] detaileddescriptionofthetoolis :valgrind--leak-check=yes--show-reachabl e= ,Linux,MacOSXU nderUnixcompatiblesystemswithgcc>= ,buildingquanekofromsourcesisstraightfor ward: Run'make'inthequanekodirectorytobuildthe core andCLI Run'makeqtgui'fortheGUI(requiresQt>= ) (32bit)Thissectiondescribestheinstallati onofquanekofromsourcesona 32bitWindowsplatform( ,98,ME,XP, 2000).


Related search queries