SUGI 27: Table Look-up: Techniques Beyond the Obvious

Paper 11-271 Table Lookup: Techniques Beyond the ObviousNancy Croonen, CC Training Services, Belgiumir. Henri Theuwissen, SOLID Partners, BelgiumABSTRACTT able lookup operations are often the most time consuming partof many SAS programs. In this paper we will combine two SASdata sets by using the value of a specific variable to locateinformation in an auxiliary or lookup SAS data set and add it toinformation from the primary SAS data set. Base SAS softwareoffers a broad range of Techniques to perform Table lookupoperations. Do you use the most Obvious technique or do youevaluate multiple Techniques and determine which technique isthe most efficient way to perform the lookup?INTRODUCTIONThis paper discusses seven different approaches to perform thetable lookup in terms of processing time and the complexity ofcoding.

The following programming Techniques , ranging fromfairly straightforward to more complicated but also from less tomore efficient, are discussed using examples:1. DATA step MERGE statement2. SQL inner join3. SQL subquery4. FORMAT procedure and PUT function5. SET statements and KEY = option6. CALL EXECUTE routine7. SQL INTO clauseBenchmarking results are summarized with graphs and paper addresses base SAS and is intended for inter mediateusers of following terms are frequently used throughout the paper: The primary file is the file for which you want to obtainauxiliary information. The lookup file is an auxiliary file that is maintainedseparately from the primary file and that is referenced forone or more of the observations of the primary file. The key variable is the variable or variables whose valuesare the common elements between the primary file and thelookup file.

Typically, key values are unique in the lookup filebut not necessarily unique in the primary file. The lookup result is the auxiliary information obtained usingthe key variable or variables as reference into the lookup SAS DATA SETSThe examples in this paper use the following input SAS data setscontaining fictitious data: The lookup SAS data set containsaddress information of all Belgian companies. The SAS dataset contains observations and the followingvariables:- VAT_NUMBER- COMPANY_NAME- STREET- POSTAL_CODE- CITYThe key variable VAT_NUMBER uniquely identifies acompany. Each VAT number only appears once in the inputSAS data set The SAS data set is a copyof the SAS data set , but contains aunique index on the variable VAT_NUMBER. The index is aseparate structure that contains the data values of thevariable VAT_NUMBER paired with a location identifier forthe observations containing the value.

The primary SAS data set containsa list of companies that are registered as bad debtors. TheSAS data set contains 200 observations and contains onlythe key variable VAT_NUMBER. Each VAT number onlyappears once in the SAS data set The SAS data set contains salesinformation in the year 2001 of all Belgian companies. TheSAS data set contains observations and thefollowing variables:- VAT_NUMBER- DATE- SALESThe combination of the variables VAT_NUMBER and DATE uniquely identifies a transaction. Each VAT number canappear multiple times in the SAS data set SAS data set contains an index on thevariable in and are related by common valuesfor VAT_NUMBER. There is a one-to-one relationship betweenthe primary SAS data set and the lookupSAS data set or the indexed lookup SASdata set which implies that eachvalue of the variable VAT_NUMBER occurs no more than once ineach SAS data set.

In other words, a single observation in theSAS data set is related to a singleobservation in the SAS data set or the SASdata set We will use the value ofthe key variable VAT_NUMBER from the primary SAS data to locate the associated addressinformation in the lookup SAS data set and add it to information from theSUGI 27 Advanced Tutorials2primary SAS data in the SAS data sets are also related by common values forVAT_NUMBER. There is a one-to-many relationship and which implies thateach value of the variable VAT_NUMBER occurs no more thanonce in the SAS data set but may occurmore than once in the SAS data set In otherwords, a single observation in BAD_DEBTOR may be related tomultiple observations in We will combine thetwo SAS data sets by using the value of the key variableVAT_NUMBER from the SAS data set tolocate the associated sales information in the SAS data SAS SYSTEM OPTIONSB efore getting started, we will turn on some SAS system optionsto make as much information as possible available about theexecution of the SAS programs: The FULLSTIMER system option writes all the systemperformance statistics to the LOG window.

The SAS Systemwrites to the LOG a complete list of computer resourcesused for each step and the entire SAS session. The type ofstatistics written varies with host systems. Computerresources can be measured in the following terms:- CPU time is the actual time spent on a task. The CPUtime is the amount of time the Central Processing Unituses to perform the requested tasks, includingcalculations, reading and writing data, ..- I/O is a measurement of the Input (read) and Output(write) operations as data and programs are movedfrom a storage device to memory (input) or frommemory to a storage or display device (output).- Memory is the size of the work area that the CPUrequires to hold the executable program modules, data,and buffers. The MSGLEVEL = system option controls the level of detailin messages that are written to the LOG = N prints notes, warnings, and error messagesonly.

This is the default. MSGLEVEL = I prints additionalnotes pertaining to index usage, merge processing, and sortutilities along with standard notes, warnings, and errormessages. The SYMBOLGEN system option writes the results ofresolving macro variable references to the LOG FULLSTIMER MSGLEVEL = I SYMBOLGEN;DATA STEP MERGE STATEMENTThe first technique , which is very easy to code, uses the MERGEand BY statements in a DATA step. The observations from theSAS data sets and are combined into a singleobservation in the new SAS data set according to the values ofthe common variable VAT_NUMBER. Before you can perform amatch-merge, both SAS data sets must be sorted by the variableVAT_NUMBER or they must have an index on the the following example, both input SAS data sets are notarranged in order of the values of the variable VAT_NUMBERand they also have no index on the variable VAT_NUMBER.

Sobefore match-merging the two SAS data sets in a DATA step, wemust sort the SAS data 1-APROC SORT DATA = OUT = COMPANY; BY VAT_NUMBER;RUN;PROC SORT DATA = OUT = BAD_DEBTOR; BY VAT_NUMBER;RUN;DATA MERGE_SORTED; MERGE BAD_DEBTOR (IN = BD) COMPANY; BY VAT_NUMBER; IF BD;RUN;The IN = data set option creates and names a variable BD thatindicates whether the input SAS data set BAD_DEBTOR contributed data to the current output observation. BD is atemporary numeric variable with values of 0 or 1. The value 0indicates that the input SAS data set BAD_DEBTOR did notcontribute to the current output observation whereas the value 1indicates that the input SAS data set BAD_DEBTOR contributedto the current output observation. The variable BD is available toprogramming statements during the DATA step, but is notincluded in the SAS data set being subsetting IF statement causes the DATA step to continueprocessing only those observations that meet the condition of theexpression specified in the IF statement.

Therefore, the resultingSAS data set contains only observations to which the input SASdata set BAD_DEBTOR no index exists on the variable VAT_NUMBER in neitherof both input SAS data sets, the observations are readsequentially in the order in which they appear in the input SASdata the following example, containsan index on the variable VAT_NUMBER. Before match-mergingthe two SAS data sets in a DATA step, we must only sort theinput SAS data set 27 Advanced Tutorials3 PROGRAM 1-BPROC SORT DATA = OUT = BAD_DEBTOR; BY VAT_NUMBER;RUN;DATA MERGE_INDEXED; MERGE BAD_DEBTOR (IN = BD) ; BY VAT_NUMBER; IF BD;RUN;SQL INNER JOINThe second technique , which is also fairly easy to code andunderstand, uses an SQL inner join. If you are familiar with SQL(Structured Query Language), you may want to use PROC SQLinstead of the DATA 2-APROC SQL; CREATE Table JOIN_SORTED AS SELECT BD.

* FROM BAD_DEBTOR BD, COMPANY C WHERE = ;QUIT;Conceptually, a query with a join and a WHERE expression isevaluated in two phases. First the FROM clause is internally builds a virtual, temporary join Table by combiningeach row from BAD_DEBTOR with every row from result of this combination is the Cartesian product of the twotables. Next, the WHERE expression is processed. Only rowsthat satisfy the WHERE clause condition are selected from thisjoin it is helpful to imagine that SQL builds a temporary, internaljoin Table for every join, this is often not the case. In reality, theSQL procedure optimizer breaks the Cartesian product intosmaller pieces. SAS data sets are stored in pages that contain acertain number of observations. To reduce I/O, the SQLprocedure optimizer makes use of these pages in its a two-way join, the following tasks are completed:1.

SUGI 27: Table Look-up: Techniques Beyond the Obvious

Tags:

Information

Transcription of SUGI 27: Table Look-up: Techniques Beyond the Obvious

Related search queries

SUGI 27: Table Look-up: Techniques Beyond the Obvious

Tags:

Information

Documents from same domain

Related documents

Related search queries