Example: biology

Proc SQL, the Data Step Killer

ProcSQL, the data step KillerMike AtkinsonAcko Systems Consulting IncData StepProcSQLThe data step It s a complete programming language Easy to combine (or merge) datasets It allows you to perform sequential algorithm steps Some algorithms require data to be in sorted order first Can use values of variables from different observations, using retain It has the where statement (stolen from SQL) that allows efficient use of filtering criteria prior to other processing There are some things the data step can do that can t be done in ProcSQL, Can create multiple datasets in one step Can easily join multiple datasets each left, right, or full outer join in a ProcSQL query can join only two datasets at a time (although inner joins without the join keyword can bring together any number)ProcSQL SQL is the de facto standard query language , widely used (beyond SAS even!) for retrieving and summarizing data ProcSQL can summarize results in the same step as performing row level calculations without ProcSQL, summarizing requires a separate procsummary step , and often a pre-sort ProcSQL can sort its results in the same step It can perform distinct counts It can use like expressions While SQL isn t a complete programming language , it has case expressions, which can be very powerfulSome stuff SAS ProcSQL can do Sending (pass-through) queries to Oracle (or another DBMS) for processing, and receiving the results into a SAS datas

Proc SQLSQL is the de facto standard query language, widely used (beyond SAS even!) for retrieving and summarizing dataProc SQL can summarize results in the same step as performing row level calculations without Proc SQL, summarizing requires a separate proc summary step, and often a pre-sort •Proc SQL can sort its results in the same ...

Tags:

  Language, Standards, Data, Step, Corps, Proc sql, Query, Data step, Standard query language

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Proc SQL, the Data Step Killer

1 ProcSQL, the data step KillerMike AtkinsonAcko Systems Consulting IncData StepProcSQLThe data step It s a complete programming language Easy to combine (or merge) datasets It allows you to perform sequential algorithm steps Some algorithms require data to be in sorted order first Can use values of variables from different observations, using retain It has the where statement (stolen from SQL) that allows efficient use of filtering criteria prior to other processing There are some things the data step can do that can t be done in ProcSQL, Can create multiple datasets in one step Can easily join multiple datasets each left, right, or full outer join in a ProcSQL query can join only two datasets at a time (although inner joins without the join keyword can bring together any number)ProcSQL SQL is the de facto standard query language , widely used (beyond SAS even!) for retrieving and summarizing data ProcSQL can summarize results in the same step as performing row level calculations without ProcSQL, summarizing requires a separate procsummary step , and often a pre-sort ProcSQL can sort its results in the same step It can perform distinct counts It can use like expressions While SQL isn t a complete programming language , it has case expressions, which can be very powerfulSome stuff SAS ProcSQL can do Sending (pass-through) queries to Oracle (or another DBMS) for processing, and receiving the results into a SAS dataset Administration tasks, such as managing SAS datasets and indexes Using the SQL language against SAS datasets as an alternative to the data step Setting values of macro variables As an alternative to ProcPrintData stepBasic syntax: data new_SAS_dataset;set some_existing_dataset;/* set some_existing_dataset(keep=column_1 column_2); to subset variables*/* do stuff;run.

2 ProcSQL Create TableBasic syntax:procsql;create table new_SAS_datasetas/* select * for all columns/variables*/select column_1,column_2from some_existing_dataset;quit; Although it says create table, it is actually creating a SAS dataset. proc sql terminates with a quit;statement (not run;).WHERE clauseCan be used in data step statement and within ProcSQL, including within a case expressionComparison operators<, >, =, <=, >=, ^=or lt, gt, eq, le, ge, neLogic operators, bracketsand, or, ((a > b) and (not (c = d)))IN operatorin (values, separated, by, commas)WHERE clause like & contains While the in (list) has made its way to the IF statement, like (and contains ) have not; they are only in the WHEREThe syntax for CONTAINS is straightforward, name contains 'JONES'But I prefer LIKE, which is more powerfulpercent sign (%) match zero or more charactersunderscore (_) match any one where name like 'JONES%'ORDER BY clauseNot only does proc sql notrequire data to be sorted beforehand, but you can ask it to sort its resulting output simply by adding an ORDER BY clauseThe ORDER BY clause appears last, after the GROUP BY clause and the HAVING clause, if those are presentThe ORDER BY clause can be used on his own, without groupingThe syntax of the ORDER BY clause is slightly different than the data step (and other Procs ) BY statements.

3 The BY statement separates variables by spaces, while the ORDER BY separates them using BY clauseThe GROUP BY clause in ProcSQL lets you summarisedata (similar to ProcSummary) but without requiring the data to be sorted GROUP BY clause (if present) follows the WHERE clauseVariables in the GROUP BY are separated by commasUsing a GROUP BY statement does not guarantee the sort order of the results (although SAS is more likely to put the data into that order than Oracle is). Use an ORDER BY with a GROUP BY if you need to ensure the sort order of the : the variables in the ORDER BY clause need not match the variables in the GROUP BY functionsIf you have a GROUP BY in your query , then every variable you select should either be listed in the GROUP BY, or be summarised in some way. If you select a variable that is not summarised and not listed in the GROUP BY clause, you will almost certainly not get the summarized results you are some sample summary functions:sum(services) as services,max(service_date) as max_servdt,mean(paid_amount) as avg_paidamt,count(distinct PHN) as patient_countHAVING ClausThe HAVING clause applies after the GROUP BY, WHERE asthe WHERE clause applies before groupingThe HAVING clause looks at summarisedvalues, and cannot be used without a GROUP BY ;create table three_or_moreasselect service_date,count(*) as record_countgroup by service_datehaving count(*) >= 3;quit;CASE expressionThis is proc sql s closest equivalent to the IF statement.

4 A CASE expression, however, can only return a single value. (an IF statement can use a do/end to to perform multiple actions)The CASE expression consists of a series of WHEN conditions (that use the same syntax as WHERE conditions), followed by ELSE. So it s really more like an IF WHEN condition is accompanied by a THEN expression that evaluates to a CASE expression will use the THEN expression of the first WHEN condition that is found to be True. If none of the WHEN conditions are true, the ELSE expression will be s good practice to always have an expression exampleprocsql;select case when age = 0 then ' 0 'when age between 1 and 5 then ' 1-5'when age between 6 and 10 then ' 6-10'when age between 11 and 15 then '11-15'..else '?????' end as age_group,count(distinct recipient_id) as person_cntfrom health_servicesgroup by calculated age_group;quit;AliasesWhen joining two or more tables, it is useful to use an alias for each alias can be used as a prefix to variable names to indicate which table the variable comes from, which is handier than using the whole table name as a a variable of the same name appears in more than one table (being joined using a ProcSQL select statement), you must specify which table you want to refer to each time you refer to the variable name.

5 Prefixing variables with the table alias is the usual way to do JOIN, RIGHT JOINThe default SQL join is an Inner Join, meaning that only rows that match across both tables are includedLEFT JOIN and RIGHT JOIN in ProcSQL always operate on exactly two tables, and the order the tables are listed is very significant. Imagine writing them on the same line the first dataset listed is the Left one, and the second is the Right you use LEFT JOIN or RIGHT JOIN, you use the ON keyword (instead of the WHERE keyword) to indicate the join you use the INNER JOIN syntax to perform an inner join, you will also need to use the ON keywordComparing Inner, Left, and Right joinsHere s some sample data in two , Jane56 Adams, Giselle78 Keppel, LenStudentsGradesStudent_IDSubjectGrade3 4 MathA34 EnglishB56 MathC+99 FrenchFInner Join (usual, without JOIN keyword)procsql;create table after_innerasselect a.*,b.*from students a,grades bwhere by ;quit;Note: This will give a note in the log that student_idalready exists in the dataset.

6 Because student_idis the same in both datasets (guaranteed by the WHERE condition), this note can be safely , here s how you could rid of the note(without listing all the columns you want)procsql;create table after_inner(drop=student_id2)asselect a.*,b.*from students a,grades (rename=(student_id=student_id2))bwhere by ;quit;It s probably easier just to ignore the note in the of (default) Inner JoinStudent_IDName34 Gray, Jane56 Adams, Giselle78 Keppel, LenStudentsGradesStudent_IDSubjectGrade3 4 MathA34 EnglishB56 MathC+99 FrenchFStudent_IDNameSubjectGrade34 Gray, JaneMathA34 Gray, JaneEnglishB56 Adams, GiselleMathC+After_InnerDefaultInner Joinon student_idLEFT Joinprocsql;create table after_leftasselect a.*,b.*from students a left joingrades bon by ;quit;Results of Left JoinStudent_IDName34 Gray, Jane56 Adams, Giselle78 Keppel, LenStudentsGradesStudent_IDSubjectGrade3 4 MathA34 EnglishB56 MathC+99 FrenchFAfter_LeftLeft Joinon student_idStudent_IDNameSubjectGrade34 Gray, JaneMathA34 Gray, JaneEnglishB56 Adams, GiselleMathC+78 Keppel, LenRIGHT joinprocsql;create table after_rightasselect a.

7 *,b.*from students a right joingrades bon by ;quit;Results of Right JoinStudent_IDName34 Gray, Jane56 Adams, Giselle78 Keppel, LenStudentsGradesStudent_IDSubjectGrade3 4 MathA34 EnglishB56 MathC+99 FrenchFAfter_RightRight Joinon student_idStudent_IDNameSubjectGrade34 Gray, JaneMathA34 Gray, JaneEnglishB56 Adams, GiselleMathC+FrenchFFULL (Outer) joinprocsql;create table after_fullasselect coalesce( , ) asstudent_id, , , students a full joingrades bon by ;quit;Results of Full (Outer) JoinStudent_IDName34 Gray, Jane56 Adams, Giselle78 Keppel, LenStudentsGradesStudent_IDSubjectGrade3 4 MathA34 EnglishB56 MathC+99 FrenchFAfter_FullFull Joinon student_idStudent_IDNameSubjectGrade34 Gray, JaneMathA34 Gray, JaneEnglishB56 Adams, GiselleMathC+78 Keppel, Len99 FrenchF27 Traditional SAS Code( data step needs helpers!)procsortdata=prac_info;byprac_l ha;run;procsummarydata=prac_info;byprac_ lha;outputout=prac_lha_counts(drop=_type _ rename=(_freq_=prac_cnt));run;28 ProcSQL doing a summary procsql;createtableprac_lha_countsassele ctprac_lha,count(*) asprac_cntfromprac_infogroup byprac_lhaorder by prac_lha;quit;Calculated keyword in ProcSQLThe keyword calculated can be used to refer to a column being created within a ProcSQL query by name, in a reference later within the same can be used to reference a calculated column within the GROUP BY expression, or even in expressions to create other is no abbreviation for calculated.

8 30 Traditional SAS Codesummarize and lookup a descriptionprocsortdata=fitm_servcd;byse rvcd;run;procsummarydata=fitm_servcd;bys ervcd;outputout=servcd_fitm_cnts_0(drop= _type_ rename=(_freq_=fitm_cnt));run;dataservcd _fitm_cnts;setservcd_fitm_cnts_0;servcd_ descrip= put(servcd, svcd2ds.);run;31 proc sql Codeprocsql;createtableservcd_fitm_cntsa sselectservcd,put(servcd, svcd2ds.) asservcd_descrip,count(*) asfitm_cntfromfitm_servcdgroup byservcd, calculated servcd_descriporderbyservcd;quit;Partial Codewith joinprocsql; , ,count(*) asfitm_cntfromfitm_servcda , , 2;quit;Select desired observationsusing a data step %letstartdt_sas= '01apr2012'd;%letenddt_sas= '31mar2013'd;datadata_centres_2;setdata_ centres;whereefctvdt<= &enddt_sasand cncldt>= &startdt_sasand dt_cntr_statusin ('D', 'P')and dt_cntr_typein ('C', 'P')and not ( ' '|| dt_cntr_nm|| ' 'like '% HOLDINGS %'or ' '|| dt_cntr_nm|| ' 'like '% HOSP%'or ' '|| dt_cntr_nm|| ' 'like '%SYS%');run;Anything you can (well, not anything, but this )procsql;createtabledata_centres_with_fl agsasselectefctvdt,cncldt,dt_cntr_status ,dt_cntr_type,casewhen(efctvdt> &enddt_sas) or(cncldt< &startdt_sas)then'1.

9 Outside date range'when(dt_cntr_statusnotin('D', 'P'))then'2. Status not D or P'when(dt_cntr_typenotin('C', 'P'))then'3. Type not C or P'when(' '|| dt_cntr_nm|| ' 'like'% HOLDINGS %'or' '|| dt_cntr_nm|| ' 'like'% HOSP%'or' '|| dt_cntr_nm|| ' 'like'%SYS%') then'4. Computing type'else' 'endaserror_typefromdata_centres;quit;In formative reportprocfreqdata=data_centres_with_fla gs;tableserror_type/ missing;run;Getting the goods, either wayprocsql;createtabledata_centres_2 asselect*fromdata_centres_with_flagswher eerror_typeisnull;quit;ordatadata_centre s_2;setdata_centres_with_flags;whereerro r_typeis null;run;Distinct keywordIf distinct appears as in select distinct , it applies to all selected columns, and is basically the same as using PROC SORT with NODUP. distinct provider, patient, service_dateDistinct can also appear within a count summary function. (distinct provider) as uniq_prac_cnt,count(*) as record_cnt,count(provider) as cnt_recs_w_providerDemonstration of calculatedprocsql;createtableattached_w_ age_rangeasselect*,floor(yrdif(datepart( birth_date), '31mar2013'd, 'AGE'))asage,5* (floor(calculated age/5)) asage_temp,casewhen(calculated age) = 0then'000'elseput(calculated age_temp, z3.)

10 || '-'||put(calculated age_temp+ 4, z3.)endasage_rangefromattached_2012_2013 orderbyres_at_yr_end;quit;Some sample ;infilecards4;inputspecialty clntagepaidamt;cards4;00 5 500000 10 1000000 20 1000000 30 1000000 40 1500000 50 2500000 60 3500000 70 5500000 80 7500000 90 8500001 10 1500001 20 15000;;;;run;Get percent costs for patients (clients)aged 65 or overprocsql;createtablepct_over_65 asselectspecialty,sum(paidamt) aspaidamt,sum(casewhenclntage>= 65thenpaidamtelse0end) aspaidamt_ge65,(calculated paidamt_ge65) / (calculated paidamt) aspct_paid_over_65format= ;quit;Results percent costs for patients aged 65 or overPractitioner, service code datadataprac_servcd;infilecards4;inputpr acnumservcdpaidamt;cards4;00001 01 5000000001 12 1000000001 91 2000000002 01 4500000002 90 800000003 01 6000000003 12 500000003 92 30000;;;;run;Flag practitioners with all threeprocsql;createtableprac_3 asselectpracnum,sum(casewhenservcd= 01thenpaidamtelse0end)aspaidamt_01,sum(c asewhenservcd= 12thenpaidamtelse0end)aspaidamt_12,sum(c asewhenservcd>= 89thenpaidamtelse0end)aspaidamt_89_plus, casewhen(calculated paidamt_01) > 0and(calculated paidamt_12) > 0and(calculated paidamt_89_plus) > 0then1else0endasall_3fromprac_servcdgrou pbypracnum;quit;Practitioners with flag for all 3 Subset to those with all 3procsql;createtableprac_all_3 asselect*from(selectpracnum,sum(casewhen servcd= 01thenpaidamtelse0end) aspaidamt_01,sum(casewhenservcd= 12thenpaidamtelse0end) aspaidamt_12,sum(casewhenservcd>= 89thenpaidamtelse0end) aspaidamt_89_plus,casewhen(calculated paidamt_01) > 0and(calculated paidamt_12) > 0and(calculated paidamt_89_plus) > 0then1else0endasall_3fromprac_servcdgrou pbypracnum)whereall_3;quit;Subset to Practitioners with flag for all 348 Select Values into Macro Variableprocsqlnoprint.


Related search queries