Example: biology

Ronald Cody, Ed.D., Robert Wood Johnson Medical …

transforming SAS data Sets Using ArraysRonald Cody, , Robert Wood Johnson Medical school , piscataway , NJIntroductionThis paper describes how to efficiently transformSAS data sets using arrays. First, let us discusswhat we mean by the term " transforming ." You maywant to create multiple observations from a singleobservation or vice versa. There are severalpossible reasons why you may want to do this. Youmay want to create multiple observations from asingle observation to count frequencies or to allowfor BY variable processing. You may also want torestructure SAS data sets for certain statisticalanalyses. Creating a single observation frommultiple observations may make it easier for you tocompute differences between variables withoutresorting to LAG functions or to use the REPEATED statement in PROC TRANSFORM may come to mind as asolution to these transforming problems, but usingarrays in a data Step can be more flexible and allowyou to have full control over the 1: Creating a New data Set withSeveral Observations per Subject from aData Set with One Observation perSubjectSuppose you have a data set called DIAGNOSE,with the variables ID, DX1, DX2, and DX3.

Transforming SAS® Data Sets Using Arrays Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ Introduction This paper describes how to efficiently transform

Tags:

  School, Data, Medical, Woods, Robert, Johnson, Transforming, Robert wood johnson medical, Robert wood johnson medical school, Piscataway, Transforming sas

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Ronald Cody, Ed.D., Robert Wood Johnson Medical …

1 transforming SAS data Sets Using ArraysRonald Cody, , Robert Wood Johnson Medical school , piscataway , NJIntroductionThis paper describes how to efficiently transformSAS data sets using arrays. First, let us discusswhat we mean by the term " transforming ." You maywant to create multiple observations from a singleobservation or vice versa. There are severalpossible reasons why you may want to do this. Youmay want to create multiple observations from asingle observation to count frequencies or to allowfor BY variable processing. You may also want torestructure SAS data sets for certain statisticalanalyses. Creating a single observation frommultiple observations may make it easier for you tocompute differences between variables withoutresorting to LAG functions or to use the REPEATED statement in PROC TRANSFORM may come to mind as asolution to these transforming problems, but usingarrays in a data Step can be more flexible and allowyou to have full control over the 1: Creating a New data Set withSeveral Observations per Subject from aData Set with One Observation perSubjectSuppose you have a data set called DIAGNOSE,with the variables ID, DX1, DX2, and DX3.

2 The DXnvariables represent three diagnosis codes. Theobservations in data set DIAGNOSE are: data SET DIAGNOSEID DX1 DX2 DX301 3 402 1 2 303 4 504 7As you can see, some subjects have only onediagnosis code, some two, and some, all you want to count how many subjects havediagnosis 1, how many have diagnosis 2, and so don't care if the diagnosis code is listed as DX1,DX2, or DX3. In the example here, you would havea frequency of one for diagnosis codes 1, 2, 5, and 7and a frequency of two for diagnosis codes 3 and way to accomplish this task is to transform thedata set DIAGNOSE which has one observation persubject and three diagnosis variables, to a data setthat has a single diagnosis variable and as manyobservations per subject as there are diagnoses forthat subject. This new data set (call it NEW_DX)would look like the one shown next:TRANSFORMED data SET (NEW_DX)ID DX01 301 402 102 202 303 403 504 7 Using data set NEW_DX, it is now a simple job tocount diagnosis codes using PROC FREQ on thesingle variable DX.

3 Let us first write a SAS data step that creates data set NEW_DX from data setDIAGNOSE but does not use arrays. Here is thecode:*------------------------------- ---*| EXAMPLE 1A: CREATING MULTIPLE || OBSERVATIONS FROM A SINGLE || OBSERVATION WITHOUT USING AN || ARRAY |*----------------------------------*;DA TA NEW_DX; SET DIAGNOSE; DX = DX1; IF DX NE . THEN OUTPUT; DX = DX2; IF DX NE . THEN OUTPUT; DX = DX3; IF DX NE . THEN OUTPUT; KEEP ID DX;RUN;Let's see how this program works. As you read ineach observation from data set DIAGNOSE, youcreate from one to three observations in the newAdvanced Tutorialsdata set NEW_DX. The SET statement brings ineach observation from the original data set(DIAGNOSE), one at a time. In the first iteration ofthis data Step, the values of ID, DX1, DX2, and DX3are 01, 3, 4, and missing, respectively. Next, a newvariable, DX, is set equal to DX1 (which is a 3).

4 Since this is not a missing value, the OUTPUT statement in the next line is executed and the firstobservation in data set NEW_DX is formed. Thevalues of all the variables in the PDV at this pointare:ID=01 DX1=3 DX2=4 DX3=. DX=3 But, since we have a KEEP statement in the DataStep, only the values for ID and DX are written out tothe new data set. Next, the value of DX is set toDX2 (which is a 4). Again, since the value of DX isnot a missing value, another observation is written tothe data set NEW_DX. Finally, DX is set equal toDX3 (which is a missing value). Since the followingIF statement is not true, the third OUTPUT statement of the data Step does not execute andexecution returns to the top of the data Step andanother observation from data set DIAGNOSE isread. As you can see, the program will create asmany observations per subject as there arenonmissing DX codes for that the repetitive nature of the program and yourarray light bulb should turn on.

5 Here is the programrewritten using arrays:*-------------------------------- --*| EXAMPLE 1B: CREATING MULTIPLE || OBSERVATIONS FROM A SINGLE || OBSERVATION USING AN ARRAY |*----------------------------------*;DA TA NEW_DX; SET DIAGNOSE; ARRAY DXARRAY[3] DX1 - DX3; DO I = 1 TO 3; DX = DXARRAY[I]; IF DX NE . THEN OUTPUT; END; KEEP ID DX;RUN;In this program, you first create an array calledDXARRAY which contains the three numericvariables DX1, DX2, and DX3. Remember that thisarray allows you to refer to any of the variablesassociated with it by listing the array name,subscripted with the appropriate index (subscript).Also remember that array elements exist only in theData Step in which they are created and they are notpart of the SAS data set being created. Finally,array names follow the same rules as SAS variablenames. (Note: Do not use the same name for anarray as a variable in your data set.)

6 Now, back to the program. The two lines of codeinside the DO loop are similar to the repeated linesin the non-array example with the variable namesDX1, DX2, and DX3 replaced by the array count the number of subjects with each diagnosiscode, you can now use PROC FREQ like this:PROC FREQ data =NEW_DX; TABLES DX / NOCUM;RUN;In this example, by using an array, you only savedone line of SAS code. However, if there were morevariables, DX1 to DX50 for example, the savingswould be 2: Another Example of CreatingMultiple Observations from a SingleObservationHere is an example that is similar to Example start with a data set that contains an ID variableand three variables S1, S2, and S3 which representa score at times 1, 2, and 3 respectively. Theoriginal data set called ONEPER, looks as follows: data SET ONEPERID S1 S2 S301 3 4 502 7 8 903 6 5 4 You want to create a new data set calledMANYPER, which looks like this: data SET MANYPERID TIME SCORE01 1 301 2 401 3 502 1 702 2 802 3 903 1 6 Advanced Tutorials03 2 503 3 4 The program to transform data set ONEPER to dataset MANYPER is similar to the program in Example1 except that you need to create the TIME variablein the transformed data set.

7 This is easilyaccomplished by naming the DO loop counter TIMEas follows:*------------------------------- ---*| EXAMPLE 2: CREATING MULTIPLE || OBSERVATIONS FROM A SINGLE || OBSERVATION USING AN ARRAY |*----------------------------------*;DA TA MANYPER; SET ONEPER; ARRAY S[3]; DO TIME = 1 TO 3; SCORE = S[TIME]; OUTPUT; END; KEEP ID TIME SCORE;RUN;We first notice that the ARRAY statement in thisData Step does not have a variable list. This wasdone to demonstrate another way of writing anARRAY statement. When the variable list is omitted,the variable names default to the array namefollowed by the numbers from the lower bound to theupper bound. In this case, the statementARRAY S[3];is equivalent toARRAY S[3] S1-S3;The SET statement brings in observations from thedata set ONEPER, which contain the variables ID,S1, S2, and S3. This program is similar to theprevious program except for the fact that we call theDO loop variable TIME and we always output threeobservations for every observation in data setONEPER.

8 (If there are any missing values for thevariables S1, S2, or S3, there will be an observationin the new data set with a missing value for thevariable called going in the direction of creating multipleobservations from a single observation, let us extendthis program to include an additional 3: Going from One Observationper Subject to Many Observations perSubject Using Multidimensional ArraysSuppose you have a SAS data set (call it WT_ONE)that contains an ID and six weights for each subjectin an experiment. The first three values representsweights at times 1,2, and 3 under condition 1; thenext three values represent weights at times 1,2,and 3 under condition 2 (see the diagram below):CONDITION1 2-------------------------------------TI ME TIME1 2 3 1 2 3WT1 WT2 WT3 WT4 WT5 WT6To clarify this, suppose that data set WT_ONEcontains two observations: data SET WT_ONEID WT1 WT2 WT3 WT4 WT5 WT601 155 158 162 149 148 14702 110 112 114 107 108 109 Here, weights 1, 2, and 3 correspond tomeasurements made under condition 1at times 1 to3 and weights 4, 5, and 6 correspond tomeasurements made under condition 2 at times 1 to3.)

9 You want a new data set called WT_MANY tolook like this: data SET WT_MANYID COND TIME WEIGHT01 1 1 15501 1 2 15801 1 3 16201 2 1 14901 2 2 14801 2 3 14702 1 1 11002 1 2 11202 1 3 11402 2 1 10702 2 2 10802 2 3 109A convenient way to make this conversion would beto create a two-dimensional array with the firstdimension representing condition and the secondAdvanced Tutorialsrepresenting time. So, instead of having a one-dimensional array like this:ARRAY WEIGHT[6] WT1-WT6;you could create a two-dimensional array like this:ARRAY WEIGHT[2,3] WT1-WT6;The comma between the 2 and 3 separates thedimensions of the array. This is a 2 by 3 element WEIGHT[2,3], for example, wouldrepresent a subject's weight under condition 2 attime us use this array structure to create the new dataset which contains 6 observations for each ID.

10 Eachobservation is to contain the ID and one of the 6weights, along with two new variables, COND andTIME which represent the condition and the time atwhich the weight was recorded. Here is therestructuring program:*------------------------------- ---*| EXAMPLE 3: USING A MULTIDIMEN- || SIONAL ARRAY TO RESTRUCTURE A || data SET |*----------------------------------*;DA TA WT_MANY; SET WT_ONE; ARRAY WTS [2,3] WT1-WT6; DO COND = 1 TO 2; DO TIME = 1 TO 3; WEIGHT = WTS[COND,TIME]; OUTPUT; END; END; DROP WT1-WT6;RUN;To cover all combinations of condition and time, youuse "nested" DO loops, that is, a DO loop within aDO loop. Here's how it works: COND is first set to 1by the outer loop. Next, TIME is set to 1,2, and 3 inthe inner loop while COND remains at 1. Once theinner loop is completed, (TIME has reached 3), thecontrol returns to the outer loop where condition(COND) is new set to 2 and the inner loop cyclesthrough the three times again.


Related search queries