Example: stock market

Proc Compare to Validate Datasets

Proc Compare to Validate Datasets Angelina Cecilia Casas , PPD Development, Research Triangle Park, NC. Variables Summary Number of Variables in Common: 3. ABSTRACT. Comparison of two Datasets is a technique used to know that two Observation Summary Datasets are equal or if they have discrepancies. This method can be used to Validate that a dataset has been created correctly Observation Base Compare or that the changes in a dataset are only those that are expected. observations were added or removed or corrections were First Obs 1 1. done. Last Obs 4 4. Number of Observations in Common: 4. INTRODUCTION Total Number of Observations Read from Proc Compare is a procedure that allows two Datasets to be : 4. compared for properties, number of observations, number of Total Number of Observations Read from variables, and properties of the Datasets . : 4. For a variable, you can get output about differences in: Number of Observations with Some Compared Variables Type, length, formats, informats and label(s).

proc compare data=demog compare=compare transpose; id unique ; run; Unique=20110: Variable Base Value Compare Diff. % Diff WEIGHTKG 85.7 85.0 -0.700000 -0.816803 dob 09/17/1949 09/14/1949 -3.000000 0.079830 Unique=20120: Variable Base Value Compare Diff. % Diff WEIGHTKG 101.6 101.0 -0.600000 -0.590551 dob 08/19/1949 08/16/1949 -3.000000 0 ...

Tags:

  Corps, Transpose

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Proc Compare to Validate Datasets

1 Proc Compare to Validate Datasets Angelina Cecilia Casas , PPD Development, Research Triangle Park, NC. Variables Summary Number of Variables in Common: 3. ABSTRACT. Comparison of two Datasets is a technique used to know that two Observation Summary Datasets are equal or if they have discrepancies. This method can be used to Validate that a dataset has been created correctly Observation Base Compare or that the changes in a dataset are only those that are expected. observations were added or removed or corrections were First Obs 1 1. done. Last Obs 4 4. Number of Observations in Common: 4. INTRODUCTION Total Number of Observations Read from Proc Compare is a procedure that allows two Datasets to be : 4. compared for properties, number of observations, number of Total Number of Observations Read from variables, and properties of the Datasets . : 4. For a variable, you can get output about differences in: Number of Observations with Some Compared Variables Type, length, formats, informats and label(s).

2 Unequal: 0. For a dataset, you can find differences in: Number of Observations with All Compared Variables date of creation, last modification of the Datasets was modified, Equal: 4. number of variables and observations of the Datasets . You can NOTE: No unequal values were found. All values also see the labels of the Datasets , but differences are not compared are exactly equal. reported for that. For observations, you can Compare the value of the record for The records in DEMOG and the records in Compare have the each variable. Also, you can decide how different the values of same order and an ID statement is not needed. However if the the observations can be. order of the two Datasets is not the same, records might be compared incorrectly. THE Datasets THAT WILL BE COMPARED. proc print data=demog noobs;. WHEN THE Datasets HAVE A DIFFERENT. Unique WEIGHTKG dob ORDER.

3 20120 08/19/1949 proc sort data= Compare ;. 20130 06/02/1934 by DOB;. 20110 09/17/1949 run;. 20202 10/17/1931. proc Compare base=demog (keep=unique weightkg). Compare = Compare (keep=unique weightkg);. proc contents data=demog; run;. Data Set Name: Observations: 4 As in other SAS procedures and data step, it is possible to select Member Type: DATA Variables: 3. observations and variables in the Datasets that are going to be --Alphabetic List of Variables and Attributes-- used. Dataset Created Modified NVar Nobs Variable Type Len Pos Format Label 20 JAN03:13:17 20 JAN03:13:17 2 4. ---------------------------------------- ------- 20 JAN03:13:17 20 JAN03:13:17 2 4. UNIQUE Char 200 32 Unique Record WEIGHTKG Num 8 8 Weight in kg Vars Summary dob Num 8 24 MMDDYY10. Date of Birth # of Vars in Common: 2. proc sql; Observation Summary create table Compare as select * from demog.

4 Quit; Observation Base Compare # of Obs in Common: 4. Total # of Obs Read from : 4. Total # of Obs Read from : 4. AN EXAMPLE OF A CLEAN Compare OUTPUT. # of Obs with Some Compared Vars Unequal: 3. proc Compare base= Compare = Compare ; # of Obs with All Compared Vars Equal: 1. run;. Values Comparison Summary The Compare Procedure # of Vars Compared with All Obs Equal: 0. Comparison of with # of Vars Compared with Some Obs Unequal: 2. (Method=EXACT) Total # of Values which Compare Unequal: 6. Maximum Difference: Data Set Summary All Vars Compared have Unequal Values Dataset Created Modified Nvar Nobs Variable Type Len Label Ndif MaxDif Unique CHAR 200 Unique Record 3. 14 JAN03:16:03 14 JAN03:16:06 3 4 WEIGHTKG NUM 8 Weight in kg 3 14 JAN03:16:06 14 JAN03:16:06 3 4. Value Comparison Results for Vars _____ # of Obs with Some Compared Variables Unequal: 0. || Unique Record # of Obs with All Compared Variables Equal: 3.

5 || Base Value Compare Value Obs || Unique Unique NOTE: No unequal values were found. All values compared _____ || _____+ _____+ are exactly equal. ||. 1 || 20120 20202. 3 || 20110 20120. 4 || 20202 20110 INFORMATION ABOUT VARIABLES OR. _____. _____ OBSERVATIONS THAT ARE IN DIFFERENT IN. || Weight in kg THE Datasets BEING COMPARED. || Base Compare Obs || WEIGHTKG WEIGHTKG Diff. % Diff _____ || _____ _____ _____ _____ The output there are no differences, but it might be easy to miss || the fact that only 3 of the 4 observations were compared. Also, 1 || 3 || you don't know which observations were not compared unless 4 || you use the LIST or LISTALL option. _____. proc Compare data=base Compare = Compare The values of the variable that was to the right of the BASE or list;. DATA statement is in the BASE column. The values of dataset id unique;. listed to right of the Compare dataset is in the Compare run.

6 Column. It is necessary that the two Datasets that are going to be Observation 2 in not found in : compared have the same order UNLESS the order of the Unique=20120. Datasets is something that is being tested. Observation 4 in not found in : What happens when the order of the observations in each of the Unique=34343. Datasets is different? The LIST option will list the observations and variables that exist proc sql; in one dataset and are missing in the other. There are other update Compare set unique='34343' where options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR, unique='20202';. LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND. quit;. LISTVAR, to limit this report to one or the other dataset and only proc sort data= Compare ; variables and / or observations. LISTEQUALVAR will also list by unique; the variables that are not used in the ID variable list for which all run; values are equal.

7 Proc Compare base=demog(keep = unique weightkg). Compare = Compare (keep = unique weightkg);. run; LIMITING THE PRINTED OUTPUT. _____ Sometimes it is not necessary to know all the observations that || Unique Record || Base Value Compare Value have an error, once it is known that an error exists, it might not be Obs || Unique Unique needed to know all the discrepancies. _____ || _____+ _____+. || It might be desirable to limit the size of the output; for that, the 2 || 20120 20130. 3 || 20130 20202 option MAXPRINT is very useful. 4 || 20202 34343. _____ First it is necessary to use the original dataset and modify the variable WEIGHTKG to have something to report. Etcetera For example, let's change the value of one of the variables in the ID statement from 34343' to 20120'. Even though, there is only one discrepancy (UNIQUE) in the Datasets , several discrepancies are reported.

8 It is necessary to proc sql;. specify which observations should be compared . update Compare set unique='20120'. The option that tells proc Compare , which variables should be where unique='34343';. together, is ID. After it, you should list the number of variables that define which observation in each of the Datasets should be update Compare set weightkg=int(weightkg);. compared. This is similar to the way Datasets are merged using a by statement. quit;. proc Compare base=demog Compare = Compare ; proc sort data= Compare ;. id unique; by unique;. run; run;. proc Compare data=demog # of Obs in Common: 3. Compare = Compare maxprint=(4,2);. # of Obs in but not in : 1. id unique ;. # of Obs in but not in : 1. run;. Total # of Obs Read from : 4. Total # of Obs Read from : 4. At the most, 4 differences will be printed, 2 for every variable that has discrepancies. var age;. with nage.

9 Run;. Variable Type Len Label Ndif MaxDif WEIGHTKG NUM 8 Weight in kg 4 # of Vars in Common: 3. _____ # of Vars in but not in : || Weight in kg 1. || Base Compare Unique || WEIGHTKG WEIGHTKG Diff. % Diff _____ || _____ _____ _____ _____ # of Vars in but not in : || 1. 20110 || 20120 || # of ID Vars: 1. # of VAR Statement Vars: 1. NOTE: The MAXPRINT=2 printing limit has been reached # of WITH Statement Vars: 1. for the variable WEIGHTKG. No more values will be printed for this comparison. Sometimes, there are differences in the Datasets that are not important for the purpose of the comparison. For this scenario, it is possible to give a value for which only observations that are bigger than this number will be marked as a difference. PRINTING RELATED INFORMATION. TOGETHER. proc sql;. update Compare set nage=nage+ ;. quit;. Sometimes it is desired to see the information with discrepancies grouped together by the variables in the ID statement.

10 Proc Compare data=demog Compare = Compare proc Compare data=demog criterion=.01;. Compare = Compare var age;. transpose with nage;. ; run;. id unique ;. run;. The Compare Procedure Unique=20110: Comparison of with Variable Base Value Compare Diff. % Diff (Method=RELATIVE( ), Criterion= ). WEIGHTKG dob 09/17/1949 09/14/1949 Variables Summary Unique=20120: Number of Variables in Common: 3. Variable Base Value Compare Diff. % Diff Number of Variables in but not in WEIGHTKG : 1. dob 08/19/1949 08/16/1949 Number of Variables in but not in : 1. Unique=20130: Number of VAR Statement Variables: 1. Variable Base Value Compare Diff. % Diff Number of WITH Statement Variables: 1. WEIGHTKG dob 06/02/1934 05/30/1934 Number of Observations in Common: 4. Total Number of Observations Read from : 4. Unique=20202: Total Number of Observations Read from : Variable Base Value Compare Diff.


Related search queries