Example: air traffic controller

Imputing ISO8601 Dates From Character Variables Containing ...

Paper FC06 Imputing ISO8601 Dates From Character Variables Containing partial Dates John R. Gerlach, MaxisIT Abstract Assume that there is an initiative to include legacy studies into a database such that the original data are converted into SDTM domains in compliance with CDISC standards. One aspect of this effort involves date Variables that must follow ISO8601 rules. This sub-task would be easy if the data were stored in SAS data sets and, specifically, the Dates were stored as SAS date Variables . Unfortunately, the plethora of data is stored mostly as raw text files. Moreover, Dates are stored as Character Variables in various formats ( , mmddyyyy, yyyymmdd). Even worse, the date Variables contain partial Dates and, perhaps, bogus values. This paper explains a SAS macro that imputes ISO8601 Dates from Character Variables Containing even partial Dates . Introduction The International Standard ISO 8601 specifies numeric representations of date and time. This standard is based on the Gregorian calendar (introduced in 1582) for Dates and the 24-hour clock for time.

Paper FC06 Imputing ISO8601 Dates From Character Variables Containing Partial Dates John R. Gerlach, MaxisIT Abstract Assume that there is an initiative to include legacy studies into a database such that the original data are converted

Tags:

  Form, Date, Variable, Partial, Containing, Character, Dates from character variables containing partial dates

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Imputing ISO8601 Dates From Character Variables Containing ...

1 Paper FC06 Imputing ISO8601 Dates From Character Variables Containing partial Dates John R. Gerlach, MaxisIT Abstract Assume that there is an initiative to include legacy studies into a database such that the original data are converted into SDTM domains in compliance with CDISC standards. One aspect of this effort involves date Variables that must follow ISO8601 rules. This sub-task would be easy if the data were stored in SAS data sets and, specifically, the Dates were stored as SAS date Variables . Unfortunately, the plethora of data is stored mostly as raw text files. Moreover, Dates are stored as Character Variables in various formats ( , mmddyyyy, yyyymmdd). Even worse, the date Variables contain partial Dates and, perhaps, bogus values. This paper explains a SAS macro that imputes ISO8601 Dates from Character Variables Containing even partial Dates . Introduction The International Standard ISO 8601 specifies numeric representations of date and time. This standard is based on the Gregorian calendar (introduced in 1582) for Dates and the 24-hour clock for time.

2 All ISO compliant Dates and times are written with the largest time units given first, from left to right, that is: year, month, week, day, hour, minute, and second. This standard offers representations for the following time elements: date , time of day, Coordinated Universal Time (UTC), local time with offset to UTC, date and time, time intervals, and recurring time intervals. Representations can be in one of two formats: a basic format showing only the date /time elements, such as 20060522, or extended format, such as 2006-05-22, both representing May 22, 2006. Hyphens are used sometimes to denote missing components. The CDISC / SDTM standard includes the ISO 8601 standard for those Variables representing Dates and time. Consider the partial list of SDTM Variables denoting time elements: RFSTDTC Subject Reference Start date /Time (Demographic domain) AESTDTC Start date /Time of an adverse event AEENDTC End date /Time of an adverse event AEDUR Duration of an adverse event LBDTC date /Time of specimen collection PEDTC date /Time of examination Suffice it to say that time elements are an important component of SDTM data sets.

3 However, for this discussion, this paper will address only Dates , not time elements. If this vital information were stored as SAS date Variables , it would be easy to populate such Variables using SAS, Version , specifically the IS8601DA for date Variables and IS8601DT and IS8601DN for date -time Variables . The IS8601DT format writes a date -time value ( , 2006-05-22T09:30:00:00) and the IS8601DT writes a date value using a date -time variable , similar to the IS8601DA format for date Variables . These formats write values in complete representation, which includes seconds for the IS8601DT format, which may not be desirable. Also noteworthy, these same formats do not allow a truncation of a value using a smaller width, which prompts a nasty error message, such as: Error: Format IS8601DT specified with width out of range. Unfortunately, the aforementioned legacy data contains Dates that are stored as Character Variables often consisting of incomplete Dates . Thus, the SAS-supplied formats are not immediately useful for the task of generating ISO8601 compliant Dates for SDTM data sets.

4 Imputation Rules In order to impute ISO compliant Dates from Character Variables , it is important to consider the possible natural formats of the Dates found in the legacy data. For this discussion, three natural date formats are considered: MMDDYYYY ( , 05/22/2006); DDMONYYYY ( , 22 MAY2006); and, YYYYMMDD ( , 20060522). It is assumed that these Dates might be manifested in several ways, as illustrated in the following table. The original values are listed on the left side of each date -type column; whereas, the right side denotes the imputed values written in year, month, day order. -------- MMDDYYYY --------- ------- DDMONYYYY ------- ------- YYYYYMMDD -------- 01012005 2005 01 01 01 JAN2005 2005 JAN 01 20050101 2005 01 01 01/01/2005 2005 01 01 01 JAN2005 2005 JAN 01 2005/01/01 2005 01 01 01-01-2005 2005 01 01 01 JAN2005 2005 JAN 01 2005-01-01 2005 01 01 01- -2005 2005 01 01 JAN2005 2005 JAN 01 2005- -01 2005 01 -01-2005 2005 01 01 JAN2005 2005 JAN 01 2005-01- 2005 01 2005 01 01 01 JAN2005 2005 JAN 01 2005 01 01 01 2005 2005 01 01 JAN2005 2005 JAN 01 2005 01 2005 01 01 2005 2005 01 JAN2005 2005 JAN 200501 2005 01 012005 2005 01 01 2005 2005 01 2005 01 2005 01 012005

5 2005 01 01 JAN JAN 01 2005 01 2005 01 As stated already, the date Variables might contain incomplete Dates . Moreover, the date Variables might contain other characters, such as a virgule, dash, blank, or period ( , 01/01/2005). The utility must be robust enough to deal with the diverse values, as well as consider the opportunity to impute partial Dates in a hierarchical fashion. The table above documents the proposed imputations. Also, keep in mind that the imputed values will be used accordingly, meaning that a date component might not be used because there is no higher-level date component associated with it. The Utility The %isodate macro is designed to process a SAS data set Containing one or more Character Variables that contain full or partial Dates having the form DDMMYYYY ( , 22052006), DDMONYYYY ( , 22 MAY2006), or MMDDYYYY ( , 05222006). The macro definition contains five parameters including one keyword parameter denoting a default date format.

6 The positional parameters are: IDSN Input data set ODSN Output data set IVAR Input Character variable Containing full or partial Dates OVAR Output Character variable Containing ISO8601 Dates Basically, the macro processes a SAS data set using, a designated Character variable , and creates a new data set Containing the original variable plus a new variable Containing ISO8601 compliant Dates . In order to accomplish this task, the macro considers one of three possible natural date formats and applies it to the Character variable in order to impute its ISO equivalent. Using one of the aforementioned natural date formats, denoted the keyword parameter FMT, the macro assigns the appropriate SAS-supplied input format, which is used to read the values of the Character variable . Thus, for example, if the keyword parameter FMT is assigned the value DDMONYYYY, then the same local macro variable is assigned the value date9, the respective SAS input format. For the formats MMDDYYYY and YYYYMMDD, the FMT variable is assigned the SAS input formats mmddyy10 and yymmdd10, respectively.

7 Since one of the proposed formats includes an abbreviation for the month, a format is created to convert that component into a two digit number. Another format $missf is used for reporting the effectiveness of the transformation from original data to ISO-compliant Dates . proc format; value $monthf 'JAN'='01' 'FEB'='02' 'MAR'='03' 'APR'='04' 'MAY'='05' 'JUN'='06' 'JUL'='07' 'AUG'='08' 'SEP'='09' 'OCT'='10' 'NOV'='11' 'DEC'='12'; value $missf ' ' = 'Missing' other = 'Not Missing'; run; The %isodate macro begins by making sure that a valid natural date format has been specified and thereby assigns the respective SAS-supplied input format; otherwise, an error message is printed to the SAS log and the macro terminates.

8 Assuming a valid date format, a Data step processes a data set employing the IF-logic and assignment statements depending on the natural date specified. Notice that the code required for Dates having the form DDMONYYYY is rather terse compared to the other two formats. As indicated in the table illustrating the imputations, there are fewer ambiguities among the component parts, such as the month, which is denoted by abbreviated text. In contrast, other two possible formats pose greater challenges for one obvious reason: the value denoting day could be construed as the month and vice versa. Indeed, there is not much difference between the imputation process for YYYYMMDD and MMDDYYYY. It is left for the reader to appraise the imputation process. Once the component parts are obtained (month, day, year), there is another block of code that assigns the ISO date to designated output variable , in hierarchical fashion. It is here where a component date -part might be imputed, but then not used, since it does not have a respective higher level ( , an imputed day, but no month).

9 After the Data step, the FREQ procedure is used to generate a report by comparing the original (input) variable with the ISO date (output) variable , that is, by simply comparing the two Variables with respect to Missing / Non-Missing values, a concise way to show the effectiveness of the process. %macro isodate(idsn,odsn,ivar,ovar,fmt=DDMONYYY Y); options nomprint nosymbolgen nomacrogen nomlogic; %if %upcase(&fmt.) eq DDMONYYYY | %upcase(&fmt.) eq MMDDYYYY | %upcase(&fmt.) eq YYYYMMDD %then %do; %if %upcase(&fmt.) eq DDMONYYYY %then %do; %let fmt = date9.; %end; %else %if %upcase(&fmt.)

10 Eq MMDDYYYY %then %do; %let fmt = mmddyy10.; %end; %else %do; %let fmt = yymmdd10.; %end; data length dd mm $2 yy $4 &ovar. $10; set %if %upcase(&fmt.) eq YYMMDD10. %then %do; if length(compress(&ivar.,' .-/')) ge 8 then &ovar. = put(input(&ivar., ?? yymmdd10.),yymmdd10.); if &ovar.


Related search queries