Transcription of Address Cleaning Using the TRANWRD Function
1 Address Cleaning Using the TRANWRD Function Rena Jones, New York State Department of Health, Troy, NYMike Zdeb, U@Albany School of Public Health, Rensselaer, NYABSTRACTSAS data sets sometimes contain character variables that need alteration. Address databases used to generate mailing lists or for geocoding (assigning of a geographic coordinate location in terms of latitude and longitude based on street Address ) require standardized Address information. The various portions of the Address (numbers, street names, street types) should conform to a set of pre-specified string functions are one way to manipulate the components of Address data.
2 SAS on-line help states that the TRANWRD Function replaces or removes all occurrences of a given word (or a pattern of characters) within a character string. This paper will provide an overview of the TRANWRD Function syntax and its utility for Cleaning an Address database either by itself or in conjunction with other string functions (COMPRESS, TRANSLATE). A SAS macro is described that uses the TRANWRD Function to adjust for common inconsistencies and errors in Address data. The macro provides one of a number of BASE SAS alternatives to functions available in the SAS Data Quality DATAC onsider the following raw data from a table of addresses which we would eventually like to geocode in a GIS software package, or use to create a mailing list.
3 Data mail; input Address $25.;datalines;81 CANTOR LN922 NEW_HAVEN PL22 HARING/ ROADONE WILLOW RD727 RAVEN TERR35 PL ST15 S. SWAN AVE129 NORTH BDWY;run;Assume some pre-set rules to obtain consistency among all addresses, such as requiring the full suffix for street names and removing all punctuation. STANDARD OPTIONS Using a simple IF/THEN statement and a quoted text string, which only allows for replacement of an entire text string with another, would be inefficient if you have multiple records in which you want to make the same change:if Address = '8 CANTOR LANE' then Address = '8 CANTOR LN';An IF/THEN statement with the SUBSTR Function would also work.
4 However, it requires knowledge of the location of the text to be changed and thus allows only for correction of Address text that starts in the same location across observations. This is also not useful for a standard set of changes:if substr( Address ,10,4)= 'LANE' then substr( Address ,10) = 'LN';Coders' CornerNESUG 2008 TRANWRDThe TRANWRD Function is a search and replace Function . TRANWRD does not rely on knowing the specific location of the text to replace, and will replace all occurrences of a text string Using the following syntax: (source, target text, replacement text): Address = TRANWRD ( Address ,'ST.)
5 ','STREET');The list of adjustments Using TRANWRD can be as comprehensive as the data set requires. Depending on the number of changes needed, one can write numerous lines of code. TRANWRD can make numerous changes with each observation in a data AND TRANSLATECOMPRESS will remove specified characters from a string. By default, SAS removes blanks unless other characters are specified in the second part of the argument. Using COMPRESS prior to or in conjunction with making text replacements with TRANWRD may be useful. Address = COMPRESS( Address ,".");1 81 CANTOR LN2 922 NEW_HAVEN PL3 22 HARING/ ROAD4 ONE WILLOW RD5 727 RAVEN TERR6 35 PL ST7 42 W26TH ST8 15 S SWAN AVE9 129 NORTH BDWYThe TRANSLATE Function replaces characters in strings, but is less "sophisticated" than TRANWRD in that it replaces single characters, not patterns of characters (source, replacement text or list, target text or list).
6 Address = TRANSLATE ( Address , '', '_ /'); 1 81 CANTOR LN2 922 NEW HAVEN PL3 22 HARING ROAD4 ONE WILLOW RD5 727 RAVEN TERR6 35 PL 42 ST8 15 S. SWAN AVE9 129 NORTH BDWYA MACROA convenient way to use the TRANWRD Function to clean multiple Address data sets is to put all of your common corrections and inside a SAS macro, such as the one correct(dataset);data temp; set **TRANSLATE AND COMPRESS TO REMOVE/REPLACE CHARACTERS; Address = COMPRESS( Address ,'.'); Address = translate( Address ,' ','_ /');Coders' CornerNESUG 2008** TRANWRD LIST OF COMMON CONVERSIONS; Address = TRANWRD ( Address ,'ST','STREET'); Address = TRANWRD ( Address ,'LN','LANE'); Address = TRANWRD ( Address ,'RD','ROAD'); Address = TRANWRD ( Address ,'PL','PLACE'); Address = TRANWRD ( Address ,'TERR','TERRACE'); Address = TRANWRD ( Address ,'ONE','1'); Address = TRANWRD ( Address ,'BDWY','BROADWAY');**THIS LIST COULD GO ON AND ;run;%mend;**CALL THE MACRO Using the mail dataset;%correct (mail).
7 CAVEATSF irst, don't assume that the changes will work for your entire data set. When working with Address data, a mass change can create more errors rather than making multiple simple corrections. Look what happens to the original addresses if we use the code: Address = TRANWRD ( Address ,'PL','PLACE');1 81 CANTOR LN2 922 NEW_HAVEN PLACE3 22 HARING/ ROAD4 ONE WILLOW RD5 727 RAVEN TERR6 35 PLACE 42 ST8 15 S. SWAN AVE9 129 NORTH BDWYIn observation # 2 the change was appropriate. For observation # 6, assuming the original Address was factually correct (that 35 PL Street does exist), the new Address now has an incorrect street name.
8 This is problematic for both a mailing list and for data that will be geocoded, where such a change could lead to an incorrect geographic assignment or may not get geocoded at , TRANWRD does not search and replace words. TRANWRD does not consider word boundaries since it is scanning for strings. It will replace the target text string with your text of choice and leave any neighboring text: Address = TRANWRD ( Address ,'AVE','AVENUE');1 81 CANTOR LN2 922 NEW_HAVENUEN PL3 22 HARING/ ROAD4 ONE WILLOW RD5 727 RAVENUEN TERR6 35 PL 42 ST8 15 S.
9 SWAN AVENUE9 129 NORTH BDWYC oders' CornerNESUG 2008In observation #8, the desired change was made. However, notice that in observations #2 and #5, the string 'AVE' in the middle of street names has been changed to 'AVENUE'. One way to avoid such problems is to embed a blank space at the start or the end of the search string. That also requires that you add a space temporarily at the start and end of the variable value being changed: Address = left( TRANWRD (cat(' ', Address ,' '),' AVE ',' AVENUE ')); Using the CAT Function to add the extra blank spaces ensures that even if a search string occurs at the beginning or the end of an Address , there will always be a set of blanks around that string, for example:AVE OF THE AMERICASW ithout the leading blank added with the CAT Function , the string ' AVE ' would not be found.
10 The LEFT Function removes the leading blank that is added at the beginning of the string with the CAT , as of version 9 of SAS, the PRX functions also offer text changing capabilities. They are both more flexible and more powerful than the TRANWRD Function . However, they do require more study and practice and are not as easily understood as TRANWRD . Here is one example that allows you to change either 'AV' or 'AVE' to 'AVENUE' in one statement Using PRXCHANGE: Address = left(prxchange('s/ AVE | AV / AVENUE /',-1,cat(' ', Address ,' ')));The same precautions mentioned earlier with TRANWRD about searching for words rather than portions of words are taken here.