Transcription of encode — Encode string into numeric and vice versa
1 Encode string into numeric and vice versaDescriptionQuick startMenuSyntaxOptions for encodeOptions for decodeRemarks and examplesReferencesAlso seeDescriptionencodecreates a new variable namednewvarbased on the string variablevarname, creating, addingto, or just using (as necessary) the value labelnewvaror, if specified, notuseencodeifvarnamecontains numbers that merely happen to be stored as strings; instead, usegeneratenewvar= real(varname)ordestring; see[U] Categorical string variables, [FN] string functions,and [D] a new string variable namednewvarbased on the encoded numeric variablevarnameand its value startGenerate numericnewv1from stringv1, using the values ofv1to create a value label that is appliedtonewv1encode v1, generate(newv1)Same as above, but name the value labelmylabel1encode v1, generate(newv1) label(mylabel1)Same as above, but refuse to encodev1if values exist inv1that are not present in preexisting valuelabelmylabel1encode v1, generate(newv1) label(mylabel1) noextendConvert numericv2to stringnewv2using the value label applied tov2to generate values ofnewv2decode v2, generate(newv2)
2 MenuencodeData>Create or change data>Other variable -transformation commands> Encode value labels from stringvariabledecodeData>Create or change data>Other variable -transformation commands>Decode strings from labeled numericvariable12 Encode Encode string into numeric and vice versaSyntaxString variable to numeric variableencodevarname[if][in], generate(newvar)[label(name) noextend] numeric variable to string variabledecodevarname[if][in], generate(newvar)[maxlength(#)]Options for encodegenerate(newvar)is required and specifies the name of the variable to be (name)specifies the name of the value label to be created or used and added to if the namedvalue label already exists. Iflabel()is not specified,encodeuses the same name for the labelas it does for the new thatvarnamenot be encoded if there are values contained invarnamethat arenot present inlabel(name). By default, any values not present inlabel(name)will be addedto that for decodegenerate(newvar)is required and specifies the name of the variable to be (#)specifies how many bytes of the value label to retain;#must be between 1 and32,000.
3 The default ismaxlength(32000).Remarks and are presented under the following headings:encodedecodeVideo exampleencodeencodeis most useful in making string variables accessible to Stata s statistical routines, most ofwhich can work only with numeric also useful in reducing the size of a you are not familiar with value labels, read[U] Value maximum number of associations within each value label is 65,536. Each association in avalue label maps a string of up to 32,000 bytes to a number. For plainASCII text, the number ofbytes is equal to the number of characters. If your string has other Unicode characters, the numberof bytes is greater than the number of characters. See[U] Handling Unicode strings. If yourvariable contains string values longer than 32,000 bytes, then only the first 32,000 bytes are retainedand assigned as a value label to a Encode string into numeric and vice versa 3 Example 1We have a dataset on high blood pressure, and among the variables issex, a string variablecontaining either male or female.
4 We wish to run a regression of high blood pressure on race, sex,and age group. We typeregress hbp race sex agegrpand get the message no observations .. use regress hbp sex race age_grpno observationsr(2000);Stata s statistical procedures cannot directly deal with string variables; as far as they are concerned,all observations onsexare the solution:. Encode sex, gen(gender). regress hbp gender race age_grpSourceSS df MS Number of obs = 1,121F(3, 1117) = 3 .67004492 Prob > F = 1,117 .044215413 R-squared = R-squared = 1,120 .045891742 Root MSE = .21027hbpCoefficient Std. err. t P>|t| [95% conf. interval] .0130022 .0139633 ..0113721 .00624 .0119049 ..0389167 .059543encodelooks at a string variable and makes an internal table of all the values it takes on, here male and female.
5 It then alphabetizes that list and assigns numeric codes to each entry. Thus 1becomes female and 2 becomes male . It creates a newintvariable (gender) and substitutes a1 wheresexis female , a 2 wheresexis male , and amissing(.) wheresexisnull(""). Itcreates a value label (also namedgender) that records the mapping1 femaleand2 ,encodelabels the values of the new variable with the value 2It is difficult to distinguish the result ofencodefrom the original string variable . For instance, inour last two examples, we typedencode sex, gen(gender). Let s compare the two variables:. list sex gender in 1/4sex maleThey look almost identical, although you should notice the missing value for gender in the Encode Encode string into numeric and vice versaThe difference does show, however, if we telllistto ignore the value labels and show how thedata really appear.
6 List sex gender in 1/4, nolabelsex 2We could also ask to see the underlying value label:. label list gendergender:1 female2 malegenderreally is a numeric variable , but becauseallStata commands understand value labels, thevariable displays as male and female , just as the underlying string 3We can drastically reduce the size of our dataset by encoding strings and then discarding theunderlying string variable . We have a string variable ,sex, that records each person s sex as male and female . Because female has six characters, the variable is stored as canencodethesexvariable and usecompressto store the variable as abyte, which takesonly 1 byte. Because our dataset contains 1,130 people, the string variable takes 6,780 bytes, but theencoded variable will take only 1,130 use , clear. describeContains data from : 1,130 Variables: 7 3 Mar 2022 06:47 variable Storage Display Valuename type format label variable labelid str10 %10s Record identification numbercity byte % Cityyear int % Yearage_grp byte % agefmt Age grouprace byte % racefmt Racehbp byte % yn High blood pressuresex str6 %9s SexSorted by.
7 Encode sex, generate(gender) Encode Encode string into numeric and vice versa 5. list sex gender in 1/5sex female. drop sex. rename gender sex. compressvariablesexwaslongnowbyte(3,390 bytes saved). describeContains data from : 1,130 Variables: 7 3 Mar 2022 06:47 variable Storage Display Valuename type format label variable labelid str10 %10s Record identification numbercity byte % Cityyear int % Yearage_grp byte % agefmt Age grouprace byte % racefmt Racehbp byte % yn High blood pressuresex byte % gender SexSorted by:Note: Dataset has changed since last size of our dataset has fallen from 24,860 bytes to 19,210 noteIn the examples above, the value label did not exist beforeencodecreated it, because that is notrequired.
8 If the value label does exist,encodeuses your encoding as far as it can and adds newmappings for anything not found in your value label. For instance, if you wanted female to beencoded as 0 rather than 1 (possibly for use in linear regression), you could type. label define gender 0 "female". Encode sex, gen(gender)You can also specify the name of the value label. If you do not, the value label is assumed to havethe same name as the newly created variable . For instance,. label define sexlbl 0 "female". Encode sex, gen(gender) label(sexlbl)6 Encode Encode string into numeric and vice versadecodedecodeis used to convert numeric variables with associated value labels into true string 4We have a numeric variable namedfemalethat records the values 0 and associatedwith a value label namedsexlblthat says that 0 means male and 1 means female:. use , clear. describe femaleVariable Storage Display Valuename type format label variable labelfemale byte % sexlbl Female.
9 Label list sexlblsexlbl:0 Male1 FemaleWe see thatfemaleis stored as abyte. It is a numeric variable . Nevertheless, it has an associatedvalue label describing what the numeric codes mean, so if wetabulatethe variable , for instance,it appears to contain the strings male and female :. tabulate femaleFemaleFreq. Percent ,128 can create a real string variable from this numerically encoded variable by usingdecode:. decode female, gen(sex). describe sexVariable Storage Display Valuename type format label variable labelsex str6 %9s FemaleWe have a new variable calledsex. It is a string , and Stata automatically created the shortest possiblestring. The word female has six characters, so our new variable is :. list female sex in 1/4female Maleencode Encode string into numeric and vice versa 7 But when we addnolabel, the difference is apparent.
10 List female sex in 1/4, nolabelfemale MaleExample 5decodeis most useful in instances when we wish to match-merge two datasets on a variable thathas been encoded instance, we have two datasets on individual states in which one of the variables (state)takes on values such as CA and NY . The state variable was originally a string , but along the waythe variable was encoded into an integer with a corresponding value label in one or both wish to merge these two datasets, but either 1) one of the datasets has a string variable forstate and the other an encoded variable or 2) although both are numeric , we are not certain that thecodings are consistent. Perhaps CA has been coded 5 in one dataset and 6 in take an encoded variable and turn it back into a string ,decodeprovides thesolution:use first(load the first dataset)decode state, gen(st)(make a string state variable )drop state(discard the encoded variable )sort st(sort on string )save first, replace(save the dataset)use second(load the second dataset)decode state, gen(st)(make a string variable )drop state(discard the encoded variable )sort st(sort on string )merge 1:1 st using first(merge the data)Video exampleHow to convert categorical string variables to labeled numeric variablesReferencesCox, N.