Transcription of Summary Statistics in SAS - Mark Irwin
1 Summary Statistics in SASS tatistics 135 Autumn 2005 Copyrightc 2005 by Mark E. IrwinSummary Statistics in SAST here are a number of approaches to calculating Summary Statistics most common three are PROC MEANSP rovides data summarization tools to compute descriptive Statistics forvariables across all observations and within groups of observations. PROC UNIVARIATEC alculates many of the Statistics thatPROC MEANS plus some standardunivariate graphical summaries, comparison of data to fixed distributions,and parameter estimation PROC TABULATED isplays descriptive Statistics in tabular format, using some or all of thevariables in a data set. You can create a variety of tables ranging fromsimple to highly Statistics in SAS1 PROC TABULATE computes many of the same Statistics that are computedby other descriptive statistical procedures such asPROC MEANS,PROCFREQ, andPROC :Roofing Shingle SalesData on sales last year in 49 sales districts were collected for a maker ofasphalt roofing shingles.
2 Sales in 1000s of squares (sales) Promotional expenditures in 1000s of $ (promotion) Number of active accounts (accounts) Number of competing brands (brands) District potential (potential) Summary Statistics in SAS2 PROC MEANS Calculates descriptive Statistics based on moments Estimates quantiles, which includes the median Calculates confidence limits for the mean Identifies extreme values Performs a t MEANS3 PROC MEANS <option(s)> <statistic-keyword(s)>;BY <DESCENDING> variable-1 <..<DESCENDING> variable-n> <NOTSORTED>;CLASS variable(s) </ option(s)>;FREQ variable;ID variable(s);OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)> </ option(s)> ;TYPES request(s);VAR variable(s) < / WEIGHT=weight-variable>;WAYS list;WEIGHT variable;There are a wide range of Statistics calculated in thisPROC.
3 These includePROC MEANS4 Descriptive Statistics :N, NMISS, MEAN, STDDEV|STD, VAR, MIN, MAX, RANGE, CV,SKEWNESS|SKEW, KURTOSIS|KURT, STDERR, CSS, SUM, SUMWGT, USS,CLM(2-sided CI of ),LCLM, UCLM(1-sided CI of )The default Statistics areN, MEAN, STD, MIN, MAX Quantile Statistics :MEDIAN|P50, Q3|P75, P1, P90, P5, P95, P10, P99, Q1|P25, QRANGE Hypothesis testingPROBT, TPROC MEANS5 There any many options available in thisPROC. The most useful are DATA = SAS-data-set: Sets the data set for thePROC. ALPHA = (default = ): This sets confidence level to be1 forthe confidence procedures. FW = field-width: Specifies the field width to display Statistics indisplayed output. Has no effect on values saved in an output data set.
4 PRINT|NOPRINT(default =PRINT): Specifies whether output is to MEANS6 PROC MEANS DATA = shingles;TITLE PROC MEANS Output of Roofing Shingle Sales ;TITLE2 Default Output ;VAR sales promotion accounts brands potential;PROC MEANS Output of Roofing Shingle Sales 2 Default Output 19:43 Sunday, November 27, 2005 The MEANS ProcedureVariable N Mean Std Dev Minimum Maximum--------------------------------- ---------------------------------------- -sales 49 49 49 49 49 MEANS7 PROC MEANS DATA = shinglesMEAN STD MIN Q1 MEDIAN Q3 MAX CLM PROBT T /* Statistics */ALPHA = FW = 8; /* options */TITLE PROC MEANS Output of Roofing Shingle Sales ;TITLE2 Statistics Selected.
5 VAR sales promotion accounts brands potential;PROC MEANS Output of Roofing Shingle Sales 3 Statistics Selected 19:43 Sunday, November 27, 2005 The MEANS ProcedureLower UpperVariable Mean Std Dev Minimum Quartile Median Quartile-------------------------------- ---------------------------------------- ---sales MEANS8 Lower 99% Upper 99%Variable Maximum CL for Mean CL for Mean Pr > |t| t Value----------------------------------- --------------------------------------sa les <.
6 0001 <.0001 <.0001 <.0001 <.0001 MEANS9 PROC UNIVARIATE descriptive Statistics based on moments (including skewness andkurtosis), quantiles or percentiles (such as the median), frequency tables,and extreme values histograms and comparative histograms. Optionally, these can be fittedwith probability density curves for various distributions and with kerneldensity estimates. quantile-quantile plots (Q-Q plots) and probability plots. These plotsfacilitate the comparison of a data distribution with various theoreticaldistributions.
7 Goodness-of-fit tests for a variety of distributions including the normal the ability to inset Summary Statistics on plots produced on a graphicsdevicePROC UNIVARIATE10 the ability to analyze data sets with a frequency variable the ability to create output data sets containing Summary Statistics ,histogram intervals, and parameters of fitted curvesPROC UNIVARIATE < options > ;BY variables ;CLASS variable-1 <(v-options)> < variable-2 <(v-options)> > < / KEYLEVEL= value1 | ( value1 value2 ) >;FREQ variable ;HISTOGRAM < variables > < / options > ;ID variables ;INSET keyword-list < / options > ;OUTPUT < OUT=SAS-data-set > < keyword1= > < percentile-options >;PROBPLOT < variables > < / options > ;QQPLOT < variables > < / options > ;VAR variables ;WEIGHT variable ;PROC UNIVARIATE11 ThisPROC generates a very large amount of output by default, and otheroptions will increase it.
8 Some useful ones are ALPHA = (default = ): This sets default confidence level to be1 for the confidence procedures. Can be overridden for specificintervals CIBASIC <(<TYPE = keyword> <ALPHA = )>: Gives confidenceintervals for , , and 2assuming the data is normally whether the interval isTWOSIDED(default),LOWER, orUPPER. CIPCTLDF <(<TYPE = keyword> <ALPHA = )>CIQUANTDF <(<TYPE = keyword> <ALPHA = )>:Calculates confidence intervals for quantiles by a distribution-free methodbased on the keywordsLOWER, UPPER, SYMMETRIC(default), UNIVARIATE12 CIPCTLNORMAL <(<TYPE = keyword> <ALPHA = )>CIQUANTNORMAL <(<TYPE = keyword> <ALPHA = )>:Calculates confidence intervals for quantiles assuming normallydistributed data.
9 The options are the same as those forCIBASIC. MU0 = 0: Sets the null hypothesis for the location parameter for testsof location. If you specify one value, it is used for all variables. Ifyou specify more than one, you must specify the variables with aVARstatement. The default value is 0. NEXTROBS =n: Specifies the number of extreme observations (nsmallest andnlargest) to be displayed for each variable. NORMAL: Generates 4 tests of normality - Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von Mises. I suspect, but can tconfirm that the Kolmogorov-Smirnov test is actually the Lilliefors test asyou don t want to specify a mean and variance of the normal for the test,which would be required for the strict use of the UNIVARIATE13 PLOT: Produces stem-and-leaf, box plot, and normal probability plot inline-printer output.
10 If aBYstatement is used, side-by-side box plots aregenerated. ROBUSTSCALE: Generates a table of robust estimates of scale. Theseinclude the interquartile range, Gini s mean difference, median absolutedeviation around the median (MAD), plus a couple more due toRousseeuw and Croux (1993). TRIMMED=values <(<TYPE = keyword> <ALPHA = )>TRIM=values <(<TYPE = keyword> <ALPHA = )>: Generates atable of trimmed means wherevaluespecifies the number or proportionof observations trimmed. WINSORIZED=values <(<TYPE = keyword> <ALPHA = )>WINSOR=values <(<TYPE = keyword> <ALPHA = )>: Generates atable of Winsorized means, a robust measure of location. The optionswork the same as UNIVARIATE14 VARDEF=divisor: Specifies the divisor to use in calculating are 4 choicesValueDivisorFormula for DivisorDFDegrees of freedomn 1 NNumber of observationsnWDFSum of Weights minus one( iwi) 1 WEIGHT|WGTSum of Weights iwiLets now look at the various statements that can be included in aPROCUNIVARIATE block VAR: Specifies the analysis variables and there order in the results.