Example: air traffic controller

DESCRIPTIVE STATISTICS AND EXPLORATORY …

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural STATISTICS Research Institute Library Avenue, New Delhi - 110 012 1. DESCRIPTIVE STATISTICS STATISTICS is a set of procedures for gathering, measuring, classifying, computing, describing, synthesizing, analyzing, and interpreting systematically acquired quantitative data. STATISTICS has major two components: the DESCRIPTIVE STATISTICS and the Inferential STATISTICS . DESCRIPTIVE STATISTICS gives numerical and graphic procedures to summarize a collection of data in a clear and understandable way whereas Inferential STATISTICS provides procedures to draw inferences about a population from a sample.

D Descriptive Statistics and Data Exploration 4 Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each.

Tags:

  Statistics, Descriptive, Exploratory, Descriptive statistics and exploratory, Paddy

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of DESCRIPTIVE STATISTICS AND EXPLORATORY …

1 DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural STATISTICS Research Institute Library Avenue, New Delhi - 110 012 1. DESCRIPTIVE STATISTICS STATISTICS is a set of procedures for gathering, measuring, classifying, computing, describing, synthesizing, analyzing, and interpreting systematically acquired quantitative data. STATISTICS has major two components: the DESCRIPTIVE STATISTICS and the Inferential STATISTICS . DESCRIPTIVE STATISTICS gives numerical and graphic procedures to summarize a collection of data in a clear and understandable way whereas Inferential STATISTICS provides procedures to draw inferences about a population from a sample.

2 DESCRIPTIVE STATISTICS help us to simplify large amounts of data in a sensible way. Each DESCRIPTIVE statistic reduces lots of data into a simpler summary. There are two basic methods: numerical and graphical. Using the numerical approach one might compute STATISTICS such as the mean and standard deviation. These STATISTICS convey information about the average. The plots contain detailed information about the distribution. Graphical methods are better suited than numerical methods for identifying patterns in the data. Numerical approaches are more precise and objective. Since the numerical and graphical approaches complement each other, it is wise to use both.

3 There are three major characteristics of a single variable that we tend to look at: Distribution Central Tendency Dispersion Distribution The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of times each value occurs. One of the most common ways to describe a single variable is with a frequency distribution. Frequency distributions can be depicted in two ways, as a table or as a graph. Distributions may also be displayed using percentages. Frequency distribution organizes raw data or observations that have been collected into ungrouped data and grouped data.

4 The Ungrouped Data provide listing of all possible scores that occur in a distribution and then indicating how often each score occurs. Grouped Data combines all possible scores into classes and then indicating how often each score occurs within each class. It is easier to see patterns in the data, but the information about individual scores is lost. Graphs make it easier to see certain characteristics and trends in a set of data. Graphs for quantitative data include Histogram, Frequency Polygon etc. and graphs for qualitative data include Bar Chart, Pie Chart etc. Shape of the Distribution An important aspect of the "description" of a variable is the shape of its distribution, which tells the frequency of values from different ranges of the variable.

5 Typically, a researcher is interested in how well the distribution can be approximated by the normal DESCRIPTIVE STATISTICS and Data Exploration 2 distribution. Simple DESCRIPTIVE STATISTICS can provide some information relevant to this issue. For example, if the skewness (which measures the deviation of the distribution from symmetry) is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical. If the kurtosis (which measures the peakedness of the distribution) is clearly different from 0, then the distribution is either flatter or more peaked than normal; the kurtosis of the normal distribution is 0.

6 More precise information can be obtained by performing one of the tests of normality to determine the probability that the sample came from a normally distributed population of observations ( , the so-called Kolmogorov-Smirnov test, or the Shapiro-Wilks' W test). However, none of these tests can entirely substitute for a visual examination of the data using a histogram ( , a graph that shows the frequency distribution of a variable). The graph allows you to evaluate the normality of the empirical distribution because it also shows the normal curve superimposed over the histogram. It also allows to examine various aspects of the distribution qualitatively.

7 For example, the distribution could be bimodal (have 2 peaks). This might suggest that the sample is not homogeneous but possibly its elements came from two different populations, each more or less normally distributed. In such case, in order to understand the nature of the variable in question, one should look for a way to quantitatively identify the two sub-samples. Central Tendency The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major types of estimates of central tendency: Mean Median Mode The Mean or average is probably the most commonly used method of describing central tendency.

8 It is the most common measure of central compute the mean, all the values are added up and divided by the number of values. The Median is the score found at the exact middle of the set of values. One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample. For example, if there are 500 scores in the list, score number 250 DESCRIPTIVE STATISTICS and Data Exploration 3 would be the median. Let the 8 scores be ordered as 15, 15, 15, 20, 20, 21, 25, 36. Score number 4 and number 5 represent the halfway point. Since both of these scores are 20, the median is 20. If the two middle scores had different values, you would have to interpolate to determine the median.

9 The Mode is the most frequently occurring value in the set of scores. To determine the mode, order the scores as shown above, and then count each one. The most frequently occurring value is the mode. It is used for either numerical or categorical data. In the above example, the value 15 occurs three times and is the mode. In some distributions, there may be more than one modal value. For instance, in a bimodal distribution there are two values that occur most frequently. Further, there may not be a mode. Mode is not affected by extreme value. If the yield of paddy from different fields are , , , , , , , , , and tonnes per hectare, the modal value is tonnes per hectare.

10 For the same set of 8 scores, three different values, , 20, and 15 for the mean, median and mode respectively have been obtained. If the distribution is truly normal ( , bell-shaped), the mean, median and mode are all equal to each other. While the mean is the most frequently used measure of central tendency, it does suffer from one major drawback. Unlike other measures of central tendency, the mean can be influenced profoundly by one extreme data point (referred to as an "outlier"). The median and mode clearly do not suffer from this problem. There are certainly occasions where the mode or median might be appropriate. For qualitative and categorical data, the mode makes sense, but the mean and median do not.


Related search queries