Large amounts of
data are often compressed into more easily assimilated summaries, which provide
the user with a sense of the content, without overwhelming him or her with too
many numbers. There a number of ways data can be presented. We will consider
two here—one is to present the data in a distribution, and the other is
to provide summary statistics that capture key aspects of the data.
When presented
with thousands of pieces of information, you can break the numbers down into
individual values (or ranges of values) and indicate the number of individual
data items that take on each value or range of values. This is called a frequency distribution. If the data can only take on specific values, as is
the case when we record the number of goals scored in a soccer game, you get a discrete distribution. When the data can take on any value within the range,
as is the case with income or market capitalization, it is called a continuous distribution.
The advantages of
presenting the data in a distribution are twofold. For one thing, you can
summarize even the largest data sets into one distribution and get a measure of
what values occur most frequently and the range of high and low values. The
second is that the distribution can resemble one of the many common ones about
which we know a great deal in statistics. Consider, for instance, the
distribution that we tend to draw on the most in analysis: the normal
distribution, illustrated in Figure A1.1.
A
normal distribution is symmetric, has a peak centered around the middle of the
distribution, and tails that are not fat and stretch to include infinite positive
or negative values. Not all
distributions are symmetric, though. Some are weighted towards extreme positive
values and are called positively skewed, and some towards extreme negative
values and are considered negatively skewed. Figure A1.2 illustrates positively
and negatively skewed distributions.
The simplest way
to measure the key characteristics of a data set is to estimate the summary
statistics for the data. For a data series, X1, X2, X3,
. . . Xn,
where n is the number of observations
in the series, the most widely used summary statistics are as follows:
¥ The mean (m), which
is the average of all of the observations in the data series.
¥ The median, which is the midpoint of the series; half the data
in the series is higher than the median and half is lower.
¥ The
variance, which is a measure of the spread in the distribution around the mean
and is calculated by first summing up the squared deviations from the mean, and
then dividing by either the number of observations (if the data represent the
entire population) or by this number, reduced by one (if the data represent a
sample).
The
standard deviation is the square root of the variance.
The
mean and the standard deviation are the called the first two moments of any
data distribution. A normal distribution can be entirely described by just
these two moments; in other words, the mean and the standard deviation of a
normal distribution suffice to characterize it completely. If a distribution is
not symmetric, the skewness is the third moment that
describes both the direction and the magnitude of the asymmetry and the
kurtosis (the fourth moment) measures the fatness of the tails
of the distribution relative to a normal distribution.