Lecture 1.3: Outline
Lecture 1.3: Interpreting results, descriptive statistics
- descriptive statistics [Lar82], sample mean and sample variance
After collecting data in a simulation experiment, we
often want to calculate some statistics to characterize
the results, typically estimates of the mean and variance
of certain observed quantities. If you measure the tail
length of 10 laboratory mice, for example, you might
very naturally calculate the average length, and the
average square deviation from the average length.
The average length is more properly called the sample
mean, and the average square deviation from sample
mean, the sample variance.
Here's a point that's sometimes shrouded in mystery,
but is actually simple to understand:
When we sum the squares of the differences between the observations
and the sample mean, we should divide by (N-1), not N, where
there are N observations. The reason for this intuitively is
that there are (N-1) degrees of freedom among the N sample
deviations from sample mean, because they must sum to zero.
The algebra shows that the the sum of the deviations squared
divided by (N-1) has the right expected value, the variance
of the observed quantity, and is thus an unbiased estimate
of the actual variance.
If the observed random variable has mean mu and variance sigma^2,
and we take N independent samples, the sample mean is an
unbiased estimator of mu (its expected value is mu), and
variance (1/N)*sigma^2. The square-root of this, the standard
deviation of the sample mean, thus decreases as the square-root
of the number of observations. This is true in many practical
situations --- you need a hundred times as many observations
to decrease the standard deviation of the sample mean by a factor of
ten.
- Importance of Gaussian (normal) distribution
The Gaussian, or normal distribution plays
a fundamental role in probability theory and statistics.
One reason is that the sum of independent observations
tends to this distribution rather quickly in practice.
With broad conditions this can be proved mathematically,
and the result is called the Central Limit Theorem.
Many observed variables in nature are in fact the sum of
many independent random effects, and are very well
approximated by a Gaussian random variable.
It's common for experimentalists to assume that the
variables and especially the noise they are dealing
with is roughly Gaussian for estimating confidence
intervals. That's OK, as long as you are aware of the
assumption and are prepared to think about exceptional
cases when you might be led astray.
Important properties of the Gaussian:
- linear combinations are also Gaussian
- has maximum entropy; that is, is ``most random''
- least-squares estimates are maximum likelihood
- many derived random variables have analytically
known densities, like chi-squared, student t (see below)
- the sample mean and the sample variance of N independent,
identically distributed samples are independent, and the sample
mean is also normal with the same mean as the parent distribution,
and (1/N)th the variance.
- distributions derived from Gaussian
Certain statistics derived from independent samples of a Gaussian
have special known distributions which can be computed fairly
easily. (They used to be published in big fat tables.)
The two most important are
- the normalized sample variance (sum of the square deviations
from sample mean, divided by the true variance) of N
independent samples from a Gaussian is chi-squared distributed
with N-1 degrees of freedom
- the deviation of sample mean from true mean, normalized
appropriately, is student t distributed with N-1 degrees of freedom.
- confidence intervals and interpretation of results
If you plot, say, the sample mean of a number of measurements,
it would be very nice if you could say roughly how accurate
your estimate is --- that is, how far you can reasonably expect
your estimate to differ from the true mean. These "error bars"
are expected in your graphs as a matter of course in many
experimental disciplines.
If you assume that the samples you are measuring are independent
samples from a Gaussian distribution, the distributions above
enable you to estimate confidence intervals. Take the sample
mean, for example. It's (normalized) deviation from the true mean has a known
distribution (student t), so you can look up two values, say
L and R, with the property that the probability that your observation
is within the interval [L,R] is 99%. You can then plot this interval
around your data point, giving the viewer an estimate of how accurate
the observation is. (The same idea works for sample variance if
you use the chi-squared distribution.)
Postscript of lecture
master reference list