The logic of inferential statistics.... some building blocks.
What we want to know:
Population parameter
Examples:
Mean height for Oregon women
Mean difference in stats anxiety for men & women
Proportion of Oregonians who prefer Bradley or Gore or Bush or Dole for president
What we have (in inferential statistics)
Sample statistic
Height of a few Oregon women
Difference in stats anxiety for men & women taking 302 this term
Results of a telephone poll of 500 Oregonians
We can use our statistic to make a guess at the parameter. How good is the guess? How far off
is it likely to be? To know this, we need a sampling distribution. The sampling distribution
allows us to figure out, for example, what the standard deviation of our statistic is (which is the
typical distance it will be from the true parameter: if what we are guessing at is the mean).
In some cases, we can figure out what the sampling distribution should be because we know what
the population is (example: our card deck). However, if we know about the population, why
would we be doing inferential statistics in the first place? Usually we don't have the sampling
distribution -- we only have a single sample.
What comes to our rescue is the central limit theorem, which tells us what the sampling
distribution will be. It will be NORMAL, for one thing, because it is a distribution of
measurements that include error.
Early on, the normal curve was called the "error law" because it described the distribution of
errors in astonomical observations (which were full of error because of primitive telescopes). A
sample mean is also a kind of measurement of the population, and it is typically in error (usually
will not be equal to the true population mean)
Central Limit Theorem:
For any population with mean mu and variance sigma2 , the distribution of sample means
(x-bar) for sample size n will approach a normal distribution with a mean of mu and a
variance of sigma2/ n as n approaches infinity.
This means the following three things are true about sampling distributions:
1. Mean of x-bar will equal mu
2. Variance of x-bar will equal sigma2/ n
3. Distribution will be normal as n gets large OR if population distribution is normal
In practice, large is 20-30. When the population distribution is rectangular, sampling
distribution becomes very close to normal with samples of 20. For skew, need samples of size
30 or so to correct skew, especially a strong skew.
Empirical sampling distribution:
1. Take a deck of cards.
2. Pick two cards at random
3. Take the average of the DENOMINATION of the cards
A=1, J=11, Q=12, K=13
(Class does this, then we create a distribution of the means on the whiteboard)
Notice that the means are starting to pile up more in the middle than on the sides, so shape is
changing away from rectangular. Variance should be smaller. Mean may or may not be equal to
7---but if we kept sampling for a very long time, it would be.
Theoretical sampling distribution:
Can be computed by figuring out what the probability is of each sample mean. There are 6 ways to get a sample mean of 1, for example, 6 ways to get a sample mean of 13, 102 ways to get a sample mean of 7 [the true mean] (there are 6 ways to get two 7s, 16 ways to get a 6 and an 8, etc.).