Chapters 6-7 notes:
1. Probability as Relative Frequency
Probability (p) can be understood as relative frequency.
The probability of getting a particular outcome, when choosing at random, is the same as the relative frequency of that outcome in a distribution of possible outcomes.
2. Samples versus Population
1. We want to know what is true about a population of scores but
2. We don't have and often can't get all the scores
1. Take a sample of scores.
2. Calculate the statistics for the sample
3. Use sample statistics to make a guess about
the population parameters.
Good Sampling Strategy:
Mean for population is a PARAMETER
Mean for a sample is a SAMPLE STATISTIC
3. Normal Curve and Probability as Area
The normal curve is a good model of how certain variables are distributed (chest size of Scottish men, errors in locating stars).
If your population of scores or statistics is normally distributed, you can find probabilities of scores in a certain range by calculating the percentage area under the curve.
0. Change X value(s) to Z-score(s)
1. Sketch distribution, mark mean and sd
2. Mark location of your score(s) on the distribution and draw vertical line (s).
3. Shade the area. Guess the proportion.
4. Consult the unit normal table. Add and subtract if needed to find the exact area.
Note: Making a guess helps you discover your mistake if you look in the wrong column, for example.
The normal distribution equation (see your book, p. 138). Do not worry about understanding or memorizing this equation! Just want you to see both (1) the algebraic equivalent of the curve (2) the equation used to figure out the numbers in the unit normal table.
In this equation, e & pi are mathematical constants, myuu and sigma-squared (variance) are the two parameters. The formula gives the relative frequency (y) for each x score (can you find the x in the equation?).
4. Two routes to approximately normal distributions:
1. Binomial distribution (coins)
2. Error law (astronomer's)
Exercise: Determine the distribution of the variable "number of heads" for tossing the coin once (50/50 for 0 or 1 head) for tossing the coin twice (25/50/25 for 0 or 1 or 2 heads) and so on. How do you figure this out? You enumerate the outcomes (for 2 coin tosses, they are 00, 01, 10, 11), and then see what proportion have what number of heads. For coin tosses, 1 out of the 4 outcomes is 0 heads (probability .25), 1 out of the 4 outcomes is 2 heads (.25) and TWO of the four outcomes result in 1 head (probability 2/4 = .5). If you've done it correctly, the probabilities should add up to 1.00 (100%). For every problem like this, in which there are two distinct outcomes in each trial, and you have multiple trials, the distribution is a binomial distribution.
Definition: Distribution of "successes" in a fixed number of trials, when the probability of success on each trial is constant.
The limit of the binomial is the normal, as the number of trials goes to infinity. Discovery of the normal distribution allowed people to calculate proportions for large number of trials without having to go through the exhaustive procedure of enumerating every outcome, then counting up which ones gave the same number of heads, etc. The number of distinct outcomes for a given number (n) of trials = 2n, so for tossing 3 coins, there are 23 = 8 outcomes, for tossing 4 coins, there are 24 = 16, outcomes and so on. For three coins, for example, the possible outcomes are (with 0=tail and 1=head): 000 001 010 100 011 101 110 111. Once you get up to tossing 10 coins, you are up to how many distinct outcomes? 210 = 1024. You can see why the normal distribution (and the ability to construct and use a unit normal table) became very attractive.
How does this relate to statistics?
Many variables in the world we are interested in investigating are distributed approximately normally, via the binomial route. An outcome that is influenced by the contribution of many factors which can go one way or another (heads or tails) is often distributed normally.
The statistic x-bar is distributed normally as n goes to infinity (central limit theorem), via the error law route (think of the astronomers trying to figure out where the star REALLY is)..
A little History:
(Note: this is not something you need to memorize or will be tested on-it is to give you some interesting background that I hope will help you distinguish between two distinct processes that give rise to the normal distribution. This is something NONE of my stats teachers -I've had 5-ever explained clearly to me, so I had them confused in my mind for years. The confusion has also contributed to pernicious social ideas about difference from the mean being bad: "abnormal")
The NORMAL DISTRIBUTION was one of the first distributions studied in statistics, and is still the most important for inferential statistics. First discovered by Abraham De Moivre, published in 1733 paper, as the limit of the binomial distribution as the number of trials goes to infinity. During most of the 19th century, the normal distribution was known as the astronomer's error law, or error curve, because of the realization that it describes the distribution of errors in astronomical observations. Once they realized that the observations were normally distributed, it was clear that the mean was the TRUE position of the planet.
Later the error curve was adopted by social scientists who wanted to study mass phenomena w/out having detailed knowledge of the constituent individuals. However, the normal distribution of the variables they were studying was caused not by "error" but via the binomial route of many factors contributing in one direction or another.
The Belgian Adolphe Quetelet went to Paris in 1823 to learn astronomy, but was introduced to mathematical probability and was "infected" by the belief in its universal applicability. The new social science of statistics became for him a branch of the "social physics," patterned closely on celestial physics. Error law finally found its place in 1844 as the formula governing deviations from an idealized "average man." His Idea: HUMAN VARIABILITY IS FUNDAMENTALLY ERROR (the average is also what is best, most perfect-think of our duck as the perfect bird). People eventually dropped this idea in the literal sense. It was an inappropriate generalization for the use of the error law in astronomy. Confuses the two routes to the normal distribution.
Instead, in biological and social sciences, variation is genuine, important, meaningful.
Back to inferential statistics:
SAMPLING ERROR is TRUE ERROR, and here the theory of measurement error, and the application of the normal distribution as an "error curve" applies. This distribution is often called Gaussian after Carl Friedrich Gauss (late 19th century). Very end of 19th, beginning 20th,tje terms NORMAL LAW and NORMAL DISTRIBUTION came into use (Karl Pearson, another giant in statistics, used the term in 1894). Also called the bell curve.
5. The central limit theorem:
CLT: For any population with mean myuu and standard deviation sigma, the distribution of sample means for sample size n will approach a normal distribution with a mean of myuu and a standard deviation of sigma divided by the square root of n as n approaches infinity (see p. 163 G&W)
In practice, the distribution of sample means (x-bar) will be almost perfectly normal if
1. The underlying population of scores (X) is normally distributed ***OR****
2. The sample size (n ) is "relatively large." Large means 30 or larger. S
Sampling distribution of x-bar: Distribution of all possible sample means for samples of size n.
When n is small, sampling distribution looks like distribution of X (raw scores).
When n is large, sampling distribution looks normal. In between small and large, looks in-between! (As shown in pictures during class).
6. Measurement and standard error
X-bar as measurement: x-bar = myuu + measurement error.
Error might be positive or negative. Average x-bar will be myuu (errors cancel out).
The typical size of the error (standard error) = sigma divided by the square root of n (as indicated by the central limit theorem, CLT). Standard error: symbol is sigma with x-bar subscript, to show that this is the "standard deviation" of sample means, thinking of sample means as measurements that have some error in them. The larger the n, the smaller the standard error (Law of large numbers). Think of getting a better telescope, or getting closer to the dart board.
The larger the n, the more RELIABLE the measurement (more consistent, more accurate)