More on hypotheses testing:
Why we need two hypotheses:
Because we need a sampling distribution to figure out probabilities.
The null hypothesis provides us with a sampling distribution.
NOTE: When researchers talk about hypotheses, they always mean the "real" research
hypotheses.
The null is part of the machinery that allows testing of research hypotheses.
Tails and significance levels:
Two tails for a nondirectional hypothesis
(and often for directional as well)
One tail for a directional hypothesis
(sometimes)
In practice, psychologists use two-tailed tests unless a difference in the other direction is
non-plausible or of no interest whatsoever.
Examples:
1. Can psychics guess denomination of cards at better than chance level?
2. Does prayer help speed recovery from serious illness?
If one-tailed, the probability for the significance level is all in one tail
If two-tailed, need to divide this probability in half, and split between the two tails
For the test of the "mystery deck," we used two tails, because stacking in either direction
(higher OR lower mean denomination from normal deck) was plausible.
Results of H-test on our "mystery deck," from which we took a sample of size 20 and calculated a mean of 7.9:
* Didn't find a difference at the specified level of significance (.05)
* So we retained the null hypothesis... we don't know if the deck is stacked or not. We
may still suspect that it is stacked, but we didn't have sufficient evidence to reject the
null.
In this state of uncertainty, we could still make an estimate about what the "true" mean of
the mystery deck is.
Two Kinds of Estimation:
Point estimate: What's the best estimate of the mean denomination for the deck we tested? Answer: The sample mean: 7.9 NOT the mean of a regular deck (7)
Why? Even though we were unable to reject the null, the ONLY information we really
have about this deck is the sample mean. So we use that mean. We know it is most
likely wrong, but it's the guess most likely to minimize our error.
* For point estimate, use the sample mean
Interval estimate:
* For interval estimate, construct an interval with the SAMPLE MEAN at the middle.
What's a 90% confidence interval for the mean denomination of this deck?
We put the sample mean in the middle, and we want 45% of the probability on one side, and 45% of the probability on the other side.
NOTE: Confidence intervals are always "two-sided" in this way--we want to construct the interval around the sample mean, not all on one side or the other.
Because we are estimating based on a distribution of means (like our sample mean), we
can presume a normal distribution (sample size of 20 is sufficient for a rectangular
distribution).
Look up cutoffs: z of plus/minus 1.65 to have 45% of the probability on each side. We multiply this by the SE for the NORMAL deck, previously determined to be .837 (square root of 14/20, since sigma2 = 14, n=20).
1.65 X .837 = 1.38, so we add and subtract 1.38 from our sample mean to get confidence
limits of 6.83 and 8.97. We are 90% confident that the TRUE mean for this deck lies
somewhere in this interval. Notice that 7, the true mean for the NORMAL deck, is
included in the interval. This shows us that even with a significance level of .10, we
wouldn't be able to reject the null.
Repeat for 80% confidence interval, this time taking 40% on either side of the sample
mean. (Work through the math). This gives us an interval of 6.52 to 9.28. 7 is still
included (would not be able to reject even at significance level of .20).
However, we HAVE successfully captured the TRUE mean for the mystery deck, which
is 8.5... the deck really is stacked.
What's fishy about this?
We rejected the hypothesis, but we are estimating the mean as being 7.9, which is
definitely not 7 (the null hypothesis). How can we have it both ways? Because we are
operating under conditions of uncertainty, we may not be able to reject the null, but that
doesn't mean we ACCEPT the null. The only real piece of evidence we have suggests a
mean higher than 7... so that's what we use (without believing that the sample mean
really IS the true mean, which is probably is not).
What else is fishy? The whole business of hypothesis testing presumes we have to use
the NULL to provide us with a mean and variance.... but in estimation we use the
SAMPLE MEAN... and use the variance of the NULL population. As is often the case
in inferential statistics, we need a number that we don't have (variance of the mystery
deck), so we use a number we do have.
Important assumption we are making: We are assuming that the variances of the two
decks do not differ. This assumption is important, and you will be seeing more of it in
future chapters...
Two kinds of error:
We just made an error, in testing the stacked deck. We failed to detect a true difference.
The other kind of error is "finding" a difference that doesn't exist.
To help with grasping this slippery abstract business of hypothesis testing, let's move
away from cards and consider a situation where finding out the true state of affairs is
actually quite important. Consider testing for the presence of cancer. In this case,
consider a single person going in for a test.
There are two ways to be right, and two ways to be wrong.
Two ways to be right:
Correct detection of cancer for person with cancer.
Correct result of no cancer for healthy person.
Two ways to be wrong:
A false positive:
Test says healthy person has cancer
A false negative:
Test fails to detect cancer when it is there
The possibilities:
Does person have cancer?
Decision after the test |
YES |
NO | |
Has cancer | Correct! | Type I Error | |
No cancer | Type II Error | Correct! |
Applied to hypothesis testing:
Type I error: false positive p(Type I) = alpha
Type II error: false negative p(Type II) = beta
Probability of Type I error = significance level :
(False alarm)
In hypothesis testing, false positive means we reject the null even though the null is
correct. We have "found" an effect that doesn't exist.
Type II: A false negative
(Failure to detect what's there)
In hypothesis testing, false negative means we fail to detect a real effect -- we don't
reject the null even when the research hypothesis is true.
The Power of a test is its ability to detect something when it is REALLY there.
Statistical power:
Probability that a study will yield a significant result when the research hypothesis is true.
Possibilities:
Your research hypothesis is:
Decision after the H-test |
TRUE | FALSE | |
Reject null | Correct! :) | Type I Error :( | |
Retain null | Type II Error :( | Correct! :) |