More on hypotheses testing:

Why we need two hypotheses:

Because we need a sampling distribution to figure out probabilities.

The null hypothesis provides us with a sampling distribution.

NOTE: When researchers talk about hypotheses, they always mean the "real" research hypotheses.

The null is part of the machinery that allows testing of research hypotheses.

Tails and significance levels:

Two tails for a nondirectional hypothesis

(and often for directional as well)

One tail for a directional hypothesis

(sometimes)

In practice, psychologists use two-tailed tests unless a difference in the other direction is non-plausible or of no interest whatsoever.

Examples:

1. Can psychics guess denomination of cards at better than chance level?

2. Does prayer help speed recovery from serious illness?

If one-tailed, the probability for the significance level is all in one tail

If two-tailed, need to divide this probability in half, and split between the two tails

For the test of the "mystery deck," we used two tails, because stacking in either direction (higher OR lower mean denomination from normal deck) was plausible.

Results of H-test on our "mystery deck," from which we took a sample of size 20 and calculated a mean of 7.9:

* Didn't find a difference at the specified level of significance (.05)

* So we retained the null hypothesis... we don't know if the deck is stacked or not. We may still suspect that it is stacked, but we didn't have sufficient evidence to reject the null.

In this state of uncertainty, we could still make an estimate about what the "true" mean of the mystery deck is.

Two Kinds of Estimation:

Point estimate: What's the best estimate of the mean denomination for the deck we tested? Answer: The sample mean: 7.9 NOT the mean of a regular deck (7)

Why? Even though we were unable to reject the null, the ONLY information we really have about this deck is the sample mean. So we use that mean. We know it is most likely wrong, but it's the guess most likely to minimize our error.

* For point estimate, use the sample mean

Interval estimate:

* For interval estimate, construct an interval with the SAMPLE MEAN at the middle.

What's a 90% confidence interval for the mean denomination of this deck?

We put the sample mean in the middle, and we want 45% of the probability on one side, and 45% of the probability on the other side.

NOTE: Confidence intervals are always "two-sided" in this way--we want to construct the interval around the sample mean, not all on one side or the other.

Because we are estimating based on a distribution of means (like our sample mean), we can presume a normal distribution (sample size of 20 is sufficient for a rectangular distribution).

Look up cutoffs: z of plus/minus 1.65 to have 45% of the probability on each side. We multiply this by the SE for the NORMAL deck, previously determined to be .837 (square root of 14/20, since sigma2 = 14, n=20).

1.65 X .837 = 1.38, so we add and subtract 1.38 from our sample mean to get confidence limits of 6.83 and 8.97. We are 90% confident that the TRUE mean for this deck lies somewhere in this interval. Notice that 7, the true mean for the NORMAL deck, is included in the interval. This shows us that even with a significance level of .10, we wouldn't be able to reject the null.

Repeat for 80% confidence interval, this time taking 40% on either side of the sample mean. (Work through the math). This gives us an interval of 6.52 to 9.28. 7 is still included (would not be able to reject even at significance level of .20).

However, we HAVE successfully captured the TRUE mean for the mystery deck, which is 8.5... the deck really is stacked.

We rejected the hypothesis, but we are estimating the mean as being 7.9, which is definitely not 7 (the null hypothesis). How can we have it both ways? Because we are operating under conditions of uncertainty, we may not be able to reject the null, but that doesn't mean we ACCEPT the null. The only real piece of evidence we have suggests a mean higher than 7... so that's what we use (without believing that the sample mean really IS the true mean, which is probably is not).

What else is fishy? The whole business of hypothesis testing presumes we have to use the NULL to provide us with a mean and variance.... but in estimation we use the SAMPLE MEAN... and use the variance of the NULL population. As is often the case in inferential statistics, we need a number that we don't have (variance of the mystery deck), so we use a number we do have.

Important assumption we are making: We are assuming that the variances of the two decks do not differ. This assumption is important, and you will be seeing more of it in future chapters...

Two kinds of error:

We just made an error, in testing the stacked deck. We failed to detect a true difference.

The other kind of error is "finding" a difference that doesn't exist.

To help with grasping this slippery abstract business of hypothesis testing, let's move away from cards and consider a situation where finding out the true state of affairs is actually quite important. Consider testing for the presence of cancer. In this case, consider a single person going in for a test.

There are two ways to be right, and two ways to be wrong.

Two ways to be right:

Correct detection of cancer for person with cancer.

Correct result of no cancer for healthy person.

Two ways to be wrong:

A false positive:

Test says healthy person has cancer

A false negative:

Test fails to detect cancer when it is there

The possibilities:

Does person have cancer?

 Decision after the test YES NO Has cancer Correct! Type I Error No cancer Type II Error Correct!

Applied to hypothesis testing:

Type I error: false positive p(Type I) = alpha

Type II error: false negative p(Type II) = beta

Probability of Type I error = significance level :

(False alarm)

In hypothesis testing, false positive means we reject the null even though the null is correct. We have "found" an effect that doesn't exist.

Type II: A false negative

(Failure to detect what's there)

In hypothesis testing, false negative means we fail to detect a real effect -- we don't reject the null even when the research hypothesis is true.

The Power of a test is its ability to detect something when it is REALLY there.

Statistical power:

Probability that a study will yield a significant result when the research hypothesis is true.

Possibilities: