NOTE: These include both notes about what I said and what you saw on the overheads.
I've reorganized some of this material and dropped some of the "visual" explanation that
just confused people. Since this material is very difficult, I thought more elaborate notes
might help, so I've also added some explanation that was not actually in the lecture. If
you find discrepancies between these posted notes and the notes you took -- especially
the sequencing -- it is because I've made these changes.

Correlation

INTRO:

In chapter 3 we move from looking at one variable only to looking at two variables
measured for the same people or objects. So instead of just looking at height, for
example, and how it is distributed in this class, we are looking at how height might be
connected to age (as people grow up, they get taller) or infant nutrition (poor nutrition
can stunt growth).

We can look for associations between any two variables measured for the same object.
We can find out if the expected difficulty of a class is associated with the amount of
stress people feel about a class, as long as we have measured these two variables for the
same people. We can see if height is associated with how much stress people feel about
a statistics course as long as we have measured height and statistics stress for the same
people (which the class questionnaire did).

**Correlation:** A statistical association between scores on two or more variables (or)

A statistical technique for describing a relationship between two or more variables.

A correlation can be positive, negative, curvilinear, or 0

**Positive: **High scores on one variable go with high scores on the other, low go with low,
and moderate go with moderate

Example: Height and age in children

Example: Time since last meal and hunger

**Negative: **High scores on one variable go with low scores on the other, low go with high

Example: Height and age in adults

Example: Time spent partying and time spent studying

**Curvilinear: **The two variables are associated, but not in a straight linear fashion

Example: Sleep and alertness

Very little sleep: not alert

Right amount of sleep: alert

Too much sleep: less alert

**Zero correlation:** Example: Denomination of card and Suit (1,2,3,4). Knowing the value
of one tells you nothing (0) about what the value of the other might be.

**X and Y variables**

When there are two variables, the first is X and the second is Y. X goes on the horizontal
axis, Y on the vertical

IF the two variables can be considered independent and dependent, then make the
independent variable X and the dependent variable Y

The value of a "dependent" variable is expected to DEPEND ON the value of the
"independent" variable.

However, lots of times we just have two variables, which can't be designated as
independent and dependent. We call one X and one Y** arbitrarily.**

Positive and Negative correlations can also vary in **Strength:**

A correlation can be small, medium, or large.

If there is a small positive correlation, there is a SLIGHT tendency for high values of X
to go with high values of Y, and low with low

If there is a large positive correlation, there is a STRONG tendency for high values of X
to go with high values of Y

*How can we tell if two variables are correlated?*

1. Draw a picture, called a **scatter diagram** or a **scatterplot**. Dots are scores in two-dimensional space (x, y)

2. Calculate a number:** Pearson's r**

Correlation coefficient, also called Pearson's ** r** :

Average of cross products of Z scores for two variables

Formula:** r = E(Z_{x}Z_{y}) /
N**

Note: I've had to resort to using a capital E for summation because the Greek symbols are disappearing when I post on the web, sorry.

**Z _{x}Z_{y}** = cross products of Zs for X & Y

**E / N = **take average

Why Z scores?

Reason 1. To put the scores on the same footing--standardized distance from the mean. Since raw scores come from two variables, they may be measured on very different scales (for example, inches vs. 10-point scale)

Reason 2. To give the scores a sign, which makes the result of the calculation negative or
positive.

Steps for calculating the correlation coefficient:

1. Find **Z** scores for X & Y ( Z_{x} scores & Z_{y } scores)

2. Compute crossproducts of Z_{x} and Z_{y }

3. Sum crossproducts **E**

4. Divide by **N** (number of objects/people involved)

(NOTE: Steps 3 & 4 = take average)

Z Cross Average ZCA

**Strength: **How closely the variables track one another.* r* ranges from -1 to +1

**Correlation & Causality:**

The fact that two variables are correlated does not mean they are causally related. They
might be, and they might not be.

Here are the possibilities:

1. X causes Y (in part, directly or indirectly)

2. Y causes X (in part, directly or indirectly)

3. Factor Z has an effect on both X & Y

4. 1&2

5. 1&2&3

6. Random correlation. This means that two variables are correlated (a real correlation)
but there is no plausible reason for them to be. It's just an odd fact. Example: Number of
letters in a state name is probably correlated with SOME other variable for states, but it's
highly unlikely there is any causal chain connecting the two--the association would be a
coincidence.

**Prediction using Z-scores:**

INTRO.

Problem: Making a good guess. The guess is about an INDIVIDUAL's score, based on what we
know about POPULATION parameters. This is like Z-score: We know about the distribution,
and want to locate an INDIVIDUAL score within that distribution. Now presume that we know a
person's score and its location in the distribution of ONE variable, and want to guess what their
score will be on another variable. If the two variables are correlated, we can use this information
to improve our guess.

You all filled out a questionnaire at the beginning of class, and I've been calculating various
parameters for this particular population. I know the mean for height, I know the mean for stress
about stats on a 10 point scale. Now, I DON'T know each person's score for stress about stats,
because there are no names on the questionnaires.

So suppose a student comes to my office the end of the first day, and I have calculated means and
standard deviations, and I want to guess how stressed he or she is about statistics. What should I
guess? Answer: guess the mean. That will minimize my expected error.

Now suppose I have also calculated correlations, and I know height and stats stress have a

correlations on the data to figure out what is correlated with what.

So now when a student comes to my office, I can estimate height (since that is visible). How
should I use this information to make a better guess about how stressed they are about stats?

I can guess they will be above the mean or below the mean on stress, based on whether they are
below the mean or above the mean on height (since the correlation is negative). So if someone is
quite short, I will guess that they are more stressed than average about stats.

Prediction using Z-scores formalizes the process of using correlation information to improve our
guesses.

**Prediction using Z-scores**

Formula: Predicted **Z _{y} =
(beta)(Z_{x)}**

California Z_{x=} .76

Oregon Z_{x} = -.41

Beta = Correlation Coefficient = .5785

We want to find the PREDICTED **Z _{y}**

Have students calculate using the formula.

**Steps when predicting Y from single X:**

1. Change X scores to Zx scores

2. Multiply Zx by beta to get predicted Zy

3. Transform predicted Zy scores to predicted "raw" Y scores.

The regression line (line drawn through the predicted scores--all predicted scores will fall on a
single line in the graph) is like a two-dimensional mean. It minimizes the squared error in our
guesses.

**Regression to the mean**

I showed you a visual illustration of prediction using a single X to predict a single Y. Not very successful!

Let's try the algebra:

Formula: Predicted **Z _{y} =
(beta)(Z_{x})**

UNLESS the correlation between X and Y is -1 or +1 (perfect), the predicted Y scores
will always be** less variable** than the X scores (and vice versa), because you are
multiplying Z_{x} by a fraction.

Thus the predicted Y scores will be closer to the mean of Y than the actual scores.

**The only place there is no regression is AT the mean **(because you predict the mean Y
from the mean X).

The more extreme the values of X [the farther from the mean] the stronger the regression
of the predicted Y to mean Y

Examples:

1. If we take the tallest 5 men in the class, and use a correlation <1 to predict the height of their fathers, we will predict that the fathers will, on average, be shorter than the sons.

2. If we take the men in the class with the 5 tallest fathers, a use a correlation <1 to
predict the height of their sons, we will predict that the sons will, on average, be shorter
than the fathers.

So when you SELECT a set of extreme scores on one variable, the mean score on an
(imperfectly) correlated variable will be less extreme. That's regression to the mean.

Formula: Predicted **Z _{y} =
(beta)(Z_{x)}**

We change to the X scores to Zx scores, multiply by beta [the standardized regression coefficient,
which turns out to be the same value as Pearson's *r* when you have a SINGLE predictor variable]
to get the predicted Zy scores. Then we go from the Zy scores back to the predicted raw Y
scores.

**Proportion of variance accounted for:**

The amount of improvement in your "guess" about the Y value based on knowing the X value. This is the improvement you get over just guessing the mean of Y for every score.

If you guess the mean for each score, the variance of your guesses will be zero (they don't vary at
all, right?).

Total variance for Y: the average squared deviation from the mean of Y.

To find the proportion of variance accounted for, **square the correlation coefficient. **

**Why?** Variance is figured in the unit of squared deviations. We square *r* to get it back into the
right kind of unit.

If correlation is **perfect** (+1 or -1), ALL the variance is account for (our guesses will be perfect).
If not, some of the variance is accounted for -- the proportion equal to the squared correlation.

**Multiple regression.**

Multiple regression is using scores on** multiple** X (predictor) variables to predict a score on a
single Y (criterion) variable.

So it is a generalization of the two variable prediction or regression we've been discussing.

Examples:

1. Predict how many salmon will return to spawn (Y) based on pollution of rivers (X1), intensity of fishing (X2, and average ocean temperature (X3)

2. Predict how well a student will do on the final exam based on their scores on mean quiz score (X1) and midterm (X2).

3. Predict miles per gallon for a road trip based on efficiency of the car (X1) and proportion of
city to highway driving (X2), proportion of mountain driving to plains driving (X3)

Formula:

Predicted **Z _{y} =
(Beta_{1})(Z_{x1})** +

**for as many X variables as you have**

NOTE: The beta weights for multiple regression will typically NOT be equal to the correlation
coefficient for that variable. The Xs share variance when they are correlated to each other.