Chapters 1-4 overheads from lecture.
Note: When I translate into html, I lose Greek symbols. So I'm using M for myuu, SD for the standard deviation (sigma), x-bar, and other words such as sigma instead of the symbols.
Techniques to summarize and organize a whole set of observations (data). This area of statistics is concerned with describing the data you actually have.
Distribution of hours of sleep for this class.
Census data about how the population of the United States is distributed across regions, states, urban, suburban, rural, how much money people in different regions make, etc., etc.
What percentage of people voting chose McCain over Bush in primaries in the different states
Techniques that use data from samples to generalize about a population. Inferential statistics allow us to make conclusions that go beyond the data we actually have.
Use sleep data from your little group to infer how much sleep students in the whole class got last night.
From a sample of whole U.S. population that fills out the census long form, infer the answers to questions such as "What percentage of U.S. households have access to the Web at home?"
From a sample of people interviewed on the phone, infer whether voters prefer Bush or Gore for president
Frequency distribution table:
|Score (X)||Frequency (f)|
|(Hours)||(Number of people with this score)|
To make a histogram from this data, put the scores on the X-axis (horizontal) and the frequency (number of people/events/objects with this score) on the Y-axis (vertical). If instead of counting up frequencies you made a mark for each person, when you turned the frequency table on its side you would see a rough histogram.
Variables and Values:
A Variable is a characteristic that can have different values--that varies.
A Value is a particular score for a variable. Also called observation, measurement
Amount of sleep is a variable. 3, 5, 9 hours of sleep are different values for that variable.
Sex is a variable. Male or female are different values for that variable
A frequency distribution shows how frequent the different values are for a particular variable
Variables can be continuous or discrete.
Examples of continuous variables: Stress level, Amount of sleep
Examples of a discrete variables: Country, Number of siblings
Question: Are these variables discrete or continuous?
State of birth _____________
Difficulty of college classes ____________
Popularity of math as a subject ____________
There are at least 4 distinct measurement scales:
Nominal (categorical, think names)
Ordinal (think ordinal numbers-1st, 2nd, etc)
Interval (think equal intervals)
Ratio (needs an absolute zero)
NOTE: Both discrete and continuous variables can be measured on multiple scales. Many psychological variables are measured on scales that are treated as interval but in actuality are better than ordinal but not perfectly interval (that is, the intervals may only be roughly equal).
When we do experiments, we call variables independent or dependent based on their role in the experimental design. An experiment manipulates (controls the values of) one or more variables and then measures others, which are free to vary.
The manipulated variable are called independent variables (IVs).
The variables that are free to vary are called dependent variables (DVs).
NOTE: Independent and dependent are not inherent qualities of the variables. Instead they depend on the research design. A variable can be the IV in one experiment and the DV in another.
Displaying, describing, summarizing distributions:
Two ways to display the information contained in a distribution: frequency table and histogram.
Distributions are also described by words and numbers.
WORDS for SHAPE:
Modes: Unimodal, bimodal, multimodal. (Rectangular has no mode). The two modes in a bimodal distribution may be equal or unequal(major and minor).
Normal distribution (bell curve). This is a mathematical object, defined by an equation. It shows relative frequency (no numbers on the y axis). Many variables are distributed in a way that is approximately normal.
Distributions may be symmetric (mirror image, like the normal curve) or asymmetric (skewed).
Skew may be positive (skewed to the right) or negative (skewed to the left).
The skew is where the skinny tail of the distribution is.
Population: Any complete set of observations or measurements; the entire set of individuals, objects, or events a researcher wants to study.
Examples:1) all UO undergrads 2) all Oregonians
Sample: A subset of observations from a population, used to infer what is true of the population
Examples: 1) the students in this class 2) the Oregonians who got the long census form
Parameters are numbers used to describe a the distribution of a Population.
The parameters used most in inferential stats are Mean, Variance, and Standard Deviation.
Mean is a measure of central tendency;
Variance and standard deviation are measures of variability
Chapter 3. Central tendency: Mean, Median, Mode
Mean: The average-- sum of all scores divided by N.
Greek letter "Myuu" used for population mean; symbol "x-bar" for sample mean.
In symbols: M = [sum of X] /N. For samples, x-bar = [sum of x]/n
The mean M is equal to = the sum [capital sigma, looks like a big E] of all scores X divided by the number N of scores.
Median: The score of the middle person (odd N), or the midpoint between the scores of the two middle people (even N). N= number of people/objects in the population. The median divides N (the number of scores) in half. It does NOT divide the x-axis (the range of values) in half (most common misconception). When finding the median of a distribution, you should be dividing the area of the distribution in half (not the number line).
How mean, median and mode are related:
In a unimodal symmetric distribution, Mean = Median = Mode
In a skewed distribution, they are not equal: What gets skewed, mainly, is the MEAN, and it is skewed toward the tail, away from the main bulk of the distribution.
The median is more resistant to skew, and the mode is not affected.
WHEN TO USE the different measures:
Use Mode for nominal data.
Use Median to describe seriously skewed distributions (like income, house prices).
Most inferential tests use the mean; some use the median. None use the mode.
Chapter 4. Variability
Range: Distance between largest score (Max) and smallest score (Min). For whole numbers, formula is Max - Min +1 (plus one so that the Min is counted too)
Example: 1, 3, 8. Range is 8-1 +1 = 8 (1, 2, 3, 4, 5, 6, 7, 8)
Population Variance and Standard Deviation
Variance: sigma squared:
Mean squared deviation from the mean [MS].
Symbols: sigma =[sum of (X-M)2 ]/ N
Four steps for variance:
1. Calculate deviation scores (X - M)
2. Square ( )2 each deviation score
3. Sum the squares (give SS, sum of squares)
4. Divide by N for mean square (MS)
Example: Xs (scores) are 1, 3, 4, 4, 4, 6, 6
N (number of scores) = 7, Sum of X = 28
M = [sum of X] /N = 28/7 = 4
|1-4 = -3||(-3)2 = 9|
|3-4 = -1||(-1)2 = 1|
|4-4 = 0 [x 3]||02 x 3 = 0|
|6-4 = 2 [x 2]||22=4 x 2 =8|
Sum of (X-M)2 =18 (The sum of squares)
Variance = [sum of (X-M)2 ]/ N = SS/N = 18/7 = 2.57
Standard deviation: sigma:
Mathematically: Positive square root of the variance.
Conceptually: The typical distance of scores from the mean.
(Not exactly the average distance, but close)
Two steps for standard deviation:
1. Calculate the variance
2. Take the square root
Our variance was 2.57, so SD = 2.57 = 1.6
From our example: The deviation scores were -3, -1, 0, 0, 0, 2, 2
Average distance from the mean using absolute values = 8/7=1.14. So the standard deviation is not EXACTLY the same as the average distance from the mean, instead it is a "typical" distance from the mean.
Sample variance (s2 )and standard deviation (s) (used to estimate population variance and standard deviation). The difference is that instead of dividing SS by N, you divide SS by n-1. Why the change? (1) This is a correction to improve your prediction of the population variance. The variance of samples is systematically smaller than the true population variance-it is a BIASED estimator, unlike x-bar, which is an unbiased estimator of the mean. (2) You lose a degree of freedom (df) when you calculate the mean. This is because if you know the mean and n-1 of the scores, you know the value of the last score (the 1 score) as well-it is no longer free to vary.
Calculate variance for the number of siblings in your group, using [sum of (X-M)2 ]/ N formula (population variance)
Xs = , M=
1. X-M =
2. (X-M)2 =
3. Sum of (X-M)2 =
4. Sigma2 (variance, MS) =
Let's see how well these variances predict the TRUE population variance, by averaging them together: _________
Now calculate the SAMPLE variance for your groups (s2) dividing the SS by n-1 (the degrees of freedom) instead of by N.
Hw well do these CORRECTED variance scores predict the true population variance? (should be a closer fit)