2. The Normal Linear Regression Model As a motivating example to begin in familiar territory, I'll first review a few basic concepts and assumptions which underlie the least squares regression equation. The reason this model contains the descriptive word "Normal" is the residuals are assumed to be normally distributed (and not the actual response variable itself). This model is one of the first to be explained in most introductory courses in statistics and presents the foundation on which other types of models can be developed. Let y be a vector of responses measured on a continuous scale, and let x be one covariate (it can easily be a vector which includes two or more covariates). In the normal linear regression model each observed response value y_i, is assumed to be a linear function of an intercept and its corresponding explanatory value x_i, plus a random residual: y_i = beta_0 + beta1*x_i + resid_i where the error terms (the residuals) are independent and identically distributed as N(0,sigma^2). An equivalent way of writing this model is y_i ~ N(mu_i,%sigma^2) where y_1, .. y_n are independent and the expected value of y_i for a given covariate x_i is: %mu_i = beta_0 + beta_1*x_i With two or more explanatory variables the expected values of the multiple linear regression model are more compactly expressed in matrix notation: E(y) = mu = X*beta (1) where beta is a vector of regression coefficients. Since the individual observations y_i are independent and normally distributed with variance/covariance matrix %sigma^2*I(n), the parameter vector beta is estimated by minimizing the sum of squares: (y - mu)'(y - mu) = (y - X*Beta)'(y - X*Beta) Always keep in mind the "assumptions" behind any data analysis method are not just there to simplify the mathematics; they provide a reasonably simple mathematical representation of the data, that is, a simplification of the data-generating process. Generalized Linear Models also have assumptions. However, GLM's relax the following three assumptions usually placed on linear regression models based on residuals which follow the normal distribution. * The response is a linear function of the explanatory data * Error term with constant variance * Normality assumption of the residuals Why is Linearity Not Necessarily Appropriate? Bounded values of yi (and thus their respective means mu_i) are a common situation in data analysis. For example, if y_i represents a measurement of a physical attribute of a substance then y_i > 0 and likewise its mean mu_i > 0. Generally the realistic range of continuous data, such as heights or weights, is such the lower bound of 0 is not a practical concern and assuming normality works quite well. On the other hand if y_i has a dichotomous outcome with probability p (0
> p) For any linear model to be estimated, the number of observations (n) must be greater than the number of parameters. This implies the number of sample values must be greater than the number of regressors. Since GENMOD produces models based on maximum likelihood, the assumption is the total number of observations will be much "larger" than the number of parameters in the model. Observe spread in the independent variables and combinations with other factors Statistics is the study of variability both in the dependent and independent variables: significance of a predictor can only be obsvered with suitable variation in the independent values (should not look like a constant and should cover the relevant range of possible values). Model Specification The model should be correctly specified concerning: * Included Variables (i.e., no important variables have been omitted) * Interactions or non-linear terms (such as regressors which are squared) * Functional form (including the link) Absence of Excessive Multicollinearity Independent variables in any GLM are assumed to be linearly independent. That is, no independent variable can be expressed as a non-trivial linear combination of the remaining independent variables. Multicollinearity makes it infeasible to disentangle the effects of the supposedly independent variables and provide poor estimates of coefficients (i.e., will have large variances). Error free predictor (no EIV's) Assume all the independent variables are measured without error.