3. Introduction to Generalized Linear Models The theory of Generalized Linear Models (GLMs) unifies important statistical models for continuous and categorical response variables. The term "generalized linear model" was first introduced in a landmark paper by Nelder and Wedderburn (1972, JRSS A). A wide range of different problems of statistical modeling and inference were put into an elegant unifying framework including: * Analysis of variance and covariance (linear regression) for normal outcomes * Regression models for binomial, Poisson, negative binomial, inverse gaussian, and gamma outcomes Generalized linear models (GLMs), which extend classical linear models for independent normally-distributed random variables with constant variance, consist of three components: Random provides a specific probability distribution for Y. Systematic specifies the explanatory variables entered as predictors in the model. They can be either continuous (numbers) or discrete (categorical). Link describes the functional relationship between the systematic component and the expected value (mean) of the random component. Unifying Features In generalized linear models, the defining characteristic of the model is the random component of the model which specifies the observations are independent and come from an exponential family distribution. The choices include the normal, Poisson, binomial, negative binomial, inverse Gaussian, and gamma distributions. For these choices, their respective probability density functions, whether they are continuous or discrete, can be expressed in the general form: _ _ | y*theta - b(theta) | f(y | theta,phi) = exp| ------------------- + c(y,phi) | |_ a(phi) _| If the value of phi is known, this equation describes an exponential family model with canonical parameter theta. The systematic component of a GLM specifies a linear predictor which is a function of the mean -- g(mu) [also called the expected value E(y)] -- to the explanatory variables which enter as a linear function of the predictors [i.e., in the form SUM(beta*x_i) ]. g{E(y_i)}=g(mu_i)= beta0 + x_i1*beta_1 + x_i2*beta_2 + .. + x_ip*beta1 (2) where y_i is a response variable (i=1,...,n) and mu_i=E(y_i), x_ij (j=1,..,p). There are p explanatory variables available for each subject, and the beta_j, j=0,1,..,p are the corresponding regression parameters to be estimated. The function g, called the link, is explained in more detail below. Some of the model predictors may be computed from other variables in the model to form interactions (e.g., beta_12*x_i1*x_i2) or to detect nonlinearity (squared terms such as beta_33*x_i^2). In both situations the resulting multiplicative combination of variables remains linear in the parameters. And you may want to center these variables rather than computing with the original values. The monotone link function between the random and systematic components specifies how expected value of the response [mu = E(y)] relates to the p explanatory variables in the linear predictor. For example, you can model the mean directly (i.e., the identity link with g(mu) = mu as in linear regression) or the mean can be modeled with a monotone non-linear function (e.g., logit link where g(mu)=LN(mu/(1-mu)). The GLM model formula specifies that g(mu) = beta_0 + beta_1*x_i1 + .. + beta_ip * x_p where g() is a one-to-one (monotonic) function of mu. Typically, the preferred choice of a link function g() will convert the constrained values of the mean mu into a range of numbers for which the transformed data are unconstrained (i.e., it will convert a bounded interval into the entire real line). For example, with g(mu)=LOG(mu) for Poisson data (where mu can only have positive values) the range of g(mu) is the entire real line. With g(mu)=LOG[mu/(1-mu)] for binomial data the mean mu is bounded between 0 and 1, yet the range of g(mu) is also the entire set of real numbers. Because the variance of a response often depends on its mean (that is, nonhomogeneous variance), generalized linear models assume that VAR(y) = f[V(mu)] (3) where V() is some known variance function appropriate for the particular type of response data. The nonlinear regression equation (2) weights the observations inversely according to the variance functions V() for given values of the mean. This weighting procedure turns out to be equivalent to maximum likelihood estimation when the observations come from one of the distributions defined under the exponential family. The following table summarizes the wide range of different types of statistical modeling and inferences available with GLMs: Response | Random | Link | Systematic | Type Data | Component | | Component | of Model -----------+--------------+------------+-------------+--------------------- Continuous | Normal | Identity | Continuous | Regression Continuous | Normal | Identity | Categorical | Analysis of Variance Continuous | Normal | Identity | Mixed | Analysis of Covariance Discrete | Binomial | Logit | Mixed | Logistic regression Discrete | Poisson | Log | Mixed | Loglinear Discrete | Multinomial | Cumulative | Mixed | Ordinal | | logit | | Discrete | Multinomial | Generalized| Mixed | Nominal | | logit | | --------------------------------------------------------------------------- This table implies different modeling approaches exist for specific types of response data. In particular, statistical computing does not center around the normal distribution as an all-encompassing model for numerical response data - in reality, only continuous, essentially unbounded data meets this condition. The type of data you have collected specifies the modeling strategy. This is a very different approach to the all-too-common decision of choosing one model (i.e., based on the normal distribution) to fit many types of data. Generalized linear models extend the applications formerly limited to linear regression and ANOVA models (that is, those which assume independent and normally-distributed random variables with constant variance). The inclusive nature of generalized linear model theory allows the same algorithms be used to fit data from a variety of distributions. Significance tests for model parameters are also closely related, so if you understand statistical tests in a linear regression or analysis of variance model, it is a small step to apply the same techniques to logistic or Poisson regression and other types of GLM's. The major differences among these models are the interpretation of the estimated parameters. Interpretation of a model is directly tied to the particular 'link' function chosen which the above table shows varies across types of distributions. Several different links may also exist for the same distribution, but this discussion will be deferred to another article. The links specified here are derived from the exponential family theory, although that condition does not mean they are the only ones, nor even the best ones for any situation. The important concept to remember is that no matter what link function is chosen, all these models produce predicted values in the units of the original data through a mathematical transformation. In summary, generalized linear models (GLMs) exist for regression-like modeling of data which do not assume a normal distribution. GLMs are flexible enough to include a wide range of common data analysis situations, while at the same time utilizing many familiar ideas from linear regression based on the normal distribution. Statistical computing does not center around the normal distribution as a general model for numerical response data - in reality, only particular types of response data meets this condition, i.e., continuous, essentially unbounded data. The type of response data specifies the modeling strategy. This is a very different approach to the all-too-common decision of choosing one model (i.e., normal) to fit all types of data. Generalized linear models extend the applications formerly limited to linear regression and ANOVA models (that is, those which assume independent and normally-distributed random variables with constant variance). The inclusive nature of generalized linear model theory allows the same algorithms be used to fit data from a variety of distributions. Significance tests for model parameters are also closely related, so if you understand statistical tests in a linear regression or analysis of variance model, it is a small step to apply the same techniques to logistic or Poisson regression and other types of GLM's. The major differences among these models are the interpretation of the estimated parameters. Interpretation of a model is directly tied to the particular link function chosen which the above table shows varies depending upon the random component. Actually, several links exist for each distribution and the one chosen produce a value that looks "linear". The links specified here for each distribution are called "canonical" because they are defined by the form of the equation in exponential form. This condition should not imply they are the only ones, nor even the best ones, for any situation. The important concept to remember is that no matter what link function is chosen, all these models produce predicted values in the units of the original data through a mathematical transformation. Summary Generalized linear models (GLMs) exist for regression-like modeling of data which do not assume a normal distribution. GLMs are flexible enough to include a wide range of common data analysis situations, while at the same time utilizing many familiar ideas from linear regression based on the normal distribution.