Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable).
Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple linear regression, also known as multivariable linear regression.
Multiple regression also allows you to determine the overall fit of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender “as a whole”, but also the “relative contribution” of each independent variable in explaining the variance.
At the center of the multiple linear regression analysis is the task of fitting a single line through a scatter plot. More specifically the multiple linear regression fits a line through a multi-dimensional cloud of data points. The simplest form has one dependent and two independent variables, the general form of the multiple linear regression is defined as
for i = 1…n .
Sometimes the dependent variable is also called endogenous variable or prognostic variable. The independent variables are also called exogenous variables, predictor variables or regressors.
There are 3 major uses for Multiple Linear Regression Analysis – (1) causal analysis, (2) forecasting an effect, (3) trend forecasting.
Firstly, it might be used to identify the strength of the effect that the independent variables have on a dependent variable. Typical questions are what is the strength of relationship between dose and effect, sales and marketing spend, age and income.
Secondly, it can be used to forecast effects or impacts of changes. That is multiple linear regression analysis helps us to understand how much will the dependent variable change, when we change the independent variables.
Thirdly, multiple linear regression analysis predicts trends and future values. The multiple linear regression analysis can be used to get point estimates. Typical questions are what will the price for gold be in 6 months from now? What is the total effort for a task X?
When selecting the model for the multiple linear regression analysis another important consideration is the model fit. Adding independent variables to a multiple linear regression model will always increase its statistical validity, because it will always explain a bit more variance (typically expressed as R²).
A categorical variable is ordinal if there is a natural ordering of its possible categories. If there is no natural ordering, it is nominal.
Because it is not appropriate to perform arithmetic on the values of the variable, there are only a few possibilities for describing the variable, and these are all based on counting. First, you can count the number of categories. Many categorical variables such as Gender have only two categories. Others such as Region can have more than two categories. As you count the categories, you can also give the categories names, such as Male and Female.
Once you know the number of categories and their names, you can count the number of observations in each category (this is referred to as the count of categories). The resulting counts can be reported as “raw counts” or they can be transformed into percentages of totals. For example, if there are 1000 observations, you can report that there are 560 males and 440 females, or you can report that 56% of the observations are males and 44% are females.
A dummy variable is a variable with possible values 0 and 1. It equals 1 if the observation is in a particular category and 0 if it is not.
Categorical variables are used in two situations. The first is when a categorical variable has only two categories. A good example of this is a gender variable that has the two categories “male” and “female.” In this case only a single dummy variable is required, and you have the choice of assigning the 1s to either category. If the dummy variable is called Gender, you can code Gender as 1 for males and 0 for females, or you can code Gender as 1 for females and 0 for males. You just need to be consistent and specify explicitly which coding scheme you are using.
The other situation is when there are more than two categories. A good example of this is when you have quarterly time series data and you want to treat the quarter of the year as a categorical variable with four categories, 1 through 4. Then you can create four dummy variables, Q1 through Q4. For example, Q2 equals 1 for all second-quarter observations and 0 for all other observations. Although you can create four dummy variables, only three of them—any three—should be used in a regression equation.
Dummy coding is used when there is a control or comparison group in mind. One is therefore analyzing the data of one group in relation to the comparison group: a represents the mean of the control group and b is the difference between the mean of the experimental group and the mean of the control group. It is suggested that three criteria be met for specifying a suitable control group: the group should be a well-established group (e.g. should not be an “other” category), there should be a logical reason for selecting this group as a comparison (e.g. the group is anticipated to score highest on the dependent variable), and finally, the group’s sample size should be substantive and not small compared to the other groups.
The following table is an example of dummy coding with French as the control group and C1, C2, and C3 respectively being the codes for Italian, German, and Other (neither French nor Italian nor German):
In the effects coding system, data are analyzed through comparing one group to all other groups. Unlike dummy coding, there is no control group. Rather, the comparison is being made at the mean of all groups combined (a is now the grand mean). Therefore, one is not looking for data in relation to another group but rather, one is seeking data in relation to the grand mean.
Effects coding can either be weighted or unweighted. Weighted effects coding is simply calculating a weighted grand mean, thus taking into account the sample size in each variable. This is most appropriate in situations where the sample is representative of the population in question. Unweighted effects coding is most appropriate in situations where differences in sample size are the result of incidental factors. The interpretation of b is different for each: in unweighted effects coding b is the difference between the mean of the experimental group and the grand mean, whereas in the weighted situation it is the mean of the experimental group minus the weighted grand mean.
In effects coding, we code the group of interest with a 1, just as we would for dummy coding. The principal difference is that we code −1 for the group we are least interested in. Since we continue to use a g – 1 coding scheme, it is in fact the −1 coded group that will not produce data, hence the fact that we are least interested in that group. A code of 0 is assigned to all other groups.
The b values should be interpreted such that the experimental group is being compared against the mean of all groups combined (or weighted grand mean in the case of weighted effects coding). Therefore, yielding a negative b value would entail the coded group as having scored less than the mean of all groups on the dependent variable. Using our previous example of optimism scores among nationalities, if the group of interest is Italians, observing a negative b value suggest they score obtain a lower optimism score.
The following table is an example of effects coding with Other as the group of least interest.
Regression Models with Nonlinear Terms
Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.
While a linear equation has one basic form, nonlinear equations can take many different forms. The easiest way to determine whether an equation is nonlinear is to focus on the term “nonlinear” itself. Literally, it’s not linear. If the equation doesn’t meet the criteria above for a linear equation, it’s nonlinear.
Unlike linear regression, these functions can have more than one parameter per predictor variable.
|Nonlinear function||One possible shape|
|Power (convex): Theta1 * X^Theta2|
|Weibull growth: Theta1 + (Theta2 – Theta1) * exp(-Theta3 * X^Theta4)|
|Fourier: Theta1 * cos(X + Theta4) + (Theta2 * cos(2*X + Theta4) + Theta3|
The data consist of error-free independent variables (explanatory variables), x, and their associated observed dependent variables (response variables), y. Each y is modeled as a random variable with a mean given by a nonlinear function f(x,β). Systematic error may be present but its treatment is outside the scope of regression analysis. If the independent variables are not error-free, this is an errors-in-variables model, also outside this scope.
In general, there is no closed-form expression for the best-fitting parameters, as there is in linear regression. Usually numerical optimization algorithms are applied to determine the best-fitting parameters. Again in contrast to linear regression, there may be many local minima of the function to be optimized and even the global minimum may produce a biased estimate. In practice, estimated values of the parameters are used, in conjunction with the optimization algorithm, to attempt to find the global minimum of a sum of squares.