Multiple Linear Regression

Page Contents

The Linear Regression Model

The general linear regression model is one of the form:

~{y_i} _ = _ ~a + ~b_1~x_{~i,1} + ~b_2~x_{~i,2} + ... + ~e

where ~{y_i} is the value of the response in the ~i^{th} observation,~{x}_{~i,~j} is the value of the ~j^{th} explanatory variable in the ~i^{th} observation,a, b_1, b_2 . . . are unknown parameters,~e is a random variable, where ~e &tilde. N(0,σ^2) in all observations, σ^2 being the variance.

Multiple Linear Regression

The example we use to illustrate multiple regression is a study of crime rates in 47 American states in 1960. A number of other variables were measured for each state, all of which are included in the data.

Source: D. R. Cox and E. J. Snell, Applied Statistics (Chapman & Hall, 1981).
(Data from W. Vandaele, in Deterrence and Incapacitation, editied by A.Blumstien, J. Cohen and D. Nagin (National Academy of Sciences, 1978) pp 270-335.)

< script type="text/javascript" src="../Lib/MathymaStat.js" >< /script >
< script type="text/javascript" src="../Lib/MathymaLinMod.js" >< /script >
. . .
< script >
  var myData = new mathyma.stats.DataBlock("StatsData/crime.xml");
< /script >
Variable Name Description
crime crime rate
malyth number of males aged 14-24 per 1000 of whole population.
state 0 = Northern State, 1 = Southern State.
school mean number of years schooling x10 of people over 25.
pol59 police expenditure 1959 ($ per head of pop.)
pol60 police expenditure 1960 ($ per head of pop.)
empyth labour force participation civilian urban males aged 14-24 (number per 1000)
mf males per 1000 females in population
popn population (in 100,000)
race number of non-whites per 1000 of whole polulation
uneyth unemployment 14-24 yr. olds (no. unemplyed per 1000)
unemid unemployment 35-59 yr. olds (no. unemplyed per 1000)
income median value of family income, transferrable goods and assets (x $10)
poor number of families per 1000 earning less than half median income

First let's add all 14 variables, and then look at a summary of the response variable

< script >
< /script >

Now we will define our first model, in which we include all 13 explanatory variables.

Model1 = myData.Model("crime=_c+malyth+state+school+pol60+pol59+empyth+mf+popn+race+uneyth+unemid+income+poor"); 

The main purpose of defining this model is to be able to look at the relationship between the explanatory variables. To do this we draw the Correlation matrix. This shows the value of the correlation coefficient between all possible pairs of variables. Correlation varies between -1 and +1, being zero for two variables that are independent. As a visual aid, those pairs with an absolute value of correlation coefficient greater than 0.7 are printed in red:


Next we use the display( ) method of the model to print the ANOVA table for the model, as well as the table of estimates of coefficients together with their t-values and significance probability.


Model Reduction

One of the common aims of a multiple regression analysis is to determine a model using as few explanatory variables as possible while still providing a reasonable model for the data.

More sphisticated Statistical Packages (SAS, Genstat, etc.) provide semi-automatic processes to determine this minimal set of variables. Although MathymaStats does not do this, it does provide methods to do all the calculations necessary. Here we will illustrate how to test for the removal of one of the explanatory variables.

In our original model with 13 explanatory variables we saw that there was a strong correlation between poice expenditure in 1959 and 1960 ( > 0.99). This would suggest that we remove one of the two without significantly altering the model. Let us introduce model number 2, in which pol59 (previous year's police expenditure) has been removed:

  Model2 = myData.Model("crime=_c+malyth+state+school+pol60+empyth+mf+popn+race+uneyth+unemid+income+poor"); 

If you compare the ANOVA table for this model with that for the full model, you'll see that the model sum of squares has been reduced by about 160, and the residual sum of squares increased accordingly.

Finally we test for model reduction:


Note that the significance probability of over 56% is very strong indication that model 2 can be accepted.

We can then proceed to remove other variables, one by one, until further removal affects the model significantly.