Lab 4 - Linear Regression Diagnostics

COMPUTER LAB 4: LINEAR REGRESSION DIAGNOSTICS -
A REVIEW 

 

When using Linear Regression models to provide inference about a given dataset, it is very important ensure that the model actually fits the data and meets the appropriate assumptions. Today we will review some of the methods for testing the fit of your model.

SET UP THE R ENVIRONMENT

Lets' first import the data that we are going to use for linear regression, download the gala.txt dataset and save it to your directory. Read in the tab-delimited text file into R (note: you will have to change the directory to where you stored the file):

gala <- read.table("//miriam/shannon$/asas/salbeke/LinRegDiag/gala.txt", sep="\t", header=TRUE)

Next we need to load the R packages required to run our diagnostic tests:

library(MASS)
library(car)

Review the data...you will notice there are 30 rows of data having 8 columns. For this dataset, the response variables are the total number of species, 'Species' and the number that are 'Endemics'. We will choose which column to use later. The remaining five columns are the predictor variables: Area, Elevation, Nearest, Scruz and Adjacent (I am not totally sure what each of these variables represent).

CHECKING THE DISTRIBUTION

Something that is always a good idea is to look at the distribution of your response variable. Copy and paste the following lines of code into R:

d <- density(gala$Species)
plot(d, type="n", main="Distribution of Species", xlab="Species", ylab="pdf")
polygon(d, col="blue")

What we have created is a kernel density estimate assuming a gaussian distribution of the total number of species and plotted the results. From this plot we can determine the number of species is highly skewed toward smaller numbers, but there are several study areas that have very high numbers of species. We can also obtain a similar result by creating a histogram of the data:

hist(gala$Species,10)

Because of the strong skewness and the very large numbers toward the tail of the distribution, we may want to determine if the data contains any outliers. An outlier is an observation which differs from the main trend in the data. The reason might be due to (unforeseen) special circumstances about the particular observation, or it might be due to a measurement error. But there is also the possibility that the unusual observation is simply due to random variation in the data: since the data are observations of random variables, there will be some variation away from the 'true mean'. Most points will lie close to the 'true mean', some will lie a little away, and a few might lie a bit further away from the mean. Let's test for outliers using the following function:

 

outliers <- function(x) {
e <- (length(x) - 1) / sqrt(length(x))
mad <- function (x, center=median(x), constant=1.4826,
low=FALSE, high=FALSE) {
n <- length(x)
constant * if ((low || high) && n%%2 == 0) {
if (low && high)
stop("'low' and 'high' cannot be both TRUE")
n2 <- n%/%2 + as.integer(high)
sort(abs(x - center), partial = n2)[n2]
}
else median(abs(x - center))
}
z <- ( (0.6745 * (x - median(x))) / mad(x) )
if ( (max(z) > (length(x) - 1) / sqrt(length(x))) == TRUE ) {
print(paste( "OUTLIERS DETECTED - ", "Expected-Z: ", round(e,digits=2) ,
" Observed-Z: ", round(max(z), digits=2), sep="" ))
} else {
print(paste( "NO OUTLIERS DETECTED - ", "Expected-Z: ", round(e,digits=2) ,
" Observed-Z: ", round(max(z), digits=2), sep="" ))
}
( list(Z=z, Expected=e, Observed=max(z)) )
}

 Once you have loaded the function into R, test for outliers within the Species column:

Z <- outliers(gala$Species)
 

The function should indicate that there are outliers within our dataset, but we do not know which values are the outliers...just yet.

MULTIPLE LINEAR REGRESSION WITH FIT & ASSUMPTION DIAGNOSTICS

The principle of least squares regression provides a general methodology for fitting straight-line models to regression data. Recall that for a multiple linear regression model:

$\displaystyle Y_{i}=\beta _{0}+\beta _{1}x_{i,1}+\beta _{2}x_{i,2}+\cdots +\beta<br />
_{k}x_{i,k}+\varepsilon _{i},\ \ \ i=1,\ldots n,<br />
$

 

we must meet the following four model assumptions:

(I)

Independence: The response variables $ Y_{i}$ are independent (i.e. each sample is independent of the other samples), thus the errors/residuals are also independent (no autocorrelation).

 

(II)

Normality: The random errors εi  are normally distributed.

 

(III)

Homoscedasticity: The estimated valuse of the response variable $ Y_{i}$ have constant error rate (The term homoscedasticity is from Greek and means `same variance'.)

 

(IV)

Linearity: The true relationship between the mean of the response variable $ {\mathbb{E}}[Y]$ and the explanatory variables $ x_{1},\ldots ,x_{k}$ is a straight line.

For our next set of diagnositic tests, we are going to move forward in rather blind manner only for demonstration purposes. When you are trying to model your own data, use your common sense and the knowledge you have gained during this class. To begin, fit a linear regression model to the data:

fit <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent, data=gala)
#look at the modeling results
summary(fit) 

Peruse the regression results, noting the parameter estimates, the adjusted R2, lack of an AIC value, etc. Now let's start to dive deeper into determining if our model is a good fit and meets the assumptions.

Check #1: Extreme Values / Outliers

Outliers may have undue influence upon your parameter estimates, therefore we should look and see if any exist within our dataset. Within the R package 'car', there is a function to test for outliers:

 # BONFERONNI P-VALUE FOR EXTREME OBSERVATIONS 
outlierTest(fit) #(package: car)

Your results should indicate that 2 of the 30 observations are considered to be outliers. At the very least you would want to investigate each observation, and even consider removing them from the analysis.

Check #2: Assumption of Independence

Correlation in the residuals means that there is room for improvement in the model, and extreme  autocorrelation is often a symptom of a badly mis-specified model. Autocorrelation is also sometimes a byproduct of a violation of the linearity assumption--as in the case of a simple (i.e., straight) trend line fitted to data which are growing exponentially over time. The Durbin-Watson statistic provides a test for significant residual autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) where a is the lag-1 residual autocorrelation, so ideally it should be close to 2.0--say, between 1.4 and 2.6 for a sample size of 50.

# TEST FOR AUTOCORRELATED ERRORS USING Durbin-Watson statistic
durbinWatsonTest(fit) #(package: car)

For our model, the result indicates that our observations are not autocorrelated, therefore we have met the first assumption of linear regresssion - Independence of observations/errors.

Check #3: Assumption of Normality

During model fit (i.e. the estimation of coefficients and the calculation of confidence intervals), sometimes the error distribution is "skewed" by the presence of a few large outliers. Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates. Calculation of confidence intervals and various signficance tests for coefficients are all based on the assumptions of normally distributed errors. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.

Before we move onto testing for normaility, let's take a slight detour and discuss residuals. Recall that residuals are the difference between the measured sample and the estimated function value (see red lines below). These are also known as the 'raw' residuals and that each estimate has different variance and that they may not necessarily be independent. Therefore, the raw residuals are not suitable for checking the assumptions on the random errors. The standardized residuals are designed to overcome the problem of different variances of the raw residuals; thus the standardized residuals are better suited than the raw residuals for checking the assumptions on the random errors. You can calculate the standardized residuals using the code below:

# DISTRIBUTION OF STUDENTIZED RESIDUALS
sresid <- studres(fit) #(package: MASS)
hist(sresid, freq=FALSE, main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)

 

The best test for normally distributed errors is a normal probability plot (Q-Q plot) of the standardized residuals. This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should fall close to the diagonal line. A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in the same direction). An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either two many or two few large errors in both directions.

# QQ PLOT OF STUDENTIZED RESIDUALS
qqPlot(fit, main="QQ Plot") #(package: car)

You will notice that in this case there is an 'S-shape' to the curve where we have two points falling far from the diagonal line. Since it appears we have a couple of misbehaving data points (probably the outliers previously identified) we can use a couple of other statistics to assess the effect of these data points. One such test is Cook's Distance. Cook's distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression. Points with a large Cook's distance are considered to merit closer examination in the analysis.

# COOK'S DISTANCE PLOT - IDENTIFY D > 4/(n-k-1)
cutoff <- 4/((nrow(gala)-length(fit$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff, main="Cook's Distance")

You will notice that three of the data points appear to merit closer examination within our model, specifically 'sample 16'. We can gain yet another perspective by creating an Influence plot.

# INFLUENCE PLOT (ESC OUT)
influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

The larger the circle, the greater the influence of the sample on the model. Leverage plots can also be used to assess the effect of a data point on the model fit. A leverage point is a point for which the observed value has a large influence on the analysis. The value of this isolated point is disproportionately influential on the least squares line (one might say that it works as a lever-if the value of this observation is changed, the least squares line changes considerably). In contrast, if the value of one of the points within the cluster is changed, the least squares line will not be affected to the same extent.

# LEVERAGE PLOTS
leveragePlots(fit) #(package: car)

 

Check #4: Assumption of Homoscedasticity

Violations of homoscedasticity make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when estimating coefficients. One possible method to test for homoscedasticity is to apply a Non-Contant Variance Score Test:

# NON-CONSTANT ERROR VARIANCE TEST
ncvTest(fit) #(package: car)

If the calculated p-value is below your chosen threshold (i.e. α ≤ 0.05), then your model is not meeting the assumption of homoscedasticity. A more visual and probably common approach is to plot the standardized residuals against the fitted values:

# PLOT STUDENTIZED RESID VS FITTED
plot(fitted(fit), studres(fit))
abline(0,0)

Within this plot we are looking to see if there is a systematic pattern within the residuals around "0". When looking for patterns in residual plots, there are three main features which are important. If the residuals seem to increase or decrease in average magnitude with the fitted values, it is an indication that the variance of the residuals is not constant. That is, the errors are heteroscedastic. If the points in the plot lie on a curve around zero, rather than fluctuating randomly, it is an indication that the data do not mee the assumption of linearity. If a few points in the plot lie a long way from the rest of the points, they might be outliers. But what should we do if the residual are not homoscedastic? One option is to transform the response variable.

# STUDENTIZED RESIDUALS VS. FITTED VALUES - TRANSFORMATION
spreadLevelPlot(fit) #(package: car)

The resulting plot can be viewed somewhat similarly to the previous residual plot. However, one potentially convenient statistic provided by this function is the calculation of a suggested Power Transformation value. In theory, if one were to transform their response variable using the following formula:

Yt = (Yp - 1) / p

where p is the Power Transformation value and Y is the observed response value, then the model residuals will hopefully be homoscedastic and meet the assuption criteria.

Check #5: Assumption of Linearity

If you fit a linear model to data which are nonlinearly related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. In multiple regression the correlations among predictors often makes it difficult to see the unique effects of one predictor, controlling for the others. Partial regression residual plots (Larsen & McCleary, 1972) are designed to show the relationship between y and each xk , after the effects of all other predictors have been removed.

# PARTIAL RESIDUAL PLOT
crPlots(fit) #(package: car)

If the partial residual plot shows a straight line (i.e. the green line...the red line is the slope of the parameter estimate derived during model fit), it is an indication that the true relationship between the response variable and the kth explanatory variable is straight-line, when all other variables are taken into account. If the plot shows a non-linear relationship, it is an indication that affects the response variable in a non-linear fashion. However, partial residual plots can fail if there are nonlinear relationships among the predictors. Dennis Cook developed the CERES plot to show a curve in the relationship of the dependent variable to a predictor, in spite of any nonlinear relationships among the predictors.

# CERES PLOT
ceresPlots(fit) #(package: car)

 

THE COUP DE GRAS

As you are becoming acutely aware, Program R wouldn't be Program R without someone developing a package to try and make your statistical life a little easier. Below we will use a package that helps determine if our model fits the data and meets the assumptions:

# Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)

You will notice that our model fails nearly every test. What happens if we model the number of endemic species instead of the total number of species? Let's try:

#Fit a new model
fit2 <- lm(Endemics ~ Area+Elevation+Nearest+Scruz+Adjacent, data=gala)
gvmodel2 <- gvlma(fit2)
summary(gvmodel2)

Yes, now all of the global assumptions have been met. I, personally, would want to go through each of the steps manually to get a better feel for the data. I just like being in control of my statistical analyses...unless I was the one who wrote the packageWink

Here is a link to the above R code

 

THERE IS NOT A HOMEWORK ASSIGNMENT FOR THIS LAB