Lab 3 - Ecoinformatics Inititative

COMPUTER LAB 3: INFORMATION THEORETIC APPROACH TO

MODEL SELECTION AND INFERENCE
 

Adapted from Dr. Jim Peterson, Oregon State University/University of Georgia

 

Simulated data is often used to help understand how and why certain quantitative techniques work. It also provides a known benchmark with which to examine the ability of each technique to detect and estimate important effects. Thus, for this lab, we will be using simulated data to examine various aspects of model selection and inference. Be sure to review Burnham and Anderson 2001 before beginning the lab.

 

WE WILL USE THE FOLLOWING TWO R Scripts:    

Simulate_Data.r

AIC_Selection.r

 

GENERATING MODEL

Simulated sampling/experimental data are created with a generating model. Generating models provide a convenient benchmark for examining the efficacy of various modeling procedures. For our model selection evaluations, we will use the data generated with the following model:

Y = 5 + A*2 + B*5 + C*2 + D*2 + Error

 

where A, B, C, and D are normally distributed variables with means (standard deviations) of 8(2.5), 8(2.5), 5(2) and 3(3), respectively. The Error has a mean of zero and variance of 40. in reality, we will probably never know the 'true' model, even if one exists. Consequently, we'll assume that factor D is unknown to us and that information on D was never collected during our experiment (i.e. we won't include it in the model fitting).

To create the simulated dataset, open Simulate_Data.r and examine the R script file. Notice how the random variables A through D are generated within the rand.reg function. The rand.reg function requires one argument, x, which represents the sample size and the returned object by the function is a data.frame containing a row for each randomly generated value given the regression equation (see above; A, B, C, D, Y, Error). Read in the rand.reg function into your R session then run the following code to generate 100 rows of 'data' (you can think of this as a sample that you collected in the field) and calculate the mean(sd) of the simulated data.

#run the simulation with sample size = 100
rand.test<-rand.reg(100)
#summarize the results
rbind(sapply(rand.test, FUN=mean), sapply(rand.test, FUN=sd))

Run these 2 lines of code 5 separate times. Note, the first row of the matrix are the means, second row are the standard deviations.

Do the means and variances differ?

 

Now set the sample size to 500, estimate the means and standard deviations 5 more times and compare.

#run the simulation with sample size = 500
rand.test<-rand.reg(500)
#summarize the results
rbind(sapply(rand.test, FUN=mean), sapply(rand.test, FUN=sd))

 

MODEL SELECTION WITH HYPOTHESIS TESTS

One commonly-used model selection technique is through the use of hypothesis tests. Here, independent variables are tested individually or in combination and are retained in the model when hypothesis tests (p-values) indicate that they are 'significantly' related to the response variable. Common names for this process are Forward Selection, Backward Selection and Stepwise Selection for all possible combinations of the variables. Often variables are retained in models when their (t-test or F-test) p-values are <0.05. As stated in lecture, I do not agree with this method and therefore we will not dedicate lab time to learning these methods...but hey, at least you now know that they exist in case you run across them within the literature.

MODEL SELECTION WITH AIC

Now that you are somewhat familiar with AIC model selection after this week's lecture and review material, let's walk through the steps to develop the competing models, generate AIC, ΔAIC, and AIC weights (wi). We will complete the AIC model selection process using Program R. Luckily for us, a very convenient R package was developed. Within the R GUI, install the AICmodavg package, then load the package for future use:

library(AICcmodavg)

STEP 1 - Develop Models:  The first step in the process of model selection using Program R is to actually develop the candidate set of models. Other than collecting 'good' data, this step is extremely important and should not be taken lightly. Recall that our simulated 'sample' data has a response variable, Y, and three predictor variables A, B, and C (Note: all of the variables are continuous data). Open AIC_Selection.r, you will notice 7 individual models stored to unique R objects. What type of model are we specifying? As you can see, all possible combinations of the variables A through C are being fit to the data. This is only for demonstrative purposes and I do not recommend this approach (all possible combinations) for real data analyses. Scientists should always give considerable thought to their set of candidate models and include only those that make ecological sense.

You will notice that the first model, 'lr01', has all of the predictor variables. This is a very important model that is often called the Global Model. The Global Model must contain all of the parameters that are to be used within subsequent candidate models. So, if you deem that an interaction is occurring between two predictor variables (X1 and X2), the Global Model must have the form: Y = X1 + X2 + X1*X2, in addition to any other predictors you deem to be important.

STEP 2 - Test GOF:  Before running a model, generate a new random dataset:

#run the simulation with sample size = 100
rand.test<-rand.reg(100)
#summarize the results
rbind(mean(rand.test), sd(rand.test))

As always, we should first test for collinearity between our predictor variables

#typically a good idea to test for collinearity of predictor variables
cor(rand.test[,2:6])

Submit all of the models to Program R. You have just run 7 unique linear regression models on the same dataset! Let's explore the results of a glm object of the Global Model by typing into R:

#this displays linear reg results
summary(lr01)

What you should see are the list of the model coefficients (i.e. parameter estimates) and the standard error for each independent variable. Investigate the other summary output to familiarize yourself. This output is essentially identical to output from other statistical software packages. One of the most important steps in developing a 'useful' model is to assess if the model actually fits the data. Thus, the best method to assess the Goodness-of-Fit (GOF) for all of the candidate models is to test the Global Model. In this case we are using linear regression to fit our model. One of the easiest ways to evaluate GOF is by examining the residuals, such as a plot of the residuals vs. predicted/fitted values. Submit the following code to R:

#check for homoscedacisity
plot(fitted(lr01), resid(lr01))
abline(0,0)

The plotted values should center on zero and have no discernible pattern (i.e. be homoscedastic). If the residuals deviate from the pattern (i.e. heteroscedastic), then the response data may require a transformation or you may consider using a different type of GLM to approximate the response. If the Global Model is deemed to be a good fit, we can then make the assumption that the subsequent candidate models are also a good fit.

STEP 3 - List models by name: Now that we have determined our Global Model meets our GOF criteria, we need to perform a few 'housekeeping' steps within R to prepare our candidate models for the model selection process. First we need to create a 'list object' containing our fitted set of candidate models. Second it will make our lives easier if we give appropriate names to each of our models. Submit the following lines of code to R:

#Create list of all the models
lrModels<- list(lr01, lr02, lr03, lr04, lr05, lr06, lr07)
#Give each model a name
lrNames <- c("Global", "A", "B", "C", "AB", "BC", "AC")

STEP 4 - Calculate AICc table:  By now your curiosity has to be bubbling over, "What will be the results of the model selection?!" To squelch any anxiety, run the following R code:

aicWt<-aictab(cand.set=lrModels, modnames=lrNames, sort=TRUE, c.hat=1)
#Display AIC table
aicWt

The previous code uses the already loaded AICcmodavg package. Within that package is a function named 'aictab' that requires a list of glm models and the associated names of the models. The output should be similar to the table below (Note: The table below is a recommended format for publication).

Model K AICc ΔAICc AICcWt 
 A 3  517.30 0.00 0.57
 AC 4  519.54 2.24 0.19
 AB 4  519.64 2.34 0.18
 Global 5  522.01 4.71 0.05
 B 3  526.28 8.98 0.01
 C 3  526.50 9.20 0.01
 BC 4  528.60 11.31 0.00

 

 There are a couple of things to point out with this table. 1) This output is sorted by AICcWt because we set 'sort=TRUE'. This is very handy and makes reading the table much more efficient; 2) Notice column K for the number of parameters. Because this is a linear regression, we have to account for both the Intercept (β0) and the Error (e) as model parameters in addition to A, B and C3) Your AICc output should differ from this output because each of you is using randomly derived data. In the above example, the best approximating model (i.e. the most parsimonious)  is the univariate model containing on the predictor variable A. What would happen if we created a new random dataset and performed model selection? Chances are we may get a different AIC table, which leads us to the topic of Model Selection Uncertainty.

STEP 5 - Model Selection Uncertainty: Let's assume that I conducted 1000 experiments (in this case randomly created 1000 datasets) and used AIC to select the best approximating model.

AIC MODEL SELECTION FREQUENCIES (N= SAMPLE SIZE)

Notice that no one model was selected more than 50% of the time with either sample size. Conservation biologists and resource managers base many of their management strategies on the prediction and interpretation of the best fitting model. This model, however, can change from experiment to experiment simply due to random chance (as is shown above). Thus, in most instances, we are uncertain as to which model best approximates the 'ture' ecological processes. This is called model selection uncertainty. Conservation biologists who ignore model selection uncertainty may fail to consider important effects when relying on the best fitting model for inference.

What are some of the implications of ignoring model uncertainty?

 

To incorporate model selection uncertainty (Burnham and Anderson 2002, Anderson 2008), one can base inferences on a confidence set of models (somewhat analogous to a confidence interval for a mean estimate). The determination of the confidence set is made using AIC weights and include those models that are within some predetermined fraction of the highest weight. Royall (1997) suggested a 1/8 (12.5%) minimum cutoff point as a general rule-of-thumb for evaluating strength of evidence. Others have used 10% cutoff. Create a new random set of data in R:

#run the simulation with sample size = 500
rand.test<-rand.reg(500)
#summarize the results
rbind(mean(rand.test), sd(rand.test))

Now resubmit the 7 linear regression models and recreate the list:

#store candidate set of models to the variables/objects
lr01<- glm(Y ~ A+B+C, family=gaussian(link="identity"), rand.test, na.action=na.pass)
lr02<- glm(Y ~ A, family=gaussian(link="identity"), rand.test, na.action=na.pass)
lr03<- glm(Y ~ B, family=gaussian(link="identity"), rand.test, na.action=na.pass)
lr04<- glm(Y ~ C, family=gaussian(link="identity"), rand.test, na.action=na.pass)
lr05<- glm(Y ~ A+B, family=gaussian(link="identity"), rand.test, na.action=na.pass)
lr06<- glm(Y ~ B+C, family=gaussian(link="identity"), rand.test, na.action=na.pass)
lr07<- glm(Y ~ A+C, family=gaussian(link="identity"), rand.test, na.action=na.pass)

#Create list of all the models
lrModels<- list(lr01, lr02, lr03, lr04, lr05, lr06, lr07)
#Give each model a name
lrNames <- c("Global", "A", "B", "C", "AB", "BC", "AC")

Now create a confidence set of models using a 10% threshold (i.e. you want the top 90%):

#Create the confidence set of models using the 10% threshold
aicSet<-confset(cand.set=lrModels, modnames=lrNames, level=0.90)
aicSet #show the candidate set of models 

You will notice that the confidence set of models include fewer than the entire set of original candidate model set. The models falling outside of the threshold can be deemed as having insufficient evidence to be plausible in describing the particular phenomena of interest (i.e. the dependent variable).

STEP 6 - Model uncertainty and parameter estimates: Often the parameter estimates (e.g. slope and intercepts in regression models) for the same variable differ among models within the candidate set. We could simply choose the most plausible model, but as we've shown this is fraught with error. We could also just average each parameter across the candidate set of models. However, this is somewhat unsatisfying because one model may have a much higher weight of evidence (wi) than another model. Thus, our best option is to use AIC weights to calculate parameter estimates.

The idea behind AIC model averaging is to use the AIC weights to weight the parameter estimates from each model and combine those. These are known as model-averaged parameter estimates. Thus, we incorporate model selection uncertainty directly into the parameter estimates using the AIC weights. Model-averaged parameter estimates are only calculated for those parameters that are included in the confidence set of models and the final model of model-averaged parameter estimates is called the composite model. The composite model should always be used for interpretation.

Using the following two lines of code, copy and paste each one separately and edit them according to your candidate set of models:

#Create new list of models given the candidate set,
#so you must edit this list before submitting
lrCanSet<- list(lr01, lr02, lr03, lr04, lr05, lr06, lr07)
lrCanNames <- c("Global", "A", "B", "C", "AB", "BC", "AC")

Now, using your new candidate set of models, calculate the model-averaged parameter estimates for each variable:

#Calculate parameter estimate for each model parameter in candidate set
aicAvg<-modavg(cand.set=lrCanSet, parm="(Intercept)", modnames=lrCanNames, conf.level=0.95)

Make sure to not forget about the intercept (the default in the code above). You will want to record your estimate and the SE for use in your composite model. The parameter estimate standard errors are used to determine the precision of the individual parameter estimates in regression. Large values (relative to the parameter estimates) indicate imprecise estimates and small values, precise estimates. Precise estimates (small standard errors) are preferred because they indicate that the parameter estimate is probably closer to the 'true' value.  One thing to keep an eye out is if your 95% CI crosses zero. When this occurs, it is difficult to determine the 'direction' of the effect of that predictor variable upon your response variable.

When model-averaging, we must include 2 sources of variance, 1) variance due to model selection and 2) sampling variance. The modavg function of the AICcmodavg package provides a convenient method for calculating the unconditional variance for the model-averaged parameter estimates.

A CAUTIONARY NOTE

When comparing the relative evidence for a particular factor or hypothesis, it is important to remember that they are influenced by the error in your measures of the independent variables (i.e. the predictors) as well as the sampling error of the response variable.

 

 

 

COMPUTER EXERCISE ASSIGNMENT 2

DUE February 11, 2014

  1. The dataset shapedata.txt contains data on the body shape of a random selection of men and women. It contains the following variables:
    male - an indicator variable of whether or not the person is a man
    height - the height of the person
    weight - the weight of the person
    age - the age of the person

    You can read in a tab-delimited text file into R using the following code, just change the directory path:
    sdata<-read.table("C:/Users/salbeke/Documents/Lab3AIC/shapedata.txt", sep="\t", header=TRUE)

     
    • Fit 3 candidate models of a person's weight using glm in R
    • Compare the AICc, ΔAICc and wi between models.
    • Provide only the AIC table and your interpretation. Which is the best model? How much better is it?

       
  2. For this exercise, you can use your own data OR the fitness dataset described below. Using the tools/scripts you deem necessary, create a set of candidate models. The modeling results should be presented in tabled format as described during lab. I expect the write-up to in a "methods and results" format for publication (this also means you should include cited literature as needed). Items to consider including are your a priori hyphotheses, AIC table(s) and your brief interpretation of the results.
    • Develop at least 3 hypotheses based on the available information (your data OR the fitness data)
    • Create a model to represent each hypothesis
    • Calculate AICc, AICc weights and model-averaged estimates
    • Interpret the results

      The fitnessdata.txt dataset contains data on the fitness characteristics of a random selection of men and women. I am interested in the influence of various combinations of these factors on oxygen consumption.
      oxy - a person's oxygen consumption
      runtime - the amount of time a person ran during the fitness test
      runpulse - the average pulse rate during running
      maxpulse - the maximum pulse rate during the test
      weight - the weight of the person
      age - the age of the person

       
DON'T FORGET TO CHECK YOUR VARIABLES FOR COLINEARITY!!
ALSO, DON'T FORGET TO INCLUDE ALL PREDICTORS IN THE GLOBAL MODEL