Lab 2 - Random Sample Selection

COMPUTER LAB 2: BRIEF INTRODUCTION TO R SPATIAL OBJECTS... THEN

RANDOM SAMPLE SELECTION AND PSEUDO-ABSENCE CREATION

 

You can download the R Script here: AdvSpatial_Lab2.R

 

R SPATIAL OBJECTS:

You can go here (http://geostat-course.org/system/files/monday_slides.pdf) and here (http://cran.r-project.org/web/views/Spatial.html) to get additional information provided by Roger Bivand on R Spatial Objects. 

Vector Data:

For this class, the primary packages that we will use to manipulate and process vector datasets are: rgdal, rgeos and sp

  • sp :  A package that provides classes and methods for spatial data. The classes document where the spatial location information resides, for 2D or 3D data. Utility functions are provided, e.g. for plotting data as maps, spatial selection, as well as methods for retrieving coordinates, for subsetting, print, summary, etc.
     
  • rgdal : Provides bindings to Frank Warmerdam's Geospatial Data Abstraction Library (GDAL) (>= 1.6.3) and access to projection/transformation operations from the PROJ.4 library. The GDAL and PROJ.4 libraries are external to the package, and,        when installing the package from source, must be correctly installed first. Both GDAL raster and OGR vector map data can be imported into R, and GDAL raster data and OGR vector data exported. Use is made of classes defined in the sp package.       Windows and Mac Intel OS X binaries (including GDAL, PROJ.4 and Expat) are provided on CRAN.
     
  • rgeos :  Interface to Geometry Engine - Open Source (GEOS) using the C API for topology operations on geometries. The GEOS library is external to the package, and, when installing the package from source, must be correctly installed first. Windows and Mac Intel OS X binaries are provided on CRAN.

We will use rgdal (readOGR and writeOGR) functions to read/write ESRI Shapefiles that contain our spatial data. Once a shapefile is read into R, it is stored in memory as a 'sp' object of either SpatialPointsDataFrame (points), SpatialLinesDataFrame (lines) or SpatialPolygonsDataFrame (polygons). As you would expect, these objects have multiple embedded object types that help describe the spatial and tabular components of the information.

 

#Grab the necessary packages
require(sp)
require(raster)
require(rgdal)
require(rgeos)

#Currently, R can only read shapefiles for ESRI vector data or data from SQL Server!!!
#points - check out the R object type (class) in the Workspace

pts<- readOGR(dsn="//wygiscdev/AdvSpatial/LabData/lab2", layer="Cactus")
#streams - check out the R object type (class) in the Workspace
lines<- readOGR(dsn="//wygiscdev/AdvSpatial/LabData/lab2", layer="NHD_Streams")
#study area - check out the R object type (class) in the Workspace
poly<- readOGR(dsn="//wygiscdev/AdvSpatial/LabData/lab2", layer="StudyArea")

#first, let's visualize the data
#let's plot the vector data and get an areal perspective and see if it is similar to ArcGIS

plot(lines, col="blue")
plot(poly, add=TRUE)
plot(pts, add=TRUE, col="red")

#OK, visually this is looking correct
#Now lets check out the attribute tables

View(pts)
View(lines)
View(poly)

#Look under the hood a spatial object
str(poly) #or str(lines) or str(pts)
#you will notice there are objects called 'slots' (data, polygons(or lines or coords), plotOrder, bbox, proj4string)
#these slots essentially mimic the file structure of a shapefile
#@data = the *.dbf = the attribute table
#@polygons/lines/coords = *.shp = the geometries that define the spatial components
#@bbox = the extent of the object(s) in rectangular form
#@proj4string = *.prj = the projection file describing how to draw the coordinates in space

#the number of rows in @data with match the number of features listed in the @polygons/lines/coords
#you can have complex multipart features or simpler singlepart features
#because of this, you can select features and their associated attributes by simply referring to their index
#in this example we are retrieving the 5th line

lines[5,]
#we can plot this as well
plot(lines[5,])
#this logic works for retrieving only the attribute table information. 
#Thus @data is simply a data.frame() associated with spatial features

lines@data[5,]


Raster Data:

  • raster : Reading, writing, manipulating, analyzing and modeling of gridded spatial data. The package implements basic and high-level functions and processing of very large files is supported.
 

#R can read in a multitude of raster formats, such as *.img, *.GeoTiff and *.asc. It can also
#deal with high-density formats such as HDF and netCDF. WARNING, R cannot read in ESRI GRID
#so pay close attention to the format in which you have received data from others. I prefer to use *.img

#read in the existing 30m DEM
dem<- raster("//wygiscdev/AdvSpatial/LabData/lab2/DEM_30m.img")
#visualize the raster
plot(dem)
plot(poly, add=TRUE, col="red")
#look at the structure of the object. It tells you all that you need to know about the raster
dem
str(dem)

#the raster package can perform many tasks on raster objects, often in a more convenient manner
#we need to crop the DEM to match our study area

cdem<- crop(dem, poly)
plot(cdem)
plot(poly, add=TRUE)
plot(pts, add=TRUE)

#calculate the slope using the DEM
slp<- terrain(cdem, opt="slope", unit="degrees", progress="text")

#one very useful thing about the raster package is the ability to stack rasters in-memory and perform functions
#The only caveat is that for R to stack the rasters, each raster must have the exact same extent, origin, 
#coordinate system and cell size.
#Stacked rasters gives us a lot of power to perform broad-scale spatial analyses

rs<- stack(cdem, slp)
#see the data and notice that a single raster object now has two bands
rs

#create a moving window/focal raster of the slope
ms<- focal(rs[[2]], w=3, fun=mean, pad=TRUE, pavValue=3.673, progress="text")
names(ms)<- "MeanSlope"
#add to the raster stack
rs<- stack(rs, ms)
#view data, 3-bands now exist
rs

#maybe we are interested in knowing the elevation and slope for each point
e<- data.frame(extract(rs, pts))
#now add the results to the in-memory point shapefile
pts@data<- data.frame(pts@data, e[match(rownames(pts@data), rownames(e)),])
View(pts@data)

 

 

RANDOM SAMPLE SELECTION

During this computer exercise, we will use both ArcGIS and Program R to randomly select points from an occurrence dataset in a manner that will reduce inherent spatial autocorrelation within the dataset. As is often the case, we may want to employ an algorithm that requires both presence and absence points. We will develop a set of randomly distributed pseudo-absence points that provide an appropriate filter to match our occurrence data.

THE DATA (PLEASE DO NOT DISTRIBUTE THIS DATA TO OTHERS OUTSIDE OF CLASS)

The occurrence data consist of 5,000 points in which a rare cactus (Sclerocactus brevispinus) were observed. Our goal is to develop a Species Distribution Model to predict potential locations of this cactus in locations that have not yet been surveyed. To help constrain the study area (extent), we have been requested to use the HUC12 watersheds as logical boundaries for our predictions.

STEP 1: PREPPING THE DATA IN R

 

##Using your network drive on \\wygiscdev\AdvSpatial\LabData add the BoundingHUCs and the SCBR shapefiles from the folder. 

#First Let's read in the HUCs shapefile using rgdal from the network drive
huc<- readOGR(dsn="//wygiscdev/AdvSpatial/LabData/lab2", layer="BoundingHUCs")
#view the hucs
plot(huc)
#look at the information such as the coordinate system
proj4string(huc)
#check out the attribute table
View(huc)

#Now read in the the Cactus Observation points
cactus<- readOGR(dsn="//wygiscdev/AdvSpatial/LabData/lab2", layer="SCBR")
#view the points
plot(cactus, add=TRUE)
#look at the information such as the coordinate system
proj4string(cactus)
#check out the attribute table
View(cactus)
#check out the data types of the attributes
str(cactus@data)

Using your network drive on \\wygiscdev\AdvSpatial\LabData add the BoundingHUCs and the SCBR shapefiles from the folder. Notice how 'clumped' the data are. The interesting thing about this dataset is that we do not know the accuracy of the GPS receivers. Thus, we cannot filter out observations based on GPS accuracy, leaving us with many points that may be highly spatially autocorrelated. Secondly, we need to decide our spatial extent from which to perform our analyses. We can either choose to limit the extent to HUC12s that have an occurrence record or choose to use all of the HUC12s. By choosing the smaller extent, the accuracy of the model will likely increase because the variability in the predictor variables will be reduced. However, you may sacrifice the ability of the model to extrapolate to areas outside of where the model was specified (i.e. the model is highly specific to the study area). On the other had, if we increased the extent to include all of the HUC12s, our model may become less accurate, but also may be more transferable to others areas because the model-fit accounted for greater variability within the predictors. We could debate about which path to choose, but for brevity and processing time, we will choose to use only the HUC12s having occurrence data for our extent:

#now use a spatial query to select the HUCs that have observations
h<- over(huc, cactus, returnList=FALSE) #sp package
#create reduced set of huc polygons
huc.sm<- huc[row.names(h[is.na(h[,1])==FALSE,]),]
#plot the smaller extent hucs
plot(huc.sm)
#plot the observations again
plot(cactus, add=TRUE)

#now save the new shapefile to your network directory because we will use this again later
#make sure that you change the directory to be YOUR working directory
#you will need to create your own lab2 directory

writeOGR(huc.sm, dsn="//wygiscdev/AdvSpatial/salbeke/lab2", layer="HUC_SCBR", driver="ESRI Shapefile")

 

  • I would just like to make a point here. Recall that when setting an extent you want to account for possible influences from outside the actual study area. We may need to think about buffering HUC_SCBR.shp to account for outside influence. More to come, just food for thought.
     

STEP 2: RANDOMLY SELECT SAMPLES OF OBSERVED POINTS

There are several options at our disposal to limit the number of samples. We could simply select a set number of samples from the dataset, we could select a certain percent from each HUC12, we could select sites that are not within a certain distance from another sample, etc. For this example, we are going to set a minimum distance that a sample must be from another occurrence point. The shapefile SCBR.shp has 5,000 sample locations, and with this feature class we are assuming:

  1. Each sample location was captured with the same level of accuracy. We have to make this assumption because there isn't an attribute that correctly describes the point's GPS accuracy. 
     
  2. Each sample location is unique (i.e. each point location represents an individual plant, so no plant was counted twice).
     
  3. Each sample was collected by a person fully capable of correctly identifying the species (i.e. no misidentifications). 
     
  4. Each sample was opportunistically collected, not part of a set study design.
     

Given that we have chosen to use Program R to help us randomly sample our locations we must first ensure that we have a way of identifying each row as unique other than it's spatial location. Do the following steps:

#check out the attribute table. 
View(cactus)

#I'm not happy with having a true/static unique column value, so create one named 'PointID' and make it the 1st column
cactus@data<- data.frame(PointID=as.numeric(row.names(cactus@data)), cactus@data)
#check out the attribute table again. 
View(cactus)
#now save the updated observation points to your own directory for later use
writeOGR(cactus, dsn="//wygiscdev/AdvSpatial/salbeke/lab2", layer="SCBR", driver="ESRI Shapefile")

  • One thing that I want to point out is that far more often than not I choose to make my unique column AlphaNumeric (i.e. a combination of letters and numbers). I just find that I have much more flexibility in how I can treat the data than if the field were numeric only. ALSO REMEMBER that data types are extremely important to keep track of!!! To query a numeric field you use different syntax than to query a text field.

 

#this is the function to test for the min distance between two sets of points: uses raster package
#read it into memory

min.dist <- function(xy1, xy2, longlat=FALSE, ...)
{
  outdist<-pointDistance(xy1, xy2, longlat=longlat)
  return(min(outdist))
}

#set the number of samples to select
n<- 200
#set the minimum distance two points can be from each other
mindist <- 50

#We need to randomly choose a point, one at a time, and test for proximity to existing points
#To accomplish this task, we will run a loop

for(i in 1:n){
  #set a tracking variable for each iteration of i to know we found a point that meets 
  #the distance criteria

  good<-0
  repeat{
    #grab a single random presence point and only the first column of data
    rpt<- cactus[sample(1:nrow(cactus), size=1, replace=FALSE),1]
    #if it is the first point, nothing to test against for distance, add to output
    if(i == 1)
    {
      #save the point to the tracking matrix
      outPts<- rpt
      good<-1
    }#if i==1
    else
    {
      #get the min distance to the selected points
      testxy <- min.dist(outPts, rpt)
      #test the distance against our selected distance threshold
      if(testxy >= mindist){
        #save the point to the tracking matrix
        outPts<-rbind(outPts, rpt)
        #get out of the repeat
        good<-1
      }#test mindist
    }#if i !=1
    #criteria met, get out of current repeat
    if(good==1) break()
  }#close the repeat, move to next i
}#close the i loop

#double check to make sure our subsampling by distance worked
d<- as.numeric(gDistance(outPts, byid=TRUE)) #uses rgeos package
#remove 0's because they are the distance of the point to itself
( d<- min(d[d!=0]) )

#now add an additional column describing these points as presence
outPts$PresAbs<- 1

 

STEP 3: RANDOMLY CREATE PSEUDO-ABSENCES

As was previously described, we want to use an algorithm that requires absence points, a piece that is not currently part of our dataset. Thus we need to randomly place locations within either our 1) existing polygons (i.e. HUC_SCBR.shp) OR 2) the extent of our polygons. Seconldy, these points need to meet our criteria of 1) They are not within a distance of a previously selected sample location (see Step 2),  2) are distributed throughout the chosen mask, and 3) Are in relative proportion to the number of samples selected from our occurrence data (i.e. we don't want to swamp the algorithm with 'negative' data). 

#////////////////////////
#Method #1: Create Points That are WITHIN AN IRREGULAR POLYGON (i.e. the HUCs)

#set the number of absences to create
a<- 400
#set the beginning number for the PointID
ptID<- 6000 #this is larger than num of rows in input, so unique values

#First we need to dissolve the HUCs to make a single boundary of the HUCs
bnd<- gUnaryUnion(huc.sm)
#check out the result
plot(bnd)
#add the points
plot(outPts, add=TRUE)

#We need to randomly choose a point location within the boundary, 
#one at a time, and test for proximity to existing points
#To accomplish this task, we will run a loop

for(i in 1:a){
  #set a tracking variable for each iteration of i to know we found a point that meets 
  #the distance criteria

  good<-0
  repeat{
    #grab a single random pseudo-absence point, must make into SPDF to bind to existing data
    rpt<- SpatialPointsDataFrame(spsample(bnd, n=1, type="random", iter=100), data=data.frame(PointID=(i+ptID), PresAbs=0))
    #get the min distance to the selected points
    testxy <- min.dist(outPts, rpt)
    #test the distance against our selected distance threshold
    if(testxy >= mindist){
      #save the point to the tracking matrix
      outPts<-rbind(outPts, rpt)
      #get out of the repeat
      good<-1
    }#test mindist
    #criteria met, get out of current repeat
    if(good==1) break()
  }#close the repeat, move to next i
}#close the i loop

#double check to make sure our subsampling by distance worked
d<- as.numeric(gDistance(outPts, byid=TRUE)) #uses rgeos package
#remove 0's because they are the distance of the point to itself
( d<- min(d[d!=0]) )

#now plot the spatial data
plot(bnd)
plot(outPts, add=TRUE, col=ifelse(outPts$PresAbs==1,"black", "red"))

 

 

#////////////////////////
#Method #2: Create Points That are WITHIN THE EXTENT OF AN IRREGULAR POLYGON (i.e. the HUCs)

#set the number of absences to create
a<- 400
#set the beginning number for the PointID
ptID<- 6000 #this is larger than num of rows in input, so unique values

#We are going to use the extent of the HUCs to make a new polygon
#let's build a polygon from scratch. To create a true polygon feature class, we need to deal with the spatial hierarchy

xs<- c(bbox(huc.sm)[1,1], bbox(huc.sm)[1,2], bbox(huc.sm)[1,2], bbox(huc.sm)[1,1], bbox(huc.sm)[1,1])
ys<- c(bbox(huc.sm)[2,1], bbox(huc.sm)[2,1], bbox(huc.sm)[2,2], bbox(huc.sm)[2,2], bbox(huc.sm)[2,1])
#create a single polygon
rstdy<- Polygon(matrix(c(xs, ys), ncol=2)) #sp package
#create a list of polygons (yes we only have one, but...)
rstdy<- Polygons(list(rstdy), ID="1")
#now make it spatially aware
rstdy<- SpatialPolygons(list(rstdy), proj4string=CRS(proj4string(huc.sm)))
#plot your new polygon
plot(rstdy)
plot(huc.sm, add=TRUE)

#We need to randomly choose a point location within the boundary, 
#one at a time, and test for proximity to existing points
#To accomplish this task, we will run a loop

for(i in 1:a){
  #set a tracking variable for each iteration of i to know we found a point that meets 
  #the distance criteria

  good<-0
  repeat{
    #grab a single random pseudo-absence point, must make into SPDF to bind to existing data
    rpt<- SpatialPointsDataFrame(spsample(rstdy, n=1, type="random", iter=100), data=data.frame(PointID=(i+ptID), PresAbs=0))
    #get the min distance to the selected points
    testxy <- min.dist(outPts, rpt)
    #test the distance against our selected distance threshold
    if(testxy >= mindist){
      #save the point to the tracking matrix
      outPts<-rbind(outPts, rpt)
      #get out of the repeat
      good<-1
    }#test mindist
    #criteria met, get out of current repeat
    if(good==1) break()
  }#close the repeat, move to next i
}#close the i loop

#double check to make sure our subsampling by distance worked
d<- as.numeric(gDistance(outPts, byid=TRUE)) #uses rgeos package
#remove 0's because they are the distance of the point to itself
( d<- min(d[d!=0]) )

#now plot the spatial data
plot(rstdy)
plot(outPts, add=TRUE, col=ifelse(outPts$PresAbs==1,"black", "red"))

 

 

#////////////////////////
#Method #3: Create Points That are WITHIN THE EXTENT OF AN IRREGULAR POLYGON (i.e. the HUCs)
#But use an inverse kernel density estimate to weight random points to be further from presence points

#get an additional package for point-pattern analysis
require(spatstat)

#set the number of absences to create
a<- 400
#set the beginning number for the PointID
ptID<- 6000 #this is larger than num of rows in input, so unique values

#you need your presence points (i.e. cactus)
#read in the DEM to provide extent/resolution information
#this could be any raster, but the DEM is typically your base raster data source

dem<- raster("//wygiscdev/AdvSpatial/LabData/lab2/DEM_30m.img")
#We are going to use the extent of the HUCs to make a new polygon
#let's build a polygon from scratch. To create a true polygon feature class, we need to deal with the spatial hierarchy

xs<- c(bbox(huc.sm)[1,1], bbox(huc.sm)[1,2], bbox(huc.sm)[1,2], bbox(huc.sm)[1,1], bbox(huc.sm)[1,1])
ys<- c(bbox(huc.sm)[2,1], bbox(huc.sm)[2,1], bbox(huc.sm)[2,2], bbox(huc.sm)[2,2], bbox(huc.sm)[2,1])
#create a single polygon
rstdy<- Polygon(matrix(c(xs, ys), ncol=2)) #sp package
#create a list of polygons (yes we only have one, but...)
rstdy<- Polygons(list(rstdy), ID="1")
#now make it spatially aware
rstdy<- SpatialPolygons(list(rstdy), proj4string=CRS(proj4string(huc.sm)))
#plot your new polygon
plot(rstdy)
plot(huc.sm, add=TRUE)
#now use the study extent to crop the DEM to the size that we want. This could be done 
#prior to analysis as well and then you would only have to read in the raster instead of process
#we need to crop the DEM to match our study area

cdem<- crop(dem, rstdy)

#now with the base data ready, create a KUD from occurrence points
#recreate a raster object with correct extent and cell size

r<- raster(nrows=dim(cdem)[1], ncols=dim(cdem)[2], ext=extent(cdem), crs=CRS(proj4string(cdem)))
#get the raster extent
e<- extent(r)
#set the sample window for spatstat
win<- as.owin(c(e@xmin, e@xmax, e@ymin, e@ymax))
#create point pattern object
cactus.ppp <- as.ppp(coordinates(cactus), win)
#calculate the sigma (the decay rate)
sigma <- bw.diggle(cactus.ppp)     
#create kernel density estimate
den.ppp <- density(cactus.ppp, sigma=sigma) 

#////////////////////////////////////////
#this is a function from Jeff Evans to convert ppp to Grid

as.SpatialGridDataFrame.im <- function(from){
  offset=c(from$xrange[1] + 0.5 * from$xstep, from$yrange[1] + 0.5 * from$ystep)
  cellsize = c(diff(from$xrange)/from$dim[2], diff(from$yrange)/from$dim[1])
  dim = c(from$dim[2], from$dim[1])
  gt = GridTopology(offset, cellsize, dim)
  m = t(from$v[nrow(from$v):1, ])
  data = data.frame(v = as.vector(m))
  SpatialGridDataFrame(gt, data)
}  

#//////////////////////////////////////

#convert to Grid
den.sp <- as.SpatialGridDataFrame.im(den.ppp)
#set the projection  
proj4string(den.sp)<- CRS(proj4string(cdem))
#make into a raster object
kde<- raster(den.sp)
#view the estimate
plot(kde)
#reverse the weights and rescale
kde<- calc(kde, fun=function(x){1-((x-cellStats(kde, stat="min"))*(1)/(cellStats(kde, stat="max")-cellStats(kde, stat="min")))}, progress="text")
#plot the results. This creates a raster in which cells further from occurence points
#have a greater weight (i.e. likelihood of being randomly selected), than points closer

plot(kde)
plot(cactus, add=TRUE)
#view the stats of the raster. You will notice the resolution is much larger than 30m we expect
kde
#save the raster to disk for later use
writeRaster(kde, "//wygiscdev/AdvSpatial/salbeke/lab2/kernel.img", format="HFA", overwrite=TRUE, progress="text")
#thus, we need to create a kde that matches our 30m resoultion so that we can randomly sample at the 
#correct scale/grain/resolution. Use the resample function to accomplish this
#resample to be same raster size

kde30<- raster::resample(kde, r, progress="text")
#plot the result and view the attributes
plot(kde30)
kde30
#we like what we see, so save it to disk
#write out the raster for later use. Make sure you change your directory

writeRaster(kde30, "//wygiscdev/AdvSpatial/salbeke/lab2/kernel30m.img", format="HFA", overwrite=TRUE, progress="text")

#convert low res kernel values to a vector
kde.vec<-getValues(kde)

#This function allows us to obtain an extent from a raster cell
#that has a larger resolution than a raster that we want to sample
#we can extract the cells that fall within the larger resolution
#/////////////////////////////////////////////////////////////

CreateCellExtent<- function(inRaster, inCell){
  inXY<- xyFromCell(inRaster, inCell)
  #create an extent object for coordinates of a larger raster cell
  return(extent(c((inXY[1]-(xres(inRaster)/2)), (inXY[1]+(xres(inRaster)/2)),
                  (inXY[2]-(yres(inRaster)/2)), (inXY[2]+(yres(inRaster)/2)))))
}
#/////////////////////////////////////////////////////////////

#We need to randomly choose a point location within the boundary, 
#one at a time, and test for proximity to existing points
#To accomplish this task, we will run a loop that will use the low resolution kernel

for(i in 1:a){
  #set a tracking variable for each iteration of i to know we found a point that meets 
  #the distance criteria

  good<-0
  repeat{
    #grab a single random absence point
    #extract the cells from the high res kernel given the 1 cell from low res

    a.cell<- cellsFromExtent(kde30, CreateCellExtent(kde, sample(1:length(kde.vec), size=1, prob=kde.vec)))
    #sample the subset of cells and grab the one cell using the kernel weights
    rcell<- sample(a.cell, size=1, prob=kde30[a.cell])
    #get the cell coordinates of our random point, must make into SPDF to bind to existing data
    rpt<- SpatialPointsDataFrame(xyFromCell(kde30, rcell, spatial=TRUE), data=data.frame(PointID=(i+ptID), PresAbs=0))
    #get the min distance to the selected points
    testxy <- min.dist(outPts, rpt)
    #test the distance against our selected distance threshold
    if(testxy >= mindist){
      #save the point to the tracking matrix
      outPts<-rbind(outPts, rpt)
      #get out of the repeat
      good<-1
    }#test mindist
    #criteria met, get out of current repeat
    if(good==1) break()
  }#close the repeat, move to next i
}#close the i loop

#double check to make sure our subsampling by distance worked
d<- as.numeric(gDistance(outPts, byid=TRUE)) #uses rgeos package
#remove 0's because they are the distance of the point to itself
( d<- min(d[d!=0]) )

#now plot the spatial data
plot(kde30)
plot(outPts, add=TRUE, col=ifelse(outPts$PresAbs==1,"black", "red"))

 

 

#////////////////////////
#Method #4: First perform a proximity filter on Presence Points to make sure no remaining point is within
#our set distance to another. Then randomly sample 70% of the points for Training data and 30% for #Testing Data. Create sets of Pseudo Absence Points That are WITHIN THE EXTENT OF AN IRREGULAR #POLYGON (i.e. the HUCs)

#next we need to limit the points to only ones further than dist from others
#create a complete distance matrix

d<- gDistance(cactus, byid=TRUE) #uses rgeos package
#loop through each point and determine which points are within 50m
outDF<- data.frame()
for(i in 1:nrow(d)){
  d.match<- unique(which(d[i,]>0 & d[i,]<mindist))
  if(length(d.match)>0){
    outDF<- rbind(outDF, data.frame(inPt=i, closePt=paste(d.match, collapse=",")))
  }
}
#now we need to remove points from the OBS Points
outDF$closePt<- as.character(outDF$closePt)
#now loop through the close points to create a non-spatial match dataset
sdata<- cactus
for(i in 1:nrow(outDF)){
  kill<- which(sdata@data$PointID %in% unlist(strsplit(outDF[i,2], ",")))
  if(length(kill)>0){
    sdata<- sdata[-kill,]
  }  
}
#Add a field to indicate Presence
sdata@data$PresAbs<- 1
#now limit the attribute table to this field only
sdata@data<- data.frame(PresAbs=sdata@data$PresAbs)
#plot your new polygon
plot(huc.sm)
#add the limited points
plot(sdata, add=TRUE)

#set the number of Presence points to be used within a Training dataset
np<- round(nrow(sdata)*0.7)
#set the number of pseudo absence
a<- np*2

#set the beginning number for the PointID
ptID<- 6000 #this is larger than num of rows in input, so unique values

#We are going to use the extent of the HUCs to make a new polygon
#let's build a polygon from scratch. To create a true polygon feature class, we need to deal with the spatial hierarchy

xs<- c(bbox(huc.sm)[1,1], bbox(huc.sm)[1,2], bbox(huc.sm)[1,2], bbox(huc.sm)[1,1], bbox(huc.sm)[1,1])
ys<- c(bbox(huc.sm)[2,1], bbox(huc.sm)[2,1], bbox(huc.sm)[2,2], bbox(huc.sm)[2,2], bbox(huc.sm)[2,1])
#create a single polygon
rstdy<- Polygon(matrix(c(xs, ys), ncol=2)) #sp package
#create a list of polygons (yes we only have one, but...)
rstdy<- Polygons(list(rstdy), ID="1")
#now make it spatially aware
rstdy<- SpatialPolygons(list(rstdy), proj4string=CRS(proj4string(huc.sm)))

#RANDOMLY SELECT THE POINTS TO BE USED AS TRAINING DATA
#select random presence points for model fit

rpts<- sample(1:nrow(sdata), np)
#get the random points
spts<- sdata[rpts,]
#get the remaining presence points for validation
vpts<- sdata[is.na(match(rownames(sdata@data), rpts))==TRUE,]
#plot the training and validation presence points
plot(spts)
plot(vpts, add=TRUE, col="red")

#now we have to create pseudo-absence points for model fit
rpt<- spsample(rstdy, n=a, type="random", iter=100)

#create a distance matrix between pres and abs
d<- gDistance(rpt, spts, byid=TRUE) #uses rgeos package
#loop through each point and determine which points are within 50m
outDF<- data.frame()
for(j in 1:nrow(d)){
  d.match<- unique(which(d[j,]>0 & d[j,]<mindist))
  if(length(d.match)>0){
    outDF<- rbind(outDF, data.frame(inPt=j, closePt=paste(d.match, collapse=",")))
  }
}
#now test for points being too close to presence points
if(nrow(outDF)>0){
  for(j in 1:nrow(outDF)){
    #remove the offending point from the list
    rpt<- rpt[-outDF[j,1],]
    #keep drawing random sample until distance criteria is met
    repeat{
      #for simplicity sake, resample a replacement point from polygon
      rpt.add<- spsample(rstdy, 1, type="random")
      #test if first point meets distance criteria
      if(gDistance(rpt.add, spts)>=mindist){
        #add the point to the random samples
        rpt<- rbind(rpt, rpt.add)
        #break out of the loop
        break()
      }#close if gDistance
    }#close repeat
  }#not to next outDF
}#close outDF if
#make random points into a SPDF

rpt<- SpatialPointsDataFrame(rpt, data=data.frame(PresAbs=rep(0, length(rpt))))
#now append absence data to presence
spts<- rbind(spts, rpt)
#check out your handy work
plot(rstdy)
plot(spts, add=TRUE, col=ifelse(spts$PresAbs==1,"black", "red"))

#now we have to create pseudo-absence points for model validation
rpt<- spsample(rstdy, n=(nrow(vpts)*2), type="random", iter=100)
#create a distance matrix between pres and abs
d<- gDistance(rpt, vpts, byid=TRUE) #uses rgeos package
#loop through each point and determine which points are within 50m
outDF<- data.frame()
for(j in 1:nrow(d)){
  d.match<- unique(which(d[j,]>0 & d[j,]<mindist))
  if(length(d.match)>0){
    outDF<- rbind(outDF, data.frame(inPt=j, closePt=paste(d.match, collapse=",")))
  }
}
#now test for points being too close to presence points
if(nrow(outDF)>0){
  for(j in 1:nrow(outDF)){
    #remove the offending point from the list
    rpt<- rpt[-outDF[j,1],]
    #keep drawing random sample until distance criteria is met
    repeat{
      #for simplicity sake, resample a replacement point from polygon
      rpt.add<- spsample(rstdy, 1, type="random")
      #test if first point meets distance criteria
      if(gDistance(rpt.add, vpts)>=mindist){
        #add the point to the random samples
        rpt<- rbind(rpt, rpt.add)
        #break out of the loop
        break()
      }#close if gDistance
    }#close repeat
  }#not to next outDF
}#close outDF if
#make random points into a SPDF
rpt<- SpatialPointsDataFrame(rpt, data=data.frame(PresAbs=rep(0, length(rpt))))
#now append absence data to presence
vpts<- rbind(vpts, rpt)
#check out your handy work
#plot(rstdy)

plot(vpts, add=TRUE, col=ifelse(vpts$PresAbs==1,"green", "purple"))

 


HOMEWORK ASSIGNMENT #1

 

You have been contracted to develop a Species Distribution Model for Pygmy Rabbit (Brachylagus idahoensis). The hiring entity has given you a set of occurrence points and wants to use Logistic Regression as one of the algorithm approaches. The goal of the project is to develop a statewide distribution model (i.e. Wyoming). For this assignment, I would like for you to use your newly acquired insight, skills and tools to create a dataset you feel is appropriate for use in model creation. All that I am asking is for you to create a set of observation and pseudo-absence points that you would feel comfortable for use in developing a logistic regression model. To receive full credit you will need to provide:

  1. A single spatial point dataset containing both the occurrence and absence points, making sure to correctly label which is which. The data type/model is up to you (i.e. shapefile, feature class, etc.). Make sure to pay attention to your projection and feel that it is appropriate for the extent of your study.

    Note: the projection of the rabbits is in Wylam. You will need to set your projection of your spatial points to this parameterization: xysp <- SpatialPoints(xy, proj4string= CRS("+proj=lcc +lat_1=41 +lat_2=45 +lat_0=41 +lon_0=-107.5 +x_0=500000 +y_0=200000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs +towgs84=0,0,0"))

    Note 2: For the time being, create a numeric Unique ID field for use in the Random Site selection. I will work on creating a work-around for using alpha-numeric fields.
     
  2. An MS Word document (or something that I can at least open and edit) written in manuscript style. Within this document, please articulate the choices you made and why you made them to create your set of points to include in the model. This could be as short as a single paragraph, but I would hope you could keep it shorter than 1-page in length. Think of this as a portion of your Methods sections if you were writing up a true manuscript
     
  3. The data can be downloaded from here (PygmyRabbit.zip). This assignment is due 1 week from now, in my inbox before class begins (i.e. February 4, 2014). Please email me a zip file containing your document and your points.
    • The key metadata for the occurence sites are as follows:
      • UNIQUE_ID - Unique Character describing the datarow
      • SEC_SOURCE - Description of data origination
      • TAX_CERT - The certainty the taxa was identified correctly. Y = Yes, Q=Questionable
      • OBS_YEAR - Year of the Observation
      • OBS_MONTH - Month of the Observation
      • OBS_DAY - Day of the Observation
      • LATITUDE - Y coordinate of occurence point in Geographic Coordinates (WGS84)
      • LONGITUDE - X coordinate of occurence point in Geographic Coordinates (WGS84)
      • precis - Precision of coordinates (in meters?)
      • POINT_X - X coordinate in WyLAM (meters)
      • POINT_Y - Y coordinate in WyLAM (meters)
         

Good luck and let me know if you have any questions.