Saturday, March 17, 2018

Bioregionalisation part 5: Cleaning point distribution data in R

I should finally complete my series on bioregionalisation. What is missing is a post on how to do a network (Modularity) analysis in R. But first I thought I would write a bit about how to efficiently do some cleaning of point distribution data in R. As often I write this because it may be useful to somebody who finds it via search engine, but also because I can then look it up myself if I need it after not having done it for months.

The assumption is that we start our spatial or biogeographic analyses by obtaining point distribution data by querying e.g. for the genus or family that we want to study on an online biodiversity database or aggregator such as GBIF or Atlas of Living Australia. We download the record list in CSV format and now presumably have a large file with many columns, most of them irrelevant to our interests.

One problem that we may find is that there are numerous cases of records occurring in implausible locations. They may represent geospatial data entry errors such as land plants supposedly occurring in the ocean, or vouchers collected from plants in botanic gardens where the databasers fo some reason entered the garden's coordinates instead of those of the source location , or other outliers that we suspect to be misidentifications. What follows assumes that this at least has been done already (and it is hard to automate anyway), but we can use R to help us with a few other problems.

We start up R and begin by reading in our data, in this case all lycopod records downloaded from ALA. (One of the advantages about that group is that very few of them are cultivated in botanic gardens, and I did not want to do that kind of data clean-up for a blog post.)
rawdata <- read.csv("Lycopodiales.csv", sep=",", na.strings = "", header=TRUE, row.names=NULL)
We now want to remove all records that lack any of the data we need for spatial and biogeographic analyses, i.e. identification to the species level, latitude and longitude. Other filtering may be desired, e.g. of records with too little geocode precision, but we will leave it at that for the moment. In my case the relevant columns are called genus, specificEpithet, decimalLatidue, and decimalLongitude, but that may of course be different in other data sources and require appropriate adjustment of the commands below.
rawdata <- rawdata[!(is.na(rawdata$decimalLatitude) | rawdata$decimalLatitude==""), ]
rawdata <- rawdata[!(is.na(rawdata$decimalLongitude) | rawdata$decimalLongitude==""), ]
rawdata <- rawdata[!(is.na(rawdata$genus) | rawdata$genus==""), ]
rawdata <- rawdata[!(is.na(rawdata$specificEpithet.1) | rawdata$specificEpithet.1==""), ]
All the records missing those data should be gone now. Next we make a new data frame containing only the data we are actually interested in.
lat <- rawdata$decimalLatitude
long <- rawdata$decimalLongitude
species <- paste( as.character(rawdata$genus), as.character(rawdata$specificEpithet.1, sep=" ") )
mydata <- data.frame(species, lat, long)
mydata$species <- as.character(mydata$species)
Unfortunately at this stage there are still records that we may not want for our analysis, but they can mostly be recognised by having more than the two usual name elements of genus name and specific epithet: hybrids (something like "Huperzia prima x secunda" or "Huperzia x tertia") and undescribed phrase name taxa that may or may not actually be distinct species ("Lycopodiella spec. Mount Farewell"). At the same time we may want to check the list of species in our data table with unique(mydata$species) to see if we recognise any other problems that actually have two name elements, such as "Lycopodium spec." or "Lycopodium Undesignated". If there are any of those, we place them into a vector:
kickout <- c("Lycopodium spec.", "Lycopodium Undesignated")
Then we loop through the data to get rid of all these problematic entries.
myflags <- rep(TRUE, length(mydata[,1]))
for (i in 1:length(myflags))
{
  if ( (length(strsplit(mydata$species[i], split=" ")[[1]]) != 2) || (mydata$species[i]) %in% kickout )
  {
    myflags[i] <- FALSE
  }
}
mydata <- mydata[myflags, ]
If there is no 'kickout' vector for undesirable records with two name elements, we do the same but adjust the if command accordingly to not expect its existence.

Check again unique(mydata$species) to see if the situation has improved. If there are instances of name variants or outdated taxonomy that need to be corrected, that is surprisingly easy with a command along the following lines:
mydata$species[mydata$species == "Outdatica fastigiata"] = "Valida fastigiata"
In that way we can efficiently harmonise the names so that one species does not get scored as two just because some specimens still have an outdated or misspelled name.

Although we assume that we had checked for geographic outliers, we may now still want to limit our analysis to a specific area. In my case I want to get rid of non-Australian records, so I remove every record outside of a box of 9.5 to 44.5 degrees south and 111 to 154 degrees east around the continent. Although it turns out that this left parts of New Guinea in that is fine with me for present purposes, we don't want to over-complicate this now.
mydata <- mydata[mydata$long<154, ]
mydata <- mydata[mydata$long>111, ]
mydata <- mydata[mydata$lat>(-44.5), ]
mydata <- mydata[mydata$lat<(-9.5), ]
At this stage we may want to save the cleaned up data for future use, just in case.
write.table(mydata, file = "Lycopodiales_records_cleaned.csv", sep=",")
And now, finally, we can actually turn the point distribution data into grid cells and conduct a network analysis, but that will be the next (and final) post of the series.

No comments:

Post a Comment