Rate of description of new animal species and *that* Taxatoy graph

As part of the discussion on whether legacy biodiversity literature matters a graph from the following paper came up:

Sarkar, I., Schenk, R., & Norton, C. N. (2008). Exploring historical trends using taxonomic name metadata. BMC Evolutionary Biology, 8(1), 144. doi:10.1186/1471-2148-8-144


So, why is the Sarkar et al. graph bogus? Here is their graph (Fig. 3) for animals:

Taxatoy

This is the number of new animal species described each year, estimated by parsing taxonomic names and extracting the date in the taxonomic authority. There are two prominent "spikes" which are worrying. Sarkar et al. discuss the peak in 1994:

For example, the analyzed data indicate that a significant portion of the 1994 peak is due to an increase in descriptions of the family Cerambycidae, a large group of beetles.


So, 1994 was a bumper year for describing new species of Cerambycidae? Not quite. Taxatoy is based on names in uBio, and I have a local copy of most of these names. The Cerambycidae names contain lots of duplicate names that differ only in taxon authority. For example, searching the name Ancylocera macrotela on uBio finds:


Ancylocera macrotela
Ancylocera macrotela Aurivillius, 1912
Ancylocera macrotela BATES Henry Walter, 1880
Ancylocera macrotela Bates, 1880
Ancylocera macrotela Bates, 1885
Ancylocera macrotela Blackwelder, 1946
Ancylocera macrotela Chemsak & Linsley, 1970
Ancylocera macrotela Chemsak, 1963
Ancylocera macrotela Chemsak, 1964
Ancylocera macrotela Chemsak, Linsley & Mankins, 1980
Ancylocera macrotela Chemsak, Linsley & Noguera, 1992
Ancylocera macrotela Lameere, 1883
Ancylocera macrotela Maes & al., 1994
Ancylocera macrotela Monné & Giesbert, 1994
Ancylocera macrotela Monné, 1994
Ancylocera macrotela Noguera & Chemsak, 1996
Ancylocera macrotela Viana, 1971


These names are chresonyms. The original name is Ancylocera macrotela Bates, 1880 (you can see first publication of this name in BHL), the rest are subsequent citations of that name (gotta love taxonomy...).

Why the spike in 1994? I suspect that this is due to the publication in 1994 of "Checklist of the Cerambycidae and Disteniidae (Coleoptera) of the Western Hemisphere" by Miguel A Monné and Edmund F Giesbert. At least 8552 names from that checklist seem to have ended up in uBio, all with the date "1994". So the spike is an artefact. Similarly, the other peak (1912) corresponds to the publication of a checklist by Per Olof Christopher Aurivillius, which contributes over 3000 names.

One reason I was suspicious of the Taxatoy graph is that it doesn't look anything like the equivalent graph from the Index of Organism Names. After a bit of fussing I've grabbed data from the ION site, and from Taxatoy's Google Code repository and created the following chart:

Taxatoy version2

The data for this chart is on figshare http://dx.doi.org/10.6084/m9.figshare.156862. ION is an index of all new animal names, based on Zoological Record. I place more confidence in its data than data derived from uBio, but it clearly ION has its own issues (such as the gap after 1850, and the uneven sampling of the early years of taxonomy). The key point is that arguments on the temporal distribution of taxonomic descriptions (and the value of legacy literature) need to be aware that the data used is in pretty poor shape.

Update 2013-02-23
Jose Antonio Gonzalez Oreja pointed out in an email that the values for ION that I used were a little higher than those that appear on the ION web site. My script for retrieving those values hadn't quite worked. I've uploaded the corrected data to Figshare http://dx.doi.org/10.6084/m9.figshare.156862, updated the diagram above, and put the web calls I used to fetch the data on GitHub https://gist.github.com/rdmpage/5019153. The story doesn't change, but it helps to have the correct data.

Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature

Following on from my previous post on microcitations I've blasted all the citations in Nomenclator Zoologicus through my microcitation service and created a simple web site where these results can be browsed.

The web site is here: http://iphylo.org/~rpage/nz/.

To create it I've taken a file dump of Nomenclator Zoologicus provided by Dave Remsen and run all the citations through the microcitation service, storing the results in a simple database. You can search by genus name, author and year, or publication. The search is pretty crude, and in the case of publications can be a bit hit and miss. Citations in Nomenclator Zoologicus are stored as strings, so I've used some crude rules to try and extract the publication name from the rest of the details (such as page numbering).

To get started, you can look at names published by published by Distant in 1910, which you can see below:

Nz1

If the citation has been found you can click on the icon to view the page in a popup, like this:

Nz2

You can also click on the page number to be taken to that page in BHL.


I've also added some other links, such as to the name in the Index to Organism Names, as well as bibliographic identifiers such as DOIs, Handles, and links to JSTOR and CiNii.

So far only 10% of Nomenclator Zoologicus records have a match in BHL, which is slightly depressing. Browsing through there are some obvious gaps where my parser clearly failed, typically where multiple pages are included in the citation, or the citation has some additional comments. These could be fixed. There are also cases where the OCR text is so mangled that a match has been rejected because the genus name and text were too different.

This has been hastily assembled, but it's one vision of a simple service where we can go from genus name to being able to see the original publication of that name. There are other things we could do with this mapping, such as enabling BHL to tell users that the reference they are looking at is the original source of a particular name, and enabling services that use BHL content (such as EOL and Atlas of Living Australia to flag which reference in BHL is the one that matters in terms of nomenclature.

Time for some decent service

The BBC web site has an article entitled Giant deep sea jellyfish filmed in Gulf of Mexico which has footage of Stygiomedusa gigantea, and mentions an associated fish, Thalassobathia pelagica.

AE8B4B6F-CC98-405F-90FF-390262EBE3C0.jpg


One thing that frustrates me beyond belief is how hard it is to get more information about these organisms. Put another way, the biodiversity informatics community is missing a huge opportunity here. There are a slew of services, such as Zemanta and OpenCalais.com, that can enrich the content of a document by identifying terms and adding links. Imagine a similar service that took taxonomic names and could provide information and links about that name, so that sites such as the BBC could enrich their pages. We've had various attempts at this1, but we are still far from creating something genuinely useful.

Part of the problem is that the plethora of taxonomic databases we have are often of little use. After fussing with Google I discover that Stygiomedusa gigantea (Browne, 1910) has the synonym Stygiomedusa fabulosa Russell, 1959 (see, e.g., the WoRMS database), but no database tells me that the genus Stygiomedusa was published by Russell in Nature in 1959 (doi:10.1038/1841527a0). Nor can I readily find the original reference for (Browne, 1910) in these databases2. Why is this so hard?

9B0FFA09-AF7B-4F82-98F5-C5D7DF891C5F.jpgThen when we do have information, we fail to make it digestible. For example, the EOL page for Thalassobathia pelagica links to BHL pages, but fails to point out that the pages it links belong to a single article, and that this article (http://biostor.org/reference/4339) is the original description of the fish.

Publishers are increasingly interested in any tools that can embellish their content. The organisation that gets their act together and provides a decent service for publishers (including academic journals, and news services such as the BBC) is going to own this space. Any takers...?

  1. Such as uBio LinkIT and EOL NameLink.
  2. After finding another taxon with the author Browne 1910 in BHL, I found Diplulmaris (?) gigantea, which looked like a good candidate for the original name for the jellyfish, see http://biodiversitylibrary.org/page/1727009. This is confirmed by the Smithsonian's Antarctic Invertebrates site.

e-Biosphere Challenge: visualising biodiversity digitisation in real time

e-Biosphere '09 kicks off next week, and features the challenge:
Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:
  • Capturing data in private and public databases;
  • Conducting quality assurance on the data by automated validation and/or peer review;
  • Indexing, linking and/or automatically submitting the new data records to other relevant databases;
  • Integrating the data with other databases and data streams;
  • Making these data available to relevant audiences;
  • Make the data and links to the data widely accessible; and
  • Offering interfaces for users to query or explore the data.


Originally I planned to enter the wiki project I've been working on for a while, but time was running out and the deadline was too ambitious. Hence, I switched to thinking about RSS feeds. The idea was to first create a set of RSS feeds for sources that lack them, which I've been doing over at http://bioguid.info/rss, then integrate these feeds in a useful way. For example, the feeds would include images from Flickr (such as EOL's pool), geotagged sequences from GenBank, the latest papers from Zootaxa, and new names from uBio (I'd hoped to include ION as well, but they've been spectacularly hacked).

After playing with triple stores and SPARQL (incompatible vocabularies and multiple identifiers rather buggers this approach), and visualisations based on Google Maps (building on my swine flu timemap), it dawned on me what I really needed was an eye-catching way of displaying geotagged, timestamped information, just like David Troy's wonderful twittervision and flickrvision.com. In particular, David took the Poly9 Globe and added Twitter and Flickr feeds (see twittervision 3D and flickrvision 3D. So, I took hacked David's code and created this, which you can view at http://bioguid.info/ebio09/www/3d/:



It's a lot easier to simply look at it rather than describe what it does, but here's a quick sketch of what's under the hood.

Firstly, I take RSS feeds, either the raw geoFeed from Flickr, or from http://bioguid.info/rss. The bioGUID feeds include the latest papers in Zootaxa (most new animal species are described in this journal), a modified version of uBio's new names feed, and a feed of the latest, geotagged sequences in GenBank (I'd hoped to use only DNA barcodes, but it turns out rather few barcode sequences are geotagged, and few have the "BARCODE" keyword). The Flickr feeds are simple to handle because they include locality information (including latitude, longitude, and Yahoo Where-on-Earth Identifiers (WOEIDs)). Similarly, the GenBank feed I created has latitude and longitudes (although extracting this isn't always as straightforward as it should be). Other feeds require more processing. The uBio feed already has taxonomic names, but no geotagging, so I use services from Yahoo! GeoPlanet™ to find localities from article titles. For the Zootaxa feed that I created I use uBio's SOAP service to extract taxonomic names, and Yahoo! GeoPlanet™ to extract localities.


I've tried to create a useful display popup. For Zootaxa papers you get a thumbnail of the paper, and where possible an icon of the taxonomic group the paper talks about (the presence of this icon depends on the success of uBio's taxonomic name finding service, the Catalogue of Life having the same name, and my having a suitable icon). The example above shows a paper about copepods. Other papers have a icon for the journal (again, a function of my being able to determine the journal ISSN and having a suitable icon). Flickr images simply display a thumbnail of the image.

What does it all mean? Well, I could say all sorts of things about integration and mash-ups but, dammit, it's pretty. I think it's a fun way to see just what is happening in digital biodiversity. I've deliberately limited the demo to items that came online in the month of May, and I'll be adding items during the conference (June 1-3rd in London). For example, if any more papers appear in Zootaxa, or in the uBio feeds I'll add those. If anybody uploads geotagged photos to EOL's Flickr group, I'll grab those as well. It's still a bit crude, but it shows some of the potential of bringing things together, coupled with a nice visualisation. I welcome any feedback.

Clustering taxonomic names

As part of my Quixotic attempt to construct a wiki of taxonomic names, I'm building a database of names and links. My current plan is to seed this with the NCBI taxonomy. What I want to do is flesh out the NCBI taxonomy with authorities and links to the original literature. At the moment the NCBI taxonomy is almost "nude", lacking links to the literature behind the names. As the magnificently bearded Geoffrey Bilder notes in an interview with Martin Fenner:
One way in which researchers assess the trustworthiness of content is by determining how it sits within the scholarly record. Does it provide evidence for its assertions in citations? Do other people cite it?

Given how important the NCBI taxonomy is, I think it would be a great improvement if each name could be linked to the original taxonomic publication. A first step to this is to find the taxonomic authority, the name of the author (or authors) of the name.

One potential source is uBio, which provides web services for retrieving information on names. Hence, an obvious approach is to map NCBI names to uBio names. However, if I use uBio's SOAP service typically I get multiple records for the same name. Some of these are due to homomyms (e.g., the same name used for a plant and an animal), but many are the same name with variations on the taxonomic authority. Much of this variation arises because uBio aggregates information from a wide range of databases, and each database differs in who it records the taxonomic authority.

For example, for the name "Diplura" (which I've discussed earlier) we get these names and authorities:
  • Diplura (Greene MS.) Allman 1864
  • Diplura Borner, 1904
  • Diplura C. L. Koch 1850
  • Diplura G. J. Hollenberg, 1969
  • Diplura Hollenb.
  • Diplura Jerdon 1864
  • Diplura Koch 1850
  • Diplura Koch 1851
  • Diplura Rambur 1866
  • Diplura Simon 1892
Before asking which of these names corresponds to "Diplura" in NCBI, I'd like to cluster these names into sets by merging names that are "the same." This resembles the problem of equivalent author names. The approach I'm using is to build a graph linking taxonomic authorities that are more similar than some threshold, then finding the components of that graph. For example, here is the graph for "Diplura":

The nodes in the graph are the taxonomic authorities, "cleaned" by making all the text lower case, and stripping any punctuation. The edges are labelled by the length of the longest common substring shared by the nodes that edge links (I ignore substrings less than four characters long). This graph groups the variations on Diplura Koch (a spider), and Diplura Hollenberg (a brown alga, see doi:10.1111/j.1529-8817.1969.tb02617.x).

Not surprisingly, perhaps, the linkouts from the NCBI taxonomy for Diplura are a mess, with the algal genus (taxon:371965 linking to both plant and animal databases, and the insect class (taxon:29997) linking to a mix of plants and animals, not all of the animals are insects.

I'm still playing with the underlying code, but I might try and build a web service that returns name clusters (and perhaps the graph as well).
Powered by Blogger.