What would people want/expect from taxon searching on TreeBASE?

Rutger Vos asked on Twitter "What would people want/expect from taxon searching on TreeBASE?". This is a good question, and one which motivated the work I did on TBMap (see doi:10.1186/1471-2105-8-158), which developed a mapping between TreeBASE taxa and other databases. In that paper I published a table showing the effectiveness of string and hierarchical queries of TreeBASE. A string query returned just the studies that included the query term, a hierarchical query returned all studies that contained taxa that descend from a given node in the NCBI taxonomy.

A hierarchical query can be visualised as a range query. For example, the diagram below shows a classification where the descendants of Node A correspond to the range 1-8 (this is a simplification of visitation numbers, see my earlier post, and also Chen et al. doi:10.1186/1471-2148-8-90). The three trees can be represented as the ranges 3-8, 6-7, and 2-4, respectively. To find trees for the taxon corresponding to Node A we look for trees whose range intersects 1-8, to find trees corresponding to Node B we look for trees whose range intersects 1-5.


This approach retrieves a list of all trees that include a given taxon, but potentially this list could be very large (for example, a query for all plant trees could return 1000's of trees). So the question becomes how to order these trees? Some ideas:
  1. order by the number of taxa in the tree.
  2. order by the size of the range ("span") of the tree.
  3. order by set inclusion.

Ordering by size seems attractive, but will favour trees with more taxa over those with a greater taxonomic spread (which is favoured by ordering by the span of the tree). I used span in my challenge entry to display papers with related taxa.

What I have in mind for ordering by set inclusion is constructing a directed graph where each node represents a tree, and a pair of nodes (x, y) is connected by an edge xy if the taxa in tree y are nested inside the taxa in tree x. We could also introduce some additional nodes corresponding to nodes in the classification. If we then topologically sort the graph we have a linear order for the trees. Given that this order could be pre-computed independently of any queries (in much the same way that PageRank is), it could make for faster queries.

It would be useful to explore these and other ordering criteria. Perhaps the best approach would be some measure which combines one or more criteria, in which case we might want to use some form of rank aggregation (see iSpecies clones, and taxonomic intelligence for some links to the relevant literature).

Charting taxonomic knowledge

Nice paper by Robert Huber and Jens Klump has appeared in Computers & Geosciences entitled "Charting taxonomic knowledge through ontologies and ranking algorithms" (doi:10.1016/j.cageo.2008.02.016). The paper is not open access, but you can get some background from the post How TaxonRank works. Here's the abstract.

Since the inception of geology as a modern science, paleontologists have described a large number of fossil species. This makes fossilized organisms an important tool in the study of stratigraphy and past environments. Since taxonomic classifications of organisms, and thereby their names, change frequently, the correct application of this tool requires taxonomic expertise in finding correct synonyms for a given species name. Much of this taxonomic information has already been published in journals and books where it is compiled in carefully prepared synonymy lists. Because this information is scattered throughout the paleontological literature, it is difficult to find and sometimes not accessible. Also, taxonomic information in the literature is often difficult to interpret for non-taxonomists looking for taxonomic synonymies as part of their research.

The highly formalized structure makes Open Nomenclature synonymy lists ideally suited for computer aided identification of taxonomic synonyms. Because a synonymy list is a list of citations related to a taxon name, its bibliographic nature allows the application of bibliometric techniques to calculate the impact of synonymies and taxonomic concepts. TaxonRank is a ranking algorithm based on bibliometric analysis and Internet page ranking algorithms. TaxonRank uses published synonymy list data stored in TaxonConcept, a taxonomic information system. The basic ranking algorithm has been modified to include a measure of confidence on species identification based on the Open Nomenclature notation used in synonymy list, as well as other synonymy specific criteria.

The results of our experiments show that the output of the proposed ranking algorithm gives a good estimate of the impact a published taxonomic concept has on the taxonomic opinions in the geological community. Also, our results show that treating taxonomic synonymies as part of on an ontology is a way to record and manage taxonomic knowledge, and thus contribute to the preservation our scientific heritage.
Powered by Blogger.