A decadal view of biodiversity informatics: challenges and priorities

LogoBMC Ecology has published Alex Hardisty and Dave Roberts' white paper on biodiversity informatics:

Hardisty, A., & Roberts, D. (2013). A decadal view of biodiversity informatics: challenges and priorities. BMC Ecology, 13(1), 16. doi:10.1186/1472-6785-13-16

Here are their 12 recommendations (with some comments of my own):

  1. Open Data, should be normal practice and should embody the principles of being accessible, assessable, intelligible and usable.

    Seems obvious, but data providers are often reluctant to open "their" data up for reuse.
  2. Data encoding should allow analysis across multiple scales, e.g. from nanometers to planet-wide and from fractions of a second to millions of years, and such encoding schemes need to be developed. Individual data sets will have application over a small fraction of these scales, but the encoding schema needs to facilitate the integration of various data sets in a single analytical structure.

    No I don't know what this means either, but I'm guessing that it's relevant if we want to attempt this: doi:10.1038/493295a
  3. Infrastructure projects should devote significant resources to market the service they develop, specifically to attract users from outside the project-funded community, and ideally in significant numbers. To make such an investment effective, projects should release their service early and update often, in response to user feedback.

    Put simply, make something that is both useful and easy to use. Simples.
  4. Build a complete list of currently used taxon names with a statement of their interrelationships (e.g. this is a spelling variation; this is a synonym; etc.). This is a much simpler challenge than building a list of valid names, and an essential pre-requisite.

    One of the simplest tasks, first tackled successfully by uBio, now moribund. The Global Names project seems stalled, intent on drowning in acronym soup (GNA,GNI,GNUB, GNITE).
  5. Attach a Persistent Identifier (PID) to every resource so that they can be linked to one another. Part of the PID should be a common syntactic structure, such as ‘DOI: ...’ so that any instance can be simply found in a free-text search.

    DOIs have won the identifier wars, and everything citable (publications, figures, datasets) is acquiring one. The mistake to avoid is forgetting that identifiers need services built on top of them (see http://labs.crossref.org/ for some DOI-related tools). The core service we need is reverse lookup: given this thing (publication, specimen, etc.) what is its identifier?
  6. Implement a system of author identifiers so that the individual contributing a resource can be identified. This, in combination with the PID (above), will allow the computation of the impact of any contribution and the provenance of any resource.

    This is a solved problem, assuming ORCID continues to gain momentum. For past authors VIAF has identifiers (which are being incorporated into Wikipedia).
  7. Make use of trusted third-party authentication measures so that users can easily work with multiple resources without having to log into each one separately.

    Again, a solved problem. People routinely use third parties such as Google and Facebook for this purpose.
  8. Build a repository for classifications (classification bank) that will allow, in combination with the list of taxonomic names, automatic construction of taxonomies to close gaps in coverage.

    Let's not, let's focus on the only two classifications that actually matter because they are linked to data, namely GBIF and NCBI. If we want one classification to coalesce around make it GBIF (NCBI will grow anyway).
  9. Develop a single portal for currently accepted names - one of the priority requirements for most users.

    Yup, still haven't got this, we clearly didn't get the memo about point 3.
  10. Standards and tools are needed to structure data into a linked format by using the potential of vocabularies and ontologies for all biodiversity facets, including: taxonomy, environmental factors, ecosystem functioning and services, and data streams like DNA (up to genomics).

    The most successful vocabulary we've come up with (Darwin Core) is essentially an agreed way to label columns in Excel spreadsheets. I've argued elsewhere that focussing on vocabularies and ontologies distracts from the real prerequisite for linking stuff together, namely reusable identifiers (see 5). No point developing labels for links if you don't have the links.
  11. Mechanisms to evaluate data quality and fitness-for-purpose are required.

    Our data is inaccurate and full of holes, and we lack decent tools for visualising and fixing this (hence my interest in putting the GBIF classification into GitHub).
  12. A next-generation infrastructure is needed to manage ever-increasing amounts of observational data.

    Not our problem, see doi:10.1038/nature11875 (by which I mean lots of people need massive storage, so it will be solved)

Food for thought. I suspect we will see the gaggle of biodiversity informatics projects will seek to align themselves with some of these goals, carving up the territory. Sadly, we have yet to find a way to coalesce critical mass around tackling these challenges. It's a cliché, but I can't help thinking "what would Google do?" or, more, precisely, "what would a Google of biodiversity look like?"

EOL challenge draft proposal

In the spirit of the Would you give me a grant experiment? [1] here's the draft of a proposal I'm working on for the Computable Data Challenge. It's an attempt to merge taxonomic names, the primary literature, and phylogenetics into one all-singing, all-dancing website that makes it easy to browse names, see the publications relevant to those names, and see what, if anything, we know about the phylogeny of those taxa. It builds on a number of other projects I've been working on, most recently my efforts to link names to the primary literature. Comments welcome (the proposal deadline is next week).

The proposal is embedded below using Google's PDF viewer, if you can't see it try logging into your Google account, or click here.



1. The answer from NERC was a resounding "no".

EOL Computable Data Challenge community

17823 130 130Now we are awash in challenges! EOL has announced its Computable Data Challenge:
We invite ideas for scientific research projects that use EOL, including the Biodiversity Heritage Library (BHL), to answer questions in biology. The specific field of biological interest for the challenge is open; projects in ecology, evolution, behavior, conservation biology, developmental biology, or systematics may be most appropriate. Projects advancing informatics alone may be less competitive. EOL may be used as a source of biological information, to establish a sampling strategy, to assist the retrieval of computable data by mapping identifiers across sources (e.g. to accomplish name resolution), and/or in other innovative ways. Projects involving data or text or image mining of EOL or BHL content are encouraged. Current EOL data and API shall be used; suggestions for modification of content or the API could be a deliverable of the project. We encourage the use of data not yet in EOL for analyses. In all cases projects must honor terms of use and licensing as appropriate.

Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted. So, it's essentially a grant competition (with a pleasingly minimal amount of administrivia). There is also a Computable Data Challenge community to discuss the challenge.

It's great to see EOL trying different strategies to engage with developers. Of the different challenges EOL is running this one is perhaps the most appealing to me, because one of my biggest complaints about EOL is that it's hard to envisage "doing science" with it. For example, we can download GenBank and cluster sequences into gene families, or grab data from GBIF and model species distributions, but what could we do with EOL? This challenge will be a chance to explore the extent to which EOL can support science, which I would argue will be a key part of its long term future.

EOL Phylogenetic Tree Challenge

34106 130 130The Encyclopedia of Life have announced the EOL Phylogenetic Tree Challenge. The contest has two purposes:


It provides a testbed for the Evolutionary Informatics community to develop robust methods for producing, serving, and evaluating large, biologically meaningful trees that will be useful both to the research community and to broader audiences.

It enables the Encyclopedia of Life to organise the information it aggregates according to phylogenetic relationships; in other words, it provides a direct pipeline from research results to practical use.


First prize is a trip to iEvoBio 2012, this year in Ottawa, Canada. For more details visit the challenge website. There is also an EOL community devoted to this challenge.

Challenges are great things, especially ones with worthwhile tasks and decent prizes. EOL badly needs a phylogenetic perspective, so this is a welcome development.

But (there's always a but), I can't help feeling that we need something a little more radical. The tree of life isn't a tree. At deep levels it's a forest, and even at shallow levels things are a complicated tangle of gene trees. Sometimes the tree is clear, sometimes not, and some of this is real and some reflects our ignorance.

If you want a simple tree to navigate, then I'd argue that the NCBI tree is a pretty good start, and EOL already has this. What would be really cool is to have a way to navigate that makes it clear that phylogenetic knowledge has a degree of uncertainty, and that the "tree of life" might be better depicted as a set of overlapping trees. The mental image I have is of a collage of trees from different data sets, superimposed over each other, with perhaps an underlying consensus to help navigate. This visualisation could be zoomable, because in some ways the tree of life is fractal. Trees don't stop at species, as the wealth of barcoding and phylogeographic studies show. Given computational constraints (not to mention visualisation issues), I wonder whether there is an effective limit to the size of any one tree in terms of number of taxa. What varies is the taxonomic scope. So we could imagine a backbone tree based on slowly evolving genes, we zoom in and more trees appear, but at lower levels, and finally we hit populations and individuals, trees that may have 100's of samples, but a very narrow scope.

This is all rather poorly articulated, but I can't help wondering whether a phylogenetic classification will end up distorting the very thing we're trying to depict. It also looses connection with the underlying data (and trees), which for me is a huge drawback of existing classifications. There's no sense of why they are the way they are. There's a chance here to bring together ideas that have been kicking around in the phylogenetic community for a couple of decades and rethink how we navigate the "tree of life".

Final thoughts on TDWG RDF challenge

Quick final comment on the TDWG Challenge - what is RDF good for?. As I noted in the previous post, Olivier Rovellotti (@orovellotti) and Javier de la Torre (@jatorre) have produced some nice visualisations of the frog data set:
Cartodb
Nice as these are, I can't help feeling that they actually help make my point about the current state of RDF in biodiversity informatics. The only responses to my challenge have been to use geography, where the shared coordinate system (latitude and longitude) facilitates integration. Having geographic coordinates means we don't need to have shared identifiers to do something useful, and I think it's no accident that GBIF is one of the most important resources we have. Geography is also the easiest way to integrate across other fields (e.g., climate).

But what of the other dimensions? What I'm really after are links across datasets that enable us to make new inferences, or address interesting questions. The challenge is still there...

Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

SpeciesStatusDOIYear describedImage
Atelopus nanayCRhttp://dx.doi.org/10.1655/0018-0831(2002)058[0229:TNSOAA]2.0.CO;22002
Eleutherodactylus mariposaCRhttp://dx.doi.org/10.2307/14669621992
Phrynopus kauneorumCRhttp://dx.doi.org/10.2307/15659932002
Eleutherodactylus eunasterCRhttp://dx.doi.org/10.2307/15630101973
Eleutherodactylus amadeusCRhttp://dx.doi.org/10.2307/14455571987
Eleutherodactylus lamprotesCRhttp://dx.doi.org/10.2307/15630101973
Churamiti maridadiCRhttp://dx.doi.org/10.1080/21564574.2002.96354672002
Eleutherodactylus thorectesCRhttp://dx.doi.org/10.2307/14453811988
Eleutherodactylus apostatesCRhttp://dx.doi.org/10.2307/15630101973
Leptodactylus silvanimbusCRhttp://dx.doi.org/10.2307/15636911980
Eleutherodactylus sciagraphusCRhttp://dx.doi.org/10.2307/15630101973
Bufo chavinCRhttp://dx.doi.org/10.1643/0045-8511(2001)001[0216:NSOBAB]2.0.CO;22001
Eleutherodactylus fowleriCRhttp://dx.doi.org/10.2307/15630101973
Ptychohyla hypomykterCRhttp://dx.doi.org/10.2307/36720601993
Hyla suweonensisDDhttp://dx.doi.org/10.2307/14441381980
Proceratophrys concavitympanumDDhttp://dx.doi.org/10.2307/15654122000
Phrynopus bufoidesDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Boophis periegetesDDhttp://dx.doi.org/10.1111/j.1096-3642.1995.tb01427.x1995
Phyllomedusa duellmaniDDhttp://dx.doi.org/10.2307/14446491982
Boophis liamiDDhttp://dx.doi.org/10.1163/1568538033224407722003
Hyalinobatrachium ignioculusDDhttp://dx.doi.org/10.1670/0022-1511(2003)037[0091:ANSOHA]2.0.CO;22003
Proceratophrys cururuDDhttp://dx.doi.org/10.2307/14477121998
Amolops bellulusDDhttp://dx.doi.org/10.1643/0045-8511(2000)000[0536:ABANSO]2.0.CO;22000
Centrolene bacatumDDhttp://dx.doi.org/10.2307/15645281994
Litoria kumaeDDhttp://dx.doi.org/10.1071/ZO030082004
Phrynopus pesantesiDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Gastrotheca galeataDDhttp://dx.doi.org/10.2307/14436171978
Paratelmatobius cardosoiDDhttp://dx.doi.org/10.2307/14479761999
Rhacophorus catamitusDDhttp://dx.doi.org/10.1655/0733-1347(2002)016[0046:NAPKPF]2.0.CO;22002
Huia melasmaDDhttp://dx.doi.org/10.1643/CH-04-137R32005
Telmatobius vilamensisDDhttp://dx.doi.org/10.1655/0018-0831(2003)059[0253:ANSOTA]2.0.CO;22003
Callulina kisiwamsituENhttp://dx.doi.org/10.1670/209-03A2004
Arthroleptis nikeaeENhttp://dx.doi.org/10.1080/21564574.2003.96354862003
Eleutherodactylus amplinymphaENhttp://dx.doi.org/10.1139/z94-2971994
Eleutherodactylus glaphycompusENhttp://dx.doi.org/10.2307/15630101973
Bufo tacanensisENhttp://dx.doi.org/10.2307/14397001952
Phrynopus brackiENhttp://dx.doi.org/10.2307/14458261990
Telmatobius sibiricusENhttp://dx.doi.org/10.1655/0018-0831(2003)059[0127:ANSOTF]2.0.CO;22003
Cochranella macheENhttp://dx.doi.org/10.1655/03-742004
Eleutherodactylus melacaraENhttp://dx.doi.org/10.2307/14669621992
Plectrohyla glandulosaENhttp://dx.doi.org/10.2307/14410461964
Aglyptodactylus laticepsENhttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Eleutherodactylus glamyrusENhttp://dx.doi.org/10.2307/15656641997
Gastrotheca trachycepsENhttp://dx.doi.org/10.2307/15643751987
Eleutherodactylus grahamiENhttp://dx.doi.org/10.2307/15639291979
Litoria havinaLChttp://dx.doi.org/10.1071/ZO99302251993
Crinia ripariaLChttp://dx.doi.org/10.2307/14407941965
Litoria longirostrisLChttp://dx.doi.org/10.2307/14431591977
Osteocephalus mutaborLChttp://dx.doi.org/10.1163/1568538023208776092002
Leptobrachium nigropsLChttp://dx.doi.org/10.2307/14409661963
Pseudis tocantinsLChttp://dx.doi.org/10.1590/S0101-817519980004000111998
Mantidactylus argenteusLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Aglyptodactylus securiferLChttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Pseudis cardosoiLChttp://dx.doi.org/10.1163/1568538005072642000
Uperoleia inundataLChttp://dx.doi.org/10.1071/AJZS0791981
Litoria pronimiaLChttp://dx.doi.org/10.1071/ZO99302251993
Litoria paraewingiLChttp://dx.doi.org/10.1071/ZO97602831976
Philautus aurifasciatusLChttp://dx.doi.org/10.1163/156853887X000361987
Proceratophrys avelinoiLChttp://dx.doi.org/10.1163/156853893X001561993
Osteocephalus deridensLChttp://dx.doi.org/10.1163/1568538005075252000
Gephyromantis boulengeriLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Crossodactylus caramaschiiLChttp://dx.doi.org/10.2307/14469071995
Rana yavapaiensisLChttp://dx.doi.org/10.2307/14453381984
Boophis lichenoidesLChttp://dx.doi.org/10.1163/156853898X000251998
Megistolotis lignariusLChttp://dx.doi.org/10.1071/ZO97901351979
Ansonia endauensisNEhttp://dx.doi.org/10.1655/0018-0831(2006)62[466:ANSOAS]2.0.CO;22006
Ansonia kraensisNEhttp://dx.doi.org/10.2108/zsj.22.8092005
Arthroleptella landdrosiaNThttp://dx.doi.org/10.2307/15653592000
Litoria jungguyNThttp://dx.doi.org/10.1071/ZO020692004
Phrynobatrachus phyllophilusNThttp://dx.doi.org/10.2307/15659252002
Philautus ingeriVUhttp://dx.doi.org/10.1163/156853887X000361987
Gastrotheca dendronastesVUhttp://dx.doi.org/10.2307/14450881983
Hyperolius cystocandicansVUhttp://dx.doi.org/10.2307/14439111977
Boophis sambiranoVUhttp://dx.doi.org/10.1080/21564574.2005.96355202005
Ansonia torrentisVUhttp://dx.doi.org/10.1163/156853883X000211983
Telmatobufo australisVUhttp://dx.doi.org/10.2307/15630861972
Stefania coxiVUhttp://dx.doi.org/10.1655/0018-0831(2002)058[0327:EDOSAH]2.0.CO;22002
Oreolalax multipunctatusVUhttp://dx.doi.org/10.2307/15648281993
Eleutherodactylus guantanameraVUhttp://dx.doi.org/10.2307/14669621992
Spicospina flammocaeruleaVUhttp://dx.doi.org/10.2307/14477571997
Cycloramphus acangatanVUhttp://dx.doi.org/10.1655/02-782003
Leiopelma pakekaVUhttp://dx.doi.org/10.1080/03014223.1998.95175541998
Rana okaloosaeVUhttp://dx.doi.org/10.2307/14448471985
Phrynobatrachus uzungwensisVUhttp://dx.doi.org/10.1163/156853883X000301983


This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tdwg_tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tdwg_co: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?name ?status ?doi ?date ?thumbnail
WHERE {
?ncbi uniprot:scientificName ?name .
?ncbi rdfs:seeAlso ?dbpedia .
?dbpedia dbpedia-owl:conservationStatus ?status .
?ion tdwg_tn:nameComplete ?name .
?ion tdwg_co:publishedInCitation ?doi .
?doi dcterms:date ?date .

OPTIONAL
{
?dbpedia dbpedia-owl:thumbnail ?thumbnail
}
}
ORDER BY ASC(?status)


This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
Chart
In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Update
Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large

Suggested apps for BHL's Life and Literature Code Challenge


Since I won't be able to be at the Biodiversity Heritage Library's Life and Literature meeting I thought I'd share some ideas for their Life and Literature Code Challenge. The deadline is pretty close (October 17) so having ideas now isn't terribly helpful I admit. That aside, here are some thoughts inspired by the challenge. In part this post has been inspired by the Results of the PLoS and Mendeley "Call for Apps", where PLoS and Mendeley asked for people (not necessarily developers) to suggest the kind of apps they'd like to see. As an aside, one thing conspicuous by it's absence is a prize for winning the challenge. PLoS and Mendeley have a "API Binary Battle" with a prize of $US 10,001, which seems more likely to inspire people to take part.

Visual search engine
I suspect that many BHL users are looking for illustrations (exemplified by the images being gathered in BHL's Flickr group). One way to search for images would be to search within the OCR text for figure and plate captions, such as "Fig. 1". Indexing these captions by taxonomic name would provide a simple image search tool. For modern publications most figures are on the same page as the caption, but for older publications with illustrations as plates, the caption and corresponding image may be separated (e.g., on facing pages), so the search results might need to show pages around the page containing the caption. As an aside, it's a pity the Flickr images only link to the BHL item and not the BHL page. If they did the later, and the images were tagged with what they depict, you could great a visual search engine using the Flickr API (of course, this might be just the way to implement the visual search engine — harvest images, tags with PageID and taxon names, upload to Flickr).

Mobile interface
The BHL web site doesn't look great on an iPhone. It makes no concessions to the mobile device, and there are some weird things such as the way the list of pages is rendered. A number of mainstream science publishers are exploring mobile versions of their web sites, for example Taylor and Francis have a jQuery Mobile powered interface for mobile users. I've explored iPad interfaces to scientific articles in previous posts. BHL content posses some challenges, but is fundamentally the same as viewing PDFs — you have fixed pages that you may want to zoom.

OCR correction
There is a lot of scope for cleaning up the OCR text in BHL. Part of the trick would be to have a simple use interface for people to contribute to this task. In an earlier post I discussed a Firefox hOCR add-on that provides a nice way to do this. Take this as a starting point, add a way to save the cleaned up text, and you'd be well on the way to making a useful tool.

Taxon name timeline
Despite the shiny new interface, the Encyclopedia of Life still displays BHL literature in the same clunky way I described in an earlier blog post. It would great to have a timeline of the usage of a name, especially if you could compare the usage of different names (such as synonyms). In many ways this is the BHL equivalent Google Books Ngram viewer.

These are just a few hastily put together thoughts. If you have any other ideas or suggestions, feel free to add them as comments below.

- Posted using BlogPress from my iPad

The Mendeley API Binary Battle - win $US 10,001

Now we'll bring the awesome. Mendeley have announced The Mendeley API Binary Battle, with a first prize of $US 10,0001, and some very high-profile judges (Juan Enriquez, Tim O'Reilly, James Powell, Werner Vogels, and John Wilbanks). Deadline for submission is August 31st 2011, with the results announced in October.

The criterion for judging are:
  1. How active is your application? We’ll look at your API key usage.

  2. How viral is the app? We’ll look at the number of sign ups on Mendeley and/or your application, and we’ll also have an eye on Twitter.

  3. Does the application increase collaboration and/or transparency? We’ll look at how much your application contributes to making science more open.

  4. How cool is your app? Does it make our jaws drop? Is it the most fun that you can have with your pants on? Is it making use of Facebook, Twitter, etc.?

  5. The Binary Battle is open to apps built previous to this announcement.


Start your engines...

Mendeley mangles my references: phantom documents and the problem of duplicate references

One issue I'm running into with Mendeley is that it can create spurious documents, mangling my references in the process. This appears to be due to some over-zealous attempts to de-duplicate documents. Duplicate documents is the number one problem faced by Mendeley, and has been discussed in some detail by Duncan Hull in his post How many unique papers are there in Mendeley?. Duncan focussed on the case where the same article may appear multiple times in Mendeley's database, which will inflate estimates of how many distinct references the database contains. It also has implications for metrics derived from the Mendeley, such as those displayed by ReaderMeter.

In this post I discuss the reverse problem, combining two or more distinct references into one. I've been uploading large collections of references based on harvesting metadata for journal articles. Although the metadata isn't perfect, it's usually pretty good, and in many cases linked to Open Access content in BioStor. References that I upload appear in public groups listed on my profile, such as the group Proceedings of the Entomological Society of Washington.

Reverse engineering Mendeley
In the absence of a good description by Mendeley of how their tools work, we have to try and figure it out ourselves. If you click on a refernece that has been recently added to Mendeley you get a URL that looks like this: http://www.mendeley.com/c/3708087012/g/584201/magalhaes-2008-a-new-species-of-kingsleya-from-the-yanomami-indians-area-in-the-upper-rio-orinoco-venezuela-crustacea-decapoda-brachyura-pseudothelphusidae/ where 584201 is the group id, 3708087012 is the "remoteId" of the document (this is what it's called in the SQLite database that underlies the desktop client), and the rest of the URL is the article title, minus stop words.

After a while (perhaps a day or so) Mendeley gets around to trying to merge the references I've added with those it already knows about, and the URLs lose the group and remoteId and look like this: http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/ . Let's call this document the "canonical document" (this document also has a UUID, which is what the Mendeley API uses to retrieve the document). Once the document gets one of these URLs Mendeley will also display how many people are "reading" that document, and whether anyone has tagged it.

But that's not my paper!
The problem is that sometimes (and more often than I'd like) the canonical document bears little relation to the document I uploaded. For example, here is a paper that I uploaded to the group Proceedings of the Entomological Society of Washington:

16212462.gifReview of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records by Roger D Price, Ricardo L Palma, Dale H Clayton, Proceedings of the Entomological Society of Washington, 105(4):915-924 (2003).


You can see the actual paper in BioStor: http://biostor.org/reference/57185. To see the paper in the Mendeley group, browse it using the tag Phthiraptera:

group.png


Note the 2, indicating that two people (including myself) have this paper in their library. The URL for this paper is http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/, but this is not the paper I added!.

What Mendeley displays for this URL is this:
dala.png


Not only is this not the paper I added, there is no such paper! There is a paper entitled "A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar", but that is by Harry Brailovsky, not Clayton and Price (you can see this paper in BioStor as http://biostor.org/reference/55669). The BioStor link for the phantom paper displayed by Mendeley, http://biostor.org/reference/55761, is for a third paper "A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions". The table below shows the original details for the paper, the details for the "canonical paper" created by Mendeley, and the details for two papers that have some of the bibliographic details in common with this non-existent paper (highlighted in bold).

FieldOriginal paperMendeley
TitleReview of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host recordsA new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from MadagascarA new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from MadagascarA review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions
Author(s)Roger D Price, Ricardo L Palma, Dale H ClaytonDH Clayton, RD PriceHarry Brailovsky
Volume105105104107
Pages915-924915-924111-118917-940
BioStor57185557615566955761

As you can see it's a bit of a mess. Now, finding and merging duplicates is a hard problem (see doi:10.1145/1141753.1141817 for some background), but I'm struggling to see why these documents were considered to be duplicates.

What I'd like to see
I'm a big fan of Mendeley, so I'd like to see this problem fixed. What I'd really like to see is the following:
  1. Mendeley publish a description of how their de-duplication algorithms work

  2. Mendeley describe the series of steps a document goes through as they process it (if nothing else, so that users can make sense of the multiple URLs a document may get over it's lifetime in Mendeley).

  3. For each canonical reference Mendeley shows the the set of documents that have been merged to create that canonical reference, and display some measure of their confidence that the match is genuine.

  4. Mendeley enables users to provide feedback on a canonical document (e.g., a button by each document in the set that enables the user to say "yes this is a match" or "no, this isn't a match").


Perhaps what would be useful is if Mendeley (or the community) assemble a test collection of documents which contains duplicates, together with a set of the canonical documents this collection actually contains, and use this to evaluate alternative algorithms for finding duplicates. Let's make this a "challenge" with prizes! In many ways I'd be much more impressed by a duplication challenge than the DataTEL challenge, especially as it seems clear that Mendeley readership data is too sparse to generate useful recommendations (see Mendeley Data vs. Netflix Data).


Elsevier Grand Challenge paper out

CB88EB6F-75CD-485D-8A3D-5F43D9EE2B37.jpgAt long last the peer-reviewed version of the paper "Enhanced display of scientific articles using extended metadata" (doi:10.1016/j.websem.2010.03.004), in which I describe my entry in the Elsevier Grand Challenge, has finally appeared in the journal Web Semantics: Science, Services and Agents on the World Wide Web. The pre-print version of this paper has been online (hdl:10101/npre.2009.3173.1) for a year prior to appearance of the published version (24 April 2009 versus 3 April 2010), and the Challenge entry itself went online in December 2008. Unfortunately the published version has an awful typo in the title (that was in neither the manuscript nor the proofs).

Given this typo, the time lag between doing the work, writing the manuscript, and seeing it published, and the fact that I've already been to meetings where my invitation has been based the entry and the pre-print, I do wonder why on Earth would I bother with traditional publication (which is somewhat ironic, given the topic of the paper)?

e-Biosphere '09: Twitter rules, and all that


So, e-Biosphere '09 is over (at least for the plebs like me, the grown ups get to spend two days charting the future of biodiversity informatics). It was an interesting event, on several levels. It's late, and I'm shattered, so this post ill cover only a few things.

This was first conference I'd attended where some of the participants twittered during proceedings. A bunch of us settled on the hashtag #ebio09 (you can also see the tweets at search.twitter.com). For the uninitiated, a "hashtag" is a string preceded by a hash symbol (#), to indicate that it is a tag, such as #fail. It provides a simple way to tag tweets so that others interested in that topic can find them.

Twittering created a whole additional layer to the conference. We were able to:

Twitter greatly enhanced the conversation, noticeably when a speaker said something controversial (all too rare, sadly), or when a group rapporteur's summary didn't reflect all the views in that group. It also helped document what was going on, and this can be further exploited. For fun, I grabbed tweets from days 2 and 3 and made a wordle:
As @edwbaker noted @edwbaker @rdmpage The size of 'together', 'people' & 'visionary' is somewhat telling...... In case you're wondering about the prominence of "Knowlton", it's because Nancy Knowlton gave a nice talk highlighting the every increasing number of cases where we have no names for the things we are encountering (for example, when barcoding fresh samples from poorly studied environments). This is just one example of the huge disconnect between the obsession with taxonomic names in biodiversity informatics, and the reality of metagenomics and DNA barcoding. Just as worrying is the lack of resemblance of the taxonomic classification used by the Encyclopedia of Life and our notion of the evolutionary tree of those organisms. A systematist would find much of EOL's classification laughable. I don't want to bash EOL, but it's worrying that they can continue to crank out press releases, but fail to provide something like a modern classification.

But I digress. In many ways this was less of a scientific conference and more of an event to birth a discipline, namely "biodiversity informatics" (which I'm sure some would claim as been around for quite a while). So, the event was to attract attention to the topic, and assure the outside world (and those attending) that the field exists and has something to say. It also was billed as a forum to discuss strategies for its future. Sadly, much of this discussion will take place behind closed doors, and will feature the major players who bring money and influence (but not much innovation) to the table.

Symptomatic of this lack of innovation, in a sense, was the contrast between the official "Online Conference Community", and the twitter feed. When I asked if anybody on twitter had used the official forum, @fak3r replied tellingly: @rdmpage thought we were on it ;) #ebio09. As fun as it is to use the new hotness to conduct a parallel (and slightly subversive) discussion at a conference it's worrying that, in a field that calls itself "informatics" the big beasts probably had little idea what was going on. If we are going to exploit the tools the web provides, we need people who "get it", and I'm unconvinced that the big players in this area truly grasp the web (in all it's forms). There's also a worrying degree of physics envy, which might be cured by reading The Unreasonable Effectiveness of Data (doi:10.1109/mis.2009.36).

I tried to stir things up a little (almost literally as captured in this photo by Chris Freeland), with a couple of questions, but to not much effect (other than apparently driving to despair the poor chap behind me ).


But enough grumbling. It was great to see lots of people attending the event, the were lots of interesting posters and booths (creating a market for this field may go some way towards providing an incentive to provide better, more reliable services), and my challenge entry won joint first prize, so perhaps I should sit back, enjoy the wine Joel Sachs choose as the prize (many thanks for his efforts in putting the challenge event together), and let others say what they thought of the meeting.

e-Biosphere 09 Challenge slides

I've put the slides for my e-Biosphere 09 challenge entry on SlideShare.

Not much information on the other entries yet, except for the eBiosphere Citizen Science Challenge, by Joel Sachs and colleagues, which will demonstrate a "global human sensor net". Their plan is to aggregate observations posted on Flickr, Twitter, Spotter, and email. It might be fun to make use of some of this for my own entry (by default already will, because we are both using EOL's Flickr pool).

e-Biosphere Challenge: visualising biodiversity digitisation in real time

e-Biosphere '09 kicks off next week, and features the challenge:
Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:
  • Capturing data in private and public databases;
  • Conducting quality assurance on the data by automated validation and/or peer review;
  • Indexing, linking and/or automatically submitting the new data records to other relevant databases;
  • Integrating the data with other databases and data streams;
  • Making these data available to relevant audiences;
  • Make the data and links to the data widely accessible; and
  • Offering interfaces for users to query or explore the data.


Originally I planned to enter the wiki project I've been working on for a while, but time was running out and the deadline was too ambitious. Hence, I switched to thinking about RSS feeds. The idea was to first create a set of RSS feeds for sources that lack them, which I've been doing over at http://bioguid.info/rss, then integrate these feeds in a useful way. For example, the feeds would include images from Flickr (such as EOL's pool), geotagged sequences from GenBank, the latest papers from Zootaxa, and new names from uBio (I'd hoped to include ION as well, but they've been spectacularly hacked).

After playing with triple stores and SPARQL (incompatible vocabularies and multiple identifiers rather buggers this approach), and visualisations based on Google Maps (building on my swine flu timemap), it dawned on me what I really needed was an eye-catching way of displaying geotagged, timestamped information, just like David Troy's wonderful twittervision and flickrvision.com. In particular, David took the Poly9 Globe and added Twitter and Flickr feeds (see twittervision 3D and flickrvision 3D. So, I took hacked David's code and created this, which you can view at http://bioguid.info/ebio09/www/3d/:



It's a lot easier to simply look at it rather than describe what it does, but here's a quick sketch of what's under the hood.

Firstly, I take RSS feeds, either the raw geoFeed from Flickr, or from http://bioguid.info/rss. The bioGUID feeds include the latest papers in Zootaxa (most new animal species are described in this journal), a modified version of uBio's new names feed, and a feed of the latest, geotagged sequences in GenBank (I'd hoped to use only DNA barcodes, but it turns out rather few barcode sequences are geotagged, and few have the "BARCODE" keyword). The Flickr feeds are simple to handle because they include locality information (including latitude, longitude, and Yahoo Where-on-Earth Identifiers (WOEIDs)). Similarly, the GenBank feed I created has latitude and longitudes (although extracting this isn't always as straightforward as it should be). Other feeds require more processing. The uBio feed already has taxonomic names, but no geotagging, so I use services from Yahoo! GeoPlanet™ to find localities from article titles. For the Zootaxa feed that I created I use uBio's SOAP service to extract taxonomic names, and Yahoo! GeoPlanet™ to extract localities.


I've tried to create a useful display popup. For Zootaxa papers you get a thumbnail of the paper, and where possible an icon of the taxonomic group the paper talks about (the presence of this icon depends on the success of uBio's taxonomic name finding service, the Catalogue of Life having the same name, and my having a suitable icon). The example above shows a paper about copepods. Other papers have a icon for the journal (again, a function of my being able to determine the journal ISSN and having a suitable icon). Flickr images simply display a thumbnail of the image.

What does it all mean? Well, I could say all sorts of things about integration and mash-ups but, dammit, it's pretty. I think it's a fun way to see just what is happening in digital biodiversity. I've deliberately limited the demo to items that came online in the month of May, and I'll be adding items during the conference (June 1-3rd in London). For example, if any more papers appear in Zootaxa, or in the uBio feeds I'll add those. If anybody uploads geotagged photos to EOL's Flickr group, I'll grab those as well. It's still a bit crude, but it shows some of the potential of bringing things together, coupled with a nice visualisation. I welcome any feedback.

Integrating and displaying data using RSS


Although I'd been thinking of getting the wiki project ready for e-Biosphere '09 as a challenge entry, lately I've been playing with RSS has a complementary, but quicker way to achieve some simple integration.

I've been playing with RSS on and off for a while, but what reignited my interest was the swine flu timemap I made last week. The neatest thing about the timemap was how easy it was to make. Just take some RSS that is geotagged and you get the timemap (courtesy of Nick Rabinowitz's wonderful Timemap library).

So, I began to think about taking RSS feeds for, say journals and taxonomic and genomic databases and adding them together and displaying them using tools such as timemap (see here for an earlier mock up of some GenBank data). Two obstacles are in the way. The first is that not every data source of interest provides RSS feeds. To address this I've started to develop wrappers around some sources, the first of which is ZooBank.

The second obstacle is that integration requires shared content (e.g., tags, identifiers, or localities). Some integration will be possible geographically (for example, adding geotagged sequences and images to a map), but this won't work for everything. So, I need to spend some time trying to link stuff together. In the case of Zoobank there's some scope for this, as ZooBank metadata sometimes includes DOIs, which enables us to link to the original publication, as well as bookmarking services such as Connotea. I'm aiming to include these links within the feed, as shown in this snippet (see the <link rel="related"...> element):


<entry>
<title>New Protocetid Whale from the Middle Eocene of Pakistan: Birth on Land, Precocial Development, and Sexual Dimorphism</title>
<link rel="alternate" type="text/html" href="http://zoobank.org/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
<updated>2009-05-06T18:37:34+01:00</updated>
<id>urn:uuid:c8f6be01-2359-1805-8bdb-02f271a95ab4</id>
<content type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></content>
<summary type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></summary>
<link rel="related" type="text/html" href="http://dx.doi.org/10.1371/journal.pone.0004366" title="doi:10.1371/journal.pone.0004366"/>
<link rel="related" type="text/html" href="http://bioguid.info/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C" title="urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
</entry>


What I'm hoping is that there will be enough links to create something rather like my Elsevier Challenge entry, but with a much more diverse set of sources.

e-Biosphere Challenge


The e-Biosphere meeting in London June 1-3 has announced a The e-Biosphere 09 Informatics Challenge:

Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:
  • Capturing data in private and public databases
  • Conducting quality assurance on the data by automated validation and/or peer review
  • Indexing, linking and/or automatically submitting the new data records to other relevant databases
  • Integrating the data with other databases and data streams
  • Making these data available to relevant audiences
  • Make the data and links to the data widely accessible and
  • Offering interfaces for users to query or explore the data.

The "real time" aspect of the challenge seems a bit forced. I think they originally wanted a "live demo", but now they seem to be happy with a demo that unfolds over the three day meeting, without necessarily literally taking three days (what the organisers term "cooking shows"). I also think cash prizes would have been a good idea (the web site simply says "there will be prizes"). It's not the cash itself that matters, it's the fact that it indicates that the organisers are serious about wanting to attract entries. Entrants are likely to invest more time than they'd recoup in cash.

In any event, given that challenges are a great way to focus the mind on a deadline, I'll be entering the wiki of taxonomic names that I've been working on.

Failure

Success is the ability to go from failure to failure without losing your enthusiasm -- Winston Churchill
I learnt today that my Elsevier Challenge entry didn't make the final cut. This wasn't unexpected. In the interests of "open science" (blame Paulo Nuin) here is the feedback I received from the judges:

Strengths
Beautiful presentation, lovely website. Page clearly made his case for open access to metadata/full articles in order to allow communities to build the tools they want. The judges would have enjoyed seeing more elements from the original abstract (tree of life). Great contribution so far to the discussion; Page made his point very well.

Weaknesses
Given that no specific tool was proposed, this submission is somewhat out of scope for the competition. Nonetheless, in support of his point, Page could have elaborated on the kinds of open formats and standards for text and data and figures that would support integrated community-wide tool-building. Alternatively, if the framework and the displayed functionalities were to be the submission, there could have been more discussion of how others can integrate their plug-ins and make them cross-referential to the plug-ins of others. The proposal for Linked Data should utilize Semantic Web standards

Elements to Consider for Development
How many, and which types of, information substrates? How much work for a new developer to create a new one, and to make this work? How to incentivize authors to produce the required metadata? Or to make the data formats uniform?
I think this is a pretty fair evaluation of my entry. I was making a case for what could be done, rather than providing a specific bit of kit that could make this happen right now. I think I was also a little guilty of not following the "underpromise but overdeliver" mantra. My original proposal included harvesting phylogenies from images, and that proved too difficult to do in the time available. I don't think having trees would have ultimately changed the result (i.e., not making the cut), but it would have been cool to have them.

Anyway, time to stomp around the house a bit, and be generally grumpy towards innocent children and pets. Congratulations to the bastards fellow contestants who made it to the next round.

Challenge entry

I've submitted my entry for the Elsevier Grand Challenge. The paper describing the entry is available from Nature Precedings (doi:10.1038/npre.2008.2579.1). The web site demo is at http://iphylo.org/~rpage/challenge/www/. I'm now officially knackered.

Sequencing Carmen Electra


One byproduct of playing with the Challenge Demo is that I come across some rather surprising results. For example, the rather staidly titled "Cryptic speciation and paraphyly in the cosmopolitan bryozoan Electra pilosa—Impact of the Tethys closing on species evolution" (doi:10.1016/j.ympev.2007.07.016) starts to look a whole lot more interesting given the taxon treemap (right).

The girl is Carmen Electra, which is understandable given the Yahoo image search was for "Electra" (a genus of bryozoan). However, what are the wild men (and women) doing at the top? Turns out this is the result of searching for the genus Homo. But why, you ask, does a paper on bryozoans have human sequences? Well, looks like the table in the paper has incorrect GenBank accession numbers. The sequences AJ711044-50 should, I'm guessing, be AJ971044-50.

Ironically, although it was Carmen Electra's photo that initially made me wonder what was going on, it's really the hairy folks above her image that signal something is wrong. I've come across at least one other example of a paper citing an incorrect sequences, so it might be time to automate this checking. Or, what is probably going to be more fun, looking at treemaps for obviously wrong images and trying to figure out why.

Challenge Demo online

I've put the my Elsevier Challenge demo online. I'm still loading data into it, so it will grow over the next day or so. There's also the small matter of writing a paper on what's under the hood of the demo. Feel free to leave comments on the demo home page.

For some example of what the project does, take a look at Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio), then compare it to the same publication in Science Direct (doi:10.1016/j.ympev.2007.06.010).

What is a study about? Treemaps of taxa


One of the things I've struggled with most in putting together a web site for the challenge is how to summarise that taxonomic content of a study. Initially I was playing with showing a subtree of the NCBI taxonomy, highlighting the taxa in the study. But this assumes the user is familiar with the scientific names of most of life. I really wanted something that tells you "at a glance" what the study is about.

I've settled (for now, at least) on using a treemap of images of the taxa in the study. I've played with treemaps before, and have never been totally convinced of their utility. However, in this context I think they work well. For each paper I extract the taxonomic names (via the Genbank sequences linked to the paper), group them into genera, and then construct a treemap where the size of each cell is proportional to the number of species in each genus. Then I harvest images from Flickr and/or Yahoo's image search APIs and display a thumbnail with a link to the image source.


I'm hoping that these treemaps will give the user an almost instant sense of what the study is about, even if it's only "it's about plants". The treemap above is for Frost et al.'s The amphibian tree of life (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2), the one to the right is for Johnson and Weese's "Geographic distribution, morphological and molecular characterization, and relationships of Lathrocasis tenerrima (Polemoniaceae)".

Note that the more taxa a study includes the smaller and more numerous the cells (see below). This may obscure some images, but gives the user the sense that the study includes a lot of taxa. The image search isn't perfect, but I think it works well enough for my purposes.

Powered by Blogger.