TDWG Challenge - what is RDF good for?

Last month, feeling particularly grumpy, I fired off an email to the TDWG-TAG mailing list with the subject Lobbing grenades: a challenge. Here's the email:
It's morning and the coffee hasn't quite kicked in yet, but reading through recent TDWG TAG posts, and mindful of the upcoming meeting in New Orleans (which sadly I won't be attending) I'm seeing a mismatch between the amount of effort being expended on discussions of vocabularies, ontologies, etc. and the concrete results we can point to.

Hence, a challenge:

"What new things have we learnt about biodiversity by converting biodiversity data into RDF?"

I'm not saying we can't learn new things, I'm simply asking what have we learnt so far?

Since around 2006 we have had literally millions of triples in the wild (uBio, ION, Index Fungorum, IPNI, Catalogue of Life, more recently Biodiversity Collections Index, Atlas of Living Australia, World Register of Marine Species, etc.), most of these using the same vocabulary. What new inferences have we made?

Let's make the challenge more concrete. Load all these data sources into a triple store (subchallenge - is this actually possible?). Perhaps add other RDF sources (DBpedia, Bio2RDF, CrossRef). What novel inferences can we make?

I may, of course, simply be in "grumpy old arse" mode, but we have millions of triples in the wild and nothing to show for it. I hope I'm not alone in wondering why...

In the context of the TDWG meeting (happening as we speak and which I'm following via Twitter, hashtag #tdwg) Joel Sachs asked me whether I had any specific data in mind that could form the basis of a discussion. So, here goes. I've assembled some small RDF data sets that it might be fun to play with. Each data set is for frogs, and I've divided them into two sets.

Primary data
These data sets are essentially unmodified RDF fetched from data providers:
  • uniprot.rdf Uniprot RDF for frogs in GenBank
  • ion.rdf Index of Organism Names (ION) RDF for taxonomic names for frogs (filtered to just those names that are also in GenBank, the RDF comes from ION LSIDs)
  • crossref.rdf CrossRef RDF for DOIs for publications that published new frog names (obtaining using CrossRef's support for Linked Data for DOIs)
  • dbpedia.rdf Dbpedia RDF for frogs in GenBank (Update 2011-10-20: the dbpedia.rdf file is a bit big, so here is subset.rdf which has just the conservation status and thumbnail image)


These sources give us information on genomics (at least, they tell us which taxa have been sequenced), where and when the original taxonomic description was published, and by whom, as well as some information on conservation status and what the frog looks like (via Dbpedia). Ideally we just load these files into a triple store and then ask a bunch of questions, such as what is the conservation status of frogs sequenced in Genbank?, is there correlation between the conservation status of a frog and the date it was discovered?, who has described the most frog species?, etc.

My contention is that actually we can't do any of this because the data is siloed due to the lack of shared identifiers and vocabularies (I suspect that there is not a single identifier any of these files share). The only way we can currently link these data sets together is by shared string literals (e.g., taxonomic names), in which case why bother with RDF? So my first challenge is to see whether any of the questions I've just listed can actually be tackled using this data.

Glue
In a slightly more constructive mode, to see if we can make progress I'm providing some additional RDF files, based on projects I'm working on to link data together. These files may help provide some of the missing "glue" to connect these data sets.

  • linkout.rdf The list of links between NCBI and Dbpedia (based on mapping in iPhylo LinkOut)
  • ion_doi.rdf A subset of publications listed in ION have DOIs, this file links the corresponding ION LSIDs to those DOIs (this file is from an ongoing project mapping names to primary literature)


The first file links the ION and CrossRef RDF, so we could start to ask questions about dates of discovery, who described what species, etc.. The second file links NCBI taxon ids (in this case in the form of UniProt URIs) to Wikipedia (in the form of Dbpedia URIs). Dbpedia has information on conservation status, and some frogs will also have pictures, so we can start to join genomics to conservation, as well as make some visualisations.

Update
I've now added another RDF file for 1000 georeferenced GenBank sequences for frogs. The file is genbank.rdf. This file is generated from a local, processed version of EMBL, and uses a mixture of Dublin Core and TDWG vocabularies. Here's an example of a single record:

<?xml version="1.0"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:toccurrence="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#"
xmlns:uniprot="http://purl.uniprot.org/core/">
<uniprot:Molecule rdf:about="http://bio2rdf.org/genbank:EU566842">
<dcterms:created>2008-07-06</dcterms:created>
<dcterms:modified>2010-12-23</dcterms:modified>
<dcterms:title>EU566842</dcterms:title>
<dcterms:description>Xenopus borealis voucher MHNG:Herp:2644.64
cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial.</dcterms:description>
<dcterms:subject rdf:resource="http://purl.uniprot.org/taxonomy/8354"/>
<dcterms:relation rdf:parseType="Resource">
<rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#TaxonOccurrence"/>
<toccurrence:identifiedToString>Xenopus borealis</toccurrence:identifiedToString>
<toccurrence:decimalLatitude>0.66</toccurrence:decimalLatitude>
<geo:lat>0.66</geo:lat>
<toccurrence:decimalLongitude>37.5</toccurrence:decimalLongitude>
<geo:long>37.5</geo:long>
<toccurrence:verbatimCoordinates>0.66 N 37.5 E</toccurrence:verbatimCoordinates>
<toccurrence:country>Kenya</toccurrence:country>
<dcterms:identifier>MHNG:Herp:2644.64</dcterms:identifier>
</dcterms:relation>
</uniprot:Molecule>
</rdf:RDF>

I've added this simply so one could do some geographical queries.

Missing links
There are still lots of missing links here (for example, there's no explicit link between NCBI and ION, so we'd need to create this using taxonomic names), and we could add further links to the literature via sequences for taxa. Then there's the lack of geographic data. We could get some of this via georeferenced sequences in GenBank, but there's no RDF for this (Bio2RDF does have RDF for sequences but it ignores the bulk of the organismal metadata such as voucher specimens and latitude and longitude).

In many ways it's this lack of links that was point of my original email. The reality is that "linked data" isn't linked to anything like the extent that makes it useful. Simply pumping out RDF won't get us very far until we tackle this problem (see also my earlier post Linked data that isn't: the failings of RDF).

So, if you think RDF is the way to go, please tell me what you can learn from these data files.


NCBI taxonomy, TDWG vocabularies, and RDF


Lately I've been returning to playing with RDF and triple stores. This is a serious case of déjà vu, as two blogs I've now abandoned will testify (bioGUID and SemAnt). Basically, a combination of frustration with the tools, data cleaning, and the lack of identifiers got in the way of making much progress. I gave up on triple stores for a while, rolling my own Entity–Attribute–Value (EAV) database, which I used for the Elsevier Challenge (EAV databases are essentially key-value databases, CouchDB being a well-known example).

Now, I'm revisiting triple stores and SPARQL, partly because Linked Data is gaining momentum, and partly because we now have a few LSID providers, and some decent vocabularies from TDWG. Having created a LSID resolver that plays nicely with Linked Data (it also does the same thing for DOIs), it's time to dust off SPARQL and see what can be done.

One reason there's interest in having GUIDs and standard vocabularies is so that we can link different sources of information together. But more than just linking, we should be able to compute across these links and learn new things, or at least add annotations from one database to another.

To make this concrete, take the NCBI taxon 101855 , Lulworthia uniseptata. If we visit the NCBI page we see links to other resources, such as Index Fungorum record 105488, which tells us that Lulworthia uniseptata was published in Trans. Mycol. Soc. Japan 25(4): 382 (1984), and that the current name is Lulwoana uniseptata, which was published in Mycol. Res. 109(5): 562 (2005).

Wouldn't it be nice to be able to automatically link these things together? And wouldn't it be nice to have identifiers for the literature, rather than only human-readable text strings? Using bioGUID, we can discover that Mycol. Res. 109(5): 562 (2005) has the DOI doi:10.1017/S0953756205002716 -- I haven't found Trans. Mycol. Soc. Japan 25(4): 382 (1984) online anywhere.

Now, given that we have LSIDs for Index Fungorum, I can resolve urn:lsid:indexfungorum.org:names:369395 and discover that

urn:lsid:indexfungorum.org:names:369395 tname:hasBasionym urn:lsid:indexfungorum.org:names:105488

and, I can add the statement

urn:lsid:indexfungorum.org:names:36939 tcommon:publishedInCitation doi:10.1017/S0953756205002716

What I'd like to do is link this to the NCBI taxon, so that I can display this additional knowledge in one place (i.e., there is an additional name for this fungus, and where it is published). To do this, I need the NCBI taxonomy in RDF. Turns out that everyone and their dog has been generating RDF versions of the NCBI taxonomy, including Uniport (source of the diagram above). The problem is, each effort creates their own project-specific vocabulary. For example , here is the record for NCBI taxon 101855 in Uniprot RDF (http://www.uniprot.org/taxonomy/101855):


<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://purl.uniprot.org/taxonomy/101855">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Taxon"/>
<rank rdf:resource="http://purl.uniprot.org/core/Species"/>
<scientificName>Lulworthia uniseptata</scientificName>
<otherName>Zalerion maritimum</otherName>
<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/45817"/>
<partOfLineage>false</partOfLineage>
</rdf:Description>
</rdf:RDF>


Uniprot has it's own vocabulary, http://purl.uniprot.org/core/. So, what I'd like to do is create a version of the NCBI taxonomy using TDWG's TaxonConcept vocabulary, so that it becomes straightforward to link NCBI to name databases such as Index Fungorum, IPNI, Zoobank, and ION that are serving taxon names.
Powered by Blogger.