Supplemental Information

Supporting retrieval of diverse biomedical data using evidence-aware queries
Eithon Cadag, Peter Tarczy-Hornoch
Submitted to AMIA Fall Symposium 2009

Though there have been many advances in providing access to linked and integrated biomedical data across repositories, methods allowing users to specify ambiguous and exploratory queries over disparate sources remains a challenge to extracting well-curated or diversely-supported biological information. In the following work, we discuss the concepts of data coverage and evidence in the context of integrated sources. We address diverse information retrieval via a simple framework for representing coverage and evidence that operates in parallel with an arbitrary schema, and a language upon which queries on the schema and framework may be executed. We show that this approach is capable of answering questions which require ranged levels of evidence or triangulation.

Supplemental information:
[PyDI + DaRQL package (tar.gz)]




On the reachability of trustworthy information from integrated exploratory biological queries
Eithon Cadag, Peter Tarczy-Hornoch, Peter J. Myler
In Proceedings of the 6th International Data Integration in the Life Sciences, 2009 pp 55-70, Manchester, UK

Levels of curation across biological databases are widely recognized as being highly variable, depending on provenance and type. In spite of ambiguous quality, searches against biological sources, such as those for sequence homology, remain a frontline strategy for biomedical scientists studying molecular data. In the following, we investigate the accessibility of well-curated data retrieved from explorative queries across multiple sources. We present the architecture and design of a lightweight data integration platform conducible to graph-theoretic analysis. Using data collected via this framework, we examine the reachability of evidence-supported annotations across triangulated sources in the face of uncertainty, using a simple random sampling model oriented around fault tolerance. We characterize the accessibility of high-quality data from uncertain queries and levels of redundancy across data sources and find that generally encountering non-experimentally verified annotations are nearly as likely as encountering experimentally verified annotations, with the exception of a group of proteins whose link structure is dominated by experimental evidence. Finally, we discuss the prospect of determining overall accessibility of relevant information based on metadata about a query and its results.

Supplemental information:




BioMediator data integration and inference for functional annotation of anonymous sequences
Eithon Cadag, Brent Louie, Peter J. Myler, Peter Tarczy-Hornoch
In Proceedings of the Pacific Symposium on Biocomputing, 2007 pp 343-354, Maui, Hawaii

Scientists working on genomics projects are often faced with the difficult task of sifting through large amounts of biological information dispersed across various online data sources that are relevant to their area or organism of research. Gene annotation, the process of identifying the functional role of a possible gene, in particular has become increasingly more time-consuming and laborious to conduct as more genomes are sequenced and the number of candidate genes continues to increase at near-exponential pace; genes are left un-annotated, or worse, incorrectly annotated. Many groups have attempted to address the annotation backlog through automated annotation systems that are geared toward specific organisms, and which may thus not possess the necessary flexibility and scalability to annotate other genomes. In this paper, we present a method and framework which attempts to address problems inherent in manual and automatic annotation by coupling a data integration system, BioMediator, to an inference engine with the aim of elucidating functional annotations. The framework and heuristics developed are not specific to any particular genome. We validated the method with a set of randomly-selected annotated sequences from a variety of organisms. Preliminary results show that the hybrid data integration and inference approach generates functional annotations that are as good as or better than ``gold standard'' annotations 80% of the time.

Supplemental information:
[JESS code (zip, external link)]