BioDataModel.org

An extended brain dump by Steve Chervitz
Created: 2007-12-06
Last Modified: 2009-10-19, 17:50:18 PDT

Disclaimer: This collection is liable to be incomplete and out of date. Have suggestions to improve it? contact me.


Contents


About This Site

This site is an attempt to consolidate current and prior art regarding the modeling of biological data pertaining to the molecules of the central dogma of biology (DNA, RNA, protein) and their structural or functional properties. This survey could prove useful in guiding best practices, encouraging re-use, discouraging wheel reinvention, and improving the interoperability of bioinformatics software between organizations.

The vast majority of real world data modeling in bioinformatics these days occurs within database schemas or data exchange formats. Such platform-specific modeling is all well and good, but the proliferation of these tends to limit reusability and interoperability, since it can lead to functionally equivalent but programatically incompatible interfaces to the same data. This leads developers to wonder, "Which one should I pick?" and, when in doubt, they may feel compelled to roll their own. We should strive to avoid a scenario which either forces developers (and their end users) to make tough compatibility choices or increases the barrier for future work by leading developers and data providers to "add to the pile".

This collection does not attempt to be all-encompasing, but rather focuses on currently active, publically available data modeling efforts of general applicability. It does not cover models specific to analytical tools, data pipelines, GUIs, etc. It is not an attempt to survey bioinformatics standards efforts generally, but points to aspects of those that have a bearing on the types of bio data objects mentioned above. For a survey of bioinformatics standards, see BioStandards.info which will likely subsume BioDataModel.org in the not-too-distant future as it would enable community maintenance of this information.

At present, there are minimal comparisons or reviews. Still, simply tracking such bio data modeling efforts and providing a broader perspective on them is a worthy goal in its own right. If you have suggestions for how to improve this resource, let me know. Thanks.

For general information about data modeling, check out these sites:


Frequently Asked Questions


Data Model Survey

Data models are grouped based on their general class and are presented in alphabetical order.


Databases

Comments:

The 2004 BOSC featured a compendium of talks from several of these database groups, so the abstracts provide a good overview: http://open-bio.org/bosc2004/accepted_abstracts.html

Chado is often compared to BioSQL, but they serve quite different purposes. Chado provides a rich schema to serve as a common framework for model organism databases while BioSQL primarily serves as an object-relational mapper for the various bio* projects (BioPerl, BioJava, BioPython, BioRuby, etc.). Here's an example of it's use within BioPerl: http://www.bioperl.org/wiki/BioSQL. Version 1.0.0 of BioSQL was officially released on 6 Mar 2008.

Object-oriented access to data in a Chado database is provided by the Modware project. It makes heavy use of BioPerl (described below) so you can retrieve and persist Bio::Seq/Bio::SeqFeatures (or iterators of Bio::Seq/BioSeqFeatures) in a Chado database.

BioMart, BioWarehouse, and GUS are data-mining/warehousing tools. BioMart focuses on genome annotation, BioWarehouse supports a variety of ontology and pathway resources, while GUS targets functional genomics and dates further back than either. It has extensions such as RAD for gene expression analysis and users like GeneDB.

The OBDA is not a schema but provides an abstraction layer for access to bio sequence data residing in different types of respositories (so it's a form of bio data access modeling, rather than object modeling).

SWAMI underlies the Next Generation Biology Workbench (NGBW) from the SDSC. Not sure if they provide any documentation of their bio data models, but their book chapter mentions some of the other dbs noted here.


Data Exchange Formats

Comments:

NCBI's ASN.1 representation of bio sequence data and annotations is rightly the grandaddy of all platform-specific, comprehensive bio data models and is the raison d'être of the NCBI toolbox. It's been around since the early 1990's, and so deserves praise for it's robustness and longevity. But one might ask, "Why have has there been such a proliferation of other formats subsequent to ASN.1?". It underscores the maxim that one size doesn't fit all for data formats or data models.

Big thumbs up for the constraint in GFF3 that feature types be tied to a controlled vocabulary (SO). Chado is also strongly ontology oriented. Ontologies will be our salvation, some day perhaps.

The DAS bio data object model isn't very rich, since this is a specification for how client-servers can share genome sequence and annotations and therefore needs generic ways to describe seqs and features. Here are some key aspects of the DAS/2 spec, that relate to bio object data modeling:

Flybase/BDGP were the main developers and users of GAME XML and co-developed the Apollo genome browser which handles this format. It is also supported by the open-bio projects. Flybase distributed GAME XML files containing fly genome annotations, but as of ~2006 uses Chado-XML instead. Here's an old description of GAME containing some obsolete links. (If anyone has better GAME docs, let me know.)

The design of MAGE-TAB and MAGE-ML are fairly light on modeling of biological data objects since they are concerned with the exchange of experimental results. They were driven to a large degree by the MIAME checklist, which can be considered a requirements list for the productive exchange of gene expression experimental results. The "minimal information" meme has spread to other areas within scientific data exchange (listed on the MIAME page), and IMHO would be a useful exercise for any bioinformatics data modeling effort.

Quantitative biological data models distributed in BioPAX, CellML, SBML and related formats have their own repository called the BioModels Database.

The SAM format is an emerging standard format for representing sequence alignments, driven by the explosion of data from high-throughput sequencing projects, such as the 1000 Genomes Project. (Note that this data format and related toolset is completely unrelated to the Sequence Alignment and Modeling System with which it shares an acronym.)


Object Modelling Efforts

Comments:

The Bio* projects can utilize basic sequence, feature, and taxonomic data stored in a BioSQL database (described above), but the different projects vary in terms of the types and depth of the sequence feature hierarchy they support. BioPerl has specialized objects representing different feature types, while Biojava and Biopython are more generic, yet capable of handling features in the Genbank/EMBL/DDBJ feature model. Bioperl provides support for some genome browsers, such as Gbrowse and Genquire.

caBIO is NCICB's cancer Bioinformatics Infrastructure Objects (caBIO) model, an extensive framework comprising many biological object types and APIs for web services that provide access to cancer research data. It is part of the infrastructure the powers the Cancer Biomedical Informatics Grid (caBIG). A good writeup of caBIG appears in the April 2008 issue of The Scientist.

FuGE (Functional Genomics Experiment) and derivative projects are among the few groups that are doing generalized, platform-independent object modeling these days, although their focus is on representing experimental workflows rather than defining standard data structures for bio objects. FuGE and related efforts like MAGE-OM typically have relatively minimal representations of bio objects that are compatible with those found in richer data models.

FuGE has attracted interest from groups in different application areas such as gene expression, proteomics, and toxicogenomics interested in modeling their data and workflows as extensions of the generic FuGE model, thus realizing the benefits of re-use and potentially facilitating interoperation between FuGE-based apps.

The GenoViz data model, called Genometry, is used by the Integrated Genome Browser (IGB) and other tools based on the GenoViz code base. Documentation for IGB is end user-centric and lacks details about the underlying Genometry data model, which has some powerful concepts. Here are a few pointers:

Modware provides an object-oriented interface to a Chado database, described above.


Hall of Fame

Here are some data models that have risen and fallen, or, if not quite fallen, didn't achieve widespread adoption or are no longer actively maintained.