Disclaimer: This collection is liable to be incomplete and out of date. Have suggestions to improve it? contact me.
- About This Site
- Data Model Survey
This site is an attempt to consolidate current and prior art regarding the modeling of biological data pertaining to the molecules of the central dogma of biology (DNA, RNA, protein) and their structural or functional properties. This survey could prove useful in guiding best practices, encouraging re-use, discouraging wheel reinvention, and improving the interoperability of bioinformatics software between organizations.
The vast majority of real world data modeling in bioinformatics these days occurs within database schemas or data exchange formats. Such platform-specific modeling is all well and good, but the proliferation of these tends to limit reusability and interoperability, since it can lead to functionally equivalent but programatically incompatible interfaces to the same data. This leads developers to wonder, "Which one should I pick?" and, when in doubt, they may feel compelled to roll their own. We should strive to avoid a scenario which either forces developers (and their end users) to make tough compatibility choices or increases the barrier for future work by leading developers and data providers to "add to the pile".
This collection does not attempt to be all-encompasing, but rather focuses on currently active, publically available data modeling efforts of general applicability. It does not cover models specific to analytical tools, data pipelines, GUIs, etc. It is not an attempt to survey bioinformatics standards efforts generally, but points to aspects of those that have a bearing on the types of bio data objects mentioned above. For a survey of bioinformatics standards, see BioStandards.info which will likely subsume BioDataModel.org in the not-too-distant future as it would enable community maintenance of this information.
At present, there are minimal comparisons or reviews. Still, simply tracking such bio data modeling efforts and providing a broader perspective on them is a worthy goal in its own right. If you have suggestions for how to improve this resource, let me know. Thanks.
For general information about data modeling, check out these sites:
- Why are there so many bio data models?
Short answer: because there is so much #$*%!@ data! As the collection at this site attests, people have been doing bioinformatics data modeling in various ways for some time. Many of these efforts co-emerged, created by different groups working to solve real-world problems with budget and time constraints. Data modeling is so vital to application development that it's typically impractical to try and construct a generic model that would see wide adoption, or to embark on a collaborative effort with other groups (who don't share your application area or timelines).
Bioinformatics has always been playing catch-up to changes instigated by new knowledge and technologies in the molecular sciences, and will continue for some time to come in this young field. So the craft of biological data modelling must of necessity be a dynamic, adaptable endeavor.
- Do we need standard biological data models?
Eventually, we may achieve a "critical mass" stage where there are a sufficient number of mature bio data models that explore a sufficiently large sampling of application space that we can start seeing more convergence. Time will tell.
Some aspects of biological data are more nailed down than others. Any effort to define reusable bio data models should capture the more well-established areas while permitting room to accomodate change as our scientific understanding improves over time.
Whether we will ever or should ever achieve broad uniformity in the area of biological data models is an open debate. If we all agree to use a common object model, database schema, and exchange format we should be good, right? Maybe for a while, but different applications must meet new and ever-changing data types and end user requirements and there will always be different approaches to data denormalization, and so on. So it's hard to come up with a one-size-fits-all database or format that new user X working with new data Y won't be tempted to tweak.
- Will data standards help data modeling?
The development and widespread adoption of increasingly comprehensive ontologies for biological data could help to reduce the need for uniform practices in biological data modeling. For example, if an object's type can be specified by reference to a term in a standard ontology, this provides some freedom in how that object is implemented and simplifies alignment across different platforms and implementations. Sure, bio data objects possess properties and behaviors that are not addressable by ontologies, but ontologies could be a big help when it comes to object typing and classification (a major hurdle). Certain other technologies, such as RDF may possibly help to tame some of the other aspects of bio data objects, and the HCLS is looking into this as we speak.
I'd argue that even with a flexible, widely accepted set of ontologies and semantic web technologies, it will still be of use to define some data modeling guidelines for working with core bioinformatics data types, possibly with customizations for specific application areas. The survey of data models at this site might lead to insights as to whether or not this may be the case. Or it may just be a handy place to shop for a data model that fits your needs.
Data models are grouped based on their general class and are presented in alphabetical order.
- BioMart: http://www.biomart.org/
- BioSQL: http://biosql.org
- BioWarehouse: http://biowarehouse.ai.sri.com/
- Chado: http://www.gmod.org/wiki/index.php/Schema
- Ensembl: http://www.ensembl.org/info/software/core/schema/
- GUS: http://gusdb.org
- OBDA: http://www.bioperl.org/wiki/HOWTO:OBDA
- SWAMI and NGBW: http://www.ngbw.org/
- UCSC: http://genome.cse.ucsc.edu/goldenPath/gbdDescriptions.html
The 2004 BOSC featured a compendium of talks from several of these database groups, so the abstracts provide a good overview: http://open-bio.org/bosc2004/accepted_abstracts.htmlChado is often compared to BioSQL, but they serve quite different purposes. Chado provides a rich schema to serve as a common framework for model organism databases while BioSQL primarily serves as an object-relational mapper for the various bio* projects (BioPerl, BioJava, BioPython, BioRuby, etc.). Here's an example of it's use within BioPerl: http://www.bioperl.org/wiki/BioSQL. Version 1.0.0 of BioSQL was officially released on 6 Mar 2008.
Object-oriented access to data in a Chado database is provided by the Modware project. It makes heavy use of BioPerl (described below) so you can retrieve and persist Bio::Seq/Bio::SeqFeatures (or iterators of Bio::Seq/BioSeqFeatures) in a Chado database.
BioMart, BioWarehouse, and GUS are data-mining/warehousing tools. BioMart focuses on genome annotation, BioWarehouse supports a variety of ontology and pathway resources, while GUS targets functional genomics and dates further back than either. It has extensions such as RAD for gene expression analysis and users like GeneDB.
The OBDA is not a schema but provides an abstraction layer for access to bio sequence data residing in different types of respositories (so it's a form of bio data access modeling, rather than object modeling).
SWAMI underlies the Next Generation Biology Workbench (NGBW) from the SDSC. Not sure if they provide any documentation of their bio data models, but their book chapter mentions some of the other dbs noted here.
- ASN.1: http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html
- Chado-XML: http://www.gmod.org/wiki/index.php/Chado_XML
- EMBLxml and INSDseq: http://www.ebi.ac.uk/xembl/
- GAME: http://www.fruitfly.org/annot/apollo/game.rng.txt
- GFF3: http://song.sourceforge.net/gff3.shtml
- PML: http://www.openpml.org/
- Quantitative biological modeling data formats:
- SAM: http://samtools.sourceforge.net/
NCBI's ASN.1 representation of bio sequence data and annotations is rightly the grandaddy of all platform-specific, comprehensive bio data models and is the raison d'être of the NCBI toolbox. It's been around since the early 1990's, and so deserves praise for it's robustness and longevity. But one might ask, "Why have has there been such a proliferation of other formats subsequent to ASN.1?". It underscores the maxim that one size doesn't fit all for data formats or data models.
The DAS bio data object model isn't very rich, since this is a specification for how client-servers can share genome sequence and annotations and therefore needs generic ways to describe seqs and features. Here are some key aspects of the DAS/2 spec, that relate to bio object data modeling:
- Sequence locations (ranges) - respresented as half-open, zero-based intervals, a concept also used in the UCSC genome annotation database and in Genoviz/Genometry (see below), but at odds with the 1-based coordinate system used by Genbank/EMBL/DDBJ (not a big deal though).
- Feature types reference the Sequence Ontology (as in GFF3).
- Sequence alignments and complex, nested features can be represented
Flybase/BDGP were the main developers and users of GAME XML and co-developed the Apollo genome browser which handles this format. It is also supported by the open-bio projects. Flybase distributed GAME XML files containing fly genome annotations, but as of ~2006 uses Chado-XML instead. Here's an old description of GAME containing some obsolete links. (If anyone has better GAME docs, let me know.)
The design of MAGE-TAB and MAGE-ML are fairly light on modeling of biological data objects since they are concerned with the exchange of experimental results. They were driven to a large degree by the MIAME checklist, which can be considered a requirements list for the productive exchange of gene expression experimental results. The "minimal information" meme has spread to other areas within scientific data exchange (listed on the MIAME page), and IMHO would be a useful exercise for any bioinformatics data modeling effort.
Quantitative biological data models distributed in BioPAX, CellML, SBML and related formats have their own repository called the BioModels Database.
The SAM format is an emerging standard format for representing sequence alignments, driven by the explosion of data from high-throughput sequencing projects, such as the 1000 Genomes Project. (Note that this data format and related toolset is completely unrelated to the Sequence Alignment and Modeling System with which it shares an acronym.)
- Bio* project object models for sequence/features/annotations:
- caBIO: http://cabio.nci.nih.gov/NCICB/infrastructure/cacore_overview/caBIO
- FuGE: http://fuge.sourceforge.net/
- GenoViz/Genometry: http://genoviz.sourceforge.net
- MAGE-OM: http://mged.sourceforge.net/software/index.php
- Modware: http://gmod-ware.sourceforge.net/
- PAGE-OM: http://www.openpml.org/page-om/index.html
The Bio* projects can utilize basic sequence, feature, and taxonomic data stored in a BioSQL database (described above), but the different projects vary in terms of the types and depth of the sequence feature hierarchy they support. BioPerl has specialized objects representing different feature types, while Biojava and Biopython are more generic, yet capable of handling features in the Genbank/EMBL/DDBJ feature model. Bioperl provides support for some genome browsers, such as Gbrowse and Genquire.
caBIO is NCICB's cancer Bioinformatics Infrastructure Objects (caBIO) model, an extensive framework comprising many biological object types and APIs for web services that provide access to cancer research data. It is part of the infrastructure the powers the Cancer Biomedical Informatics Grid (caBIG). A good writeup of caBIG appears in the April 2008 issue of The Scientist.
FuGE (Functional Genomics Experiment) and derivative projects are among the few groups that are doing generalized, platform-independent object modeling these days, although their focus is on representing experimental workflows rather than defining standard data structures for bio objects. FuGE and related efforts like MAGE-OM typically have relatively minimal representations of bio objects that are compatible with those found in richer data models.
FuGE has attracted interest from groups in different application areas such as gene expression, proteomics, and toxicogenomics interested in modeling their data and workflows as extensions of the generic FuGE model, thus realizing the benefits of re-use and potentially facilitating interoperation between FuGE-based apps.
The GenoViz data model, called Genometry, is used by the Integrated Genome Browser (IGB) and other tools based on the GenoViz code base. Documentation for IGB is end user-centric and lacks details about the underlying Genometry data model, which has some powerful concepts. Here are a few pointers:
- Genometry defines a concept called a sequence symmetry that provides a generic way to map locations across any number of bio sequences (e.g., link a SNP in genomic sequence coordinates to an amino acid residue in a protein domain). Generally, Genometry can get by without needing many specialized object types (e.g., gene, exon) but some specializations exist, such as
com.affymetrix.genometryImpl.UcscGeneSymwhich represents a UCSC database dump of RefSeq data.
- A Powerpoint presentation describing how Genometry works can be found in the documentation directory of the Genoviz CVS source tree.
- There's also an object model describing Genometry in the context of the DAS/2 spec. See the
GenometryWithDas.zumlfile located in the same genoviz documentation directory.
Modware provides an object-oriented interface to a Chado database, described above.
- AGAVE: http://www.agavexml.org/ (DTD)
- BSML: http://xml.coverpages.org/bsml.html (DTD)
- BIOML: http://www.proteometrics.com/BIOML/ (DTD)
- Genomics Algebra
- Early OMG specifications by
the Life Sciences Research
These specs were developed in the pre-MDA, CORBA-centric days of the LSR, an OMG task force whose activity has dropped off substantially from its heyday in the late 90's/early 2000's. Though they are no longer maintained, some good thought was put into these efforts and their contents provide a worthy resource of reference material:
Biomolecular Sequence Analysis (BSA) spec
In 2001, an attempt was made to re-tool this spec in an effort aptly called BSANE, which aimed to create an MDA-based version of the original BSA spec, incorporating ideas from the Bio* projects. But this effort faltered for various reasons, but chiefly since the bio* developers viewed BioSQL as adequately serving the purpose of a common data model. Here's the last presentation on BSANE, from the 2001 BOSC.
- Genomic Maps spec and IDL files.
- Macromolecular Structure spec and IDL files.
- The LSR created other specs besides the ones listed above, which are the ones most relevant to bio data modeling. Here's the LSR's complete list.
- Biomolecular Sequence Analysis (BSA) spec
Genomics Algebra: A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information (2003)
I'm not familiar with this work, but the paper looked interesting when it came up in a google search.