1. Introduction
There is a need for an open source proteomics expression exploratory
data analysis software tool. Such a tool should handle data from 2D-PAGE,
2D LC-MS, 2D IPG-MS, n-dimensional LC-MS*MS*MS..., protein arrays, and other
protein expression separation methods. All of these methods share a common
paradigm: proteins separated by orthogonal features. Some of these
methods are semi-quantitative, but some of that data could be be analyzed
using some exploratory data analysis methods. Data in these systems can
be represented as protein expression profiles which lends itself to
exploratory data analysis. Open2Dprot could be then be used as part of a
broader set of integrated tools.
This Web site describes the Open2Dprot project. Use the table of contents
on the left to help navigate the Web site.
2D-PAGE was not widely used until recently due to limitations in
identifying spots differentially expressed, and the difficulty in resolving
and detecting specialized classes of proteins (e.g., basic proteins,
membrane proteins, low abundance proteins). 2D-PAGE is often
used as a prescreening stage for mass-spectrometry to identify excised
spots found in differential analysis. Today 2D gels have improved resolution
with zoom 2D-gels and new pre-fractionation methods. There are other protein
separation techniques, such as 2D LC-MS and protein arrays, that
could use exploratory analysis paradigms developed for expression data
of 2D gels and more recently for DNA-microarray.
An open-source n-dimensional proteomics data analysis effort
An open-source project can be beneficial to the proteomics community at
large much as the genomics community has benefited from open-source tools.
There is a far greater likelihood of wider distribution and in algorithm
improvement by the research community in academic style collaborations
than a closed-source business model. Researchers can more rapidly adapt new
methods to existing software without waiting for release of commercial
products. It is possible to use the contributed expertise and code of
proteomics experts and bioinformaticians to help build and test open software.
Algorithms are more transparent, so researchers can verify results more easily,
and there is more opportunity to share data in standard non-proprietary
formats. No expensive software licenses are required, which reduces
deployment costs within large organizations and small labs. However, using
the proper open-source licenses (see
Open Source Licensing book by L. Rosen, 2005) can encourage adoption and
collaboration between commercial and academic interests (e.g., Linux, Firefox,
Eclipse, Apache, etc). Many free open-source repositories available. Such
repositories offer tools to support collaboration, software development,
documentation, forums, and distribution.
Project goals for Open2Dprot
The Open2Dprot project goals include an international community
effort to create an open-source n-D quantitative data analysis system. It
will be a stand-alone downloadable system that can connect to relational
databases both locally and over the Internet. It could be used for data
mining protein expression across sets of samples from researchers'
experiments to investigate and find significant protein expression from
multiple experiments. It should provide an integrated set of
software tools, analysis methods, and data structures for quantitative
and system biology protein expression. Finally, it should handle protein
expression data from 2D-gel, 2D LC-MS, protein arrays, and other protein
separation methods.
Using open-source resources for Open2Dprot
Open2Dprot is hosted and developed on the SourceForge repository at
open2dprot.sourceforge.net.
This Web site discusses the current Open2Dprot software
development plan. We are using
a similar open-source development methodology that we used in our
Java/R-based MAExplorer
DNA microarray data-mining software. Open2Dprot could later reside as part
of other proteomic resources Web sites resources that it with other tools
relating to mass spectrometry, dye multiplexing, protein arrays, Internet
proteomic databases, etc.
2. The Open2Dprot project - phased implementation
In the initial phase, Open2Dprot is partly based on exploratory
data analysis, image processing and methods from a variety of
bioinformatics software systems. These include: a subset of methods
from the last (1993) Unix version of the
NCI "GELLAB-II" system,
open-source bioinformatics codes such as MAExplorer,
Flicker
and R-language
based systems. In addition, other open-source 2D-gel, 2D LC-MS analysis
codes, and microarray data-mining codes may be incorporated. This is
described in more detail in the
Open2Dprot development plan.
The overall analysis consists of
two procedures -
a pre-processing procedure to create the database and then
an open-ended data mining procedure of the
composite samples database.
As subprojects in the
analysis pipeline
reach the beta-testing stage, they are made available for download and
are announced in the module list of subprojects
integrating our schema with developing community proteomics data schemas.
In the second phase, Open2Dprot will be extended with other donated 2D
gel analysis and n-D LC-MS*MS*MS*... analysis, mass spectrometry,
protein microarray, and related proteomics and bioinformatics software
codes as well as core-developer efforts by the bioinformatics research
community. We envision extending the Open2Dprot project to a more
general proteomics expression analysis environment incorporating other
types of n-D proteomics expression data. These will be integrated by a
common proteomics database schema.
We welcome suggestions for
extending or modifying this agenda for Open2Dprot for both the initial
and second phase of development.
The Open2Dprot software is available as freely downloadable data
mining software tools for n-D data analysis from the SourceForge
files mirror. These will implement a minimal system including
the ability to:
- Accession (i.e., enter) samples (scanned images or spot data from
other pre-processing software) and experiment information into a user's
experiment project database. This will be extended to implement
the Protein Standards Initiative
MIAPE
schema.
- Quantify polypeptide "spots" by various methods including: image
processing of 2D gel images, cluster polypeptides "spots assembled by
clustering in a series of n-D LC-MS-MS-MS*... data, using spot data from
protein arrays analyzed by other software, or other separation
methods.
- If needed for spot pairing, create a landmark database between a
reference sample and the remaining samples by spot pairing algorithm (e.g.,
not require for some automatic spot-pairing methods).
- Pair "spots" between reference samples and the rest of the
samples if required (E.g., not required for protein arrays since spots
between arrays are paired by definition).
- Construct a Composite Samples Database,
CSD, from the paired-spot data for subsequent exploratory data anlysis.
- Explore the CSD data using
exploratory data analysis techniques:
visualization methods, statistics, clustering, direct-manipulation
graphics and reports, using additional annotation data derived from
Internet proteomic, genomic and functional bioinformatics databases,
etc.
The composite samples database is then used for exploratory data
analysis to investigate and find significant protein expression
profiles relevant to the experiments, for sample-classification, for
functional analysis of subsets of proteins with common expression,
etc.
If there are compatible reference CSD databases available over the
Internet, data from them could be cached locally to aid in the
analysis of the investigators private database. For example, if there
were a reference database with identified proteins, those
identifications could be use to annotate spots in the investigators
local experiment samples.
When fully implemented, Open2Dprot will run as an integrated set of
stand-alone applications on your computer with connection to Internet
proteomics databases as required. The set of pipeline modules is
controlled by the main Open2Dprot pipeline control program scheduler
(see Figure 3).
Open2Dprot's exploratory data analysis environment will provide tools
for the data mining of quantitative expression profiles across
multiple 2D samples. We envision providing a few demonstration
databases as examples. Other repositories could house additional
public databases. The focus of Open2Dprot is to provide an open source
set of software tools for n-D data database analysis - not to provide
a database repository.
This Web site discusses the Open2Dprot software development plan. An
overview (full
slides PDF) (2 slides/page PDF) (6 slides/page
PDF) is available that summarizes Open2Dprot. See the sub project module lists for more
details on individual processing modules.
There is also a brief historical
description of the original GELLAB-II system. There were a number of papers describing
GELLAB-II and projects where it was used.