Open2Dprot Project Software Development Plan |
The Open2Dprot Web site home page discusses the why there is a need for an an open-source n-dimensional proteomics data analysis effort. It lists the project goals, and why we are using open-source resources for Open2Dprot. An overview presents the Composite Samples Database model (CSD) used in Open2Dprot. The basic analysis pipeline shows the data reduction steps in converting n-D samples image data to a CSD. Finally, the Open2Dprot pipeline control program schedules and runs the modules in the pipeline after doing a data-dependency analysis.
The Open2Dprot project is a community effort to create an open source n-dimensional (n-D) protein expression data pipeline-analysis system. It is downloadable and could be used for exploratory data analysis of protein expression data across sets of n-dimensional (n-D) data from research experiments. In the initial phase, as early beta-level versions of the pipeline modules are created, they are made available for download and may be used for performing parts of the analysis. Interchangeable subproject modules are being developed for 2-dimensional data including 2D-PAGE (polyacrylamide gel electrophoresis). Initial support for 2D LC-MS, protein arrays and other data will also be provided. In the second phase, it will be expanded to handle data from other n-dimensional protein separation methods as well as more extensive analyses on these data.
The Open2Dprot project has a much broader agenda than just the analysis
of 2D gels - it is expected to address the issue of analysis of
protein expression data from a variety of sources and integrate them into
an single proteomic expression database.
Because of the modular XML pipeline design of Open2Dprot, particular pipeline
modules could be assigned by the user at run time with alternative pipeline
modules to configure the pipeline for their particular type of experimental
protein expression data. Some alternative modules will not depend on
performing all of the potential previous steps in the pipeline. In those
cases, unneeded preceding steps would be omitted. Pipeline scheduling is
handled by the pipeline dependency analysis. A set of alternate modules
would be available for the user to assign to the scheduler for their
type of data letting them design a pipeline optimal for their
particular data or choice of preprocessing methods. We are collaborating
with other groups to make some alternate modules available for the pipeline.
In the later phase of the project, Open2Dprot will be extended to
include other donated 2D gel analysis, n-dimensional (LC-MS, IPG-MSN,
etc.) mass spectrometry, protein array, and related proteomics software
codes. The main focus of Open2Dprot is to provide a set of flexible open
source pipeline of n-D protein expression database analysis software tools.
Other Web sites (
WORLD-2DPAGE, etc.) are primarily concerned with 2D gel database
repository issues whereas this project is primarily concerned with providing
the tools to analyze these types of data.
The old GELLAB-II system was an integrated collection of programs for
the analysis of 2D PAGE (2-dimensional polyacrylamide gel electrophoresis)
electrophoretic gels using image processing, database, statistics and
data-mining techniques. It was written in C for UNIX (SunOS and
Solaris primarily), used X-windows for graphics and Tektronix graphics
for plotting. For historical reasons, a
brief description of GELLAB-II as well as a list of literature
references is available.
Open2Dprot resides on the open source Web site (http://open2dprot.sourceforge.net/) that is part of
the the SourceForge.net
Web site. SourceForge consists of an integrated "farm" of servers to
provide free open source tools, servers, and
disk space. SourceForge is one of the largest resources providing
public access to groups wishing to create open source projects.
For example, there were over 115,949 projects with
over 1,274,624 registered developers (as of March 20, 2006). Users
of open source software don't need to register to download
software - people who want to help with projects do need
to register to participate in software development of a SourceForge
project.
Open2Dprot is being developed similar to the way the open source
Java-based MicroArray Explorer (http://maexplorer.sourceforge.net/) was developed.
MAExplorer is a DNA microarray data-mining tool and is freely
available with the Mozilla 1.1 (Netscape) public license to both
academic and commercial users. The same open source license has been adapted
for Open2Dprot. Many of the issues addressed in the old GELLAB-II
system influenced the design of MAExplorer. Some of what was learned
in building MAExplorer will be incorporated into the initial versions
of Open2Dprot. A small subset of the GELLAB-II code is being refactored
(where required) from (C / Unix / X-windows) code, as well as a subset of
MAExplorer and
Flicker projects code to a more modern, modular, and portable
(JAVA / XML / MySQL-RDBMS) paradigm.
Our plan integrates R-language, the open source
http://www.r-project.org/ data, graphics and statistics
programming language project, with Open2Dprot in a way similar to the
way R was integrated with MAExplorer. However, as MAExplorer executed a new
instance of the R program each time R was run, we will be using a R server
to evaluate R programs with Open2Dprot data which is more efficient.
The Open2Dprot development plan attempts to incorporate new relevant
proteomics bioinformatics efforts, databases, and tools and libraries as
well as other ways of running 2D PAGE gels, and their integration with
mass spectrometry and other methods for the identification of individual
spots. We are also interested in extending Open2Dprot to work with
other n-dimensional proteomics data.
This project will require considerable effort by participating open
source members to restructure it and bring it up to current researcher
requirements for 2D proteomics analyses. Some of the issues are
summarized below. It consists of a set of subproject modules applications which will
be made available as beta-level software as they become operational.
Although we are developing the CSDminer exploratory data analysis tool to
explorer a previously constructed CSD RDBMS (by the BuildCSD module), any
other software that could use the published XML CSD schema could be used.
The design uses modular the analysis components at various stages of the
analysis by providing well-defined XML interfaces between the components
and APIs (Application Programming Interfaces) to common libraries. There
is a common Open2Dprot library called
O2Plib
that handles all XML, cache and RDBMS I/O so that the pipeline modules
can use these uniform data interchange formats. This then allows alternate
analysis methods to be substituted for the original methods for sample data
accessioning, image spot segmenter, spot pairing, databasing,
data mining, etc). We envision eventually having a variety of
contributed modules available and allowing the user to configure their
particular analysis system to their particular requirements using the modules
that are superior for their needs.
As design specifications, components, and documentation become available,
they are posted on this Open2Dprot web site.
As the project proceeds, additional volunteers will be needed to help
staff it. Additional facilities will be set up at SourceForge
including: mailing lists, forums, bug reports, suggestions, etc.
SourceForge provides sophisticated project management resources that
can greatly facilitate developing open source projects. Some of these
include role assignment for those who join the active development on a
project. Examples of roles supported by the SourceForge.Net
server tools include: project manager, developer, documentation
writer, tester, support manager, graphic/other designer, documentation
translator, editorial/content writer, packager, analysis/design,
advisor/consultant, Web designer, etc. We hope to add individuals with
expertise in some of these roles. Long term, having multiple project
managers who are expert in the various proteomic and bioinformatic
areas will be essential.
Due to the size and scope of this project, it will require a minimum
level of core expert-developer participation from the bioinformatics
research community.
At the point that the initial pipeline and pipeline control program are
initially working, qualified developers will be invited to join the
SourceForge Open2Dprot core-development team.
1. Initial creation of software in the Open2Dprot project
In the initial phase, Open2Dprot is based partly on exploratory
data analysis, image processing software and methods from a variety of
bioinformatics software systems, and partly on new software methodologies.
Initially, Open2Dprot n-dimensional (2D)
proteomics analysis software is based a major refactoring of some of these software.
Refactoring of these codes includes incrementally translating some C code
to Java, recoding some existing Java code, adding new algorithms and
paradigms including XML schemas and XMLbeans, and updating Java code
using modern programming practices. Some of the code base used includes
code from the Unix version of
"GELLAB-II", Java and R code from open-source efforts including the
MAExplorer project,
the Flicker
project, R code and related R-packages from the
Bioconductor project, other
open-source microarray data-mining, image processing and relevant LC-MS
software used in 2D and n-D analysis and data-mining.
[The Unix version of GELLAB-II is no longer being distributed].
2. Subsequent development of Open2Dprot
Although the original GELLAB-II system used X-windows (MAExplorer used
Java and R), the new paradigm uses direct-manipulation Java-graphics, R
graphics, and possibly an alternative graphics system such as SVG for
some of the graphics. The original GELLAB-II composite 2D gel (i.e.,
set of gel samples) database engine was not a relational database
because of efficiency and availability considerations at that time.
However, Open2Dprot will use an open source SQL RDBMS (such as MySQL
or Postgres) using a proteomics community standard schema as part of
the redesign effort. We are leaning toward using the Minimal
Information About a Proteomics Experiment (MIAPE standard,
formerly PEDRo) to
implement the more general n-dimensional Composite Samples Database
(CSD). Our schemas will extend MIAPE schemas where required for our
data types not found in MIAPE. Early versions of a document describing
our (currently non-MIAPE compliant)
Open2Dprot XML schemas are available.3. Help and skills required for this project
Open2Dprot is targeted primarily towards most common user systems including
Microsoft Windows, Linux, Unix, and MacOS-X platforms. It is being
developed as a set of stand-alone Java tools that interact with a local
or network-based RDBMS and local data caches. The data-mining tools will
also include relational databases and R-language data analysis.3.1 Help from the Bioinformatics community
Because it is critical that we develop a well-formed XML pipeline, this group
will remain small during the initial effort until the basic pipeline design
is functioning at a beta-level. As the early beta-level system is released -
based on feedback - the design and XML schemas will be refactored and
redefined as new members of the group get involved. Some of the expert
help from the bioinformatics community might include:
3.2 Bioinformatics software core-developer skills required
As the Open2Dprot project evolves using the following technologies -
Java and R (for advanced statistics, clustering, and classification analyses),
XML and SQL RDBMS database manipulation (probably using MySQL or Postgres)
for a new composite samples database structure, additional core-developer
expertise will be required in:
3.3 Biologist users to Beta-test the system with their 2D
proteomics data
The initial effort will also require users who would be willing to use
early beta-level versions on their n-D proteomics data. As Open2Dprot
is extended to other n-D protein expression data sources, beta-testers
will be needed to help test and develop the software modules with other
types of n-D proteomics data. They might work with software developers
within their home institution group to develop particular pipeline
modules to fit into the pipeline for the new types of data.3.4 Subsequent extension and integration with other proteomics
software and databases
The extended effort will require a core of bioinformatics software
developers who are willing and have the expertise to integrate other types
of proteomic software and data with the Open2Dprot project. For example
integrating 2D LC-MS mass spectrometry, protein arrays, dye-multiplexed
data, and other proteomic annotation, characterization, nanobiology
characterization, identification data, pathway data, and related database
methods. Open2Dprot will integrate these methods in a relational database
data server model being developed by international proteomics
standards groups including the Protein Standards Initiative
(PSI), Human Proteome Organization (HUPO),
and using the Minimal Information
About a Proteomics Experiment (MIAPE, formerly PEDRo) schema or an
agreed upon proteomics community standard data description schema.3.5 Using alternate computation modules for analysis pipeline
Finally, when the initial beta version is made available and the XML data
interchange schema is stable, we encourage others to contribute their
2D-gel analysis, related n-dimensional LC-MS proteomics, protein array,
and related bioinformatics software refactored for the Open2Dprot paradigm.
By standardizing the pipeline data, it will be easy to assign alternate
pipeline modules for the equivalent pipeline function by
using a standard XML data format for equivalent interchangeable pipeline
component designs. For example, there may be other methods of doing
spot quantification or spot pairing. Different types of 2D proteomics
data will require different processing methods when the tool set is
extended to integrate or analyze data from other protein separation
methods such as mass spectrometry, protein arrays, dye multiplexing, etc.
Other examples of alternate methods would be in the data-mining methods. 4. SourceForge project management tools
We use the SourceForge
CVS system for maintaining code worked on by the various developers. The
beta releases are made available on the SourceForge
file mirror. These distributions include binary distributions, demo data
as well as source code snapshots.5. Bioinformatics community participation in the open source
Open2Dprot project
If you are involved in bioinformatics and are interested in helping
with this effort and/or would consider participating in any of the roles defined above (3. Help and skills
required for this project), please
e-mail
us with what you see as your contribution.
Revised: 03/20/2006