Software Development Plan for the Open2Dprot Project

Open2Dprot Project Software
Development Plan

The Open2Dprot Web site home page discusses the why there is a need for an an open-source n-dimensional proteomics data analysis effort. It lists the project goals, and why we are using open-source resources for Open2Dprot. An overview presents the Composite Samples Database model (CSD) used in Open2Dprot. The basic analysis pipeline shows the data reduction steps in converting n-D samples image data to a CSD. Finally, the Open2Dprot pipeline control program schedules and runs the modules in the pipeline after doing a data-dependency analysis.

The Open2Dprot project is a community effort to create an open source n-dimensional (n-D) protein expression data pipeline-analysis system. It is downloadable and could be used for exploratory data analysis of protein expression data across sets of n-dimensional (n-D) data from research experiments. In the initial phase, as early beta-level versions of the pipeline modules are created, they are made available for download and may be used for performing parts of the analysis. Interchangeable subproject modules are being developed for 2-dimensional data including 2D-PAGE (polyacrylamide gel electrophoresis). Initial support for 2D LC-MS, protein arrays and other data will also be provided. In the second phase, it will be expanded to handle data from other n-dimensional protein separation methods as well as more extensive analyses on these data.

The Open2Dprot project has a much broader agenda than just the analysis of 2D gels - it is expected to address the issue of analysis of protein expression data from a variety of sources and integrate them into an single proteomic expression database.

1. Initial creation of software in the Open2Dprot project

In the initial phase, Open2Dprot is based partly on exploratory data analysis, image processing software and methods from a variety of bioinformatics software systems, and partly on new software methodologies. Initially, Open2Dprot n-dimensional (2D) proteomics analysis software is based a major refactoring of some of these software. Refactoring of these codes includes incrementally translating some C code to Java, recoding some existing Java code, adding new algorithms and paradigms including XML schemas and XMLbeans, and updating Java code using modern programming practices. Some of the code base used includes code from the Unix version of "GELLAB-II", Java and R code from open-source efforts including the MAExplorer project, the Flicker project, R code and related R-packages from the Bioconductor project, other open-source microarray data-mining, image processing and relevant LC-MS software used in 2D and n-D analysis and data-mining. [The Unix version of GELLAB-II is no longer being distributed].

Because of the modular XML pipeline design of Open2Dprot, particular pipeline modules could be assigned by the user at run time with alternative pipeline modules to configure the pipeline for their particular type of experimental protein expression data. Some alternative modules will not depend on performing all of the potential previous steps in the pipeline. In those cases, unneeded preceding steps would be omitted. Pipeline scheduling is handled by the pipeline dependency analysis. A set of alternate modules would be available for the user to assign to the scheduler for their type of data letting them design a pipeline optimal for their particular data or choice of preprocessing methods. We are collaborating with other groups to make some alternate modules available for the pipeline.

In the later phase of the project, Open2Dprot will be extended to include other donated 2D gel analysis, n-dimensional (LC-MS, IPG-MS^N, etc.) mass spectrometry, protein array, and related proteomics software codes. The main focus of Open2Dprot is to provide a set of flexible open source pipeline of n-D protein expression database analysis software tools. Other Web sites ( WORLD-2DPAGE, etc.) are primarily concerned with 2D gel database repository issues whereas this project is primarily concerned with providing the tools to analyze these types of data.

The old GELLAB-II system was an integrated collection of programs for the analysis of 2D PAGE (2-dimensional polyacrylamide gel electrophoresis) electrophoretic gels using image processing, database, statistics and data-mining techniques. It was written in C for UNIX (SunOS and Solaris primarily), used X-windows for graphics and Tektronix graphics for plotting. For historical reasons, a brief description of GELLAB-II as well as a list of literature references is available.

Open2Dprot resides on the open source Web site (http://open2dprot.sourceforge.net/) that is part of the the SourceForge.net Web site. SourceForge consists of an integrated "farm" of servers to provide free open source tools, servers, and disk space. SourceForge is one of the largest resources providing public access to groups wishing to create open source projects. For example, there were over 115,949 projects with over 1,274,624 registered developers (as of March 20, 2006). Users of open source software don't need to register to download software - people who want to help with projects do need to register to participate in software development of a SourceForge project.

Open2Dprot is being developed similar to the way the open source Java-based MicroArray Explorer (http://maexplorer.sourceforge.net/) was developed. MAExplorer is a DNA microarray data-mining tool and is freely available with the Mozilla 1.1 (Netscape) public license to both academic and commercial users. The same open source license has been adapted for Open2Dprot. Many of the issues addressed in the old GELLAB-II system influenced the design of MAExplorer. Some of what was learned in building MAExplorer will be incorporated into the initial versions of Open2Dprot. A small subset of the GELLAB-II code is being refactored (where required) from (C / Unix / X-windows) code, as well as a subset of MAExplorer and Flicker projects code to a more modern, modular, and portable (JAVA / XML / MySQL-RDBMS) paradigm.

2. Subsequent development of Open2Dprot

Although the original GELLAB-II system used X-windows (MAExplorer used Java and R), the new paradigm uses direct-manipulation Java-graphics, R graphics, and possibly an alternative graphics system such as SVG for some of the graphics. The original GELLAB-II composite 2D gel (i.e., set of gel samples) database engine was not a relational database because of efficiency and availability considerations at that time. However, Open2Dprot will use an open source SQL RDBMS (such as MySQL or Postgres) using a proteomics community standard schema as part of the redesign effort. We are leaning toward using the Minimal Information About a Proteomics Experiment (MIAPE standard, formerly PEDRo) to implement the more general n-dimensional Composite Samples Database (CSD). Our schemas will extend MIAPE schemas where required for our data types not found in MIAPE. Early versions of a document describing our (currently non-MIAPE compliant) Open2Dprot XML schemas are available.

Our plan integrates R-language, the open source http://www.r-project.org/ data, graphics and statistics programming language project, with Open2Dprot in a way similar to the way R was integrated with MAExplorer. However, as MAExplorer executed a new instance of the R program each time R was run, we will be using a R server to evaluate R programs with Open2Dprot data which is more efficient.

The Open2Dprot development plan attempts to incorporate new relevant proteomics bioinformatics efforts, databases, and tools and libraries as well as other ways of running 2D PAGE gels, and their integration with mass spectrometry and other methods for the identification of individual spots. We are also interested in extending Open2Dprot to work with other n-dimensional proteomics data.

3. Help and skills required for this project

Open2Dprot is targeted primarily towards most common user systems including Microsoft Windows, Linux, Unix, and MacOS-X platforms. It is being developed as a set of stand-alone Java tools that interact with a local or network-based RDBMS and local data caches. The data-mining tools will also include relational databases and R-language data analysis.

This project will require considerable effort by participating open source members to restructure it and bring it up to current researcher requirements for 2D proteomics analyses. Some of the issues are summarized below. It consists of a set of subproject modules applications which will be made available as beta-level software as they become operational.

3.1 Help from the Bioinformatics community

Because it is critical that we develop a well-formed XML pipeline, this group will remain small during the initial effort until the basic pipeline design is functioning at a beta-level. As the early beta-level system is released - based on feedback - the design and XML schemas will be refactored and redefined as new members of the group get involved. Some of the expert help from the bioinformatics community might include:

Bioinformatics software core-developers with experience in Java-language / R-language / XML / SQL / (Windows, Linux, MacOS-X). They should have a good understanding of the 2D and n-D proteomics data analysis problem. Expertise in Java and bioinformatics is critical,
Bioinformatics software core-developers having additional specialized expertise in one or more of the following: image processing, databases, statistics, clustering, data-mining methods, Internet proteomics databases, etc.,
A few senior bioinformatics core-developers who might be interested in also taking on managerial and design roles (a long-term goal is to have multiple 'project managers' for the project),
Writers to help with user-friendly documentation, tutorials, and training materials which are critical for easy use of the software. These will come into play after the initial phase. Understanding of the bioinformatics issues is important.

3.2 Bioinformatics software core-developer skills required

As the Open2Dprot project evolves using the following technologies - Java and R (for advanced statistics, clustering, and classification analyses), XML and SQL RDBMS database manipulation (probably using MySQL or Postgres) for a new composite samples database structure, additional core-developer expertise will be required in:

XML data interchange to allow modular components to communicate between pipeline modules and the CSD,
Java graphics for GUI implementation, statistics for extended analyses,
R programming, graphics and statistics for extended analyses,
SQL for implementing the accession, annotation, landmark, spot lists, paired spot lists, the composite samples databases as a relational database. This would incorporate a proteomics research community SQL standard schema such as MIAPE.

3.3 Biologist users to Beta-test the system with their 2D proteomics data

The initial effort will also require users who would be willing to use early beta-level versions on their n-D proteomics data. As Open2Dprot is extended to other n-D protein expression data sources, beta-testers will be needed to help test and develop the software modules with other types of n-D proteomics data. They might work with software developers within their home institution group to develop particular pipeline modules to fit into the pipeline for the new types of data.

3.4 Subsequent extension and integration with other proteomics software and databases

The extended effort will require a core of bioinformatics software developers who are willing and have the expertise to integrate other types of proteomic software and data with the Open2Dprot project. For example integrating 2D LC-MS mass spectrometry, protein arrays, dye-multiplexed data, and other proteomic annotation, characterization, nanobiology characterization, identification data, pathway data, and related database methods. Open2Dprot will integrate these methods in a relational database data server model being developed by international proteomics standards groups including the Protein Standards Initiative (PSI), Human Proteome Organization (HUPO), and using the Minimal Information About a Proteomics Experiment (MIAPE, formerly PEDRo) schema or an agreed upon proteomics community standard data description schema.

3.5 Using alternate computation modules for analysis pipeline

Finally, when the initial beta version is made available and the XML data interchange schema is stable, we encourage others to contribute their 2D-gel analysis, related n-dimensional LC-MS proteomics, protein array, and related bioinformatics software refactored for the Open2Dprot paradigm. By standardizing the pipeline data, it will be easy to assign alternate pipeline modules for the equivalent pipeline function by using a standard XML data format for equivalent interchangeable pipeline component designs. For example, there may be other methods of doing spot quantification or spot pairing. Different types of 2D proteomics data will require different processing methods when the tool set is extended to integrate or analyze data from other protein separation methods such as mass spectrometry, protein arrays, dye multiplexing, etc. Other examples of alternate methods would be in the data-mining methods.

Although we are developing the CSDminer exploratory data analysis tool to explorer a previously constructed CSD RDBMS (by the BuildCSD module), any other software that could use the published XML CSD schema could be used.

The design uses modular the analysis components at various stages of the analysis by providing well-defined XML interfaces between the components and APIs (Application Programming Interfaces) to common libraries. There is a common Open2Dprot library called O2Plib that handles all XML, cache and RDBMS I/O so that the pipeline modules can use these uniform data interchange formats. This then allows alternate analysis methods to be substituted for the original methods for sample data accessioning, image spot segmenter, spot pairing, databasing, data mining, etc). We envision eventually having a variety of contributed modules available and allowing the user to configure their particular analysis system to their particular requirements using the modules that are superior for their needs.

As design specifications, components, and documentation become available, they are posted on this Open2Dprot web site.

4. SourceForge project management tools

We use the SourceForge CVS system for maintaining code worked on by the various developers. The beta releases are made available on the SourceForge file mirror. These distributions include binary distributions, demo data as well as source code snapshots.

As the project proceeds, additional volunteers will be needed to help staff it. Additional facilities will be set up at SourceForge including: mailing lists, forums, bug reports, suggestions, etc. SourceForge provides sophisticated project management resources that can greatly facilitate developing open source projects. Some of these include role assignment for those who join the active development on a project. Examples of roles supported by the SourceForge.Net server tools include: project manager, developer, documentation writer, tester, support manager, graphic/other designer, documentation translator, editorial/content writer, packager, analysis/design, advisor/consultant, Web designer, etc. We hope to add individuals with expertise in some of these roles. Long term, having multiple project managers who are expert in the various proteomic and bioinformatic areas will be essential.

5. Bioinformatics community participation in the open source Open2Dprot project

If you are involved in bioinformatics and are interested in helping with this effort and/or would consider participating in any of the roles defined above (3. Help and skills required for this project), please e-mail us with what you see as your contribution.

Due to the size and scope of this project, it will require a minimum level of core expert-developer participation from the bioinformatics research community.

At the point that the initial pipeline and pipeline control program are initially working, qualified developers will be invited to join the SourceForge Open2Dprot core-development team.