Open2Dprot Project for n-Dimensional Protein Expression Data Analysis

The Open2Dprot Project

The Open2Dprot Project for n-Dimensional Protein Expression Data Analysis

Welcome To Open2Dprot

The Open2Dprot project is a community effort to create an open source n-dimensional (n-D) protein expression data pipeline-analysis system. It is downloadable and could be used for exploratory data analysis of protein expression data across sets of n-dimensional (n-D) data from research experiments. In the initial phase, as early beta-level versions of the pipeline modules are created, they are made available for download and may be used for performing parts of the analysis. Interchangeable subproject modules are being developed for 2-dimensional data including 2D-PAGE (polyacrylamide gel electrophoresis). Initial support for 2D LC-MS, protein arrays and other data will also be provided. In the second phase, it will be expanded to handle data from other n-dimensional protein separation methods as well as more extensive analyses on these data.

1. Introduction

There is a need for an open source proteomics expression exploratory data analysis software tool. Such a tool should handle data from 2D-PAGE, 2D LC-MS, 2D IPG-MS, n-dimensional LC-MS*MS*MS..., protein arrays, and other protein expression separation methods. All of these methods share a common paradigm: proteins separated by orthogonal features. Some of these methods are semi-quantitative, but some of that data could be be analyzed using some exploratory data analysis methods. Data in these systems can be represented as protein expression profiles which lends itself to exploratory data analysis. Open2Dprot could be then be used as part of a broader set of integrated tools.

This Web site describes the Open2Dprot project. Use the table of contents on the left to help navigate the Web site.

2D-PAGE was not widely used until recently due to limitations in identifying spots differentially expressed, and the difficulty in resolving and detecting specialized classes of proteins (e.g., basic proteins, membrane proteins, low abundance proteins). 2D-PAGE is often used as a prescreening stage for mass-spectrometry to identify excised spots found in differential analysis. Today 2D gels have improved resolution with zoom 2D-gels and new pre-fractionation methods. There are other protein separation techniques, such as 2D LC-MS and protein arrays, that could use exploratory analysis paradigms developed for expression data of 2D gels and more recently for DNA-microarray.

An open-source n-dimensional proteomics data analysis effort

An open-source project can be beneficial to the proteomics community at large much as the genomics community has benefited from open-source tools. There is a far greater likelihood of wider distribution and in algorithm improvement by the research community in academic style collaborations than a closed-source business model. Researchers can more rapidly adapt new methods to existing software without waiting for release of commercial products. It is possible to use the contributed expertise and code of proteomics experts and bioinformaticians to help build and test open software. Algorithms are more transparent, so researchers can verify results more easily, and there is more opportunity to share data in standard non-proprietary formats. No expensive software licenses are required, which reduces deployment costs within large organizations and small labs. However, using the proper open-source licenses (see Open Source Licensing book by L. Rosen, 2005) can encourage adoption and collaboration between commercial and academic interests (e.g., Linux, Firefox, Eclipse, Apache, etc). Many free open-source repositories available. Such repositories offer tools to support collaboration, software development, documentation, forums, and distribution.

Project goals for Open2Dprot

The Open2Dprot project goals include an international community effort to create an open-source n-D quantitative data analysis system. It will be a stand-alone downloadable system that can connect to relational databases both locally and over the Internet. It could be used for data mining protein expression across sets of samples from researchers' experiments to investigate and find significant protein expression from multiple experiments. It should provide an integrated set of software tools, analysis methods, and data structures for quantitative and system biology protein expression. Finally, it should handle protein expression data from 2D-gel, 2D LC-MS, protein arrays, and other protein separation methods.

Using open-source resources for Open2Dprot

Open2Dprot is hosted and developed on the SourceForge repository at open2dprot.sourceforge.net. This Web site discusses the current Open2Dprot software development plan. We are using a similar open-source development methodology that we used in our Java/R-based MAExplorer DNA microarray data-mining software. Open2Dprot could later reside as part of other proteomic resources Web sites resources that it with other tools relating to mass spectrometry, dye multiplexing, protein arrays, Internet proteomic databases, etc.

2. The Open2Dprot project - phased implementation

In the initial phase, Open2Dprot is partly based on exploratory data analysis, image processing and methods from a variety of bioinformatics software systems. These include: a subset of methods from the last (1993) Unix version of the NCI "GELLAB-II" system, open-source bioinformatics codes such as MAExplorer, Flicker and R-language based systems. In addition, other open-source 2D-gel, 2D LC-MS analysis codes, and microarray data-mining codes may be incorporated. This is described in more detail in the Open2Dprot development plan. The overall analysis consists of two procedures - a pre-processing procedure to create the database and then an open-ended data mining procedure of the composite samples database. As subprojects in the analysis pipeline reach the beta-testing stage, they are made available for download and are announced in the module list of subprojects integrating our schema with developing community proteomics data schemas.

In the second phase, Open2Dprot will be extended with other donated 2D gel analysis and n-D LC-MS*MS*MS*... analysis, mass spectrometry, protein microarray, and related proteomics and bioinformatics software codes as well as core-developer efforts by the bioinformatics research community. We envision extending the Open2Dprot project to a more general proteomics expression analysis environment incorporating other types of n-D proteomics expression data. These will be integrated by a common proteomics database schema.

We welcome suggestions for extending or modifying this agenda for Open2Dprot for both the initial and second phase of development.

The Open2Dprot software is available as freely downloadable data mining software tools for n-D data analysis from the SourceForge files mirror. These will implement a minimal system including the ability to:

Accession (i.e., enter) samples (scanned images or spot data from other pre-processing software) and experiment information into a user's experiment project database. This will be extended to implement the Protein Standards Initiative MIAPE schema.
Quantify polypeptide "spots" by various methods including: image processing of 2D gel images, cluster polypeptides "spots assembled by clustering in a series of n-D LC-MS-MS-MS*... data, using spot data from protein arrays analyzed by other software, or other separation methods.
If needed for spot pairing, create a landmark database between a reference sample and the remaining samples by spot pairing algorithm (e.g., not require for some automatic spot-pairing methods).
Pair "spots" between reference samples and the rest of the samples if required (E.g., not required for protein arrays since spots between arrays are paired by definition).
Construct a Composite Samples Database, CSD, from the paired-spot data for subsequent exploratory data anlysis.
Explore the CSD data using exploratory data analysis techniques: visualization methods, statistics, clustering, direct-manipulation graphics and reports, using additional annotation data derived from Internet proteomic, genomic and functional bioinformatics databases, etc.

The composite samples database is then used for exploratory data analysis to investigate and find significant protein expression profiles relevant to the experiments, for sample-classification, for functional analysis of subsets of proteins with common expression, etc.

If there are compatible reference CSD databases available over the Internet, data from them could be cached locally to aid in the analysis of the investigators private database. For example, if there were a reference database with identified proteins, those identifications could be use to annotate spots in the investigators local experiment samples.

When fully implemented, Open2Dprot will run as an integrated set of stand-alone applications on your computer with connection to Internet proteomics databases as required. The set of pipeline modules is controlled by the main Open2Dprot pipeline control program scheduler (see Figure 3).

Open2Dprot's exploratory data analysis environment will provide tools for the data mining of quantitative expression profiles across multiple 2D samples. We envision providing a few demonstration databases as examples. Other repositories could house additional public databases. The focus of Open2Dprot is to provide an open source set of software tools for n-D data database analysis - not to provide a database repository.

This Web site discusses the Open2Dprot software development plan. An overview (full slides PDF) (2 slides/page PDF) (6 slides/page PDF) is available that summarizes Open2Dprot. See the sub project module lists for more details on individual processing modules.

There is also a brief historical description of the original GELLAB-II system. There were a number of papers describing GELLAB-II and projects where it was used.

Figure 1. Composite Samples Database model, CSD, used in Open2Dprot. a) Illustrates a composite sample database. Corresponding paired spots (circles) are denoted by diagonal lines drawn through them. Such sets of corresponding spots are called Rspot sets. One of the samples (from the set of samples {G_R, G₂, G₃, ..., G_n}) is selected to be a reference sample or Rsample, denoted G_R. The circle means the spot is present and the X means that it is missing in that sample. Spot A occurs in all n samples. Spot B occurs in the Rsample and in one other sample. Spot C is only in the Rsample. Spot D is not present in the Rsample but is in most of the other samples. Spots A, B, and C are in the un-extended Rspot database (since they occur in the Rsample) while spot D is in the eRspot part of the database (since it does not occur in the Rsample). Part of constructing the CSD is to extrapolate where missing spots would be in the samples that they are missing based on their relationship to neighboring spots common to the all samples. Extrapolated spots are assigned expression and area values of 0 and can be used for "missing spots" types of tests. Although the CSD in the initial Open2Dprot project, is constructed using a common reference sample, it is be possible to construct the CSD using a transitive model mapping across pairs of samples - although with higher error rates. b) Illustrates the basis of using mean (or median) spot positions for estimating canonical spots for a subset of k samples from the n samples database. The canonical spot can then be used to estimate the position of spots missing from some of the other samples. The mean and variances of Rspot positions across a set of samples is mapped to the coordinate system of the Rsample. When that has been done, the set of samples of the same experimental class can be replaced by a single averaged sample called the Csample' (the estimate of the canonical sample for a set of replicate samples). The mean displacement vector of a canonical spot from its associated landmark spot (in any sample under discussion) is used to extrapolate the position in samples where the expected canonical spot is missing. If no landmark spots were used in constructing the CSD, nearby Rspots that have spots present from all samples may be used to supply the vector offsets. Similarly, mean or median quantified spot data can be used to represent a set of replicate samples.

Figure 2. Protein expression analysis consists of two parts. Pre-processing of sample data using a pre-processing pipeline to take the sample data experiment and raw data information (images, 2D MS spectra, protein array spot data), extract the spots, and merge the spot lists into a Composite Samples Database (CSD). This is described in more detail in Figure 2.1. After the CSD has been created, it can then be used for data-mining and exploratory data analysis.

Figure 2.1 Basic pre-processing pipeline showing the data reduction steps in converting n-D sample image (whether real images or "virtual" images) data to a composite sample database suitable for exploratory data analysis. In general, each step of the pipeline depends on the previous step being completed. Some steps can be omitted (e.g., if protein array pre-quantified spot data were used then Steps 2-4 can be omitted since no spot segmentation is required, no landmarking and no spot pairing is required; if a spot-pairing method were used that did not depend on predefined landmarks, then Step 3 can be omitted, etc). A key design element of Open2Dprot is that pipeline steps are assigned one of several alternated modules that adhere to the same XML input/output schema. That means that alternate methods can be substituted in the pipeline. For example, for 2D gels an image segmenter would be used for step 2; for 2D LC-MS peak cluster data a clustering method might be used; for a protein array spot list data might be used from one of the many microarray image spot segmentation programs, etc. The pipeline is set up and controlled by the Pipeline Control Program called Open2Dprot that is under development.

Figure 3. Open2Dprot Pipeline Control Program. The pipeline control program (called Open2Dprot - and is under development) will schedule and run the modules in the pipeline after doing a data-dependency analysis on 1) what data exists, and 2) what data needs to be created by running parts of the pipeline to proceed to the next stage in the pipeline. That is, it works backwards from the future CSD to determine what data is required and then repeats that analysis further back in the pipeline until it reaches the top of the pipeline (Figure 2) or reaches data that already exists in the pipeline. It then executes the pipeline processes to construct the CSD. Once the CSD is created, the pipeline is no longer needed except to add additional samples. Instead the CSDminer program would then be used directly to access the already existing CSD database.

3. Open2Dprot capabilities

The analysis is broken up into two parts: the initial pipeline construction of the CSD and the subsequent analysis of the CSD.

Pipeline construction of the Composite Samples Database (CSD)

Accession n-D sample scanned images and experiment annotation data
Quantify spots from sample images, cluster peaks in 2D LC-MS, or protein array spot data
Pair spots between samples and a reference sample or a set of reference samples
Construct a CSD suitable for exploratory data analysis by merging spot lists relative to the reference sample(s) taking missing spots into account and extrapolating missing data

After the CSD has been constructed, one would then use a data mining tool to analyze the data in the CSD. We are developing a data mining tool called CSDminer.

All pipeline modules and the CSDminer module use standard data interchange formats that make substitution of alternate pipeline modules possible. Open2Dprot will

Use XML data interchange formats to facilitate using alternate components for parts of the analysis (MIAPE - formerly PEDRo) supported by PSI (Proteomics Standards Initiative) and HUPO
Use a standard open-source SQL database (RDBMS) to store the data. Groups that wanted to establish a standard repository could share their data from proteomics community databases. An individual researcher could have their own RDBMS localy and merge data from the proteomics community database when they do their individual data mining.
Use local caching of data from the RDBMS for the efficient analysis of subsets of the CSD and for including data from other CSD databases.

Types of exploratory data analysis on the CSD

Thhe following is a short list of some of the types of operations planned to be made available using the using the CSDminer program. Because of the modularity of the constructed CSD database, alternative data-mining software that accesses the CSD using the published XML interface could be used instead. A CSD for a particular set of data could be exported as either an XML "snapshot", a RDBMS instance or by accessing a live RDBMS where access was granted.

Handle multiple experimental condition sets of n-dimensional samples with replicate samples/condition set (e.g., time series, drug-dose, etc. with replicate samples at each point)

Named condition sets of samples may be managed

Named sets of proteins may be defined explictly or by computation

Add annotation to proteins from Internet servers from proteomic, genomic, PubMed, GO, functional and pathway databases. This external data can then be used as part of the subsequent clustering and other analyses

Analyze data multiple conditions protein expression profiles

Normalize data by a variety of within-sample and between-sample methods

Select, find, and manage subsets of the CSD using named sets of samples, condition sets of samples, and sets of proteins for further processing

Cluster proteins or cluster samples based on various similarity criteria

Classify samples by a subset of proteins, cross-validation, false-discovery corrections

Data-filter protein sets by statistics, clustering, and protein subset membership

Direct-manipulation data in graphics (scatter plots, expression profile plots, histograms, clustergrams or heat-maps, PCA, MDS, etc.), spreadsheets, sample and spot management

Integrate Java visual, statistical, clustering and other methods

Integrate R-language statistical, clustering and other methods via an interface to the R language and elements of Bioconductor (an R-based microarray data mining project)
Integrate access to proteomic / genomic / functional-genomics / pathways Internet server data for user-specified protein sets for use in extended analyses
Analyze data across compatible Internet reference databases by caching and analyzing the investigators data with respect to the reference data

4. Partial list of pipeline modules Web pages

This is a short list of the pipeline modules. See the sub-projects Module list for more details.

Pipeline Module Pipeline Step # Function

Open2Dprot scheduler program Open2Dprot Scheduler Run the pipeline

Accession module Accession [1] Enter sample information

Seg2Dgel module Seg2Dgel [2] Segment and quantify spots in a sample

Landmark module Landmark [3] Create/Edit landmarks between samples

AutoLandmark module AutoLandmark [3] Automatic landmark finding between samples

CmpSpots module CmpSpots [4] Pair spot-lists between samples

BuildCSD module BuildCSD [5] Build CSD database from a set of paired samples

CSDminer module CSDminer [6] Exploratory analysis of CSD database

Pipeline Module	Pipeline Step #	Function
Open2Dprot	Scheduler	Run the pipeline
Accession	[1]	Enter sample information
Seg2Dgel	[2]	Segment and quantify spots in a sample
Landmark	[3]	Create/Edit landmarks between samples
AutoLandmark	[3]	Automatic landmark finding between samples
CmpSpots	[4]	Pair spot-lists between samples
BuildCSD	[5]	Build CSD database from a set of paired samples
CSDminer	[6]	Exploratory analysis of CSD database

Open2Dprot is hosted at open2dprot.sourceforge.net

Contact us Request for help from the bioinformatics developer and research community Revised: 03/20/2006