Notes on the current Open2Dprot XML data schemas

 

    (*** PRELIMINARY AND NOT COMPLETE  - WORKING DOCUMENT ***)

 

 

Created: 3-27-2005

Revised: 3-28-2005

Revised: 12-17-2005

                                       

Peter F. Lemkin

CCRNP, CCR, NCI

Frederick, MD 21702 USA

 

E-mail: lemkin@users.sourceforge.net

http://open2dprot.sourceforge.net/

 

 

 

Table of contents

 

 

1. Introduction to use of XML in Open2Dprot

2. computing environment

3. Open2Dprot pipeline processing modules and scheduler

  3.1 Classes of data objects used in Open2Dprot

4. The use of standard XML interchange

5. The Open2Dprot project list of XML schemas
  5.1 Sample accession XML data schema

    5.1.1 Fields in Accession XSD (Open2Dprot-Accession.xsd)

    5.1.2 Required computable data fields for each <SampleEntry>

    5.1.3 Experiment information data fields for each <SampleEntry>

  5.2 Landmark spot data XML schema between two samples

    5.2.1 Fields in Landmark XSD (Open2Dprot-Landmark.xsd)

    5.2.2 Definition of a <LandmarkSet> entry

  5.3 Sample spot-list data (SSF) XML schema

    5.3.1 Fields in sample spot-list XSD (Open2Dprot-SSF.xsd)

    5.3.2 SSF Preface <Sample_parameters> schema

    5.3.3 SSF spot data <Spot> schema

    5.3.4 SSF Epilogue statistics <Global_segmenter_statistics> schema

  5.4 Paired sample spot-list data (SPF) XML schema

    5.4.1 Fields in sample paired-spot-list XSD (Open2Dprot-SPF.xsd)

      5.4.2.1 Contents of the <Rsample> and <Sample> objects in CSD schema

    5.4.2 SPF Preface <Pairing_parameters> schema

    5.4.3 SPF paired-spot <Pspot> schema

    5.4.4 SPF Epilogue <Global_Spot_pairing_statistics> schema

  5.5 Composite Sample Database (CSD) XML schema (Open2Dprot-CSD.xsd)

    5.5.1 The canonical sample, Csample’, for replicate samples in the CSD

    5.5.2 The expression data for a single spot in the CSD or Rspot

    5.5.3 A post-translational modification Rspot or <PTMRspot>

    5.5.4 The <ConditionsList> as a list of <Conditions> groups of samples

 

 

 

 


1. Introduction to use of XML in Open2Dprot

 

The following notes describe the way we are currently using XML data for interchange in Open2Dprot (http://open2dprot.sourceforge.net/). This document also discusses how we would like to take advantage of the Proteomics Standards Initiative (PSI at http://psidev.sourceforge.net/) MIAPE, GelML and related parts of the General Proteomics Standards (GPS) when they stabilize.

 

We are migrating our XML coding from using the Apache Xerces http://xerces.apache.org/ SAX XML reader and hand-coded XML writers to using XMLbeans (http://xmlbeans.apache.org/). This document describes the current implementation

 

The Open2Dprot project is a community effort to create an open-source    n-dimensional (n-D) protein expression data analysis system. It will be downloadable and could be used for data mining protein expression across sets of n-D data from research experiments. Modules will be created for

2-dimensional data including 2D-PAGE (polyacrylamide gel electrophoresis) and initial support for 2D LC-MS, protein arrays and other data separation methods.

 

Our goal, in using an XML data interchange format, is to be as compliant as possible with MIAPE. For those cases where we have parameters and summary statistics that are not currently MIAPE compliant and there is an escape mechanism in MIAPE to encode this information, we will refactor our current XML to use that mechanism. Otherwise, we will use additional schemas to fill the gaps.

 

At the Montreal 2003 HUPO meeting, Chris Taylor (EBI) suggested we go ahead and implement our own schema while MIAPE was being developed. We used what was available and we needed that already existed in the PEDRo schema and then added fields we needed that were missing from PEDRo.

 

In addition, when a more complete MIAPE model is available that handles these additional complex types; we will then refactor our remaining code and new data-mining code and schemas toward that model.

 

Note that the current Open2Dprot XML schemas we are presenting here are placeholders to be redone when the new GelML/MIAPE standard is available and meets our needs. At that time we plan to go back and refactor the code to take the new standard into account.

 

The files discussed throughout this document are available on the http://open2dprot.sourceforge.net/ server. We will specify the direct links to specific files throughout the document to make it easier to review the files. Documentation (manuals, PDFs, javadocs), source code (CVS and Files mirror releases), demo data, and downloadable installers are available on the Web site.

 

The primary concept of pipeline processing for the Open2Dprot system is to construct a Composite Samples Database (CSD) of protein expression values of corresponding proteins across multiple samples. Once this CSD database is constructed, it can then be used for subsequent analysis.

 

Section 2 describes the Open2Dprot computing environment shown in Figures 1, 2, and through 3 that follow. Section 3 describes the pipeline-processing paradigm. Section 4 discusses our goal of using common XML standards to facilitate data interchange. Section 5 describes our current XML schemas.

 

Note that this is a work-in-progress. We are in the process of defining the XML schema for the CSD and so don’t have the full XML schema we will use. However, we outline some of the key export data of an assembled CSD as follows in Section 5.5.


 

 

 

Figure 1. Composite Samples Database model, CSD, used in Open2Dprot.

a) Illustrates a composite sample database. Corresponding paired spots (circles) are denoted by diagonal lines drawn through them. Such sets of corresponding spots are called Rspot sets. One of the samples (from the set of samples {GR, G2, G3, ..., Gn} is selected to be a reference sample or Rsample, denoted GR. The circle means the spot is present and the X means that it is missing in that sample. Spot A occurs in all n samples. Spot B occurs in the Rsample and in one other sample. Spot C is only in the Rsample. Spot D is not present in the Rsample but is in most of the other samples. Spots A, B, and C are in the un-extended Rspot database (since they occur in the Rsample) while spot D is in the eRspot part of the database (since it does not occur in the Rsample). Part of constructing the CSD is to extrapolate where missing spots would be in the samples that they are missing based on their relationship to neighboring spots common to the all samples. Extrapolated spots are assigned expression and area values of 0 and can be used for "missing spots" types of tests. Although the CSD in the initial Open2Dprot project, is constructed using a common reference sample, it is be possible to construct the CSD using a transitive model mapping across pairs of samples - although with higher error rates.  b) Illustrates the basis of using mean (or median) spot positions for estimating canonical spots for a subset of k samples from the n samples database. The canonical spot can then be used to estimate the position of spots missing from some of the other samples. The mean and variances of Rspot positions across a set of samples is mapped to the coordinate system of the Rsample. When that has been done, the set of samples of the same experimental class can be replaced by a single averaged sample called the Csample' (the estimate of the canonical sample for a set of replicate samples). The mean displacement vector of a canonical spot from its associated landmark spot (in any sample under discussion) is used to extrapolate the position in samples where the expected canonical spot is missing. If no landmark spots were used in constructing the CSD, nearby Rspots that have spots from all samples may be used to supply the vector offsets. Similarly, a mean quantified spot data can be used to represent a set of replicate samples.

_________________________________________________________________________________

 

 

 

 

The following figure 2 illustrates the context of the schemas in

Open2Dprot processing. Figure 2.1 shows more of the details of the

XML schemas used between pipeline stages.

 

 

 

Figure 2. The Open2Dprot pipeline processing data reduction hierarchy.

This shows the data reduction steps in converting n-D sample image (whether real images or "virtual" images) data to a composite sample database suitable for exploratory data analysis. In general, each step of the pipeline depends on the previous step being completed. Some steps can be omitted (e.g., if protein array pre-quantified spot data were used then Steps 2-4 can be omitted since no spot segmentation is required, no landmarking and no spot pairing is required; if a spot-pairing method were used that did not depend on predefined landmarks, then Step 3 can be omitted, etc). A key design element of Open2Dprot is that pipeline steps are assigned one of several alternated modules that adhere to the same XML input/output schema. That means that alternate methods can be substituted in the pipeline. For example, for 2D gels an image segmenter would be used for step 2; for 2D LC-MS peak cluster data a clustering method might be used; for a protein array spot list data might be used from one of the many microarray image spot segmentation programs, etc. The pipeline is set up and controlled by the Pipeline Control Program.


­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­______________________________________________________________________________

 

          |   

          | (2D images, 2D LC-MS spot cluster data, protein array data)

          v

 1. Accession sample images or n-D data and experiment information,

          |

          | XML (sample Accession data Open2Dprot-Accession.xsd)

          v      

 2. Segment n-D data to quantify or extract "spots" for all

    samples (2D-gels, 2D-LC-MS, etc.),

          |

          | XML (sample spot-list SSF data Open2Dprot-SSF.xsd)

          v      

 3. Create a landmark database between reference sample and remaining

    samples by spot pairing algorithm (if required for spot pairing),

          |

          | XML (Landmark data Open2Dprot-Landmark.xsd)

          v      

 4. Pair spots between a reference sample and the rest of samples,

          |

          | XML (samples paired-spot-list SPF data Open2Dprot-SPF.xsd)

          v      

 5. Construct Composite Samples Database, CSD, by merging paired spot lists,

          |

          |

          v

       RDBMS and caches (CSD data Open2Dprot-CSD.xsd)

          ^

          |

          |

 6. Explore the CSD data using exploratory data analysis and data mining

    techniques: statistics, clustering, classification, direct-manipulation

    graphics, reports, etc. This may invoke Java plugins and R-language scripts.

 

______________________________________________________________________________

 

Figure 2.1 The Open2Dprot pipeline processing data reduction hierarchy. The pipeline processing hierarchy is illustrated by figure 2 of the Open2Dprot home page. See the figure legend and discussion of which stages could be run in background batch.

 

 

 

 

2. Open2Dprot computing environment

 

Open2Dprot is meant to be run locally on an investigator's or a collaborative group's computer. However, the CSD database will reside on a RDBMS that could be split between a working experiment database and a reference database (e.g., plasma proteins, etc.) with spot identifications and other proteomic information.

 

Currently, Open2Dprot saves all data (XML data interchange files as well as derived images) from a project in a set of sub-directories in a project directory. Multiple experiments, each consisting of multiple samples, could reside in the same project directory. Multiple CSDs could be constructed in the same project directory created from samples from different experiments or subsets of samples.

 

 

 <project-directory>/batch/ - batch files

 <project-directory>/cache/ - cache files

 <project-directory>/ppx/   - original input image files

 <project-directory>/rdbms/ - RDBMS CSD database files

 <project-directory>/tmp/   - generated temporary and derived image files

 <project-directory>/xml/   - accession DB, landmark DB, SSF files,

                              SPF files, etc.

 

These data could reside in the RDBMS itself or in sets of distributed RDBMS. The cache directory data files are used for data mining rather than paging data from the RDBMS, which would be too slow for many exploratory data analysis operations.

 

 

 

3. Open2Dprot pipeline processing modules and scheduler

 

There is a top-level scheduler program (not released yet) called Open2Dprot illustrated in Figure 3 that determines the data dependency, what data exists, what data is required (i.e., the next step of the processing depends on it). It then runs the appropriate pipeline modules to create that data.

 

Each pipeline step may have a specific module dynamically assigned. This means that different pipeline processing method sets can be created for different types of proteomic expression data. For example, if no spot pairing is required (e.g., the input is a protein array and not a 2D gel image or 2D LC-MS data file), then the pipeline analyzer will skip the spot pairing steps. Similar dependencies can be assigned to all processing steps.

 

______________________________________________________________________________

 

Figure 3. The Open2Dprot pipeline control program.

The pipeline control program (called Open2Dprot - and is under development) will schedule and run the modules in the pipeline after doing a data-dependency analysis on 1) what data exists, and 2) what data needs to be created by running parts of the pipeline to proceed to the next stage in the pipeline. That is, it works backwards from the future CSD to determine what data is required and then repeats that analysis further back in the pipeline until it reaches the top of the pipeline (Figure 2) or reaches data that already exists in the pipeline. It then executes the pipeline processes to construct the CSD. Once the CSD is created, the pipeline is no longer needed except to add additional samples. Instead, the CSDminer program would then be used directly to access the already existing CSD database.

 

 

 

 

All currently released pipeline modules, source code, and the common library O2Plib are available from the Web site list of subprojects

     http://open2dprot.sourceforge.net/doc/subprojects.html

 

The direct links are:

 

  http://open2dprot.sourceforge.net/Accession

  http://open2dprot.sourceforge.net/CmpSpots

  http://open2dprot.sourceforge.net/Seg2Dgel

  http://open2dprot.sourceforge.net/Landmark

  http://open2dprot.sourceforge.net/O2Plib

 

where:

   Accession  - sample experiment and ROI accession program

   Seg2Dgel   - segment 2D gel (or similar) image into a

                Sample Spot-list File (SSF)

   Landmark   - interactively define a set of landmarks between

                two sample images (Landmark)

   CmpSpots   - pair two sample spot lists into a

                Sample Paired-spot-list File (SPF)

   O2Plib     - common Open2Dprot library

                O2Plib.db.* contains data objects and XML I/O

                O2Plib.db.CSD contains CSD data objects and XML I/O

 

 [Modules not yet released: Open2Dprot, BuildCSD, CSDminer

   Open2Dprot – Open2Dprot pipeline scheduler

   BuildCSD   - construct/add-to the CSD from sets of SPF data files

   CSDminer   - exploratory analysis for CSD

 

Although we refer to a XML Sample Spot-list File (SSF) as a file, it could be a XML object in a RDBMS.

 

 

3.1 Primary data objects used in Open2Dprot

 

The primary classes which define data objects used throughout Open2Dprot are defined in the O2Plib.db.* Java library modules and define the base objects and their XML readers and writers.

 

  DbAccession.java - read, write and access accession sample database

  DbBaseSpot.java – base spot class

  DbBoundary.java - spot boundary manipulation (-- not released)

  DbLM.java - read, write and access paired sample landmark spots database

  DbPairSamples.java - read, write and access paired sample instance

  DbSample.java - read, write and access sample spot list instance

  DbPspot.java – SPF paired-spot feature object

  DbSpot.java – SSF spot feature object

  LMset.java – landmark set object.

 

The O2Plib home page is http://open2dprot.sourceforge.net/O2Plib  

It has javadoc API documentation accessible from the home page showing the object dependencies.

 

 

3.1.1 Primary data objects used in CSD for Open2Dprot

 

The primary classes which define CSD data objects used throughout Open2Dprot are defined in the O2Plib.db.CSD.* Java library modules and define the CSD base objects and their XML readers and writers. See Section 5.5 for discussion on how these describe the CSD XML schema.

 

  CSD.java – instance of CSD database

  CSDacc.java – Accession data instance

  CSDannotation.java – annotation instance

  CSDcache.java – cache instance

  CSDcal.java – grayscale to measurement units calibration instance

  CSDcond.java – sample conditions instance

  CSDexpr.java – expression profile instance

  CSDexprList.java – list of expression profiles instance

  CSDfilterState.java – data filtering instance

  CSDglb.java – global state instance

  CSDio.java – I/O instance

  CSDlimits.java – filter limits instance

  CSDlms.java – landmark data instance

  CSDnorm.java – sample spots normalization instance

  CSDRmap.java – reference map instance

  CSDRspot.java – Rspot (reference spot) set instance

  CSDRspotList.java – list of Rspots instance

  CSDsizes.java – current size limits instance

  CSDtotals.java – current samples, spots etc totals instance

 

 

 

 

4. The use of standard XML interchange data files

 

Open2Dprot standardizes data interchange through XML files. The I/O is to/from data files, which in the future will also be able to be kept in a relational database.

 

To minimize problems porting data between processing modules, we have created a common Java library O2Plib. This library defines all data structures that would be used in more than one Open2Dprot pipeline module.

It also centralizes the XML I/O.

 

The early, current, version of the Open2Dprot XML library uses the SAX XML reader to read XML data files. It also uses Java methods to explicitly generate the XML output. These used optional document type definitions (DTDs) with the SAX readers (if the –dtd command-line switch was specified with the module).

 

 

Future XSD XMLbeans I/O

 

We are in the process of refactoring the XML I/O using  XMLbeans (http://xmlbeans.apache.org/) generated XML readers and writers based on Java code generating from xsd schemas. This will make it much easier to integrate the MIAPE xsd schemas with Open2Dprot. We will be adding xsd namespaces to the Open2Dprot schemas to keep the fields unique. We will use the merge with MIAPE schema fields if possible. The name space will be the following:

 

   o2p:    Open2Dprot name space that will be used if there is no

           equivalent MIAPE schema type or element

 

If there is a conflict between the Open2Dprot sub-schemas and there is no MIAPE replacement, then we may rename our elements and type names or we might use alternate namespaces as follows. This is to be avoided if possible by the renaming of fields.

 

   o2pa:   Open2Dprot accession space

   o2pl:   Open2Dprot landmark space

   o2ps:   Open2Dprot spot list space

   o2pp:   Open2Dprot paired-spots list space

   o2pc:   Open2Dprot CSD space

 

 

Current SAX XML readers

 

There are document type definitions (DTDs) associated with the XML data readers and writers in the Library. All XML I/O is handled completely by this library. Currently, we are using hardwired Xerces SAX readers, but hope to migrate this to dynamic XML schema I/O this year.

 

Currently, we have 4 published DTDs on the Web site (the CSD will be an XML schema file Open2Dprot-CSD.xsd and is being constructed for the Composite Sample Database).

 

The DTDs are available several ways: 1) download and install the CmpSpots program (this includes the four DTDs), or 2) they are available at:

 

  http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Accession.dtd

  http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Landmark.dtd

  http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-SSF.dtd

  http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-SPF.dtd

 

 

 

5. The Open2Dprot project list of XML schemas

 

Once the Composite Sample Database (CSD) is constructed it could be used for exploratory data analysis and data mining. It consists of a MIAPE sample experiment data and a list of paired-"spot" expression values across all samples in the database. Missing sample spot values are allowed and are described in Section 5.5.

 

The CSD database is viewed as a multiple-sample database is illustrated in Figure_1_CSD in the Introduction to this document and discussed in the home page of the Open2Dprot Web site.

 

There are other types of informatics data that are both useful and necessary such as sets (i.e. subsets) of spots, condition sets (subsets of samples), calibration data for each sample (as different from normalization), etc.

 

The complex XML types definitions are highlighted in yellow in this Section. E.g., see the definition of <CSD> in Section 5.5.

 

 

5.1 Sample accession data XML schema

 

A single sample is a 2D gel image, a 2D LC-MS data set, a protein array, or some other quantitative or semi-quantitative protein expression vector etc. A DIGE type sample is really a set of samples so each channel is a separate sample - even though spots are effectively paired by the way the samples are run simultaneously in the same gel with different dyes.

 

Each sample has some input data (e.g., 2D gel image, 2D LC-MS real image, 2D LC-MS clustered peaks data with a virtual image, protein array image or protein array spot list in some format), etc. The sample name is currently the name of the data file without a path and without the extension (e.g., .tif, .jpg, .gif, etc).  The extension is determined at run time.

 

The accession database contains enough information to find the sample, grayscale or other calibration information, regions of interest, etc.

 

The Accession program currently creates and edits this information. We plan to replace this program with a more advanced program to fill in the other MIAPE experiment information fields.

 

The accession database DTD is available at

  http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Accession.dtd

 

The accession database XSD is available from the Open2Dprot Web site CVS server at

  http://cvs.sourceforge.net/viewcvs.py/open2dprot/schemas/Open2Dprot-Accession.xsd?rev=1.14&view=log

 

There is an example of an accession XML file in the file

  http://open2dprot.sourceforge.net/demo/xml/accession.xml 

 

 

5.1.1 Fields in the Accession XSD (Open2Dprot-Accession.xsd)

 

Samples are accessioned (i.e., entered) into a database containing experiment information about a sample as well as global image descriptions (file name, size, data region-of-interest (ROI), pixel grayscale to measurement units calibration, etc). In the case of non-image samples (e.g., 2D LC-MS peaks data, protein array data), these fields are defined for the virtual image of that data.

Additional experiment related information is also saved in the accession database for each sample. This includes study, investigator, date, sample conditions, etc.

 

This is currently defined using the Open2Dprot common library class O2Plib.db.DbAccession.java.

 

The accession database file

    <project-directory>/xml/accession.xml

 

contains sample and experiment information in a single flat-file. It is similar to a subset of the original PEDRo proteomics schema. [We note that that this list of sample description accession data descriptors is inadequate for many research purposes and will be replaced with a more general MIAPE set of descriptors.] We currently use this accession schema as a placeholder until we implement a full MIAPE sample experiment information subset.

 

We would of course have to integrate the GUI for the ROIs (Region Of Interest) and grayscale calibrations.

 

A <SampleEntry> consists of required computable data fields, and

experiment information data fields in the subsections that follow.

 

The Accession pipeline module program allows the creation and editing of the accession sample database. We plan on making the Accession module more MIAPE compliant by either extending Accession or replacing it with another accessioning program. The home page is

 

  http://open2dprot.sourceforge.net/Accession

 

The accession database DTD is available at

  http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Accession.dtd

 

The accession database XSD is available from the Open2Dprot Web site CVS server at

  http://cvs.sourceforge.net/viewcvs.py/open2dprot/schemas/Open2Dprot-Accession.xsd?rev=1.14&view=log

 

There is an example of an accession XML file in the demonstration directory

  http://open2dprot.sourceforge.net/demo/xml/accession.xml

 

The top-level object contains two database identifier fields and an arbitrary list of <SampleEntry>s.

 

 

<Accession>

 

  <DatabaseName> (String) name of the accession database </DatabaseName>

 

  <Date> date database was created or modified</Date>

 

  <SampleEntry> sample entry 1 </SampleEntry>

 

  <SampleEntry> sample entry 2 </SampleEntry>

                 . . .

  <SampleEntry> sample entry n </SampleEntry>

 

</Accession>

 

The <SampleEntry> complex type contains two types of data: required computable data and optional experiment annotation data. These are described in Sections 5.1.2 and 5.1.3.

 

 

5.1.2 Required computable data fields for each <SampleEntry>

 

The accession data contains several critical computable data fields for each <SampleEntry>. These fields are required by any Open2Dprot programs that need to lookup accession information for one or more samples.

 

In particular, the spot segmentation module, Seg2Dgel, requires a region of interest (ROI) called the computing window where spots will be found. If this ROI is not defined, it is defined as the entire image or (x,y) space of the data (in the case of abstract 2D LC-MS peak data or protein arrays. If a grayscale calibration is present, it will calibrate grayscale in terms of the calibration data. Fields that have optional entries are indicated with [opt].

 

<SampleEntry>

 

  <Sample> (String) sample base name (if there is an image, no path or

            extension)

  </Sample>

 

  <Rsample> (String) reference sample base name (if ANY) (if there is

            an image, no path or extension)

  </Rsample>

 

  <WedgeCalList> (String) optional calibration values list that is

            synchronized with <WedgeGrayList>. The delimiters may be commas,

            spaces, or tabs.

  </WedgeCalList>

 

  <WedgeGrayList> (String) optional corresponding grayscale peak values list

              synchronized with <WedgeCalList>. The delimiters may be commas,

              spaces, or tabs.

  </WedgeGrayList>