Notes on the current Open2Dprot XML
data schemas
(*** PRELIMINARY AND
NOT COMPLETE - WORKING DOCUMENT ***)
Created: 3-27-2005
Revised: 3-28-2005
Revised: 12-17-2005
Peter F. Lemkin
CCRNP, CCR, NCI
Frederick, MD 21702 USA
E-mail: lemkin@users.sourceforge.net
http://open2dprot.sourceforge.net/
Table of contents
1. Introduction to use of XML in Open2Dprot
2. computing
environment
3. Open2Dprot pipeline
processing modules and scheduler
3.1 Classes of data
objects used in Open2Dprot
4. The use of standard
XML interchange
5. The Open2Dprot
project list of XML schemas
5.1
Sample accession XML data schema
5.1.1 Fields in
Accession XSD (Open2Dprot-Accession.xsd)
5.1.2 Required
computable data fields for each <SampleEntry>
5.1.3 Experiment
information data fields for each <SampleEntry>
5.2 Landmark spot data XML schema between two
samples
5.2.1 Fields in
Landmark XSD (Open2Dprot-Landmark.xsd)
5.2.2 Definition of a
<LandmarkSet> entry
5.3
Sample spot-list data (SSF) XML schema
5.3.1 Fields in sample
spot-list XSD (Open2Dprot-SSF.xsd)
5.3.2 SSF Preface
<Sample_parameters> schema
5.3.3 SSF spot data
<Spot> schema
5.3.4 SSF Epilogue
statistics <Global_segmenter_statistics> schema
5.4
Paired sample spot-list data (SPF) XML schema
5.4.1 Fields in sample
paired-spot-list XSD (Open2Dprot-SPF.xsd)
5.4.2.1 Contents of
the <Rsample> and <Sample> objects in CSD schema
5.4.2 SPF Preface
<Pairing_parameters> schema
5.4.3 SPF paired-spot
<Pspot> schema
5.4.4 SPF Epilogue
<Global_Spot_pairing_statistics> schema
5.5 Composite Sample Database (CSD) XML
schema (Open2Dprot-CSD.xsd)
5.5.1 The canonical
sample, Csample, for replicate samples in the CSD
5.5.2 The expression
data for a single spot in the CSD or Rspot
5.5.3 A
post-translational modification Rspot or <PTMRspot>
5.5.4 The
<ConditionsList> as a list of <Conditions> groups of samples
1. Introduction to use of XML in Open2Dprot
The following notes describe the way we are currently using XML
data for interchange in Open2Dprot (http://open2dprot.sourceforge.net/).
This document also discusses how we would like to take advantage of the
Proteomics Standards Initiative (PSI at http://psidev.sourceforge.net/)
MIAPE, GelML and related parts of the General Proteomics Standards (GPS) when they
stabilize.
We are migrating our XML coding from using the Apache Xerces http://xerces.apache.org/ SAX XML reader
and hand-coded XML writers to using XMLbeans (http://xmlbeans.apache.org/). This
document describes the current implementation
The Open2Dprot project is a community effort to create an open-source n-dimensional (n-D) protein expression data analysis system. It will be downloadable and could be used for data mining protein expression across sets of n-D data from research experiments. Modules will be created for
2-dimensional data including 2D-PAGE (polyacrylamide gel electrophoresis) and initial support for 2D LC-MS, protein arrays and other data separation methods.
Our goal, in using an XML data interchange format, is to be as
compliant as possible with MIAPE. For those cases where we have parameters and
summary statistics that are not currently MIAPE compliant and there is an
escape mechanism in MIAPE to encode this information, we will refactor our
current XML to use that mechanism. Otherwise, we will use additional schemas to
fill the gaps.
At the Montreal 2003 HUPO meeting, Chris Taylor (EBI) suggested we
go ahead and implement our own schema while MIAPE was being developed. We used
what was available and we needed that already existed in the PEDRo schema and
then added fields we needed that were missing from PEDRo.
In addition, when a more complete MIAPE model is available that
handles these additional complex types; we will then refactor our remaining
code and new data-mining code and schemas toward that model.
Note that the current Open2Dprot XML schemas we are presenting
here are placeholders to be redone when the new GelML/MIAPE standard is
available and meets our needs. At that time we plan to go back and refactor the
code to take the new standard into account.
The files discussed throughout this document are available on the http://open2dprot.sourceforge.net/
server. We will specify the direct links to specific files
throughout the document to make it easier to review the files. Documentation
(manuals, PDFs, javadocs), source code (CVS and Files mirror releases), demo
data, and downloadable installers are available on the Web site.
The primary concept of pipeline processing for the Open2Dprot
system is to construct a Composite Samples Database (CSD) of protein expression
values of corresponding proteins across multiple samples. Once this CSD
database is constructed, it can then be used for subsequent analysis.
Section 2 describes
the Open2Dprot computing environment shown in Figures 1,
2, and through 3
that follow. Section 3 describes the
pipeline-processing paradigm. Section 4
discusses our goal of using common XML standards to facilitate data
interchange. Section 5 describes
our current XML schemas.
Note that this is a work-in-progress. We are in the process of
defining the XML schema for the CSD and so dont have the full XML schema we
will use. However, we outline some of the key export data of an assembled CSD
as follows in Section 5.5.

Figure 1. Composite Samples Database model, CSD, used in Open2Dprot.
a) Illustrates a composite sample database. Corresponding
paired spots (circles) are denoted by diagonal lines drawn through them. Such
sets of corresponding spots are called Rspot sets. One of the samples (from the
set of samples {GR, G2, G3, ..., Gn}
is selected to be a reference sample or Rsample, denoted GR. The
circle means the spot is present and the X means that it is missing in that
sample. Spot A occurs in all n samples. Spot B occurs in the Rsample and in one
other sample. Spot C is only in the Rsample. Spot D is not present in the
Rsample but is in most of the other samples. Spots A, B, and C are in the
un-extended Rspot database (since they occur in the Rsample) while spot D is in
the eRspot part of the database (since it does not occur in the Rsample). Part
of constructing the CSD is to extrapolate where missing spots would be in the samples
that they are missing based on their relationship to neighboring spots common
to the all samples. Extrapolated spots are assigned expression and area values
of 0 and can be used for "missing spots" types of tests. Although the
CSD in the initial Open2Dprot project, is constructed using a common reference
sample, it is be possible to construct the CSD using a transitive model mapping
across pairs of samples - although with higher error rates. b) Illustrates the basis of using
mean (or median) spot positions for estimating canonical spots for a subset of
k samples from the n samples database. The canonical spot can then be used to
estimate the position of spots missing from some of the other samples. The mean
and variances of Rspot positions across a set of samples is mapped to the
coordinate system of the Rsample. When that has been done, the set of samples
of the same experimental class can be replaced by a single averaged sample
called the Csample' (the estimate of the canonical sample for a set of replicate samples). The mean displacement vector of a canonical spot from its associated
landmark spot (in any sample under discussion) is used to extrapolate the
position in samples where the expected canonical spot is missing. If no
landmark spots were used in constructing the CSD, nearby Rspots that have spots
from all samples may be used to supply the vector offsets. Similarly, a mean
quantified spot data can be used to represent a set of replicate samples.
_________________________________________________________________________________
The following figure 2 illustrates
the context of the schemas in
Open2Dprot processing. Figure 2.1 shows more of the details of the
XML schemas used between pipeline stages.

Figure 2. The Open2Dprot pipeline
processing data reduction hierarchy.
This shows the data reduction steps in converting n-D
sample image (whether real images or "virtual" images) data to a
composite sample database suitable for exploratory data analysis. In general,
each step of the pipeline depends on the previous step being completed. Some
steps can be omitted (e.g., if protein array pre-quantified spot data were used
then Steps 2-4 can be omitted since no spot segmentation is required, no
landmarking and no spot pairing is required; if a spot-pairing method were used
that did not depend on predefined landmarks, then Step 3 can be omitted, etc).
A key design element of Open2Dprot is that pipeline steps are assigned one of
several alternated modules that adhere to the same XML input/output schema.
That means that alternate methods can be substituted in the pipeline. For
example, for 2D gels an image segmenter would be used for step 2; for 2D LC-MS peak
cluster data a clustering method might be used; for a protein array spot list
data might be used from one of the many microarray image spot segmentation
programs, etc. The pipeline is set up and controlled by the Pipeline Control
Program.
______________________________________________________________________________
|
| (2D images, 2D LC-MS spot cluster data, protein array data)
v
1.
Accession sample images or n-D data and experiment information,
|
| XML (sample Accession data Open2Dprot-Accession.xsd)
v
2.
Segment n-D data to quantify or extract "spots" for all
samples (2D-gels, 2D-LC-MS, etc.),
|
| XML (sample spot-list SSF data Open2Dprot-SSF.xsd)
v
3.
Create a landmark database between reference sample and remaining
samples by spot pairing algorithm (if required for spot pairing),
|
| XML (Landmark data Open2Dprot-Landmark.xsd)
v
4.
Pair spots between a reference sample and the rest of samples,
|
| XML (samples paired-spot-list SPF data Open2Dprot-SPF.xsd)
v
5.
Construct Composite Samples Database, CSD, by
merging paired spot lists,
|
|
v
RDBMS and caches (CSD data Open2Dprot-CSD.xsd)
^
|
|
6.
Explore the CSD data using exploratory data analysis and data mining
techniques: statistics, clustering, classification, direct-manipulation
graphics, reports, etc. This may invoke Java plugins and R-language
scripts.
______________________________________________________________________________
Figure 2.1 The Open2Dprot pipeline
processing data reduction hierarchy. The pipeline processing hierarchy is
illustrated by figure 2 of the Open2Dprot home page. See the figure legend and
discussion of which stages could be run in background batch.
2. Open2Dprot computing environment
Open2Dprot is meant to be run locally on an investigator's or a
collaborative group's computer. However, the CSD database will reside on a
RDBMS that could be split between a working experiment database and a reference
database (e.g., plasma proteins, etc.) with spot identifications and other
proteomic information.
Currently, Open2Dprot saves all data (XML data interchange files
as well as derived images) from a project in a set of sub-directories in a
project directory. Multiple experiments, each consisting of multiple samples,
could reside in the same project directory. Multiple CSDs could be constructed
in the same project directory created from samples from different experiments
or subsets of samples.
<project-directory>/batch/
- batch files
<project-directory>/cache/
- cache files
<project-directory>/ppx/ - original input image files
<project-directory>/rdbms/
- RDBMS CSD database files
<project-directory>/tmp/ - generated temporary and derived image
files
<project-directory>/xml/ - accession DB, landmark DB, SSF files,
SPF files, etc.
These data could reside in the RDBMS itself or in sets of
distributed RDBMS. The cache directory data files are used for data mining
rather than paging data from the RDBMS, which would be too slow for many
exploratory data analysis operations.
3. Open2Dprot pipeline
processing modules and scheduler
There is a top-level scheduler program (not released yet) called
Open2Dprot illustrated in Figure 3 that determines
the data dependency, what data exists, what data is required (i.e., the next
step of the processing depends on it). It then runs the appropriate pipeline
modules to create that data.
Each pipeline step may have a specific module dynamically
assigned. This means that different pipeline processing method sets can be
created for different types of proteomic expression data. For example, if no
spot pairing is required (e.g., the input is a protein array and not a 2D gel
image or 2D LC-MS data file), then the pipeline analyzer will skip the spot
pairing steps. Similar dependencies can be assigned to all processing steps.
______________________________________________________________________________
Figure 3. The Open2Dprot pipeline
control program.
The pipeline control program (called Open2Dprot
- and is under development) will schedule and run the modules in the pipeline
after doing a data-dependency analysis on 1) what data exists, and 2) what data
needs to be created by running parts of the pipeline to proceed to the next
stage in the pipeline. That is, it works backwards from the future CSD to determine
what data is required and then repeats that analysis further back in the
pipeline until it reaches the top of the pipeline (Figure 2) or reaches data
that already exists in the pipeline. It then executes the pipeline processes to
construct the CSD. Once the CSD is created, the pipeline is no longer needed
except to add additional samples. Instead, the CSDminer program would then be
used directly to access the already existing CSD database.
All currently released pipeline modules, source code, and the
common library O2Plib are available from the Web site list of subprojects
http://open2dprot.sourceforge.net/doc/subprojects.html
The direct links are:
http://open2dprot.sourceforge.net/Accession
http://open2dprot.sourceforge.net/CmpSpots
http://open2dprot.sourceforge.net/Seg2Dgel
http://open2dprot.sourceforge.net/Landmark
http://open2dprot.sourceforge.net/O2Plib
where:
Accession - sample experiment and ROI accession
program
Seg2Dgel - segment 2D gel (or similar) image into a
Sample
Spot-list File (SSF)
Landmark - interactively define a set of landmarks
between
two sample
images (Landmark)
CmpSpots - pair two sample spot lists into a
Sample
Paired-spot-list File (SPF)
O2Plib - common Open2Dprot library
O2Plib.db.*
contains data objects and XML I/O
O2Plib.db.CSD contains CSD data
objects and XML I/O
[Modules not yet
released: Open2Dprot, BuildCSD, CSDminer
Open2Dprot Open2Dprot
pipeline scheduler
BuildCSD -
construct/add-to the CSD from sets of SPF data files
CSDminer - exploratory analysis for CSD
Although we refer to a XML Sample Spot-list File (SSF) as a file,
it could be a XML object in a RDBMS.
3.1 Primary data objects used in Open2Dprot
The primary
classes which define data objects used throughout Open2Dprot are defined in the
O2Plib.db.* Java library modules and define the base objects and their XML
readers and writers.
DbAccession.java - read, write and access
accession sample database
DbBaseSpot.java base spot class
DbBoundary.java - spot boundary manipulation
(-- not released)
DbLM.java - read, write and access paired
sample landmark spots database
DbPairSamples.java - read, write and access
paired sample instance
DbSample.java - read, write and access
sample spot list instance
DbPspot.java SPF paired-spot feature
object
DbSpot.java SSF spot feature object
LMset.java landmark set object.
The O2Plib home
page is http://open2dprot.sourceforge.net/O2Plib
It has javadoc API documentation accessible from the home page showing the object dependencies.
3.1.1 Primary data objects used in CSD for Open2Dprot
The primary
classes which define CSD data objects used throughout Open2Dprot are defined in
the O2Plib.db.CSD.* Java library modules and define the CSD base objects and
their XML readers and writers. See Section 5.5 for discussion on how these
describe the CSD XML schema.
CSD.java instance of CSD database
CSDacc.java Accession data instance
CSDannotation.java annotation instance
CSDcache.java cache instance
CSDcal.java grayscale to measurement units
calibration instance
CSDcond.java sample conditions instance
CSDexpr.java expression profile instance
CSDexprList.java list of expression
profiles instance
CSDfilterState.java data filtering
instance
CSDglb.java global state instance
CSDio.java I/O instance
CSDlimits.java filter limits instance
CSDlms.java landmark data instance
CSDnorm.java sample spots normalization
instance
CSDRmap.java reference map instance
CSDRspot.java Rspot (reference spot) set
instance
CSDRspotList.java list of Rspots instance
CSDsizes.java current size limits instance
CSDtotals.java current samples, spots etc
totals instance
4. The use of standard XML interchange data files
Open2Dprot standardizes data interchange through XML files. The
I/O is to/from data files, which in the future will also be able to be kept in
a relational database.
To minimize problems porting data between processing modules, we
have created a common Java library O2Plib. This library defines all data
structures that would be used in more than one Open2Dprot pipeline module.
It also centralizes the XML I/O.
The early, current, version of the Open2Dprot XML library uses the
SAX XML reader to read XML data files. It also uses Java methods to explicitly
generate the XML output. These used optional document type definitions (DTDs)
with the SAX readers (if the dtd command-line switch was specified with
the module).
Future XSD XMLbeans I/O
We are in the process of refactoring the XML I/O using XMLbeans (http://xmlbeans.apache.org/) generated
XML readers and writers based on Java code generating from xsd schemas. This
will make it much easier to integrate the MIAPE xsd schemas with Open2Dprot. We
will be adding xsd namespaces to the Open2Dprot schemas to keep the fields
unique. We will use the merge with MIAPE schema fields if possible. The name
space will be the following:
o2p: Open2Dprot name space that will be used if
there is no
equivalent MIAPE
schema type or element
If there is a conflict between the Open2Dprot sub-schemas and
there is no MIAPE replacement, then we may rename our elements and type names
or we might use alternate namespaces as follows. This is to be avoided if
possible by the renaming of fields.
o2pa: Open2Dprot accession space
o2pl: Open2Dprot landmark space
o2ps: Open2Dprot spot list space
o2pp: Open2Dprot paired-spots list space
o2pc: Open2Dprot CSD space
Current SAX XML readers
There are document type definitions (DTDs) associated with the XML
data readers and writers in the Library. All XML I/O is handled completely by
this library. Currently, we are using hardwired Xerces SAX readers, but hope to migrate
this to dynamic XML schema I/O this year.
Currently, we have 4 published DTDs on the Web site (the CSD will
be an XML schema file Open2Dprot-CSD.xsd and is being constructed for the
Composite Sample Database).
The DTDs are available several ways: 1) download and install the
CmpSpots program (this includes the four DTDs), or 2) they are available at:
http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Accession.dtd
http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Landmark.dtd
http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-SSF.dtd
http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-SPF.dtd
5. The Open2Dprot project list of XML schemas
Once the Composite Sample Database (CSD) is constructed it could
be used for exploratory data analysis and data mining. It consists of a MIAPE
sample experiment data and a list of paired-"spot" expression values
across all samples in the database. Missing sample spot values are allowed and
are described in Section 5.5.
The CSD database is viewed as a multiple-sample database is
illustrated in Figure_1_CSD in the Introduction to
this document and discussed in the home page of the Open2Dprot Web site.
There are other types of informatics data that are both useful and
necessary such as sets (i.e. subsets) of spots, condition sets (subsets of
samples), calibration data for each sample (as different from normalization),
etc.
The complex XML types definitions are highlighted in yellow in
this Section. E.g., see the definition of <CSD> in Section 5.5.
5.1 Sample accession data XML
schema
A single sample is a 2D gel image, a 2D LC-MS data set, a protein
array, or some other quantitative or semi-quantitative protein expression
vector etc. A DIGE type sample is really a set of samples so each channel is a
separate sample - even though spots are effectively paired by the way the
samples are run simultaneously in the same gel with different dyes.
Each sample has some input data (e.g., 2D gel image, 2D LC-MS real
image, 2D LC-MS clustered peaks data with a virtual image, protein array image
or protein array spot list in some format), etc. The sample name is currently
the name of the data file without a path and without the extension (e.g., .tif,
.jpg, .gif, etc). The extension is
determined at run time.
The accession database contains enough information to find the
sample, grayscale or other calibration information, regions of interest, etc.
The Accession program currently creates and edits this
information. We plan to replace this program with a more advanced program to
fill in the other MIAPE experiment information fields.
The accession database DTD is available at
http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Accession.dtd
The accession database XSD is available from the Open2Dprot Web
site CVS server at
http://cvs.sourceforge.net/viewcvs.py/open2dprot/schemas/Open2Dprot-Accession.xsd?rev=1.14&view=log
There is an example of an accession XML file in the file
http://open2dprot.sourceforge.net/demo/xml/accession.xml
5.1.1 Fields in the Accession XSD (Open2Dprot-Accession.xsd)
Samples are accessioned (i.e., entered) into a database containing
experiment information about a sample as well as global image descriptions
(file name, size, data region-of-interest (ROI), pixel grayscale to measurement
units calibration, etc). In the case of non-image samples (e.g., 2D LC-MS peaks
data, protein array data), these fields are defined for the virtual image of
that data.
Additional experiment related information is also saved in the
accession database for each sample. This includes study, investigator, date,
sample conditions, etc.
This is currently defined using the Open2Dprot common library
class O2Plib.db.DbAccession.java.
The accession database file
<project-directory>/xml/accession.xml
contains sample and experiment information in a single flat-file.
It is similar to a subset of the original PEDRo
proteomics schema. [We note that that this list of sample description
accession data descriptors is inadequate for many research purposes and
will be replaced with a more general MIAPE set of descriptors.] We
currently use this accession schema as a placeholder until we implement a full
MIAPE sample experiment information subset.
We would of course have to integrate the GUI for the ROIs (Region
Of Interest) and grayscale calibrations.
A <SampleEntry> consists of required computable data
fields, and
experiment information data fields in the subsections that follow.
The Accession
pipeline module program allows the creation and editing of the accession sample
database. We plan on making the Accession module more MIAPE compliant by either
extending Accession or replacing it with another accessioning program. The home
page is
http://open2dprot.sourceforge.net/Accession
The accession database DTD is available at
http://open2dprot.sourceforge.net/O2Plib/Open2Dprot-Accession.dtd
The accession database XSD is available from the Open2Dprot Web
site CVS server at
http://cvs.sourceforge.net/viewcvs.py/open2dprot/schemas/Open2Dprot-Accession.xsd?rev=1.14&view=log
There is an example of an accession XML file in the demonstration
directory
http://open2dprot.sourceforge.net/demo/xml/accession.xml
The top-level object contains two database identifier fields and
an arbitrary list of <SampleEntry>s.
<Accession>
<DatabaseName> (String) name of the accession database
</DatabaseName>
<Date> date database was created or modified</Date>
<SampleEntry> sample entry 1 </SampleEntry>
<SampleEntry> sample entry 2 </SampleEntry>
. . .
<SampleEntry> sample entry n </SampleEntry>
</Accession>
The <SampleEntry> complex type contains two types of data:
required computable data and optional experiment annotation data. These are
described in Sections 5.1.2 and 5.1.3.
5.1.2 Required computable data fields for each <SampleEntry>
The accession data contains several critical computable data
fields for each <SampleEntry>. These fields are required by any
Open2Dprot programs that need to lookup accession information for one or more
samples.
In particular, the spot segmentation module, Seg2Dgel, requires a
region of interest (ROI) called the computing window where spots will be found.
If this ROI is not defined, it is defined as the entire image or (x,y) space of
the data (in the case of abstract 2D LC-MS peak data or protein arrays. If a
grayscale calibration is present, it will calibrate grayscale in terms of the
calibration data. Fields that have optional entries are indicated with [opt].
<SampleEntry>
<Sample> (String) sample base name (if there is an image, no path
or
extension)
</Sample>
<Rsample> (String) reference sample base name (if ANY) (if there
is
an image, no path or extension)
</Rsample>
<WedgeCalList> (String) optional calibration values list that is
synchronized with <WedgeGrayList>. The delimiters
may be commas,
spaces, or tabs.
</WedgeCalList>
<WedgeGrayList> (String) optional corresponding grayscale peak
values list
synchronized with <WedgeCalList>. The
delimiters may be commas,
spaces, or tabs.
</WedgeGrayList>