This is the developer documentation for the ProteomeCommons.org IO Framework. Here is where you will find basic information on the design choices of this framework and how to use the framework's features.
Regular release builds of the ProteomeCommons.org IO Framework may always be found at ProteomeCommons.org. These builds are intended for users who want relatively stable builds of the code, and who aren't interested in actively developing the code. If you just want to download and use the code, just grab the latest stable copy of ProteomeCommons.org.
If you are interested in getting the absolute latest code, you may download the code this project's developers are actively working on. All of the current code, including daily changes are kept in a public subversion repository. All of the documentation here assumes that you have downloaded some release of the ProteomeCommons.org IO Framework.
Full code releases are available at ProteomeCommons.org, and the development branch of the code is available from our subversion repository (Help! What is subversion?), svn://www.proteomecommons.org/svn/IO/dev. If you are looking to help develop the IO framework's code, use the latest development code. Don't submit new features that are based off of an old release.
Sometimes dev builds are based off a dev build of the Java Analysis Framework (JAF). You may check out the latest development JAF code from ProteomeCommons.org, svn://www.proteomecommons.org/svn/JAF/dev.
These are some simple instructions for getting the IO development code using subversion with the eclipse IDE.
you'll now be able to right click on C:\workspace\IO and c:\workspace\JAF to use update
You can help in many different ways, even if you aren't a Java developer. Here are some ideas.
All of the above are just general ideas of what you can do. If something looks interesting, please get in touch with the primary developers.
There are several key concepts that you'll have to wrap you head around before you can successfully use this framework. Most all of these concepts involve bridging a mass spectrometry concept to a Java abstraction. If you come from the Java world, you'll have no problems seeing why we chose to abstract things as we did, but you'll likely benefit from learning more about the domain specific mass spectrometry concepts. If you are a mass spectrometrist, you'll likely want to know how the Java abstractions can be used to handle your data.
The general approach for presenting the key concepts of the ProteomeCommons.org IO Framework is to break the whole framework in to a few important parts and present these parts in a logical order. Each of the following sections does just this. You are encouraged to read the documentation straight through as if it was a book, as each concept builds upon the earlier ones.
A "peak list" is used to refer to a mass spectra that has been centroided and reduced to what is considered the primary information: m/z values with associated intensities. In general, peak lists are relatively easy to work with because they represent exactly what most mass spectrometrist's want. In this framework, the basic abstraction for a peak list is the PeakList interface in the org.proteomecommons.io package. The basic interface requires only the following two things: a way to access peak m/z and intensity information and a way to assign an arbitrary name. The m/z values, intensities, and an arbitrary name are considered the bare minimum information that each peak list must contain. However, support is also provided for MSMS ion information, peak areas, charge states, and any other information you'd like to included. This support is provided by the interfaces taht extend the PeakList interface, e.g. TandemPeakList, MonoisotopicPeakList, and QuantitativePeakList. All of the peak list code in this framework is guaranteed to retrun an instance of PeakList, and you may type check and cast instance of PeakList to the appropriate sub class, depending on what information you are trying to access.
Here is a simple example of getting tandem MS information from a peak list.
// assume an instance of the peak list exists
PeakList peaklist = ...;
// check for MSMS
if (peaklist instanceof TandemPeakList) {
// do tandem MS specific stuff
TandemPeakList tandem = (TandemPeakList)peaklist;
PeakList ms = tandem.getParentIon();
int charge = tandem.getParentIonCharge();
double intensity = tandem.getParentIonIntensity();
}
Normally, instances of PeakList are instances of the GenericPeakList object. This object impements methods for all of the abstract PeakList-related interfaces, but it only formally implements PeakList. Specific PeakList readers and writers normally use this class and tack on the appropriate interfaces in order to specify what other information is stored in the peak list. You should never cast instances of PeakList to GenericPeakList. Always use the appropriate abstract interfaces. Often specific readers will return customized PeakList impelmentations that include all of the meta-data associated with a particular peak list format, e.g. the Mascot Generic Format peak list reader returns instances of MGFPeakList.
Here is a simple example of getting format specific infromation from a peak list, basically the same strategy as illustrated above.
// assume an instance of the peak list exists
import org.proteomecommons.io.mgf.*;
PeakList peaklist = ...;
// check for MGf
if (peaklist instanceof MGFPeakList) {
MGFPeakList mgf = (MGFPeakList)peaklist;
// invoke MGF-specific methods...
}
Whenever possible, it is encouraged that you avoid relying on peak list format-specific instances of PeakList. Code that does this will have to write methods that know how to handle each of the different file formats supported by the IO-framework. The proper way to get at commonly used information, say peak areas, is to check if the PeakList is an instance of QuantitationPeakList and use the methods definied by that interface. Type casting to a file-format specific peak list, say MGFPeakList, and accessing peak quantitation information that way will result in much messier code. However, in special cases where a peak list format provides information that is not properly abstracted by the IO framework, you have no choice but to type cast to the particular file-formats PeakList instance.
Filters are a simple method of encapsulating code that modifies a peak list or spectra and exposing the functionality via one simple method, filter(). The PeakListFilter interface defines a filter() method for PeakList objects. Returned from the method is another PeakList that may have been modified in any way by the encapsulated code. These filters are used either directly by applications or through the helper methods associated with the PeakList objects. If you'd like to see some examples, look in the org.proteomecommons.io.filters package.
The IO framework is based around several abstract base classes that represent the minimal amount of functionality that a peak list or a reader/writer should have. All code specific to a particular implementation of a peak list or reader/writer is included in an individual package, e.g. org.proteomecommons.io.mgf contains all the MGF file format code. During runtime a user is expected to use the factory methods of the PeakListReader or PeakListWriter, classes in order to get an appropriate instance of the associated class for the file format of interest. This design choice greatly simplifies how a user can access the frameworks functionality, and it also leaves lots of flexibility when supporting new file formats.
In general, new readers and writers are created by invoking the getPeakListReader() or getPeakListWriter() method of either the GenericPeakListReader or GenericPeakListWriter classes. The following code snippets illustrate how to obtain a new reader or writer for either peak lists or spectra.
// load the libraries
import org.proteomecommons.io.*;
...
// get a reader based on file extension
PeakListReader reader = GenericPeakListReader.getPeakListReader("my-file.mgf");
// use the reader
...
// get a writer based on file extension
PeakListWriter writer = GenericPeakListWriter.getPeakListWriter("output-file.dta");
// use the writer
...
Note that the class returned by the factory classes is always an instance of PeakListReader or PeakListWriter. All of those are interfaces. Specific implementations of those classes will be provided depending on the type of reading and or writing you intend to do. You can look at any of the existing packages (e.g. mgf, dta, pkl, txt) for examples.
The memory interface is designed to be as simple and intuitive to use as possible. Simply invoke the PeakListReader.getPeakList() method to read a peak list. Similarly, the PeakListWriter.writePeakList() methods work in a similar fashion. The only downside to the memory interface is that it requires the entire peak list must be present in memory in order to function. While this usually isn't an issue, it may be in situations where there is low memory or where you are attempting to handle an incredibly large peak list.
The basic technique for reading or writing a peak list is to get an appropriate peak list reader or writer, or you can simply use the helper methods in the PeakListReader and PeakListWriter classes. The following code snippet gives an examples of using the helper interfaces.
// load the libraries
import org.proteomecommons.io.*;
...
// load the peak list
PeakList peaklist = GenericPeakListReader.read("your-peaklist-file.mgf");
// modify the peak list some
...
// write out the new peak list to a file
GenericPeakListWriter.write("output-file.dta");
Reading and writing peak lists is near trivial. If you wish to customize the reader or writer, you can invoke the non-static read() or write() methods on the appropriate instance of the classes.
The stream interfaces replicate the functionality of the memory interfaces but are stream-lined to require the lowest amount of memory as possible. There is a slight trade off in ease of use, but the interfaces allow this framework to be helpul in both low memory situations and when handling extraordinarily large peak list resources. The stream interface is broken down in to the following three primary methods: isStartOfPeakList(), hasNext(), and next(). The isStartOfX() methods return a boolean indicating if the current reader or writer is ast the start of a peak list. The hasNext() method indicates if there is more data available in the given peak list resource. The next() method returns the next Peak. The overall functionality is similar to using standard Java Iterator classes for reading through a list.
Use of the stream interfaces in code is slightly more complex but still simple to use. Here is an example of stream reading a collection of peak lists encoded in MGF format.
PeakListReader reader = GenericPeakListReader.getPeakList("your-peaklist-file.mgf");
// read each peak list in the file
while (reader.isStartOfPeakList()) {
// handle first peak
Peak p = reader.next();
// write the peaks
while (reader.hasNext() && !reader.isStartOfPeakList()) {
// get the next peak
Peak p = reader.next();
// handle the peak as you please
...
}
}
While slightly more complex than the memory interface, the stream interfaces still accomplish the same task. The difference is that parts of a peak list are read or written serially instead of reading or writing an entire peak list all at once.
Coding your own reader or writer is as simple as extending the appropriate base class, either PeakListReader or PeakListWriter, and implementing the memory and stream interfaces. By convention, put all code associated with the class in a unique package, preferably named after the format that the code supports. If your format has more functionality that is currently present in the PeakList objects, do not change them, subclass them in your new package and add as many methods as you need. Other readers, writer, and users can cast PeakList objects in to the appropriate subclasses in order to use any functionality you've added.
If you'd like to use your reader or writer with the rest of the framework, register it using one of the GenericPeakListReader or GenericPeakListWriter classes. Each class has methods for registering a reader or writer with a regular expression that will dictate which files it can read.
The IO framework has full support for working with various standards for serializing protein sequences, including FASTA. The framework can read these resources and create appropriate Protein objects or Residue objects as specified in the JAF framework. The ProteinReader interface defines the functionality that must be implemented by all protein readers.
The org.proteomecommons.io.fasta.FASTAProteinReader class will read a InputStream containing FASTA data and construct an appropriate Protein object. See JavaDocs.
The IO framework has full support for peptides via the JAF framework's peptide object (org.proteomecommons.jaf.Peptide). You can create peptides from any set of amino-acid residues, including Protein objects. Additionally you can easily create commonly-used peptide sets such as typric peptides and non-specificly cleaved peptides, including support for missed cleaves and SNPs.
The IO framework abstracts peptide reading using the PeptideReader interface. The GenericPeptideReader class is an example implementation, and it also includes helper method for filtering peptides based on mass restrictions and/or length restrictions. See JavaDoc.
Tryptic peptides can eaisly be created from any array of residues by using the org.proteomecommons.io.TrypticPeptideReader object. See JavaDoc.
Non-specific peptides can eaisly be created from any array of residues by using the NonspecificPeptideReader object. See JavaDoc.
Missed cleaves are handled by creating a special PeptideReader (org.proteomecommons.io.MissedCleavePeptideReader) that draws data from another PeptideReader, buffering it and accounting for an arbitrary number of missed cleaves. See JavaDoc.
SNPs are handled by creating a special PeptideReader (org.proteomecommons.io.SingleNucleotidePolymorphismPeptideReader) that draws data from another PeptideReader, buffering it and accounting for an arbitrary number of SNPs. See JavaDoc.
Please note that looking for multiple SNPs is a combinatorial problem, and you'll quickly have far too many peptides to examine. It is strongly recommended that you restrict yourself to assuming peptides many only have one SNP.