ProteomeCommons.org IO Framework 6.21

This is a free, open-source, Java framework for handling spectra and peak list files. The framework can read and write to a number of different spectra and peak list formats, and the framework provides a simple, intuitive Java object model for working with spectra or peak lists. The overall goal of this project is to make working with peak list or spectral data easy, even if you are working with various different file formats or if you are working with incredibly large files. Additionally, we aren't trying to lock developers in to using Java. Utility programs are included that will translate peak list or spectral data in to a simple, plain text format that is well-suited for passing via regular expressions in Perl, Python, VBScript, or any other programming language.

In short, the goal of this framework is to support all the popular MS and MSMS data formats, and to eliminate any time or effort involved in figuring out how to read and write peak list or spectrum files.

You may always find a current download of this framework at http://www.proteomecommons.org/current/531/. If you'd like the source-code, please read the developer docs.

Special notice: please help us give the IO Framework a good peak picking algorithm!

Credits

This is a project supported and contributed to by many different people. Here is a formal list of all of those who have submitted code and/or documentation.

Primary developers and maintainers.

Code contributors, documentation contributors, and supporters.

Support for this project, in part, comes from the National Resource for Proteomics and Pathways (NRPP).

Changes since 6.2 (200)

Changes since 6.1 (200)

Changes since 6.0 (175)

Changes since 5.0 (150)

Changes since 4.4 (134)

Changes since 4.3 (Revision: 120)

Changes since 4.2 (Revision: 112)

Changes since 4.1 (Revision: 67)

Changes since 4.0 (Minor Release 4.1)

Changes since 3.0 (Major Release 4.0)

Changes since 2.0 (Major Release 3.0)

Changes since 1.0 (Major Release 2.0)

Format Information

The following peak list formats are supported, to the extend documented here.

Bruker: .baf, .fid, .yep, and AutoExecute runs of LCMALDI

Read support for several of Bruker's file formats is implemented by wrapping Bruker's CompassXport tool. This tool can be freely downloaded from Bruker's site after registering and logging in -- instructions for locating the tool are provided from both IonSource and Bioinformatics Solutions Inc.. Use the following steps to make the CompassXport tool work with the IO Framework.

  1. Download and install the CompassXport tool. It might be tricky to find, use the instructions provided from either IonSource or Bioinformatics Solutions Inc.
  2. Copy CompassXport.exe to C:/Program Files/ProteomeCommons.org/IO/CompassXport.exe. Note, if this location doesn't contain the file the IO Framework will automatically try C:/Program Files/Common Files/Bruker Daltonik/AIDA/export/CompassXport.exe, which is where Bruker normally installs it.
  3. Convert and use any of the Bruker file formats listed below.

Special Note: When using the compassXport tool from Bruker an mzXML is generated by CompassXport then converted by the IO Framework in to the desired format. If you want mzXML and you want the exact mzXML that CompassXport produces, select the "Don't Convert" option.

Read support exists for the the following Bruker file formats.

.baf (micrOTOF, micrOTOF-Q, and ultraOTOF-Q)
Read support exists for .baf data via exporting data using the CompassXport tool. If you have the tool installed, the IO Framework will automatically support this format.
.yep (esquire4000, esquire6000)
Read support exists for .yep data via exporting data using the CompassXport tool. If you have the tool installed, the IO Framework will automatically support this format.
.fid (microflex, microflex LT)
Read support exists for .fid data via exporting data using the CompassXport tool. If you have the tool installed, the IO Framework will automatically support this format.
AutoXecute run for LCMALDI (autoflex II and autoflex II TOF/TOF, ultraflex II and ultraflex II TOF/TOF)
Read support exists for AutoXecute runs for LCMALDI data via exporting data using the CompassXport tool. If you have the tool installed, the IO Framework will automatically support this format.

Mascot Generic Format .mgf

MGF file format support is based on Matrix Science LLC's generic format. Complete MGF format support is provided by this framework; however, meta-information support is not well tested.

Helpful links

Sequest .dta

DTA file format support is based on the files used by the Sequest search algorithm. Complete DTA file support is provided by this framework. Any file ending with ".dta" is treated as if it is in the sequest DTA format.

PLEASE NOTE! The IO Framework's normal behavior deviates when handling DTA files. DTA files are intended to have one peak list per file with a known charge. Tools such as Sequest assume this and might not work properly if you include multiple peak lists in one DTA file (i.e. treat it like a PKL). The IO Framework will automatically "explode" peak list files that contain more than one peak list if they are converted to DTA. You will get several files, each with one peak list, named after the original, e.g. example.dta, example.2.dta, example.3.dta, etc. If no charge information is known, you'll also get files for charge states 1 through 3, e.g. example.1.1.dta, example.1.2.dta, etc. This is done to make the default IO Framework behavior work for the majority of use cases. Note, as a developer you can toggle the setExplode(boolean) to disable this feature.

PLEASE ALSO NOTE! DTA only handles data that includes precursor ion information, i.e. MSMS, MSMSMS, etc. but not MS. The default behavior of the DTA writer is to ignore MS peak lists. Developers can change this at runtime by setting the setIgnoreMS(boolean) flag.

Helpful links

Waters/Micromass .pkl and .raw (directory structure)

Waters uses two formats to represent mass spectrometry data via the MassLynx program. The .raw/ directory structure represents various forms of the raw data coming from a mass spectrometer. The .pkl format is a plain-text file containing one or more centroided, monoisotopic peak list of a .raw/ directory structure.

Phil Andrews, University of Michigan, through collaboration with Waters has implemented a free-to-use Java library for the IO Framework that can read all of the Waters file formats. Waters has shown great support for the IO Framework and our efforts to provide free support to the proteomics community for reading mass spec file formats. If you appreciate our efforts to support the Waters file formats, please take time to Waters -- thanking your Waters sales rep is likely the best way to let the higher-ups know.

Special thanks to Ronan O'Malley and James Langridge at Waters.

Please note: the Waters .raw/ directory structure support is still considered unstable and is not part of the default IO Framework releases. Please contact jfalkner@umich.edu if you'd like to use this beta code.

PKL file format support is based on Micromass's PKL file format. Complete read and write support for the PKL format is implemented, including multiple peak lists in a single file.

PLEASE NOTE! PKL only handles data that includes precursor ion information, i.e. MSMS, MSMSMS, etc. but not MS. The default behavior of the PKL writer is to ignore MS peak lists. Developers can change this at runtime by setting the setIgnoreMS(boolean) flag.

Helpful links

Plain Text

Plain text (.txt) is not really a MS or MSMS data format, but it is often what you want when manipulating data by hand. Full support exists for both reading and writing peak lists in plain-text format, e.g. you can readily export a .raw file to a tab-delimited file that you can view in a text editor. Specialized support existing for both MS Windows and non-Windows users, particularly support for people who want to look at spectra in notepad.

The plain-text format follows these three rules for both the reader and writer:

  1. If it is MSMS data the first line contains three tab-delimited values: the precursor ion's m/z, intensity, and charge.
  2. Each peak is saved on its own line with the following two tab-delimited values: fragment m/z and intensity.
  3. The normal plain-text writer delimits newline using just the "\n" character. The notepad-friendly plain-text writer delimits newline using the "\n\r" character combo. See wikipedia's writeup for why this helps Windows users.

Here is an example of what a peak list with a precursor m/z of 1000, a intensity of 2200, and a charge of 1, with two peaks of m/z 100 and 200, and intensity 1 and 2, respectively.

1000	2200	1
100	1
200	2

Note that the above is both an input and output format. If you wanted to make up a peak list, you could serialize your data in the above format and it'd be readable by the IO Framework and any tool that uses the IO Framework.

Applied Biosystems: .t2d, .bic, and .wiff

Applied Biosystems has a number of file formats including the T2D format used by their MALDI TOFTOF 4700 and 4800 instruments and the .bic file format used by the Voyager mass spectrometer.

.t2d
Complete support is provided for reading and writing files in the T2D format. The support is based on the ProteomeCommons.org IO-T2D project, which was created in collaboration with Applied Biosystems. Further documentation may be found in the "ProteomeCommons.org IO-T2D" project at ProteomeCommons.org
.bic
Voyager instruments produces a .bic file and a text file for a continuous spectrum. Support for reading and writing this format is currently being worked on.
.wiff

Support exists for reading Analyst WIFF files from a QStar or QTrap 2000 or QTrap 4000 using the Analyst software libraries and the wiff2dta tool1. In order to use this functionality you must be using Microsoft Windows as your operating system and you must have installed ABI's Analyst and have a local copy of the wiff2dta program. Use the following steps to get everything installed.

  1. Install ABI's Analyst software package.
  2. Download a copy of the wiff2dta program from http://sourceforge.net/projects/protms/. This should be a direct link to the download page.
  3. When using the IO framework do one of the following
    • Set the JVM parameter "-Dwiff2data [wiff2dta.exe]" replacing [wiff2dta.exe] with wherever you put the copy of the program, e.g. "-Dwiff2data wiff2dta_1108_analyst141.exe".
    • Copy the wiff2dta program executable to your working directory, i.e. the directory you are executing Java from.
    • Copy the wiff2dta program in to your classpath and name it "wiff2dta.exe"

References

Andreas M Boehm, Robert P Galvin, and Albert Sickmann, "Extractor for ESI quadrupole TOF tandem MS data enabled for high throughput batch processing", BMC Bioinformatics 2004, 5:162 doi:10.1186/1471-2105-5-162

Thermo Finnigan .raw

Read support is available for Thermo Finnigan's RAW file format via the ReAdW package from the Sashimi project. You must have Thermo Finnigan's XCalibur DLL to use this functionality. Use the following steps to enable RAW file reading.

  1. Install Thermo Finnigan's XCalibur 2.0+
  2. Download a copy of the ReAdW program from http://sashimi.sf.net
  3. Add ReAdW.exe to C:/Program Files/ProteomeCommons.org/IO/ReAdW.exe. If you have ReAdW.exe installed elsewhere, the GUI will prompt you for its location.

Special Note: When using the ReAdW tool to read .raw files a mzXML file is generated by ReAdW then converted by the IO Framework in to the desired format. If you want mzXML and you want the exact mzXML that ReAdW produces, select the "Don't Convert" option.

ISB/PSI lead mzData and mzXML merge effort dataXML

Efforts lead by the ISB and PSI are underway to merge the mzXML and mzData file formats. The working name for the new file format is dataXML. Beta-support exists for this file format; however, currently very few tools use this format and certainly it'll change before the final version is released. You are not encouraged to use the beta dataXML format unless you simply want to see what it looks like.

Select the "dataXML (beta)" file format option to use dataXML file format. Both read and write support exists; however, full support for dataXML's meta-data is not exposed through the GUI.

mzXML

mzXML is an open, XML based file format designed by Systems Biology. The framework supports reading either peak list or spectral data that is in the mzXML file format support as described n the mzXML schema and documentation. This framework provides read support of peak lists and spectra in *.mzXML files. Proper stream-based XML parsing is used, allowing for mzXML files of any size to be used. Proper Base64 decoding is also used, allowing for data in either 32 bit or 64 bit precision.

mzXML support by version

mzXML version 1.1.1
Near complete read and write support is implemented for mzXML 1.1.1 with lack of complete support for spotting and separation information. Any document with the mzXML 1.1.1 schema declared will be automatically read. Alternatively, files ending in .1.1.1.mzxml.xml will automatically be treated as files in mzXML 1.1.1 format.
mzXML version 2.0
Near complete read and write support is implemented for mzXML 2.0 with lack of complete support for spotting and separation information. Any document with the mzXML 2.0 schema declared will be automatically read. Alternatively, files ending in .2.0.mzxml.xml will automatically be treated as files in mzXML 2.0 format.
mzXML version 2.1
Near complete read and write support is implemented for mzXML 2.1 with lack of complete support for spotting and separation information. Any document with the mzXML 2.1 schema declared will be automatically read. Alternatively, files ending in .2.1.mzxml.xml will automatically be treated as files in mzXML 2.1 format.

Helpful mzXML links.

Proteomics Standards Initiative (PSI) .mzData

mzData is an open, XML based file format designed by US HUPO. The framework supports reading and writing data that is in the mzData file format as defined by the schema and accompanying documentation. Proper stream-based XML parsing is used, allowing for mzData files of any size to be used. Proper Base64 decoding is also used, allowing for data in either big endian or little endian format and data in either 32 bit or 64 bit precision.

mzData version 1.04
Read support for mzData 1.04 is provided by the mzData 1.05 reader described below. Write support is currently not available for mzData 1.04.
mzData version 1.05
Read and write support exist for the mzData 1.05 format. Any document with the mzData 1.05 schema declared will be automatically read. Alternatively, files ending in ..mzdata will automatically be treated as files in mzData 1.05 format.

Helpful mzData links.

National Institute of Standards and Technology (NIST) Library of Peptide Ion Fragmentation Spectra .msp

Steve Stein from the NIST has been working on creating high quality spectral libraries of MS/MS data. This data is available in several different formats along with a search tool. You can download the data from several locations including ProtoemeCommons.org. The primary format intended for other's use is a custom, simple, plain-text format that uses the '.msp' extension.

The IO framework includes read support for data in the .msp file format. Along with the standard MS/MS peak list information the file format also includes information about the peptide sequence that most likely matches each spectrum. If you are interested in accessing this peptide sequence information, you can sub-cast any PeakList object returned by the MSPPeakListReader class to MSPPeakList as illustrated below.

PeakListReader plr = GenericPeakListReader.getPeakListReader("C:/data/example.msp");
// get the abstract PeakList object -- no information about peptide sequence
PeakList pl = plr.getPeakList();
// sub-cast to get peptide sequence info, if desired.
MSPPeakList msppl = (MSPPeakList)pl;
msppl.getPeptides();
...

X!Tandem Output Files (aka TheGPM output) .xml, .thegpm.xml, .tandem.xml, .amethyst.xml, .opal.xml, .jasper.xml

The X!Tandem MS/MS search engine, which is used by TheGPM, produces output files that include both the identified peptide and protein sequences and the MS/MS peak lists that where identified. Several data sets, including Jasper, Amethyst, and Quartz, are published by the TheGPM.org in this format. The ProteomeCommons.org IO Framework can be used to read through X!Tandem output files and retreive the included peak lists. This is a convenient method of converting peak lists that X!Tandem identifies in to format that may be more familiar such as MGF or DTA.

Peptide and protein information associated with peak lists in X!Tandem output files are included in PeakList objects returned by the X!Tandem PeakListReader implementation. If you wish to access associated peptide sequences, sub-cast the PeakList object to a XTandemOutputPeakList as illustrated below.

PeakListReader plr = GenericPeakListReader.getPeakListReader("C:/data/example.msp");
// get the abstract PeakList object -- no information about peptide sequence
PeakList pl = plr.getPeakList();
// sub-cast to get peptide sequence info, if desired.
XTandemOutputPeakList xtopl = (XTandemOutputPeakList)pl;
xtopl.getPeptides();
...

Auto-compression/decompression support

Support exists for seamlessly using several compression algorithms with peak list formats. These compression formats are not peak list formats themselves, rather formats that can compress existing peak list files. In cases of the very verbose XML formats such as mzData and mzXML the compression can save lots of space.

ZIP

Complete read/write support exists for the ZIP format. Any file ending with .<extension>l;.zip, e.g. example.mgf.zip uses .mgf as the extension, will be read or serialized assuming use of the ZIP format.

The ZIP algorithm supports more than one file bundled together. The IO Framework will only read the first peak list included in a ZIP with multiple files. Note, the ZIP algorithm is possible the most widely supported algorithm, but it normally doesn't provide as good of compression as LZMA or bzip2.

Here are some helpful ZIP links

GZIP

Complete read/write support exists for the GZIP format. Any file ending with .<extension>l;.gzip or .<extension>l;.gz, e.g. example.mgf.gz uses .mgf as the extension, will be read or serialized assuming use of the GZIP format.

Here are some helpful GZIP links.

bzip2

Complete read/write support exists for the BZIP2 format. Any file ending with .<extension>l;.bzip2 or .<extension>l;.gz, e.g. example.mgf.gz uses .mgf as the extension, will be read or serialized assuming use of the BZIP2 format.

Here are some helpful BZIP2 links.

LZMA

Complete read/write support exists for the LZMA format. Any file ending with .<extension>l;.LZMA or .<extension>l;.gz, e.g. example.mgf.gz uses .mgf as the extension, will be read or serialized assuming use of the LZMA format.

LZMA is generally the best compression algorithm available; however, the benefit is normally marginal (5-10%) compared to the other formats.

Here are some helpful LZMA links.

Frequently Asked Questions (FAQ)

This is a list of frequently asked questions based on user e-mails from the ProteomeCommons.org e-mail list.

What does the "Don't Convert" option do?

A few of the file formats supported by the IO Framework rely on other tools. Namely ReAdW for .raw files and CompassXport for Bruker files. In these cases the associated tool makes an mzXML file and the IO Framework converts that file to the desired final file format.

The "Don't Convert" option tells the IO Framework to leave the mzXML file alone. Instead of using the mzXML as an intermediate file a copy of it will be saved in the selected output directory.

What does the "Merge" option do?

The "Merge" option will concat all of the input files in to one single output file. Handy for squeezing many files in to a single .mgf for Mascot searches and the like.

How do I disable that annoying popup box, aka "Unhandled Win32 Exception" or "Windows Error Reporting"?

A popup box may appear whenever native Windows code throws an exception, which may happen when using native libraries such as ReAdW.exe to read data in the RAW file format. This popup box is Window's Just-In-Time debugger (JIT), and it mostly appears on machines that have installed Visual Studio. To disable this debugger, delete the following registry keys as specified by the MSDN documentation.

Disabling JIT-attach Debugging

After Visual Studio is installed on a server, the default behavior when an unhandled exception occurs is to show an Exception dialog that requires user intervention to either start Just-In-Time debugging or ignore the exception. This may be undesirable for unattended operation. To configure the server to no longer show a dialog when an unhandled exception occurs (the default behavior prior to installing Visual Studio), use the registry editor to delete the following registry keys:

If you are a developer, you'll probably want to enable the JIT when coding. To do this, follow the below instructions from MSDN.

Enabling JIT-attach Debugging

JIT-attach debugging is the phrase used to describe attaching a debugger to an executable image that throws an uncaught exception. In unmanaged code, it is what happens when you see a message box that invites you to:

If you click CANCEL, a debugger is started and attached to the process. The registry key that controls this is called HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\Current Version\AeDebug.

For an application that includes managed code, the common language runtime will present a similar dialog to JIT-attach a debugger. The registry key that controls this option is called HKEY_LOCAL_MACHINE\Software\Microsoft\.NETFramework\DbgJITDebugLaunchSetting.

Finally, if you are doing batch processing of peak lists and you don't want it to be interrupted at all, you can also disable Windows XP error reporting. Follow the instructions described on MSDN also summarized below.

How to Configure and Use Error Reporting

You can enable, disable, or modify the way that error reporting works on a Windows XP-based computer. When an error occurs, a dialog box is displayed that prompts you to report the problem to Microsoft. If you want to report the problem, technical information about the problem is sent to Microsoft over the Internet. You must be connected to the Internet to use the feature. If a similar problem has been reported by other users and information about the problem is available, you receive a link to a Web page that contains information about the problem.

To access the settings for the reporting feature:

To configure the error reporting feature:

Where is that peak list conversion tool?

The on-line GUI tool that converts peak lists is here.

Why doesn't "java -jar ProteomeCommons.org-IO.jar" work?

The JAR file included with the code is not executable. It is intended to be used as a Java library, i.e. "java -cp ProteomeCommons.org-IO.jar ". If you'd like to run the example programs from command line, try executing "java -cp ProteomeCommons.org-IO.jar org.proteomecommons.io.util.PrintPeakList peaklist file", where you replace peaklist-file with the location of a real peak list.

Where do I get help?

Generally speaking, this is free, open-source code. You don't get free tech support with your download. However, there are options for getting help. The best help comes straight from the people who made this code. If you are willing to hire the core developers to help with your problem or to teach you how the framework works, contact Jayson Falkner. If you want free help, use the e-mail list at ProteomeCommons.org. Most of the developer and users of this framework are on that e-mail list, and it is the appropriate place to look for free advice.

Links

This is a list of links that references other projects that can read/write various mass spectrometry file formats. In most cases these are commercial tools that are either bundled with mass spectrometers or that cost a significant amount of money. If you know of a tool that isn't on this list, please send in the information to this project's developers.

Todo List

This is the list of features that need to be done. If you are interested in helping out, please get in touch.

Licensing Information

The goal of this project is to provide free, open-source code that anyone may use as they please; However, everyting in this project is strictly licensed under the Apache 2.0 license. This protects the authors and contributors of this project, and it encourages fair use of the code. A copy of the Apache 2.0 license may be found on-line at http://www.apache.org/licenses/LICENSE-2.0 or as plain text in the LICENSE.txt file included in this archive.

In addition to the Apache 2.0 license it is requested that any person or organization that uses the ProteomeCommons.org IO Framework properly reference this project. In at least one public, appropriate place, please note that the "ProteomeCommons.org IO Framework" is being used, and provide a URL to http://www.proteomecommons.org/current/531.

If you have further questions or comments regarding this project, please contact Jayson Falkner