Homepage - News - Tools - Data and Tranche! - FAQ - Archive - Sitemap
| About FASTA File Format This page describes the FASTA (a.k.a. Fast-A) file format and how it is relevant to ProteomeCommons. |
FASTA |
FASTAWhat is FASTA?FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences, known as the header line. Because the files are text-based, text-parsing programs are used to manipulate the contained information. There are some pitfalls to the FASTA file format. ProteomeCommons is taking part in efforts to improve the format and the way it is used. Traditional FASTATraditionally, FASTA files use a non-standardized header line or lines, which always begin with the greater-than symbol ('>'). The information contained within the header is different depending on the institution that created it.
Example
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY FASTA files are constantly being updated by the institutions that maintain them. In general, a new release of the same FASTA files will be made available each month. Not every institution will keep archives of all previously released files. Pitfalls of Traditional FASTACan't globally parse header line. Because each institution has their own standard header format, parsing the text from each of these files must be custom-tailored to the institution from which the file came. This is a time-wasting exercise for both the institution and the researcher. HUPO's Proteomics Standards Initiative has created a standard FASTA format which ProteomeCommons supports. This format is described fully on the Standard Initiative page. Citing data that is no longer available. All of the FASTA data is made available by the institution that created it. Not all of them will keep archives of the old FASTA data files for reference sake. Many will simply provide a way for the researcher to be notified of an update to the data, only staying concerned with the up-to-date data. The major problem occurs when a researcher is basing their research on an older set of data than is available online. The researcher must cite their use of the data, but the data is no longer available to be downloaded from the institution and thus cannot be independently verified. Tranche is a file-sharing network that can be used to solve this problem. The Tranche network is a secure and reliable way of storing data online. Instead of relying upon institutions to keep archives of their FASTA files, they could upload their data to Tranche and link it with the older version of the data. A function of Tranche is to link archived data with all versions of that data, therefore it is easy to get the most up-to-date version of the data. Furthermore, the data will be available and the data version citable with the data's Tranche hash (the way you reference Tranche data). There are other perks to using the Tranche network to store, retrieve, and reference FASTA files. The Tranche Graphical User Interface contains within it some software modules that make it very easy to manipulate FASTA files. These modules include creating concatenated or reversed FASTA files with only a couple of clicks. HUPO Standardized FASTA Format
Note: not all issues have been worked out. The parts of this page that are still in question are shown in red text.
File Header
An example of the FASTA file header with two databases added together (two components):
#\Dbcomponent=1
#\Name=UniProt_SwissProt #\PrimaryIdentifierType=sp_ac #\Version=52.3 #\ReleaseDate=20070425 #\NumberOfEntries=248942 #\Sequence_type=Protein_sequence #\Dbcomponent=2 #\Name=ENSEMBL #\PrimaryIdentifierType=sp_ac #\Version=12.45.3.2 #\ReleaseDate=20070425 #\NumberOfEntries=1234567 #\Sequence_type=Protein_sequence Individual Protein Entries
Example protein entry
>sp_ac|P02769_WOSIG0 \ID=ALBU_BOVIN \DE="Serum albumin precursor (Allergen Bos d 6) (BSA)" \NCBITAXID=9913 \MODRES=(1|Acetyl) \VARIANT=(196|A|T) \LENGTH=589
RGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCV ADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPK LKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKG ACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLV TDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEK DAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEA TLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKV PQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTK CCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKH KPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA |
Comments or Questions? Please contact the site's administrators.