Homepage - News - Tools - Data and Tranche! - FAQ - Archive - Sitemap

About FASTA File Format

This page describes the FASTA (a.k.a. Fast-A) file format and how it is relevant to ProteomeCommons.


FASTA


FASTA


What is FASTA?

FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences, known as the header line. Because the files are text-based, text-parsing programs are used to manipulate the contained information.

There are some pitfalls to the FASTA file format. ProteomeCommons is taking part in efforts to improve the format and the way it is used.


Traditional FASTA

Traditionally, FASTA files use a non-standardized header line or lines, which always begin with the greater-than symbol ('>'). The information contained within the header is different depending on the institution that created it.

Example
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

FASTA files are constantly being updated by the institutions that maintain them. In general, a new release of the same FASTA files will be made available each month. Not every institution will keep archives of all previously released files.


Pitfalls of Traditional FASTA

Can't globally parse header line. Because each institution has their own standard header format, parsing the text from each of these files must be custom-tailored to the institution from which the file came. This is a time-wasting exercise for both the institution and the researcher.

HUPO's Proteomics Standards Initiative has created a standard FASTA format which ProteomeCommons supports. This format is described fully on the Standard Initiative page.

Citing data that is no longer available. All of the FASTA data is made available by the institution that created it. Not all of them will keep archives of the old FASTA data files for reference sake. Many will simply provide a way for the researcher to be notified of an update to the data, only staying concerned with the up-to-date data. The major problem occurs when a researcher is basing their research on an older set of data than is available online. The researcher must cite their use of the data, but the data is no longer available to be downloaded from the institution and thus cannot be independently verified.

Tranche is a file-sharing network that can be used to solve this problem. The Tranche network is a secure and reliable way of storing data online. Instead of relying upon institutions to keep archives of their FASTA files, they could upload their data to Tranche and link it with the older version of the data. A function of Tranche is to link archived data with all versions of that data, therefore it is easy to get the most up-to-date version of the data. Furthermore, the data will be available and the data version citable with the data's Tranche hash (the way you reference Tranche data).

There are other perks to using the Tranche network to store, retrieve, and reference FASTA files. The Tranche Graphical User Interface contains within it some software modules that make it very easy to manipulate FASTA files. These modules include creating concatenated or reversed FASTA files with only a couple of clicks.


HUPO Standardized FASTA Format

Note: not all issues have been worked out. The parts of this page that are still in question are shown in red text.
Among the other items that haven't been worked out:

  • FASTA file extension (.fasta, .seqa, or something else?)
  • Do we always need DBPrefix in all entries, even if only one database?
  • Taxonomy information in protein headers: NCBI_TaxID > latin name > common name (mapping changes regularly)
  • Splicing variants: separate entries or annotations?
  • Processed sequences: separate entries or annotation? (removal of precursor peptide, active chain, …)
  • CV of database names
  • CV of internal terms

A source file that includes:
  • A file header-block with information about the included database(s)
    • All lines in the block start with the '#' character
    • One header term from the list below per line
  • Individual protein (polypeptide) entries
    • One header line in the form that starts with the '>' character
    • The sequence


File Header

Terms for the header Description Value
#\DbComponent= Count increment Integer
#\Name= Name of the database CV from database provider (UniprotKnowledgeBase)
#\PrimaryIdentifierType= Identifier to be used as prefix for individual protein entries CV
#\Decoy= Is it a decoy database ?: true/false or description
#\Version= Databse version, according to the database provider According to the database provider
#\ReleaseDate= The date of the source database
#\NumberOfEntries= Number of entries Integer
#\Sequence_type= Sequence type DNA, AA, RNA, EST, etc.

An example of the FASTA file header with two databases added together (two components):
#\Dbcomponent=1
#\Name=UniProt_SwissProt
#\PrimaryIdentifierType=sp_ac
#\Version=52.3
#\ReleaseDate=20070425
#\NumberOfEntries=248942
#\Sequence_type=Protein_sequence

#\Dbcomponent=2
#\Name=ENSEMBL
#\PrimaryIdentifierType=sp_ac
#\Version=12.45.3.2
#\ReleaseDate=20070425
#\NumberOfEntries=1234567
#\Sequence_type=Protein_sequence


Individual Protein Entries

Description of the individual entry header line Example
Header starts with >, followed by primary AC, preceded with the Database prefix (useful if more than one DB are concatenated). Mandatory field. >sp_ac|P000761
Description of all non-sequence information \term=value (terms are controlled vocabulary descriptors) \ID=ALBU_HUMAN
The order of the additional fields is not important
Value can be a list. The elements of the list are represented as (value_1)(value_2) \ALTERNATE_AC=(P00786)(Q22222)
Value can be imbedded into " " if needed \DE="Human serum albumin"
'|' can be used as separator for all individual fields \MODRES=(1|Acetyl)
Ctrl-A as separator for multi-header entries ? (NCBInr usecase) (NCBInr usecase)

Header Field Term Definition Format
ALT_AC Alternative AC
ID SwissProt_ID
DE protein description
ALT_DE alternative description
NCBITAXID NCBI taxonomy identifier (9606) integer
TAX_LATIN taxonomy in latin name (Homo sapiens)
TAX_COM taxonomy in common name format (human)
MODRES modified residue (PTM) (position|modification) (PSI_MOD)
VARIANT residue mutation (Position|original residue|final residue)

Example protein entry
>sp_ac|P02769_WOSIG0 \ID=ALBU_BOVIN \DE="Serum albumin precursor (Allergen Bos d 6) (BSA)" \NCBITAXID=9913 \MODRES=(1|Acetyl) \VARIANT=(196|A|T) \LENGTH=589
RGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCV
ADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPK
LKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKG
ACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLV
TDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEK
DAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEA
TLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKV
PQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTK
CCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKH
KPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA



Comments or Questions? Please contact the site's administrators.