Homepage - News - Tools - Data and Tranche! - FAQ - Archive - Sitemap

About FASTA - HUPO Standard Format

There are some pitfalls to the traditional FASTA file format. ProteomeCommons supports the efforts of the Human Proteome Organisation's Proteomics Standards Initiative (HUPO-PSI) to improve the format and the way it is used. This page describes the standardized FASTA format.


FASTA


Note: not all issues have been worked out. The parts of this page that are still in question are shown in red text.
Among the other items that haven't been worked out:

  • FASTA file extension (.fasta, .seqa, or something else?)
  • Do we always need DBPrefix in all entries, even if only one database?
  • Taxonomy information in protein headers: NCBI_TaxID > latin name > common name (mapping changes regularly)
  • Splicing variants: separate entries or annotations?
  • Processed sequences: separate entries or annotation? (removal of precursor peptide, active chain, …)
  • CV of database names
  • CV of internal terms


HUPO Standardized FASTA Format

A source file that includes:
  • A file header-block with information about the included database(s)
    • All lines in the block start with the '#' character
    • One header term from the list below per line
  • Individual protein (polypeptide) entries
    • One header line in the form that starts with the '>' character
    • The sequence


File Header

Terms for the header Description Value
#\DbComponent= Count increment Integer
#\Name= Name of the database CV from database provider (UniprotKnowledgeBase)
#\PrimaryIdentifierType= Identifier to be used as prefix for individual protein entries CV
#\Decoy= Is it a decoy database ?: true/false or description
#\Version= Databse version, according to the database provider According to the database provider
#\ReleaseDate= The date of the source database
#\NumberOfEntries= Number of entries Integer
#\Sequence_type= Sequence type DNA, AA, RNA, EST, etc.

An example of the FASTA file header with two databases added together (two components):
#\Dbcomponent=1
#\Name=UniProt_SwissProt
#\PrimaryIdentifierType=sp_ac
#\Version=52.3
#\ReleaseDate=20070425
#\NumberOfEntries=248942
#\Sequence_type=Protein_sequence

#\Dbcomponent=2
#\Name=ENSEMBL
#\PrimaryIdentifierType=sp_ac
#\Version=12.45.3.2
#\ReleaseDate=20070425
#\NumberOfEntries=1234567
#\Sequence_type=Protein_sequence


Individual Protein Entries

Description of the individual entry header line Example
Header starts with >, followed by primary AC, preceded with the Database prefix (useful if more than one DB are concatenated). Mandatory field. >sp_ac|P000761
Description of all non-sequence information \term=value (terms are controlled vocabulary descriptors) \ID=ALBU_HUMAN
The order of the additional fields is not important
Value can be a list. The elements of the list are represented as (value_1)(value_2) \ALTERNATE_AC=(P00786)(Q22222)
Value can be imbedded into " " if needed \DE="Human serum albumin"
'|' can be used as separator for all individual fields \MODRES=(1|Acetyl)
Ctrl-A as separator for multi-header entries ? (NCBInr usecase) (NCBInr usecase)

Header Field Term Definition Format
ALT_AC Alternative AC
ID SwissProt_ID
DE protein description
ALT_DE alternative description
NCBITAXID NCBI taxonomy identifier (9606) integer
TAX_LATIN taxonomy in latin name (Homo sapiens)
TAX_COM taxonomy in common name format (human)
MODRES modified residue (PTM) (position|modification) (PSI_MOD)
VARIANT residue mutation (Position|original residue|final residue)

Example protein entry
>sp_ac|P02769_WOSIG0 \ID=ALBU_BOVIN \DE="Serum albumin precursor (Allergen Bos d 6) (BSA)" \NCBITAXID=9913 \MODRES=(1|Acetyl) \VARIANT=(196|A|T) \LENGTH=589
RGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCV
ADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPK
LKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKG
ACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLV
TDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEK
DAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEA
TLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKV
PQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTK
CCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKH
KPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA



Comments or Questions? Please contact the site's administrators.