FASTA
|
Note: not all issues have been worked out. The parts of this page that are still in question are shown in red text.
Among the other items that haven't been worked out:
- FASTA file extension (.fasta, .seqa, or something else?)
- Do we always need DBPrefix in all entries, even if only one database?
- Taxonomy information in protein headers: NCBI_TaxID > latin name > common name (mapping changes regularly)
- Splicing variants: separate entries or annotations?
- Processed sequences: separate entries or annotation? (removal of precursor peptide, active chain, …)
- CV of database names
- CV of internal terms
HUPO Standardized FASTA Format
A source file that includes:
- A file header-block with information about the included database(s)
- All lines in the block start with the '#' character
- One header term from the list below per line
- Individual protein (polypeptide) entries
- One header line in the form that starts with the '>' character
- The sequence
|
|
File Header
| Terms for the header |
Description |
Value |
| #\DbComponent= |
Count increment |
Integer |
| #\Name= |
Name of the database |
CV from database provider (UniprotKnowledgeBase) |
| #\PrimaryIdentifierType= |
Identifier to be used as prefix for individual protein entries |
CV |
| #\Decoy= |
Is it a decoy database |
?: true/false or description |
| #\Version= |
Databse version, according to the database provider |
According to the database provider |
| #\ReleaseDate= |
The date of the source database |
|
| #\NumberOfEntries= |
Number of entries |
Integer |
| #\Sequence_type= |
Sequence type |
DNA, AA, RNA, EST, etc. |
An example of the FASTA file header with two databases added together (two components):
#\Dbcomponent=1
#\Name=UniProt_SwissProt
#\PrimaryIdentifierType=sp_ac
#\Version=52.3
#\ReleaseDate=20070425
#\NumberOfEntries=248942
#\Sequence_type=Protein_sequence
#\Dbcomponent=2
#\Name=ENSEMBL
#\PrimaryIdentifierType=sp_ac
#\Version=12.45.3.2
#\ReleaseDate=20070425
#\NumberOfEntries=1234567
#\Sequence_type=Protein_sequence
Individual Protein Entries
| Description of the individual entry header line |
Example |
| Header starts with >, followed by primary AC, preceded with the Database prefix (useful if more than one DB are concatenated). Mandatory field. |
>sp_ac|P000761 |
| Description of all non-sequence information \term=value (terms are controlled vocabulary descriptors) |
\ID=ALBU_HUMAN |
| The order of the additional fields is not important |
|
| Value can be a list. The elements of the list are represented as (value_1)(value_2) |
\ALTERNATE_AC=(P00786)(Q22222) |
| Value can be imbedded into " " if needed |
\DE="Human serum albumin" |
| '|' can be used as separator for all individual fields |
\MODRES=(1|Acetyl) |
| Ctrl-A as separator for multi-header entries ? (NCBInr usecase) |
(NCBInr usecase) |
| Header Field Term |
Definition |
Format |
| ALT_AC |
Alternative AC |
|
| ID |
SwissProt_ID |
|
| DE |
protein description |
|
| ALT_DE |
alternative description |
|
| NCBITAXID |
NCBI taxonomy identifier (9606) |
integer |
| TAX_LATIN |
taxonomy in latin name (Homo sapiens) |
|
| TAX_COM |
taxonomy in common name format (human) |
|
| MODRES |
modified residue (PTM) |
(position|modification) (PSI_MOD) |
| VARIANT |
residue mutation |
(Position|original residue|final residue) |
Example protein entry
>sp_ac|P02769_WOSIG0 \ID=ALBU_BOVIN \DE="Serum albumin precursor (Allergen Bos d 6) (BSA)" \NCBITAXID=9913 \MODRES=(1|Acetyl) \VARIANT=(196|A|T) \LENGTH=589
RGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCV
ADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPK
LKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKG
ACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLV
TDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEK
DAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEA
TLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKV
PQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTK
CCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKH
KPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA
|