Referring to SARS-CoV-2 Proteins in BEL
Many of the proteins in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
are cleavage products of the replicase polyprotein 1ab (uniprot:P0DTD1).
Unfortunately, the bioinformatics community is not so comfortable with proteins
like this and nomenclature remains tricky. Luckily, the Biological Expression Language (BEL)
has exactly the right tool to encode information about these proteins using the fragment()
function.
This image was modified from the C&EN article What do we know about the novel coronavirus’s 29 proteins?
UniProt lists each of the 16 non-structural proteins (often written as symbols nsp1-nsp16) as protein chains
of the main protein entry, uniprot:P0DTD1. These chains are assigned identifiers
following the regular expression pattern of PRO_\d{10}
. The Identifiers.org registered this pattern
under the prefix uniprot.chain
. While it resolves
to URLs following the pattern of https://www.uniprot.org/uniprot/<uniprot_id>#<chain_id>
, it appears that the
parent protein’s UniProt identifier is looked up automatically . This is really good news and means that we can
start using stable CURIEs to identify these proteins, even if like me, you’ve never used this prefix before.
Alternatively, BEL allows you to write out the relationship between the parent protein and the
fragment using the fragment() / frag()
function (docs).
For example, the nsp1 fragment from position 1-180 can be written in BEL either as
p(uniprot.chain:PRO_0000449619)
or as a fragment p(uniprot:P0DTD1 ! R1AB_SARS2, frag(1_180))
.
The entire table of non-structural proteins is written out below for your copy/paste convenience
in BEL coding.
Symbol | Chain | Positions | Name | BEL |
---|---|---|---|---|
nsp1 | PRO_0000449619 | 1 – 180 | Host translation inhibitor nsp1 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(1_180)) |
nsp2 | PRO_0000449620 | 181 – 818 | Non-structural protein 2 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(181_818)) |
nsp3 | PRO_0000449621 | 819 – 2763 | Non-structural protein 3 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(819_2763)) |
nsp4 | PRO_0000449622 | 2764 – 3263 | Non-structural protein 4 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(2764_3263)) |
nsp5 | PRO_0000449623 | 3264 – 3569 | 3C-like proteinase | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(3264_3569)) |
nsp6 | PRO_0000449624 | 3570 – 3859 | Non-structural protein 6 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(3570_3859)) |
nsp7 | PRO_0000449625 | 3860 – 3942 | Non-structural protein 7 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(3860_3942)) |
nsp8 | PRO_0000449626 | 3943 – 4140 | Non-structural protein 8 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(3943_4140)) |
nsp9 | PRO_0000449627 | 4141 – 4253 | Non-structural protein 9 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(4141_4253)) |
nsp10 | PRO_0000449628 | 4254 – 4392 | Non-structural protein 10 | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(4254_4392)) |
nsp12 | PRO_0000449629 | 4393 – 5324 | RNA-directed RNA polymerase | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(4393_5324)) |
nsp13 | PRO_0000449630 | 5325 – 5925 | Helicase | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(5325_5925)) |
nsp14 | PRO_0000449631 | 5926 – 6452 | Proofreading exoribonuclease | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(5926_6452)) |
nsp15 | PRO_0000449632 | 6453 – 6798 | Uridylate-specific endoribonuclease | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(6453_6798)) |
nsp16 | PRO_0000449633 | 6799 – 7096 | 2’-O-methyltransferase | p(uniprot:P0DTD1 ! R1AB_SARS2, frag(6799_7096)) |
I’m not sure what happened to #11. UniProt isn’t listing it here. There’s also the Replicase polyprotein 1a, which lists nsp1-nsp11, but I’m not sure what the difference is yet.
When I first started writing this, I wasn’t actually aware of the existence of the uniprot.chain
entry
in Identifiers.org. This makes things a lot better! However, this leaves two tasks for me:
- Integrate the
uniprot.chain
nomenclature into PyOBO such that identifiers can be validated and easily resolved to names -
Generate equivalence relationships in BEL linking the CURIE-named and ontologically-defined versions of each as in:
p(uniprot.chain:PRO_0000449619) equivalentTo p(uniprot:P0DTD1 ! R1AB_SARS2, frag(1_180)) ... p(uniprot.chain:PRO_0000449633) equivalentTo p(uniprot:P0DTD1 ! R1AB_SARS2, frag(6799_7096))
Happy BEL coding!