Perhaps unsurprisingly, one of my interests is functional, accurate protein annotations. The default way to annotate new sequences, especially in a high-throughput manner, is to use sequence identity with some form of BLAST and use the best hit to annotate your sequence of interest.
There are some limitations here. It’s been shown that enzyme function is not necessarily conserved even with fairly similar sequences. We’ve demonstrated that each orphan enzyme we find a sequence for can lead to the re-annotation of hundreds of genomes.
In their paper DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe, Wang et al from Tokyo Institute of Technology and Tsinghua University apply a protein-domain based approach to try and expand our ability to predict enzyme activities for proteins.
The short version – domain signatures and ECs
Read the whole paper for the details, but the gist of the approach is this:
First, take all the proteins in the curated pfam-A list (more on pfam here). Then for each one that is already classified as an enzyme (with a single EC number), reduce it to a domain signature. A domain signature is just a list of all the unique domains that protein has.
Then for each domain signature, they identify the “dominant” EC number for that signature for each step in the EC hierarchy. Based on what percent of the enzymes in the domain signature family have that EC number, they assign a confidence level.
For example, we can imagine a theoretical family of enzymes that all have the domain signature ABE (that is, they all contain one or more of each of domains A, B, and E). That family might have 93% of its members in the top-level EC category 1, so it would have a confidence level of 0.93 for EC 1. Then if 81% of its member had 3 for the next number, it would have a confidence level of 0.81 for EC 1.3.
…and so forth.
Then the basic procedure is to take a novel enzyme, find its domain signature, set your preferred confidence level, and predict an EC number for that enzyme.
One obvious limitation
The big limitation to this method is that it only works if you already have annotated enzymes in your training data set with a domain signature matching the new enzyme you’re trying to annotate.
In other words, it’s the same old problem – it’s awfully hard to predict things you’ve never seen before.
That’s fine, of course. One nice aspect of this method is that it only requires domain-level information and it’s able to predict to rougher levels of specificity, which helps cut down on over-prediction. This means that it can help fill that gap of predicting activity for sequences for which we have no reasonably specific matches.
Also, over-prediction is a huge pet peeve of mine. I firmly believe it messes with our ability to build accurate predictive models. I’d rather have an enzyme’s activity predicted to a rough class of activity than have a highly (and over-) specific prediction telling me that it can’t carry out some activity that may be biologically interesting, pharmaceutically relevant, or affect our ability to use it for synthetic biology.
So how well does it work?
The authors compare DomSign to blastp (an obvious choice) and another domain-based method.
Here’s the short version:
For proteins for which you have matches with greater than 30% identity, blastp is a more effective choice.
However, in the under 30% territory, where blastp starts becoming kind of bad at its job, DomSign is a somewhat better choice.
In that territory, blastp gives you an incorrect prediction about 40% of the time. The vast majority of this incorrect prediction is under-annotation, which is less than ideal, but introduces no additional over-prediction errors to your annotations. Maybe 5% of blastp’s predictions are incorrect or over-predictions.
In contrast, DomSign gives an incorrect prediction a little over 20% of the time. About two thirds of the time, it yields an under-prediction, with the remaining third or so being an incorrect or over-prediction.
Note that the authors term adding an extra level to an EC prediction an “improvement.” That is, if the enzyme was assigned EC 1.2.3.- and their method predicts 220.127.116.11, they call that a win. I call that an over-prediction, although it’s a modest one (and comprises only a few percent of results for each method).
So that’s pretty neat. Once identity plunges down into dismal territory, DomSign can pick up some of blastp’s slack in terms of nailing more specific EC predictions.
This is one part where I think the authors may overstate their gains (although maybe I just missed something).
They apply DomSign to trying to expand the enzyme annotations within large-scale data sets. For example, applying DomSign to the automatically annotated UniProt TrEMBL data set led to an addition of EC classifications to another 18% of the sequences in TrEMBL.
That’s pretty sweet, adding EC classifications to almost 4 million proteins.
The authors kind of gloss over the false annotation rate here, however, which is about 5% for DomSign regardless of the degree of sequence identity. I’m counting incorrect and over-predictions here, but not under-predictions.
That means for those 4 million newly annotated TrEMBL proteins, some 200,000 of them have received new, misleading information.
As the authors themselves say, you want to use a suite of methods to annotate. However…
DomSign can be a helpful addition to annotations in those spaces where blastp falters. It has a lower error rate than the other domain-based method they compare it to, but I want to read more before I come to a final judgment on that.