In our work on orphan enzymes, we’ve consistently seen a “rich get richer” effect. Research tends to accumulate on those proteins that already have assigned sequences. This is a systematic issue, since annotation based on sequence similarity probably means that we’re often assuming that a newly identified gene does the same function as a known protein…when in reality, it is more like a highly similar orphan enzyme for which we lack sequence data.

We saw this occasionally in the generally awesome BRENDA enzyme database. A curator had assigned a sequence to an orphan enzyme when that sequence was actually for a highly similar enzyme that did not catalyze the orphan enzyme activity. This kind of over-assignment likely prevents further research on the orphan enzyme and tends to focus more research on the enzyme for which we had sequence data in the first place.

Cracking the brain’s “ignorome”

In their recent paper Functionally Enigmatic Genes: A Case Study of the Brain Ignorome, Pandey and colleagues tackle this problem from the other side of the mirror – uncharacterized genes.

They surveyed those genes that show “intense and highly selective” expression in the brain (ISE genes, for short) and asked “How well-characterized are they?” After all, one of the promises of modern high-throughput methods is that we can look at features such as tissue-specific expression and use that as a guide for which genes to devote our research attention to.

What they found is that despite our knowing that these genes are all intensely and selectively expressed in the brain, research about them has been tremendously lopsided.

I’ll quote them on just how off-kilter the research distribution is:

The number of publications for these 650 ISE genes is highly skewed (Figure 1). The top 5% account for ~68% of the relevant literature whereas the bottom 50% of genes account for only 1% of the literature.

Here’s Figure 1:

journal.pone.0088889.g001

So that shows us that despite somehow being specific and important to the brain, many of these genes remain understudied.

Why?

What makes the ignorome different?

The short answer is “age.”

Much like the “rich get richer” phenomenon I talked about for orphan enzymes, there is (unsurprisingly) a correlation between when a gene was first characterized and how much research there has been on it. Nothing else really differs between the genes that are understudied and those that have been the focus of significant study.

That brings up the natural corollary question of “Okay, so are we figuring out what the other genes do?”

The answer here seems to be that we were for a while, but now the rate of advancing discovery is flattening out. I’ll quote the authors here as well:

While the average rate of decrease was rapid between 1991 and 2000 (−25 genes/year), the rate has been lethargic over the past five years (−6.4 genes/yr, Figure 5). This trend is surprising given the sharp increase in the rate of addition to the neuroscience literature. As a result, the number of neuroscience articles associated with the elimination of a single ignorome gene has gone up by a factor of three between 1991 and 2012 (Figure 5). The rate at which the ignorome is shrinking is approaching an asymptote, and without focused effort to functionally annotate the ignorome, it will likely make up 40–50 functionally important genes for more than a decade.

So what do we do about it?

One of the core reasons for “rich get richer” effects is that known genes (or proteins) simply have more “handles” you can work with. If your expression analysis tells you that 20 genes are significantly enriched in your test condition and you can find some functional characterization for 10 of them, it’s only natural to focus on those 10 first.

…and given how time and work tend to play out, “first” can quickly become “only.” Given how daunting a completely uncharacterized gene can be, who would fault researchers for spending the majority of their effort on those genes that have some functional characterizations (or predictions) available for them? That certainly fits the whole 80/20 rule idea of focusing most of your effort where you’ll have the most gain.

Pandey and colleagues attempt to address this by making more handles. They show how we can leverage high-throughput and large-scale phenotype databases to generate additional functional characterization for at least some of the ignorome genes without significant additional effort. Now, instead of flying relatively blind, a researcher can have both sequence-similarity-based predictions of function and some best guesses at phenotype associations for these genes.

I really like this kind of leveraging of existing data to make avenues of research more accessible and thus more likely. This kind of thing is going to be very important in tackling those dark areas of unknown function that exist all over biology.

Who did this research

Ashutosh K. Pandey, Lu Lu, Xusheng Wang, Ramin Homayouni, and Robert W. Williams.

(…and hey, Robert Williams is another UC alum!)

The full citation:

Pandey AK, Lu L, Wang X, Homayouni R, Williams RW (2014) Functionally Enigmatic Genes: A Case Study of the Brain Ignorome. PLoS ONE 9(2): e88889. doi:10.1371/journal.pone.0088889

Figure and quotes were used under the Creative Commons Attribution License.


We celebrated the end of 2013 with the release of our new paper, Rapid identification of sequences for orphan enzymes to power accurate protein annotation in PLOS ONE.

So what’s the big deal? What are orphan enzymes and why do we need to identify them?

Sequences are card catalog numbers for everything

In modern biology, protein and nucleotide sequence data are the glue that hold everything together. When we sequence a new genome, for example, we make a “best guess” for what each gene does by comparing its sequence to a vast collection of sequences we already have. Essentially, that lets us go from this amino acid sequence:

1 MSLPLKTIVH LVKPFACTAR FSARYPIHVI VVAVLLSAAA YLSVTQSYLN
51 EWKLDSNQYS TYLSIKPDEL FEKCTHYYRS PVSDTWKLLS SKEAADIYTP
101 FHYYLSTISF QSKDNSTTLP SLDDVIYSVD HTRYLLSEEP KIPTELVSEN
151 GTKWRLRNNS NFILDLHNIY RNMVKQFSNK TSEFDQFDLF IILAAYLTLF
201 YTLCCLFNDM RKIGSKFWLS FSALSNSACA LYLSLYTTHS LLKKPASLLS
251 LVIGLPFIVV IIGFKHKVRL AAFSLQKFHR ISIDKKITVS NIIYEAMFQE
301 GAYLIRDYLF YISSFIGCAI YARHLPGLVN FCILSTFMLV FDLLLSATFY
351 SAILSMKLEI NIIHRSTVIR QTLEEDGVVP TTADIIYKDE TASEPHFLRS
401 NVAIILGKAS VIGLLLLINL YVFTDKLNAT ILNTVYFDST IYSLPNFINY
451 KDIGNLSNQV IISVLPKQYY TPLKKYHQIE DSVLLIIDSV SNAIRDQFIS
501 KLLFFAFAVS ISINVYLLNA AKIHTGYMNF QPQSNKIDDL VVQQKSATIE
551 FSETRSMPAS SGLETPVTAK DIIISEEIQN NECVYALSSQ DEPIRPLSNL
601 VELMEKEQLK NMNNTEVSNL VVNGKLPLYS LEKKLEDTTR AVLVRRKALS
651 TLAESPILVS EKLPFRNYDY DRVFGACCEN VIGYMPIPVG VIGPLIIDGT
701 SYHIPMATTE GCLVASAMRG CKAINAGGGA TTVLTKDGMT RGPVVRFPTL
751 IRSGACKIWL DSEEGQNSIK KAFNSTSRFA RLQHIQTCLA GDLLFMRFRT
801 TTGDAMGMNM ISKGVEYSLK QMVEEYGWED MEVVSVSGNY CTDKKPAAIN
851 WIEGRGKSVV AEATIPGDVV KSVLKSDVSA LVELNISKNL VGSAMAGSVG
901 GFNAHAANLV TALFLALGQD PAQNVESSNC ITLMKEVDGD LRISVSMPSI
951 EVGTIGGGTV LEPQGAMLDL LGVRGPHPTE PGANARQLAR IIACAVLAGE
1001 LSLCSALAAG HLVQSHMTHN RKTNKANELP QPSNKGPPCK TSALL*

…to predicting that this protein is probably an “HMG-CoA Reductase,” an enzyme that carries out a key step in cholesterol synthesis.

We can also get more specific, tying part of this sequence information to the specific activity of the protein. In the case of my example enzyme, the “business end” is the second half of the protein.

This kind of sequence data powers so much of what we do in modern biology, from guessing what individual proteins do all the way to generating entire metabolic models and then predicting literally every food source a microbe can grow on.

We’re missing a lot of sequences

Hundreds upon hundreds, in fact. For a lot of critical enzymes.

As part of our Orphan Enzymes Project, we’ve tried to figure out how we can find sequences for these hundreds of enzymes.

After all, each enzyme represents hundreds of thousands of dollars in lost research…and each enzyme sequence we don’t have undercuts the value of all of our fantastic sequence-based tools.

We can rapidly identify a lot of orphan enzymes

Our new paper describes a few case studies on how we can identify orphan enzymes in the lab and just how big an impact identifying sequence for each orphan enzyme has.

We found several cases where we were actually able to buy samples of enzymes that had never been sequenced. We were also able to collaborate with Charles Waechter and Jeffrey Rush of the University of Kentucky to find sequence data for an enzyme they’d been working hard to characterize.

The key point of this part of our work is that many enzymes that are “tricky” for one set of researchers to sequence may be entirely doable for another group that specializes in sequencing. The more we collaborate, the more value we get out of all of our work.

Identifying orphan enzymes has a big impact

The second part of our work asks the simple question, “Does it matter?”

For each enzyme for which we found sequence data, we asked “How many enzymes should we now re-annotate?”

In other words, for all those guesses that have been made about what proteins do, for how many is our enzyme the best guess based on closeness of its sequence to the one we found.

It turns out that each enzyme sequence we identified led to anywhere from 130 to 430 proteins getting new, better guesses about their functions.

That’s hundreds of potential incorrect predictions or misled researchers averted by just “finishing the job” of sequencing a handful of enzymes.

Given the tremendous amount of work that has gone into characterizing each of these enzymes, it’s essential that we take every opportunity to apply modern sequencing expertise to existing samples.

Comments on the paper are welcome, whether here or on the paper itself at PLOS ONE.