We celebrated the end of 2013 with the release of our new paper, Rapid identification of sequences for orphan enzymes to power accurate protein annotation in PLOS ONE.

So what’s the big deal? What are orphan enzymes and why do we need to identify them?

Sequences are card catalog numbers for everything

In modern biology, protein and nucleotide sequence data are the glue that hold everything together. When we sequence a new genome, for example, we make a “best guess” for what each gene does by comparing its sequence to a vast collection of sequences we already have. Essentially, that lets us go from this amino acid sequence:

1 MSLPLKTIVH LVKPFACTAR FSARYPIHVI VVAVLLSAAA YLSVTQSYLN
51 EWKLDSNQYS TYLSIKPDEL FEKCTHYYRS PVSDTWKLLS SKEAADIYTP
101 FHYYLSTISF QSKDNSTTLP SLDDVIYSVD HTRYLLSEEP KIPTELVSEN
151 GTKWRLRNNS NFILDLHNIY RNMVKQFSNK TSEFDQFDLF IILAAYLTLF
201 YTLCCLFNDM RKIGSKFWLS FSALSNSACA LYLSLYTTHS LLKKPASLLS
251 LVIGLPFIVV IIGFKHKVRL AAFSLQKFHR ISIDKKITVS NIIYEAMFQE
301 GAYLIRDYLF YISSFIGCAI YARHLPGLVN FCILSTFMLV FDLLLSATFY
351 SAILSMKLEI NIIHRSTVIR QTLEEDGVVP TTADIIYKDE TASEPHFLRS
401 NVAIILGKAS VIGLLLLINL YVFTDKLNAT ILNTVYFDST IYSLPNFINY
451 KDIGNLSNQV IISVLPKQYY TPLKKYHQIE DSVLLIIDSV SNAIRDQFIS
501 KLLFFAFAVS ISINVYLLNA AKIHTGYMNF QPQSNKIDDL VVQQKSATIE
551 FSETRSMPAS SGLETPVTAK DIIISEEIQN NECVYALSSQ DEPIRPLSNL
601 VELMEKEQLK NMNNTEVSNL VVNGKLPLYS LEKKLEDTTR AVLVRRKALS
651 TLAESPILVS EKLPFRNYDY DRVFGACCEN VIGYMPIPVG VIGPLIIDGT
701 SYHIPMATTE GCLVASAMRG CKAINAGGGA TTVLTKDGMT RGPVVRFPTL
751 IRSGACKIWL DSEEGQNSIK KAFNSTSRFA RLQHIQTCLA GDLLFMRFRT
801 TTGDAMGMNM ISKGVEYSLK QMVEEYGWED MEVVSVSGNY CTDKKPAAIN
851 WIEGRGKSVV AEATIPGDVV KSVLKSDVSA LVELNISKNL VGSAMAGSVG
901 GFNAHAANLV TALFLALGQD PAQNVESSNC ITLMKEVDGD LRISVSMPSI
951 EVGTIGGGTV LEPQGAMLDL LGVRGPHPTE PGANARQLAR IIACAVLAGE
1001 LSLCSALAAG HLVQSHMTHN RKTNKANELP QPSNKGPPCK TSALL*

…to predicting that this protein is probably an “HMG-CoA Reductase,” an enzyme that carries out a key step in cholesterol synthesis.

We can also get more specific, tying part of this sequence information to the specific activity of the protein. In the case of my example enzyme, the “business end” is the second half of the protein.

This kind of sequence data powers so much of what we do in modern biology, from guessing what individual proteins do all the way to generating entire metabolic models and then predicting literally every food source a microbe can grow on.

We’re missing a lot of sequences

Hundreds upon hundreds, in fact. For a lot of critical enzymes.

As part of our Orphan Enzymes Project, we’ve tried to figure out how we can find sequences for these hundreds of enzymes.

After all, each enzyme represents hundreds of thousands of dollars in lost research…and each enzyme sequence we don’t have undercuts the value of all of our fantastic sequence-based tools.

We can rapidly identify a lot of orphan enzymes

Our new paper describes a few case studies on how we can identify orphan enzymes in the lab and just how big an impact identifying sequence for each orphan enzyme has.

We found several cases where we were actually able to buy samples of enzymes that had never been sequenced. We were also able to collaborate with Charles Waechter and Jeffrey Rush of the University of Kentucky to find sequence data for an enzyme they’d been working hard to characterize.

The key point of this part of our work is that many enzymes that are “tricky” for one set of researchers to sequence may be entirely doable for another group that specializes in sequencing. The more we collaborate, the more value we get out of all of our work.

Identifying orphan enzymes has a big impact

The second part of our work asks the simple question, “Does it matter?”

For each enzyme for which we found sequence data, we asked “How many enzymes should we now re-annotate?”

In other words, for all those guesses that have been made about what proteins do, for how many is our enzyme the best guess based on closeness of its sequence to the one we found.

It turns out that each enzyme sequence we identified led to anywhere from 130 to 430 proteins getting new, better guesses about their functions.

That’s hundreds of potential incorrect predictions or misled researchers averted by just “finishing the job” of sequencing a handful of enzymes.

Given the tremendous amount of work that has gone into characterizing each of these enzymes, it’s essential that we take every opportunity to apply modern sequencing expertise to existing samples.

Comments on the paper are welcome, whether here or on the paper itself at PLOS ONE.


One Comment