I have a new addition to my list of favorite scientists.
There’s clever, and then there’s having a vision for future needs. It’s the second one that’s really tricky. It’s also important, since it opens up avenues of discovery that we might otherwise never see…or at the very least, that may force us to keep revising our methods over and over again to try and keep up with the data.
Margaret Oakley Dayhoff was responsible for one particularly elegant and yet stunningly important change in how we handle biological data, a change that required a vision for a future for bioinformatics well beyond what most of her peers imagined.
At least month’s 4th International Biocuration Conference in Tokyo, held by the International Society for Biocuration, Raja Mazumder gave a nice talk about “Community Annotation in Biology” – that is, getting biologists outside of bioinformatics to provide their expertise to efforts to distill the maximum possible value from the wealth of biological data we’re producing these days. By way of setting up his talk, he spoke about Margaret Oakley Dayhoff and her role in calling for community assistance in the early days of protein bioinformatics.
However, the part that really stuck in my head was something else entirely about Dayhoff’s work – something near and dear to my heart.
That something is the one-letter amino acid code.
Let me explain.
So how do we talk about protein parts?
A protein is made of a series of amino acids linked together in a continuous chain (and then folded up into a three-dimensional structure, but that’s a topic for another day).
Here’s the chemical structure of the amino acid lysine:
If you’re not used to looking at this way of representing a chemical structure, you can learn more here. The short-and-sweet version is that this is a drawing of what the amino acid lysine actually looks like, as a three-dimensional object.
Human insulin is built from 110 amino acids in a row. You might imagine that it would be essentially impossible to talk or think about these proteins if you had to do it by drawing over a hundred of those pictures in a row. It would be like having to give directions by drawing a picture of every house that you pass on the way from point A to point B.
So we don’t do that.
In fact, it’s even inconvenient if you have to write the full names of the amino acids. Technically, the first little bit of the insulin sequence should be written like so:
Even more technically, that should be written methionyl-alanyl-lysinyl…, but either way, it’s super inconvenient, and makes for some very large words.
The short version
Sensing the impracticality of this method, biologists eventually settled on three-letter abbreviations for all the amino acids. For example, Lysine becomes “Lys.” That bit of insulin sequence we showed you above gets a lot shorter as a result:
The three-letter codes are obviously a vast improvement over having to write out the full name of each amino acid every time you talk about them. They’re also easy to learn, since each one is basically a “scrunched up” version of the full name:
“Leucine” (pronounced loo-seen) becomes “Leu” (pronounced “loo”)
“Tryptophan” (trip-toe-fan) becomes “Trp” (we usually say “trip”)
This is the abbreviation system that was in place during the early days of bioinformatics.
Dayhoff, however, saw that it was not going to work, and suggested an alternative.
This is where the vision part comes in. Dayhoff was one of the early movers involved in collecting protein sequence data, well before we had easy access to computers, or, say, the Internet. In other words, she wrote letters to folks and asked them to send her the sequence information, so it could be published in book form for the edification of interested parties.
But she saw that the rate of acquisition of sequence data was increasing, and she also had the vision to realize that it would pick up tremendously in the future.
In his talk, Raja focused on the increasing need Dayhoff saw for community assistance as the pace of sequence acquisition increased. This is a big need even today, since expertise is scattered across a huge number of researchers, but only a small percentage work directly on actually collecting and evaluating sequence data.
However, along with this concern, Dayhoff thought that it would simply be impractical to keep using the three-letter code to refer to protein sequences. So, instead of this:
She proposed this:
In this version, each letter stands for one amino acid:
“Leucine” becomes “L”
“Tryptophan” becomes the somewhat less intuitive “W”
You see, there are a few sets of amino acids that share a given first letter, so we have to be a little creative when coming up with a single-letter code. You can find the full code here.
Although the three-letter code may seem a little bit more intuitive, the single-letter code ends up being a lot faster once you get used to it. More importantly, it makes it much easier to store and process vast amounts of sequence data. This is especially important since a search in the NCBI protein database for insulin gives you over 20,000 results:
…and since we can compute on these sequences, like so:
This is us comparing the sequence of the (predicted) insulin from a rodent called the Degu with the insulin from Chimpanzees. The score and other stuff come from using a computational method known as BLAST to compare sequences, treating them as, effectively, lines of code to be compared.
In fact, many of the tools we use to work with protein sequence data were derived from computational methods originally designed for the analysis of human languages.
Margaret Oakley Dayhoff may not have imagined the specifics of how we work with protein sequence data, or the phenomenal amount of data we’d have just a decade after the turn of the millennium. What she saw, however, was that if we were going to work with a library of protein sequence data, we really needed to treat the protein sequences like words.
And for that, I will be forever grateful.