Quite a bit of the work in our group is about cutting the scut work out of experimental biology. My colleagues typically say that weâ€™re â€œacceleratingâ€ research, but I usually couch it in terms of â€œcutting waste.â€ Theyâ€™re two sides of the same coin, but I think it makes a lot more sense to consider bioinformatics tools in terms of how they impact the time and effort we spend doing things that we donâ€™t want to be doing.
Consider the example of owning a car. What does that car do for you?
Sure, it accelerates your day â€“ after all, youâ€™re going everywhere faster than you would have without it. But the real tragedy happens when your car breaks down. Suddenly, that 20 minute drive to work is 40 minutes on a bus or a laughably impossible day hike. In other words, itâ€™s about the cost that youâ€™re cutting out by using the tool â€“ in this case, the car.
Sprint, stumble, walk, then run again
Genome sequencing is ridiculously fast. Just in terms of pure sequencing power, it takes maybe 1% of a single run from a state-of-the-art sequencing setup to plow through a bacterial genome. Thatâ€™s not to say that sequencing a new genome is trivial â€“ itâ€™s not. But the chemistry and machinery behind sequencing have both advanced so far that the simple ability to churn through genomes is no longer the bottleneck.
Then things slow down a bit.
The first slowdown comes during the assembly and error checking stepsâ€¦which I thankfully never, ever touch. I have a passing familiarity with the problems that pop up during assembly. For example, read lengths are a major issue. The â€œread lengthâ€ is how long a continuous stretch of genome sequence you get in one go.
As you can imagine, youâ€™d like read lengths to be long, since that means you have more ability to stitch together a finished genome by overlapping the pieces youâ€™ve sequenced (since, you know, theyâ€™re more likely to overlap if theyâ€™re longer).
Fortunately, thatâ€™s not my complication to deal with.
Once we get out of the genome assembly woods, then we have the annotation stepâ€¦which Iâ€™m also not going to talk about here, but which a lot of our work does directly support. Annotation is the slow, messy, and often inaccurate process by which we guess what the genes in our newly sequenced organism actually do.
The quality of a genomeâ€™s annotation still relates pretty directly to the amount of hands-on human time put into it. There are a bunch of solid predictive tools that try to fill in the gaps, but thereâ€™s still a tremendous amount of room for error. Right now, weâ€™re stuck with the choice between â€œgoodâ€ and â€œfast,â€ and that means that annotation is, at best, a walking step.
Back up into a run
One of our groupâ€™s major products is the Pathway Tools software package. The Pathway Tools package is in many ways the next comprehensive step after that initial annotation. It takes your shiny, new annotated genome and converts it into a database representing that organismâ€™s genes, proteins, transporters, and metabolic network. You can read more about this process elsewhere. The concise version is that the software takes the genomeâ€™s annotation, then uses that to guess which metabolic pathways, transporters, and protein complexes the organism hasâ€¦and then uses those pathways it guessed to figure out additional enzymes that are likely to be present in your organism.
I like to call that post-pathways second pass â€œsecond stageâ€ annotation.
These databases are super convenient in terms of getting to check out the organismâ€™s biology. Hereâ€™s an overview of the metabolism of Staphylococcus aureus, strain MRSA252, an antibiotic-resistant strain found in hospitals in the United Kingdom:
Thatâ€™s a visualization of the metabolic network of that drug-resistant bug as inferred and predicted by the Pathway Tools software. There are all sorts of handy tricks you can do with this kind of visualization-rich database, but for now, the important part is that this a fully computable database.
In other words, you can use it to model the organismâ€™s metabolic biology and ask handy questions like â€œDoes this bug grow on glucose?â€
Some easy steps from genome to model
So, after weâ€™ve done all that work in building a database, we hit another slow, slow stepâ€¦slower than that post-stumble walk I was just talking about above.
Thatâ€™s the part where we actually do stuff with the organism in the lab. And then we have to troubleshoot over and over again for each new thing we want to do. Maybe weâ€™re trying to get our microbe to crank out a bunch of our favorite small molecule â€“ letâ€™s say pinene, which you may have correctly guessed is responsible for pine scent. What should we grow our pine-scented microbe on? Will it make more pinene on sucrose instead of glucose? Are there some genes we should knock out to help raise our pinene yield?
Geez. Sounds like about a million experiments.
We can try to cut out some of the massive time and effort involved here using our expert knowledge, but each experiment is a ton of time and effort, even in a really friendly scientific model organism like E. coli.
Weâ€™d rather not do that.
MetaFlux to bridge genomes and lab work
So, this paper just came out from some of the very clever folks I work with:
In it, they describe some recent work they did on taking that first step that Pathway Tools handles, going from genome to model:
â€¦and adding in the next step that weâ€™d really want to have, where we go from model to what Iâ€™ll call a â€œfunctional model:â€
Specifically, the new tool, called â€œMetaFlux,â€ uses mixed integer linear programming to make whatâ€™s called a â€œflux balance analysisâ€ model from the database that Pathway Tools built from your genome.
Flux balance analysis, traditionally shortened to FBA, is a modeling method that approaches the organismâ€™s metabolism as a steady-state system.
Youâ€™ll note right up front that a modelâ€™s metabolism is never a steady-state system. This is what we call a â€œsimplifying assumption.â€ Even though itâ€™s kind of vigorously wrong, it does a pretty good job of modeling many metabolic situations.
Making a hideously slow step a whole lot faster
Folks in the field have been doing flux balance models for yearsâ€¦.very, very slowly. It turns out that the traditional approach to FBA generation involves making the worldâ€™s biggest spreadsheet containing all of the known reactions in the organismâ€™s metabolism. The next steps look like this:
That â€œfigure out where itâ€™s brokenâ€ step is very labor intensive. Itâ€™s also a good thing, since it catches weird holes in your model.
For example, consider the pathway in E. coli that breaks down cyanate
I introduced the idea of metabolic modeling by talking about how weâ€™re going to go from genes to enzymes and then to chemical networks in the body. Sometimes, however, the reactions just happen, like the breakdown of carbamate that gives us ammonia and carbon dioxide. In these cases, we might actually miss the reaction when weâ€™re making our database, if it isnâ€™t already included in a metabolic pathway in our MetaCyc database.
Thatâ€™s actually one reason to have those pathways â€“ to catch those reactions we canâ€™t predict directly from the organismâ€™s genes.
If you didnâ€™t have the carbamate breakdown reaction, it would leave a gap in your model metabolic network that would â€œbreakâ€ if you tried to model growth that depended on cyanate.
Imagine finding gaps and breaks like this over and over again, and you have a good feel for the tediousness of troubleshooting a flux balance model.
This is where MetaFlux steps in. Starting with a basic metabolic network model and a set of inputs (stuff youâ€™re feeding that organism) and outputs (stuff it has to make to live), MetaFlux tries to see if it can make a working flux balance modelâ€¦and if it canâ€™t, it tries adding in the missing reactions until the model works.
Itâ€™s about weighing the costs
The basic principle behind what MetaFlux is trying to do is that basically all of our annotated genomes are going to have gaps â€“ that is, areas where our predicted take on the organismâ€™s metabolic network is going to be incomplete. This is just a given â€“ we canâ€™t even figure out any reasonable guess for what up to half the genes in any new genome do, so it would be surprising if we were able to build a perfect, working metabolism on the first try.
So, when we first build our metabolic network, itâ€™s gonna have gaps. We want to fill those gaps, so weâ€™re going to plug in reactions derived from our MetaCyc database, which contains over 10,000 distinct metabolic reactions culled from a wide range of organisms.
Now, it would be trivial to just chain together a bunch of reactions to patch a hole in our current metabolic model. Of course, that would be a lot like having a map that shows how to get from your home to the grocery store and â€œfixingâ€ a missing city block form the route by replacing it with a journey back and forth across the entire continent.
With that in mind, we start introducing â€œcostsâ€ that help inform MetaFlux what direction we want our replacements to go. For exampleâ€¦
â€¦if we were engineering a new pathway into E. coli and wanted to add as few new genes as possible, we could assign a high cost (expressed as an actual number) to the act of â€œadding a new reaction.â€
â€¦if we wanted to make sure that any new reactions didnâ€™t undercut growth, we could assign a high value to growth, effectively making it costly to move away from robust growth.
â€¦if we wanted to avoid predicting â€œplantâ€ reactions for bacteria, we could assign a high cost to adding reactions that are not known to occur in our type of organism. This will mean that MetaFlux will only add these evolutionarily distant reactions if they are critical to making the model work.
MetaFlux takes all of these costs â€“ that weâ€™ve defined based on our goals â€“ and uses them in calculating a working metabolic network that can yield a flux balance model.
MetaFlux is also set up to give you a â€œnext bestâ€ answer when it simply canâ€™t find a working metabolic network using all the rules youâ€™ve given it. In those cases, it effectively tells you, â€œWell, you canâ€™t have what you want, but what if you could make all but one of the products you wanted?â€
The eventual goal â€“ genome, model, go!
The end goal of all of this is to have a system where you can plug in an annotated genome, ask Pathway Tools to build a model organism database from it, and then ask MetaFlux to make a working flux balance model from that, so that you can now make predictions and model situations without months and months of scrolling through big spreadsheets of reactions looking for the missing and broken parts.
Right now, the chain from genome through to working model is not ready to go right out of the box, but itâ€™s a pretty sweet set of tools if you have someone with a decent bit of technical savvy.