One of the most unnerving aspects of biological research is the possibility that your samples aren’t what you think they are. Most of my lab work has involved yeast (cerevisiae) and a smattering of types of bacteria (largely coli and some cyanobacteria). In general, we didn’t maintain the cells by passaging them, and there were some obvious antibiotic and auxotrophic (nutrition-based) markers we could use to tell that the cells were basically what we thought they were.

But as I’ve learned since entering the exciting world of genome analyses, there is just a ton of variation between the “same” organism and strain in different labs…or in the same lab at different times. The “default” E. coli strain, K-12 MG1655, has a neat little mutation in amino acid biosynthesis that easily reverts to wild type, which plays all sorts of havoc with computational models that assume that it’s nonfunctional.

I’m quite interested in how we can account for these kinds of differences and make modeling and predictive tools that are resilient to them.

In their recent paper Hiding in plain view: Genetic profiling reveals decades old cross-contamination of bladder cancer cell line KU7 with HeLa, Jager et al applied a very basic kind of DNA profiling to many samples of a popular and widely used bladder cancer cell line. These are cells that were supposed to have been derived from a fairly mild bladder cancer sampled from a patient in 1980. They’ve been widely used since then to study and model bladder cancer.

Except it turns out that they’re not bladder cancer cells. As Jager and his colleagues discovered, basically all the KU7 cell lines in the world are actually a completely different kind of cell (the most common cancer cell line in the world, HeLa). This apparently started with cross-contamination back at the source.

So what does this mean for studies based on those cells? Presumably we’d want to have a way to mass-tag those publications and all the databases or other informatics resources derived from them with the true identity of the cells used. Is this reasonably achievable, and is there a good way to track areas where the ideas or conclusions drawn from experiments using these misidentified cells ended up?

I’m not especially familiar with the bioinformatics and quantitative bio of cancer biology, so I don’t know how much impact this specific discovery has on large-scale data resources those fields rely on. Presumably this kind of thing is going to keep happening – we’ve certainly seen it in the misidentification and renaming of microbial samples from which enzyme and other metabolic data were derived. It would be handy to have consistent mechanisms in place to add additional metadata to publications so that this kind of “switch” can be tracked and propagated into downstream resources.

There’s more discussion of this discovery and its consequences for publications that used the misidentified cells over at Retraction Watch.