Is there a reproducibility crisis in biomedical science? No, but there is a reproducibility problem

Posted by David Gorski on June 6, 2016

Most scientists I know get a chuckle out of the Journal of Irreproducible Results (JIR), a humor journal that often parodies scientific papers. Back in the day, we used to chuckle at articles like Any Eye for an Eye for an Arm and a Leg: Applied Dysfunctional Measurement and A Double Blind Efficacy Trial of Placebos, Extra Strength Placebos and Generic Placebos. Unfortunately, these days, reporting on science is giving the impression that the JIR is a little too close to the truth, at least when it comes to reproduciblity, so much so that the issue even has its own name and Wikipedia entry: Replication (or reproducibility) crisis. It’s a topic I had been meaning to write about again for a while. Fortunately, A recent survey published in Nature under the somewhat click-baity title 1,500 scientists lift the lid on reproducibility finally prodded me me to look into this question again. Before I get to the survey itself, though, I can’t help but do my usual pontificating to provide a bit of background.

The spectrum of reproducibility and the inherent messiness of science

Having been a PhD-holding scientist now for well over 20 years, and a physician since 1988, I’ve done a lot of experiments, published a fair number of papers in the peer-reviewed biomedical literature (but significantly fewer than I wish I had, because otherwise I would have reached the rank of Professor years ago instead of this year), and grappled with the problem of reproducibility in my areas of research. One of the first things I learned in graduate school is that, as nice and neat as the science sounds when it’s taught in the classroom, it’s anything but nice and neat. Indeed, when it comes to some molecular biology techniques, we used to joke about sacrificing goats to appease the gods of molecular biology to get experiments to work. Then, of course, the more complicated the experiment, the more ways there are for it to go awry. Of course, there is a difference between a technique, such as Western blot or PCR not working, and an experiment giving a result that can’t be reproduced. Both, however, happen, and it can be devilishly difficult to track down the cause. Indeed, I like to point out that one of the most difficult aspects of science to convey to the general public, particularly about science-based medicine is just how messy it can be. One of the first lessons graduate students learn as they embark on their doctoral research is that early reports in the peer-reviewed literature are by their very nature tentative and have a high probability of ultimately being found to be incorrect—or, more often, only partially correct.

Unfortunately, that is not science as it is imbibed by the public. Fed by too-trite tales of simple linear progressions from observation to theory to observation to better theory taught in school, as well as media portrayals of scientists as finding answers fast, most people seem to think that science is able to generate results virtually on demand. This sort of impression is fed even by shows that I used to watch a few years ago—and even—liked—for their ability to excite people about science, for instance CSI: Crime Scene Investigation and its offspring. In their heyday, these shows portrayed beautiful people wearing beautiful pristine lab coats back lit in beautiful labs using perfectly styled multicolored Eppendorf tubes doing various assays and getting answers in minutes that normally take hours, days, or sometimes weeks. Often these assays are all done over a backing soundtrack consisting of classic rock or newer (but still relatively safe) “alternative” rock. And that’s just for applied science, in which no new ground is broken and no new discoveries made. And don’t even get me started on the stereotypical medical examiners in so many crime show dramas who can generate DNA results or detailed chemical analyses seemingly almost instantly. Real scientists know that life (and science) are complicated, much more than they are on television and in other works of fiction.

Still, the messy nature of scientific research doesn’t automatically mean that reproducibility isn’t a problem. At the very least, we as scientists should do all that we can to minimize the difficulty reproducing experiments that we do. One major reason, aside from wanting to get our results right, is that the first step in building on any scientific finding is to reproduce the experiments that led to that finding before going on to do more experiments to expand on or more deeply investigate that finding. When experiments can’t be reproduced, other labs waste a lot of time, effort, and resources. On the other hand, some experimental results are wrong, and there will always be experimental results that turn out to be wrong. Attempts to reproduce those results are how scientists find out they’re wrong. The question is: How much irreproducible science is unavoidable? Is there really a “replication crisis” or “reproducibility crisis” or whatever you want to call it? And if there is such a crisis, what should we scientists do to address it?

As a scientist quoted in the Nature article about the survey notes:

Failing to reproduce results is a rite of passage, says Marcus Munafo, a biological psychologist at the University of Bristol, UK, who has a long-standing interest in scientific reproducibility. When he was a student, he says, “I tried to replicate what looked simple from the literature, and wasn’t able to. Then I had a crisis of confidence, and then I learned that my experience wasn’t uncommon.”

Indeed. I myself went through just such a rite of passage.

A personal anecdote on reproducing scientific results

One of the key results trumpeted by this survey (more details on the survey and its findings, strengths, and shortcomings in a moment) is that more than 70% of researchers have tried and failed to replicate another scientist’s experiments. If anything, I suspect that number is probably low. In any case, I can certainly say that I’m one of those researchers who’s tried and failed to replicate another researcher’s results. No, maybe that’s not quite correct, as you will see.

Back in 1996, as part of my surgical oncology fellowship at the University of Chicago, I did research in the laboratory of the chair of the Department of Radiation and Cellular Oncology. It was there that I first learned of the work of one of my all-time most admired scientists, surgeon-scientist Dr. Judah Folkman, who was basically the father of modern tumor angiogenesis research. Angiogenesis is the normal physiologic process of growing new blood vessels. This process is critical to many normal bodily functions, such as wound healing, the menstrual cycle, and others, but tumors hijack the process to supply themselves with blood and the blood and nutrients it brings. Blocking angiogenesis, Folkman hypothesized, could therefore be an effective anticancer strategy. In any case, I admired Folkman so much that one of my earliest posts for Science-Based Medicine was a tribute to him after his sudden death from a heart attack at age 74. The way I first became acquainted with him was through our laboratory’s collaboration with him to study the effect of combining two of his angiogenesis inhibitors, angiostatin and endostatin, with radiation therapy in rodent models of cancer.

Dr. Folkman’s discovery of angiostatin and endostatin began with an ingenious strategy that began from the clinical observation that sometimes tumor metastases appear shortly after the operation to remove the primary tumor. Folkman found a mouse tumor model that mimicked this behavior and in the early 1990s did a series of pioneering experiments. In a strain Lewis lung carcinoma cells of low metastatic potential (LLC-LM), when cells are injected into C57BL/6 mice and allowed to grow subcutaneously, if the tumor is left alone, mice develop only microscopic lung metastases. These metastases do not grow and kill the mouse. If, however, the primary cancer is removed, then many large lung metastases grow rapidly. The results of the experiment above strongly implied that the primary tumor was secreting something that suppresses the growth of microscopic metastases. After this, the Folkman group did what we like to call “brute force” science, collecting mouse urine and analyzing it for tumor suppressive activity until they were able to purify a single 38 kDa peptide, which they designated angiostatin. This involved analyzing literally gallons of mouse urine. (Who said science isn’t glamorous?) Once Folkman’s group had a bunch of angiostatin on hand, it peformed the following experiment. Two groups of mice were injected with LLC-LM and the tumors allowed to grow to a certain size, after which they were surgically removed. One group was treated with angiostatin, and the control group with saline. The result was that the control group developed massive lung metastases and died, while the group treated with angiostatin had microsocopic lung metastases that never grew beyond a ball of cells. Dr. Folkman then demonstrated that it was the inhibition of angiogenesis by the angiostatin that kept these tumors in check. Ultimately, he used a similar method to discover endostatin, and later he demonstrated that endostatin could induce tumor dormancy in mice.

You can see why Ralph Weichselbaum, my research mentor, wanted to test combining angiostatin with radiation therapy. Our results were ultimately published in Nature, the only Nature paper on my CV (and, alas, not even as first author). Science works, right? Yes, it does, but the path to these results was not straight. It was widely known at the time that other laboratories were having difficulty reproducing Folkman’s striking results. In our case, we were not observing nearly as potent an antitumor effect as Folkman had described with angiostatin. We wondered if it was something to do with the angiostatin itself, which was being made in bacteria from a plasmid by our collaborators at Northwestern University. Given that Folkman was one of the best scientists I ever met (and I did have the opportunity to meet him on three occasions), none of us doubted his results and assumed that it must be something we were doing.

So what happened next? Weichselbaum contacted Folkman, who provided reagents, protocols, and advice, as well as some angiostatin made in his laboratory. It turns out that the peptide were were making was easily denatured (unfolded), which was why it was not as potent as Folkman had reported. Now here’s why I say we couldn’t replicate his results. It’s because we couldn’t fully replicate his results. Our angiostatin inhibited the growth of a wide variety of tumors, but, even after applying the tweaks to our angiostatin production suggested by Folkman, in our hands angiostatin never inhibited tumor growth as potently as Folkman had reported. So in other words, there could easily have been something else going on that we never figured out. Be that as it may, Folkman had the best attitude I’ve ever seen in a scientist regarding reproducibility, as we learned later when we heard of how he had done the same thing for several other labs, even to the point of dispatching one of his postdocs to help other investigators to get angiostatin and endostatin to work. Still, few investigators could ever quite replicate Folkman’s initial results, although many demonstrated that angiostatin and endostatin were potent angiogenesis inhibitors.

Eventually, angiogenesis inhibitors were clinically validated, in particular Avastin, which is simply a humanized monoclonal antibody against vascular endothelial growth factor (VEGF). (I also played with anti-VEGF antibodies back in the day.) Unfortunately, no angiogenesis inhibitor in humans has ever been as potent as angiostatin and endostatin were in mice. Angiogenesis inhibitors were a useful addition to our anticancer armamentarium, but, contrary to how they were portrayed in 1998, they were no magic bullet.

The point of this anecdote is that reproducibility falls a spectrum. Did I fail to reproduce Folkman’s results? Yes, I reproduced the general result that angiostatin inhibited tumor growth by blocking angiogenesis, but, no, I didn’t reproduce an effect anywhere near as powerful as the one Folkman had reported. Replication of a result can range from total failure to replicate (e.g., I had failed to show any antitumor effect of angiostatin at all) to partial failure to replicated (e.g., what actually happened) to success at replication (e.g., I had shown angiostatin to block tumor growth as powerfully as Folkman had).

Survey says: Reproducibility is a crisis!

Now let’s take a look at the Nature survey. It’s not a scientific survey or even a poll, really, which made me think of dismissing it almost out of hand. Basically, Nature e-mailed the survey to its readers and advertised it on affiliated websites and social media outlets as being “about reproducibility.” So, in other words, this is nothing even resembling a sampling designed to mirror the scientific community, as political polls are designed to mirror the population being polled. Nature itself even blithely notes that the survey “probably selected for respondents who are more receptive to and aware of concerns about reproducibility.” (“Probably”?) Even so, given that it’s basically an Internet poll, I don’t think the survey is without merit, as it does suggest that there is at least a widespread perception among scientists that there is a problem.

For example:

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Those are some of the telling figures that emerged from Nature’s survey of 1,576 researchers who took a brief online questionnaire on reproducibility in research.

The data reveal sometimes-contradictory attitudes towards reproducibility. Although 52% of those surveyed agree that there is a significant ‘crisis’ of reproducibility, less than 31% think that failure to reproduce published results means that the result is probably wrong, and most say that they still trust the published literature.

The average lay person might find it odd that half of the respondents had failed to reproduce their own research, but it makes more sense if you look at the actual survey questions, to which the answers were Yes, No, or “I can’t remember”:

Tried and failed to reproduce one of your own experiments
Tried and failed to reproduce someone else’s experiment
Published a successful attempt to reproduce someone else’s work
Published a failed attempt to reproduce someone else’s work
Tried and failed to publish a successful reproduction
Tried and failed to publish an unsuccessful reproduction

I’ve failed to reproduce my own experiments before. When that happens, I either figure out why I couldn’t reproduce the results of that experiment, or I don’t publish and move on to something else. (Of course, that ignores the question of how hard and long I keep plugging away to find out why I can’t reproduce, a question impacted by multiple factors.) That’s how science works, and spurious results are not uncommon. It’s why we often do experiments in triplicate and repeat the same experiment.

One interesting result of this survey is which specialties view their published results as being the most reliable. Of note, physicists and chemists have the most confidence in their published literature, with medicine being among the lowest. I can think of potential reasons for that. Physicists, for instance, can often do many more replicates of their measurements than is possible in medicine; indeed, in medicine, given ethics of clinical trials, there are often just a handful of trials addressing major questions—sometimes only one.

Is reproducibility in science a crisis?

In a word, no.

I don’t like the word crisis to describe what are in fact problems. The word crisis implies an acute time of difficulty or danger, a turning point, or a time when a difficult decision must be made. Reproducibility in science is a problem, a chronic problem, in fact. It is not a crisis, and there is nothing in this survey that suggests we’re coming to a turning point or something horrible is about to happen if we don’t act. In fact, I’m not even convinced that the problem is quite as severe as it is being portrayed. For example, one of the studies frequently cited as evidence that only 10% of biomedical science can be reproduced. As I discussed when I analyzed that claim when it was first made, that survey was written by Lee M. Ellis and a former Amgen executive (C. Glenn Begley) and looked at whether pharmaceutical company scientists could reproduce preclinical results from 53 “landmark” studies as they searched for a way to target new molecular mechanisms those results revealed. As I pointed out at the time, preclinical research is, by definition, preclinical. It’s the groundwork, the preliminary research, that needs to be done to determine the plausibility and feasibility of a new treatment before testing it out in humans. As such, preclinical research encompasses basic research and translational research and can include biochemical, cell culture, and animal experiments.

It’s worth reiterating that what was being discussed was, essentially, frontier science published in very high impact journals, which is why it struck me at the time as rather strange that the authors found it so amazing and deplorable that much of the science at the very frontiers turns out not to be correct when tested further. As I’ve discussed on multiple occasions, the science that is published in the highest profile, most prestigious journals is almost by definition the most tentative science. Given that, it is surprising how much of what is published in such journals actually does stand the test of time, but it should not be surprising that much of it does not. However, the very prestige of such journals gives such research seemingly more authority than research published in less prestigious journals. Moreover, the Amgen executive who co-authored this report led a group that scoured high impact journals for cutting edge studies that appeared to have identified promising molecular targets. Then he had a veritable army of scientists, about 100 of them in the Amgen replication team according to this news report, who were ready to pounce on any published study that suggested a molecular target the company deemed promising. No wonder he could replicated only 11% of the results, particularly given that their definition of “non-reproduced” was assigned “on the basis of findings not being sufficiently robust to drive a drug-development programme.” So in reality, the failure of of reproducibility in this oft-cited article is a failure of being able to extend the results sufficiently to justify the resources needed to translate the result to humans, which is a very different thing than a failure to reproduce the experimental results.

Lost in the discussion of the reproducibility problem are a lot of these nuances, but, again, that doesn’t mean there isn’t a problem, and it doesn’t mean that scientists shouldn’t do something, but what?

Defining the reproducibility problem

One major issue that reformers run head on into whenever discussing reproducibility is that there is no consensus on what, exactly, constitutes adequate reproducibility in science. An accompanying editorial notes this very issue:

What does ‘reproducibility’ mean? Those who study the science of science joke that the definition of reproducibility itself is not reproducible. Reproducibility can occur across different realms: empirical, computational and statistical. Replication can be analytical, direct, systematic or conceptual. Different people use reproducibility to mean repeatability, robustness, reliability and generalizability.

Economists and social scientists often use the term to mean that computer code and data are available so that someone would be able, if so inclined, to redo the same analysis using the same data. For bench scientists, who made up most of our respondents, it usually means that another scientist using the same methods gets similar results and can draw the same conclusions. We asked respondents to use this definition.

Even with a fixed definition, the criteria for reproducibility can vary dramatically between scientists. Senior scientists will not expect each tumour sample they examine under a microscope to look exactly like the images presented in a scientific publication; less experienced scientists might worry that such a result shows lack of reproducibility.

In an article from last year about scientific reproducibility, John Ioannidis and C. Glenn Begley observed in a much better discussion of the problem of reproducibility than the one Begley co-authored that I mentioned above:

There is no clear consensus at to what constitutes a reproducible study. The inherent variability in biological systems means there is no expectation that results will necessarily be precisely replicated. So it is not reasonable to expect that each component of a research report will be replicated in perfect detail. However, it seems completely reasonable that the one or two big ideas or major conclusions that emerge from a scientific report should be validated and with-stand close interrogation.

I guess by that definition, I did replicate Judah Folkman’s results after all!

Addressing the causes of the reproducibility problem

The problem of defining what constitutes adequate reproducibility is not a trivial one, and until there is a consensus, there will only be so much that can be done. Still, the Nature survey also asked scientists what they have done and what should be done to make research more reproducible. It turns out that two-thirds of the respondents had instituted procedures to increase reproducibility, one-third within the last five years.

As for the causes of lack of reproducibility, the scientists surveyed listed the usual suspects, such as selective reporting, “publish or perish” pressure, low statistical power, insufficient replications in the original lab, poor experimental design, variable technical expertise, variability in reagents, and even fraud. To show how subtle these problems can be, it’s useful to relate another anecdote, this time from the survey:

Consolidating methods is a project unto itself, says Laura Shankman, a postdoc studying smooth muscle cells at the University of Virginia, Charlottesville. After several postdocs and graduate students left her lab within a short time, remaining members had trouble getting consistent results in their experiments. The lab decided to take some time off from new questions to repeat published work, and this revealed that lab protocols had gradually diverged. She thinks that the lab saved money overall by getting synchronized instead of troubleshooting failed experiments piecemeal, but that it was a long-term investment.

This is not an uncommon tale. The “institutional” memory of a laboratory is something that is very hard to maintain, given that, other than the principal investigator and (sometimes) a permanent technician and/or lab manager, most personnel in labs are only there for at most a few years to get their PhD or do a postdoctoral fellowship. Turnover is high by design. Often there are little “tricks” or nuances to various experimental techniques to get them to work well that are lost when someone leaves a lab. That’s why maintaining protocol notebooks is so important, but few labs do this as rigorously as they should.

The scientists surveyed identified a number of interventions that could improve reproducibility, including a better understanding of statistics, better mentoring, more robust experimental design, more within-lab validation, more time checking notebooks, journals enforcing standards, and incentives for formal replication. These are similar to proposals surveyed by Ioannidis and Begley last year:

Editors solicit replication bids
Plea to improve editorial standards
Reward quality rather than quantity
Emphasis on hypothesis testing research
Prospective, rigorous experimental plan
Improved understanding of statistics
Improved experimental design
Systematic reviews of animal studies
Use clinically relevant concentrations
Consider litter effects
Recommendations to improve computational biology
Focus on reproducibility in training, grants, journals
Pathology: Biospecimen quality control
Microarray analyses: Provide data access
Psychology: open data, methods and workflow
Meta-analyses of animal data
Judge academics on quality, reproducibility, sharing
Greater institutional responsibility
Apply greater skepticism to new technologies

Ioannidis and Begley do note, though:

The fundamental problem with most, if not all, of these proposals is the requirement for investigators, institutions, and journals to willingly comply: it is not at all clear how reasonable recommendations will be implemented or monitored while they remain voluntary. Conversely, were they to be mandatory, then one has to examine carefully how they would be enforced and by whom. The details of how to make these changes work can have a major effect on their efficiency.

Indeed. One can easily envision selective enforcement of such rules that give famous and influential scientists an easier time. Making such a mandatory system fair would be a major challenge.

This, of course, brings us to a major part of the problem, namely incentivization and cost. Scientists have long bemoaned that there is little incentive to publish a positive replication of another scientist’s experiment. That’s why getting “scooped” by another scientist can be so disastrous. Because the result is no longer novel, the scientist who gets scooped will have a hard time publishing his results in the better journals. If that result took a lot of time and resources to obtain, not being first to publish can be devastating to future publication and future funding. Worse, there’s even less incentive to publish negative replications.

In the meantime, as these issues are hashed out in the scientific community, the NIH has acted, instituting new requirements for reproducibility and rigor for research grants and mentored career development award applications. The NIH now requires a discussion of the strengths and weaknesses of previous research and the scientific premise, a description of rigorous experimental design and how bias will be eliminated, consideration of sex and other key variables, and validation of key biological and/or chemical reagents. Regarding the latter, believe it or not, a lot of cancer cell lines out there, when tested, turn out not to be the cell line they were thought to be. As for eliminating bias, again, believe it or not, results from many preclinical experiments are not measured in a blinded fashion, allowing observer bias to taint the results.

The new NIH rules are a step in the right direction but clearly don’t go far enough. I don’t believe that reproducibility in science is in “crisis,” as so many are claiming, but I do believe it’s a significant problem that needs to be addressed in a thoughtful way. I also have to concede that it’s scientists’ fault that we’re in the mess we’re in and that we haven’t addressed problems with reproducibility more robustly before now, given that this problem has been festering for a while now. If it takes labeling the problem as a “crisis” to get some action, I suppose I can live with that.

In considering how to encourage good science and discourage bad science, it is important to note that not all science, particularly biomedical science, should be assumed or expected to result in findings that have direct applications or to result in treatments for humans. As Ioannidis and Begley put it, an efficacy “of 100% and waste of 0% is unlikely to be achievable” even as he notes that there “probably substantial room for improvement.” It is also important to note that, contrary to the way some paint this problem, the concerns about reproducibility in science don’t invalidate the scientific method itself nor disprove “scientism.” Science-based medicine has yielded incredible benefits to human health over the last 150 years. Indeed, the solutions to this problem being proposed are intended to enhance the rigorous application of science, not to abandon it. Finally, I can’t help but note that it is scientists themselves who are being openly self-critical and debating how to fix perceived problems in science. That is a major strength, not weakness, of science.