Several years ago, when scientists reported in Nature that various types of cancer were consistently associated with distinct communities of microbes, the clinical possibilities were tantalizing. The 2020 paper used a form of artificial intelligence to tease out microbial DNA indicative of particular cancers and proposed a “new class of microbiome-based cancer diagnostic tools.” The research has since garnered hundreds of citations, provided data for more than 10 other studies, and helped justify at least one commercial venture aiming to use microbial sequences in a person’s blood to reveal a cancer’s presence.

But that promise is under scrutiny now that a group of researchers claims to have found “major data analysis errors” that undermine the paper’s conclusions. According to a manuscript the critics posted this week on the preprint server bioRxiv, the Nature authors failed to properly filter out human DNA from a database of sequenced cancer tissues. This led to millions of human sequences being wrongly classified as microbial—perhaps explaining why the study found improbable microbes such as a seaweed bacterium associated with bladder cancer, for example.

A separate, computational error related to the team’s analysis inadvertently generated cancer-specific patterns where there weren’t any, the preprint also contends. The paper’s “major conclusions are completely wrong,” says one of the preprint’s authors, Johns Hopkins University computational biologist Steven Salzberg.

Rob Knight, a microbiologist at the University of California, San Diego, and senior author on the Nature paper, rejects the criticisms, noting his lab already rebutted them in a response to an earlier preprint from some of the same scientists. “There’s really nothing this new preprint has that hasn’t already been addressed openly,” says Knight, who co-founded the company Micronoma in 2019 to develop microbiome-based cancer diagnostics. He also highlights his team’s 2022 paper in Cell, which used updated methods to analyze fungi and bacteria in tumors and drew similar conclusions to the Nature paper.

But researchers watching from the sidelines say the new preprint goes significantly beyond the earlier charges and that its arguments are compelling. “It’s a forensic deconstruction of the places where errors have crept into the original manuscript,” says Julian Parkhill, a bacterial geneticist at the University of Cambridge.

Microbiome science undoubtedly holds biomedical promise, and other groups have associated microbes with particular cancers, multiple researchers tell Science. But the debate offers a cautionary tale for microbiome studies relying heavily on computational approaches, says Lesley Hoyles, a microbiologist and bioinformatician at Nottingham Trent University. “There’s a lack of critique of what’s coming out,” she says. “We need people doing these kinds of analyses.”

Bacteria in unexpected places

The Nature paper used a repository known as the Cancer Genome Atlas (TCGA), which stores reams of DNA sequences from human cancer samples. Sequences are classified by the database as human or nonhuman based on whether they match a so-called human reference genome—although that classification is known to be imperfect.

The Nature study authors compared TCGA’s “nonhuman” sequences, as well as sequences from several dozen cancer-free people and 100 with cancer, with a database of DNA from bacteria, viruses, and other microbes. Doing so showed that different cancer types had a specific community of resident microbes. Feeding the data into machine-learning algorithms let the researchers reliably predict—sometimes with accuracies approaching 100%—the cancer type, or cancer’s absence, just from the microbial composition of a sample.

These correlations, the researchers suggested, could be used to devise tests to detect cancers from a person’s blood sample. (Many other groups are developing blood tests for cancer that pick up human DNA or proteins shed by tumors.) Knight’s team shared detailed results on a public website.

But there were puzzling findings among them, some readers noticed. Although the work identified many human bacteria in cancerous tissues, there was, in addition to the mysterious seaweed bacterium, a marine hydrothermal vent bacterium linked to prostate cancer, and a coral bacterium associated with melanoma. In a January preprint, researchers at the University of East Anglia suggested this might signal problems in the study’s methodology. In particular, they noted, the presence of unexpected microbes in cancer tissue could be a result of database mistakes where sequences of one species are mislabeled as another’s.

Parkhill explains it’s common for human DNA to accidentally end up in microbial databases, where it’s wrongly listed under microbial species names. This means that unless researchers properly filter out human DNA from their human tissue sequencing data before comparing it to a microbial database, they risk detecting organisms that aren’t really in the tissue. This, the first preprint proposed, is what happened in the Nature study.

In a 27-page response, Knight and colleagues disputed the observations’ importance, adding that their Cell paper, which used updated methods, had replicated the Nature paper’s conclusions. But the response wasn’t convincing to Salzberg, who developed some of the computational tools used in the Nature paper.

Teaming up with the East Anglia researchers (who themselves have a patent application for using bacteria as biomarkers for prostate cancer), Salzberg downloaded and reanalyzed a subset of the Nature study’s data. When they included extra filters for human DNA—one alone is rarely sufficient, Salzberg says—their analysis found that millions of sequences the Nature authors had assumed were microbial were in fact human. Many of the microbes the study identified, the new preprint argues, were never in the TCGA cancer samples at all.

Knight says the exact identity of sequences found in specific cancers doesn’t change his group’s conclusions. For diagnostic purposes, “If what you’re trying to do is distinguish … cancer cases from controls or one cancer from another, it matters much more that those distinguishing sequences exist rather than what you call them.” Analyses “can be refined with successive techniques and data sources,” he adds, noting that other research—and a minianalysis he and his colleagues ran this week—shows the microbial differences remain even when human sequences are more stringently ruled out.

Disputed patterns

The new preprint from Salzberg and colleagues also addresses the impressive ability of the Nature paper’s computer models to predict cancer type based on a sample’s microbiome. Because the tissue samples came from multiple different medical centers and times, Knight and colleagues used a technique called normalization to try to remove some of the variability. But this process was problematic, the new preprint alleges: It introduced a distinct electronic tag to the data from each cancer type. This meant that when the team fed normalized data into their algorithms, the computer could surreptitiously just use the tag, rather than the microbial data, to determine which cancer type a sample came from.

Knight says his group disagrees with the preprint’s analysis, and reemphasizes that the Cell paper, which didn’t process data in the same way, reached the same conclusions. He adds his team isn’t “especially motivated” to comb through the preprint’s lengthy analysis or address it on social media, where it is already generating a buzz. “If they were to publish this in a peer-reviewed journal, we would address [it], … which I think is the appropriate way to do science, as over the past few centuries.”

In a statement, Micronoma CEO Sandrine Miller-Montgomery notes the company’s products in development don’t rely on the Nature paper. “We developed additional human filtering and quality control methods that minimized human genomic DNA contamination, finding that doing so did not hinder the ability to diagnose cancer presence or types,” she says. For its lung cancer blood test, the furthest along in its pipeline, “Micronoma has generated an independent and proprietary microbial database based on metagenome assembly of nonhuman reads.”

Repercussions for research by other academic teams that used the Nature data aren’t clear. “These are very early days and this is a quite complex issue,” says the National Cancer Institute’s Eytan Ruppin,  who relied on the data set for a 2022 paper. “It is now pertinent to hear from the original authors of the Nature paper, should they choose to respond, to get a possibly more balanced and even-handed perspective on this important topic.” 

Ivan Vujkovic-Cvijin, a microbiome scientist at Cedars-Sinai Medical Center, says there’s a lack of standards in how to use machine learning in microbiome science. “I think this scientific disagreement underscores the need to develop them.”

Others hope the ensuing discussion will help fix any problems in the original paper. “As scientists, we should be open to challenge,” Parkhill says. “We should be capable of dealing with it objectively and correcting things when necessary. Hopefully, that’s what will happen.”