Adam Russell, an anthropologist and program manager at the Department of Defense’s mad-science division Darpa, laughs at the suggestion that he is trying to build a real, live, bullshit detector. But he doesn’t really seem to think it’s funny. The quite serious call for proposals Russell just sent out on Darpa stationery asks people—anyone! Even you!—for ways to determine what findings from the social and behavioral sciences are actually, you know, true. Or in his construction: “credible.”
Even for Darpa, that’s a big ask. The DoD has plenty of good reasons to want to know what social science to believe. But plenty more is at stake here. Darpa’s asking for a system that can solve one of the most urgent philosophical problems of our time: How do you know what’s true when science, the news, and social media all struggle with errors, advertising, propaganda, and lies?
Take a scientific claim. Do some kind of operation on it. Determine whether the claim is right enough to act on. So … a bullshit detector?
“I wouldn’t characterize it that way, and I think it’s important not to,” Russell says. He doesn’t want to contribute to cynicism that lets people think if scientists admit uncertainty, that means they can’t be trusted. “I have a deep faith that there is real science. It’s not that we know nothing about the world.” Science is still the best way of knowing stuff. Darpa just wants to know what stuff science is really sure about, and how it knows it. And how it knows it knows it.
You can imagine why Darpa and the DoD might want to shore up the social sciences. They want to understand how collective identity works, or why some groups (and nations) are stable and some fall apart. The military would like to get a better handle on how humans team up with machines before the machines get smarter and more get deployed. How does radicalization work, especially online? Why do people cooperate sometimes and compete at others? All these questions have two things in common: They are super-important to national security, and no one knows the answer.
The people who are supposed to figure out those knotty issues out have their own problems. You might have heard about the “reproducibility crisis,” the concern that many scientific findings, particularly in psychology and sociology, don’t pass a fundamental test of validity—that subsequent researchers can do the same experiment and get the same results as the first ones. Or you might be familiar with “P-hacking” and other ways some researchers, under pressure to publish and get grants, cherry-pick their experimental results to ensure the appearance of statistical significance.
This is not about whether any one particular claim can be replicated, right? It’s that collectively the claims don’t make sense.
Duncan Watts, Microsoft Research
Those issues come up in Darpa’s call for proposals, but researchers acknowledge that the concerns don’t end there. “If you ask a bunch of social scientists how organizations work, you’re not just going to get 20 different answers. You’re going to get answers not even comparable to each other,” says Duncan Watts, a sociologist at Microsoft Research who wrote a blistering critique of the social sciences’ (as he terms it) incoherency problem in the January 2017 issue of Nature Human Behavior. “You read one paper and then another paper, and it’s got the same words in the title but different units of analysis, different theoretical constructs, entirely different notions of causality. By the time you’ve done a literature review, you’re completely confused about what on Earth you even think. This is not about whether any one particular claim can be replicated, right? It’s that collectively the claims don’t make sense.”
But … Darpa, though, right? Impossible problems! Here’s an internet we made you! Darpa! The agency has an overarching program called Next Generation Social Science, set up in 2016 to use economics, sociology, anthropology, and so on to better understand anything from terrorism to the spread of propaganda online. And, yes, it’s an impossible problem. “In emerging fields you begin to see the development of standards as a good signal that something’s happening there,” Russell says. “We certainly don’t have those standards in social sciences.”
So Darpa wants to build them. “Confidence Levels for the Social and Behavioral Sciences” is the formal title of the agency’s “request for information,” the bureaucratic talk for “we have some grant money to award; send us your pitches.” But this RFI is capacious in its ambition, going well beyond reproducibility. It name-checks other bulwarks of scientific validation—peer review, meta-analyses, statistical techniques, and even more modern approaches like impact factors, citation webs, and expert prediction markets. But only to say, incorporate these and surpass them. Great stuff, guys, really great stuff. Anything else to pitch?
From the document: “There may be new ways to create automated or semi-automated capabilities to rapidly, accurately, and dynamically assign Confidence Levels to specific SBS results or claims.” (“SBS” = “social and behavioral sciences”) Help experts and non-experts separate scientific wheat from wrongheaded chaff using “machine reading, natural language processing, automated meta-analyses, statistics-checking algorithms, sentiment analytics, crowdsourcing tools, data sharing and archiving platforms, network analytics, etc.”
Clearly what we need here is some sort of machine with, like, a slot for feeding in journal articles. And two lights on the front: red and green. Ping or bzzzt.
Yeah, but no. “I think we’re many years from that,” says Matthew Salganik, a sociologist at Princeton who isn’t planning to submit an idea to Darpa but works on related validation issues. Though he’ll allow: “Something that might be more possible would be warning lights for papers from a relatively small number of journals.” Maybe just restrict the corpus to the Big Three—Science, Nature, and Proceedings of the National Academy of Sciences.
Really, though, no one knows what an answer will look like. In fact, one of the first people to submit a response to the Darpa RFI is asking for money to sketch one out. “The key challenge of doing this is there isn’t a gold standard for credibility. We don’t have a benchmark,” says University of Virginia psychologist Brian Nosek, head of the Center for Open Science and one of the main players in the fight for reproducibility. Lots of people say they have ways to validate scientific results, Nosek says. “So you have to play them against each other. We think all of these ideas say something about credibility, so let’s start comparing them.”
The key challenge of doing this is there isn’t a gold standard for credibility. We don’t have a benchmark.
Brian Nosek, Center for Open Science
Nosek’s pitch is titled “Path to Iterative Confidence Level Evaluation,” or, charmingly, “Pickle.” It proposes that Darpa set up, in classic fashion, a competition: Let people with credibility-assessing models test them against specific corpuses, like replicated studies. Then he’d build what’s called a nomological network, pitting the ideas against each other. “The only way to develop confidence in the evidence is to look at the problem in lots of different ways and see where you start to get convergence,” Nosek says. In other words, establish a framework for establishing the credibility of establishing credibility. “It’s very meta,” he says.
Other fields of science have their own problems with replicability and reliability. (The cancer people are freaking right out.) But the social sciences have their own particular epistemological problem. Darpa’s big obstacle here might not be social science answers but social science questions. “Computer scientists are more accustomed to asking questions where they can easily verify the answer,” Salganik says. What spam filter best filters spam? Here’s 900,000 emails labeled “spam” and “not-spam.” Now, here’s another 100,000 emails. Let 10 systems label them, and we’ll see which one gets the most right. Result: spam filter. “The goal is explicitly about prediction, and prediction problems are easy to express quantifiably,” Salganik says. “But a lot of social science questions are different. They’re more about asking why something is happening.”
Basic descriptive questions might be useful to Darpa and its military clients. But they’re tough to quantify. “This really does go beyond just statistical significance,” Salganik says. Even if you adjust the acceptable P value, a test of statistical significance, from 0.05 to 0.005—the lower it is, the more significant your data—that won’t deal with, let’s say, bias resulting from corporate funding. (Particle physicists demand a P value below 0.0000003! And you gotta get below 0.00000005 for a genome-wide association study.)
So what’s the answer? One approach might be going beyond statistics and reproducibility to add new confidence tools to the kit. Nosek’s Center for Open Science awards “badges” to articles for things like pre-registering a research plan (to ward off accusations of P hacking) and making full sets of data and the code used to analyze it available. It’s like LEED certification for environmentally-designed buildings.
Social networks might also play a positive role—metrics can show not just how many people cited or linked to a study but how they talked about it. Blog posts and tweets about a new finding in, like, astronomy, could almost constitute a kind of post-publication peer review in which an entire scientific community digs into a paper. In other words, you know who’s gonna save science? Trolls.
Russell seems like he’d be totally open to that. He doesn’t know how many ideas he’ll fund—the RFI is open until mid-August—and he doesn’t know how much money he’ll be able to dole out. “The great thing about Darpa is, if you have a lot of great ideas and budgets are required to do that, you can make that argument,” Russell says. “In the long run, we’re all in this together. The better our science, the better decisions we can make.”
It’s a tricky philosophy-of-science problem, with a tricky philosophy-of-science answer. “It’s a little shocking in some ways that we’re even having this conversation now, that so many of us are waking up and realizing that we’re not as confident in our methods as we thought we were,” Watts says. “It’s going to be a big collective effort to just improve our ability to say we actually believe this result versus we shouldn’t yet.” You have to believe that science can fix science—with some help from Darpa.