A new tool called the R-factor could help ensure that science is reproducible and valid, according to a preprint posted on biorxiv: Science with no fiction. The authors, led by Peter Grabitz, are so confident in their idea that they’ve created a company called Verum Analytics to promote it. But how useful is this new metric going to be?
Not very useful, in my view. The R-factor (which stands for “reproducibility, reputation, responsibility, and robustness”) strikes me as a flawed idea.
The R-factor of any result is calculated “simply by dividing the number of published reports that have verified a scientific claim by the number of attempts to do so.” In other words, it’s the proportion of published attempts to confirm a claim that were successful. Only independent attempts count. For “an investigator, a journal, or an institution”, their R-factor is the average of the R-factors for all the claims they’ve published.
Here are my main concerns with this idea:
1) It’s subject to publication biases
The R-factor is a summary of the published literature. We know that the literature is biased, for instance, positive results are more likely to be published than negative ones. Grabitz et al. know this as well, in fact they suggest that the R-factor could help to solve these kinds of problems. But the R-factor, which takes the published literature ‘at face value’, will itself be affected by publication bias, p-hacking, etc.
There are many examples of results which have been ‘replicated’ in many papers (i.e. with a high R-factor) yet which on closer inspection are statistically improbable. This is the motivation behind proposals such as p-curve analysis and the R-Index (not related to the R-factor.) These methods test whether the literature is plausible, rather than just assuming that it is, as the R-factor does.
2) It’s simplistic
The R-factor adopts a ‘show of hands’ definition of reproducibility: count the papers that support a claim, count the ones that refute it, and work out the percentages. This approach treats all studies as equally informative, but they rarely are. What about the sample sizes, for instance? Shouldn’t a study with 1,000 datapoints count more than a study with 10? In the R-factor, they’re treated the same.
There’s a deeper problem. It’s simplistic to treat every study in a black and white way as either “confirming” or “refuting” a claim. In reality, data may strongly support a hypothesis, weakly support it, or be inconclusive, and everything in between.
Now, it might be possible to modify the R-factor to address these criticisms. We could weight studies by sample size, for example. However, if we make these modifications, we’d soon end up re-inventing the existing and widely used technique of meta-analysis. Which brings me onto the next point:
3) It doesn’t improve on what we already have (meta-analysis)
The R-factor has no advantages over a proper meta-analysis. I suppose the R-factor might be easier to calculate in some cases, but probably not by much. Finding an R-factor requires us to check many papers (the authors suggest all of the papers citing the original study in question) and check whether the results confirm or refute the hypothesis. If we’re doing that, why not also record the results needed for a meta-analysis?
4) It glosses over hard questions
A selling-point of the R-factor is that it’s easy to use: “The R-factor is relatively easy to calculate, as the process… can be done by anyone with a general expertise in biomedical research.” However, this seems naive. If we ask “how many studies confirm the existence of phenomenon X?”, this begs at least two questions: what is X? And what does it take to confirm it? Both may be substantial scientific or even philosophical questions.
Suppose for example that we’re calculating the R-factor for the claim that ‘antidepressants cause suicide’. We find a paper reporting that antidepressants increase suicide attempts but not suicide deaths. Does that confirm the hypothesis, refute it, or neither? Opinions might differ. This is not a contrived example, it’s based on a real debate. So two people could calculate two different R-factors from the same literature.
5) It’s an impoverished metric
If my claim has only been tested once, and passed that one test, it will have an R-factor of 1. If your claim has passed 99 out of 100 tests, it will have a lower R-factor than mine (0.99), yet most people would say that your claim is more replicable than mine. The R-factor doesn’t take the number of replications into account. This problem could be fixed, perhaps, by adding some kind of a confidence interval to the measure. (Edit: in fact the authors sometimes use subscripts to indicate the number; but not consistently. See comments.)
To be fair to Grabitz et al., I think they have a specific kind of studies in mind for the R-factor, namely molecular biology studies. The authors don’t explicitly state this limitation, in fact saying that “The R-factor is universal in that it is applicable to any scientific claim”, but most of the examples they give are from cancer biology.
For molecular biology, the R-factor does make some sense. Molecular biology studies don’t tend to use statistics. The results are presented in a qualitative manner, illustrated with blots. You can’t meta-analyze blots: they either show the pattern you’re looking for, or they don’t. So for this kind of study, my first three objections to the R-factor don’t really apply.
So the R-factor might work in some fields, but I don’t think it’s appropriate for any science that uses statistics – which includes the great majority of psychology and neuroscience.