In their follow-up paper where they looked at 12 randomly selected recent publications, they found no p-hacking and an 83% replication rate. But obviously that's a very small sample, and they did find some other worrying methodological problems that are less widely discussed.
I don't agree that the papers are randomly selected. There is a random element to the selection, but the papers are from particular journals and must be repeatable using an online platform with a limited budget, among other criteria:
If a paper is preregistered with a detailed analysis plan, and the paper replicates with the same analysis, you can say there likely wasn't p-hacking involved. (Exact results replication using the analysis plan and the same data is better for p-hacking detection. You're only looking for what was done compared to what was planned.) From what I understand from personal correspondence with Spencer, this is not the method that is being used to detect p-hacking. I'd prefer Clearer Thinking state this in their own words, but our ideas about p-hacking detection are not the same. Many papers either aren't preregistered, or differ from their preregistration and are considered not p-hacked. While replication is a huge step forward in detecting p-hacking that is taking advantage of noise, it doesn't rule out p-hacking towards confounds, analyses that are trivially true, other borderline errors in reasoning or interpretation, or subtle flexibility in the primary study and replication. Doing the same analysis on new data may reproduce a confound or quirk in the study design, and it only has to be significant in the same direction.
Once you layer on the fact that Clearer Thinking may consider some of what I just said importance hacking and you have something that is really difficult to nail down.
I also think the conclusion from this should be that we should be funding replication efforts so paper selection can be more random, replications can be published, preferably in the same journals, and so on. I don't think Clearer Thinking is the phenomenon we're witnessing in other words, even though we should interpret their findings with any due caution. I think they appear to be relatively solid even though I don't completely agree on p-hacking. I also wish the method was better described.
All that said, you're right to point this out and I'll include their interpretation from December 4th in future posts.
Great article. I think the change to require preregistration (or at least normalize it as the standard but allow justified exceptions) has to come from journals (through submission requirements) and universities (in annual performance evaluations).
Preregistration should reasonably protect against things like p-value hacking, but I realized recently to my dismay that they can still leave lots of room for other biases (e.g. confirmation bias of pet theories) unless the researcher is really diligent about designing their preregistration to test competing theories against each other.
I wrote about an example recently in my field, but to put it in general terms, the preregistration simply says "we expect to find an association between A and B in observational data" and claim support for their pet causal theory, while ignoring competing explanations (including reverse-causal ones). This is almost a worse situation because authors can claim their preregistered hypotheses were supported, when the results really don't advance our understanding in a meaningful way.
Edit to add: I still support preregistration, of course - it's a big step in the right direction. We just have to be mindful of its limitations and what other good practices would help (e.g., adversarial collaborations which would be better at testing competing theories).
I have been reading Spencer Greenberg's work on metascience a lot and he is good. But as you said, I think they are too nice and too charitable to science. I also got the sense that they selected the easier projects to reproduce. Which only shows that easy low hanging fruit has a certain reproducibility rate. But I could be wrong about that.
I think if all crap in science has its cause in the same places where most institutions have their crap, it stands to reason that nice measures are never going to work. If it's darwinian, than survival should depend on what should or could make science better. And thus allowing selfish and greedy careerism to rule and solve their own problems is doomed to fail.
I agree we should be appropriately skeptical, on the replication project and the poll. For what it's worth my due diligence on Clearer Thinking by way of emailing them difficult questions and getting reasonable replies has checked out in the past and on this post. So I'm inclined to think they're credible even if they have the limitations of a small org. Definitely a fair point.
On natural selection, yes if we can change the environment enough, publication "fitness" would be scientific fitness. Unfortunately, the darwinian possibility is the hardest version of the incentives problem to solve and most people don't want to make eye contact with it. But ignoring or denying it is worse.
In their follow-up paper where they looked at 12 randomly selected recent publications, they found no p-hacking and an 83% replication rate. But obviously that's a very small sample, and they did find some other worrying methodological problems that are less widely discussed.
Thank you for the comment. You're right and this is their most recent post:
https://replications.clearerthinking.org/three-surprises-from-attempting-to-replicate-recent-studies-in-top-psychology-journals/
I don't agree that the papers are randomly selected. There is a random element to the selection, but the papers are from particular journals and must be repeatable using an online platform with a limited budget, among other criteria:
https://replications.clearerthinking.org/what-we-do/
If a paper is preregistered with a detailed analysis plan, and the paper replicates with the same analysis, you can say there likely wasn't p-hacking involved. (Exact results replication using the analysis plan and the same data is better for p-hacking detection. You're only looking for what was done compared to what was planned.) From what I understand from personal correspondence with Spencer, this is not the method that is being used to detect p-hacking. I'd prefer Clearer Thinking state this in their own words, but our ideas about p-hacking detection are not the same. Many papers either aren't preregistered, or differ from their preregistration and are considered not p-hacked. While replication is a huge step forward in detecting p-hacking that is taking advantage of noise, it doesn't rule out p-hacking towards confounds, analyses that are trivially true, other borderline errors in reasoning or interpretation, or subtle flexibility in the primary study and replication. Doing the same analysis on new data may reproduce a confound or quirk in the study design, and it only has to be significant in the same direction.
Once you layer on the fact that Clearer Thinking may consider some of what I just said importance hacking and you have something that is really difficult to nail down.
I also think the conclusion from this should be that we should be funding replication efforts so paper selection can be more random, replications can be published, preferably in the same journals, and so on. I don't think Clearer Thinking is the phenomenon we're witnessing in other words, even though we should interpret their findings with any due caution. I think they appear to be relatively solid even though I don't completely agree on p-hacking. I also wish the method was better described.
All that said, you're right to point this out and I'll include their interpretation from December 4th in future posts.
Brilliant, thank you Alex! Very clarifying
Great article. I think the change to require preregistration (or at least normalize it as the standard but allow justified exceptions) has to come from journals (through submission requirements) and universities (in annual performance evaluations).
Preregistration should reasonably protect against things like p-value hacking, but I realized recently to my dismay that they can still leave lots of room for other biases (e.g. confirmation bias of pet theories) unless the researcher is really diligent about designing their preregistration to test competing theories against each other.
I wrote about an example recently in my field, but to put it in general terms, the preregistration simply says "we expect to find an association between A and B in observational data" and claim support for their pet causal theory, while ignoring competing explanations (including reverse-causal ones). This is almost a worse situation because authors can claim their preregistered hypotheses were supported, when the results really don't advance our understanding in a meaningful way.
https://arielleselyaphd.substack.com/p/pre-registration-of-research-plans?r=45yctx
Edit to add: I still support preregistration, of course - it's a big step in the right direction. We just have to be mindful of its limitations and what other good practices would help (e.g., adversarial collaborations which would be better at testing competing theories).
I have been reading Spencer Greenberg's work on metascience a lot and he is good. But as you said, I think they are too nice and too charitable to science. I also got the sense that they selected the easier projects to reproduce. Which only shows that easy low hanging fruit has a certain reproducibility rate. But I could be wrong about that.
I think if all crap in science has its cause in the same places where most institutions have their crap, it stands to reason that nice measures are never going to work. If it's darwinian, than survival should depend on what should or could make science better. And thus allowing selfish and greedy careerism to rule and solve their own problems is doomed to fail.
I agree we should be appropriately skeptical, on the replication project and the poll. For what it's worth my due diligence on Clearer Thinking by way of emailing them difficult questions and getting reasonable replies has checked out in the past and on this post. So I'm inclined to think they're credible even if they have the limitations of a small org. Definitely a fair point.
On natural selection, yes if we can change the environment enough, publication "fitness" would be scientific fitness. Unfortunately, the darwinian possibility is the hardest version of the incentives problem to solve and most people don't want to make eye contact with it. But ignoring or denying it is worse.
Thank you for the thoughtful comment!