Lies, Damn Lies, and Statistics... P-Hacking

Steve Brunton Steve Brunton Sep 29, 2025

Audio Brief

Show transcript
This episode defines and explains P-hacking, a problematic statistical practice that can undermine research integrity. There are four key takeaways from this discussion. First, researchers must pre-define experimental protocols, including sample sizes, before data collection begins. Second, critically evaluate studies reporting single significant findings, especially from large datasets, without disclosing other tested hypotheses. Third, understand that statistical significance does not guarantee a true effect and can arise by chance, particularly with multiple tests. Fourth, continuously monitoring p-values and prematurely stopping experiments when a threshold is met constitutes p-hacking and yields unreliable conclusions. To maintain scientific integrity, researchers must pre-define their experimental protocols, including the number of samples to be collected, before beginning their research. This prevents the manipulative practice of stopping data collection prematurely, which can create misleadingly significant results from otherwise random data. Be critical of studies that report a single significant finding, especially from large datasets, without disclosing how many other hypotheses were tested. This selective reporting, known as cherry-picking or data dredging, can create spurious correlations that appear significant purely by chance, misrepresenting the overall findings. It is crucial to understand that a statistically significant result does not guarantee a true underlying effect. Significant p-values can arise purely by chance, particularly when numerous tests are conducted, inflating the rate of false positives and potentially leading to incorrect conclusions. Continuously monitoring the p-value as data is collected and prematurely halting an experiment once it crosses a significance threshold is a direct form of p-hacking. This practice, often demonstrated with simple examples like a fair coin toss, generates unreliable conclusions that may not hold with more complete data. Understanding and actively avoiding P-hacking is essential for robust and trustworthy scientific research and data analysis.

Episode Overview

  • The episode defines and explains "P-hacking," a problematic practice in statistics where data or analyses are manipulated to find statistically significant results.
  • It covers common forms of P-hacking, including cherry-picking data, data dredging (or fishing), and stopping data collection prematurely.
  • The speaker highlights the ethical implications, noting that P-hacking can be done accidentally, maliciously, or through a combination of both.
  • A practical demonstration using Python code shows how stopping data collection at just the right moment can create a misleadingly significant result from a fair coin toss.

Key Concepts

  • P-Hacking: The act of only reporting experiments or analyses that yield statistically significant p-values, while ignoring or omitting non-significant results. This practice inflates the rate of false positives.
  • P-value: A statistical measure used to determine the significance of results. Researchers often seek a low p-value (e.g., < 0.05) to claim a finding is "statistically significant."
  • Cherry Picking / Multiple Comparisons: Running numerous hypothesis tests and selectively publishing only those that show a significant outcome, which misrepresents the overall findings.
  • Data Dredging (Fishing): A common issue in the age of big data where analyzing a massive number of variables can lead to the discovery of spurious correlations that appear significant purely by chance.
  • Stopping Data Collection Early: An unethical practice where an experiment is monitored continuously, and data collection is halted as soon as the p-value drops below the significance threshold, ignoring that the result might change with more data.

Quotes

  • At 00:51 - "Sometimes people do p-hacking on accident...sometimes people do it maliciously or fraudulently, sometimes it's a combination of both." - Explaining that p-hacking is a complex issue that isn't always a result of deliberate scientific fraud.
  • At 01:28 - "Roughly speaking, p-hacking is a bad way of doing statistics where you only report experiments that yield significant p-values while ignoring or omitting non-significant results." - Providing a clear and straightforward definition of the core problem.
  • At 06:40 - "If I stopped right there, that's cheating. You have to specify how long, how many samples you're going to collect, ahead of time." - Emphasizing the importance of pre-registering an experimental plan to avoid the temptation of stopping data collection opportunistically.

Takeaways

  • To maintain scientific integrity, pre-define your experimental protocol, including the number of samples to be collected, before beginning your research.
  • Be critical of studies that report a single significant finding, especially from a large dataset, without disclosing how many other hypotheses were tested.
  • Understand that a statistically significant result does not guarantee a true effect; it can arise by chance, particularly when multiple tests are conducted.
  • Continuously checking your p-value as you collect data and stopping the experiment once it becomes significant is a form of p-hacking and leads to unreliable conclusions.