Commentary and Analysis of “The Power and Pitfalls of Megastudies for Advancing Applied Behavioral Science”, a podcast interview with Katherine L. Milkman

https://www.youtube.com/watch?v=zKa4YgktFF8&pp=ygUTbWVnYXN0dWRpZXMgbWlsa21hbg%3D%3D

Introducing Megastudies

Dr. Katherine L. Milkman begins by acknowledging the global advancements made by behavioral insights units and notes the growing demand for a blend of psychological and economic insights. However, she highlights several challenges involved, such as the reliance on field experiments for policy advice, which are resource-intensive in terms of time and money. Additionally, she points out the difficulty in comparing effect sizes across mismatched studies. Milkman also addresses the replication crisis, emphasizing that replicating non-robust studies incurs high costs.

To tackle these issues, Milkman proposes the concept of "megastudies." These are large-scale field experiments that encompass multiple smaller sub-experiments using the same dependent variable but varying independent variables. The benefits include better comparability, shared fixed costs, interdisciplinary collaboration, behavioral phenotyping, and accelerated scientific discovery.

Strengths and Pitfalls

Milkman discusses four megastudies, notably highlighting the example involving 24 Hour Fitness. In this study, approximately 63,000 gym members participated in a habit-building program designed by scientists. Although the sample may not reflect the general population, Milkman suggests that the effects could be even larger in a broader context. The study involved 53 experimental treatments, with randomized participants receiving interventions like planning, reminders, and incentives. Results showed that 45% of treatments outperformed the placebo, and 9% surpassed best-practice benchmarks. A particularly effective strategy was offering a bonus for gym visits, which increased attendance by 27%. Social proof messages about Americans' exercise habits also had a positive impact.

This megastudy approach aims to reduce publication and null-result biases. The pre-registration process enhances transparency and limits selective reporting. Milkman compares this to the Common Task Framework in AI research, where researchers work on the same problem under consistent conditions using shared datasets (Liberman & Jelinek, 2010; Donoho, 2015), fostering competition and standardization.

However, the 24 Hour Fitness megastudy lacked controls for interaction effects between treatments, which could affect overall outcomes. Furthermore, no statistical power was reported for the sub-studies, and Dr. Milkman acknowledged substantial variation in results across them. In the prediction study, using R-squared rather than correlation might have been more appropriate, as it accounts for the relative magnitude of effects, which is often more intuitive for non-experts. Sample size could also have influenced these findings. This aligns with the guidance provided in "So You Want to Run an Experiment, Now What? Some Simple Rules of Thumb for Optimal Experimental Design" (List, Sadoff, & Wagner, 2011), which emphasizes the importance of accounting for such factors in experimental design to ensure robust and interpretable results.

While relating insights from Dr. Milkman's study to other references, two connections stand out: people's ability to predict study outcomes and the potential to expand these methods.

 

People's Ability to Predict Study Outcomes

Milkman's megastudy asked participants to predict the effects of 53 experimental treatments, showing varied correlations between predicted and actual outcomes:

  • Study 1 (301 Prolific workers): r = 0.25, p = 0.07

  • Study 2 (156 public health professors): r = -0.07, p = 0.63

  • Study 3 (90 behavioral practitioners): r = -0.18, p = 0.19

These findings contrast with those in "Politicizing Mask-Wearing: Predicting the Success of Behavioral Interventions Among Republicans and Democrats" (Dimant et al.). It would also be interesting to compare these results with large language models (LLMs), as explored in "Predicting Results of Social Science Experiments Using Large Language Models" (Ashokkumar et al.).

Potential to Expand Economic Game Experiments with Megastudies

While economic games have faced criticism for their limited real-world applicability, they can reveal preferences and behaviors that other methods, like observational or self-report data, may miss. Similarly, megastudies like the 24 Hour Fitness experiment bridge theoretical constructs and real-world behavior, offering valuable insights despite challenges like variation across studies. Both economic games and megastudies complement traditional methods by providing deeper behavioral insights. The key lies in well-designed studies—whether through economic games or field experiments—that overcome biases, reduce risks, and accelerate scientific discovery, as demonstrated by the case study from Colombia in "Preferences and Constraints: The Value of Economic Games for Studying Human Behavior" (Pisor, Gervais, Purzycki, Ross).

Remaining Questions

Dr. Milkman's efforts are remarkable and noble, but to truly establish more robust megastudy frameworks that address statistical rigor, we need more researchers like her driving this initiative.

-        Are there other researchers making significant progress in this area?

-        Should we continue looking to fields like AI and computer science to enhance collaboration and advance methodological approaches?

Previous
Previous

Commentary and Analysis of “Ten Simple Rules for Good Research Practice”, a paper by Shwab, Janiaud, Dayan, Amrhein, Panczak, Hemkens, Ramon, Rothen, Senn, Furrer and Held (June 23, 2022)

Next
Next

Commentary and Analysis of  “Field Experiments and the Practice of Policy” by Esther Duflo (American Economic Review, 2020)