(Revise & Resubmit)
Big data are increasingly used to make predictions about the value of uncertain investments, thereby helping firms identify innovation opportunities without the need for domain knowledge. This trend has raised questions about which firms will primarily benefit from the availability of these data-driven predictions. Contrary to existing research suggesting that data-driven predictions level the playing field for firms lacking domain knowledge, I argue — using a simple theoretical framework — that these predictions reinforce the competitive advantage of firms with domain knowledge. In high-stakes contexts like innovation, where returns are skewed and only a few leads can be pursued, domain knowledge helps evaluate predictions and avoid false positives. I test this idea using novel data on the pharmaceutical industry, exploiting the features of genome-wide association studies (GWAS) that provide data-driven predictions about new drug targets. The results show that GWAS stimulate corporate investments in innovation, yet around one-third of these efforts are misallocated toward false positive predictions. Companies lacking domain knowledge react more strongly but are disproportionally likely to fall into the trap of false positives. Instead, domain knowledge helps firms pursue fewer alternatives that are more likely to be the best opportunities. Together, the results show that even if data-driven predictions are valuable in innovation, domain knowledge remains a crucial source of competitive advantage in the age of big data technologies.
[Working Paper][Will Mitchell Dissertation Research Grant][INFORMS/Organization Science Dissertation Proposal Competition]
(2nd Revise & Resubmit)
How does big data change discovery? Datasets covering broad portions of a scientific landscape enable a data-driven approach to search, uncovering findings whose underlying mechanisms may be unclear. This shift has raised concerns that decoupling discovery from theoretical understanding lowers innovation quality by prioritizing incremental ideas or false positive signals. Even when successful, data-driven search could weaken incentives to develop theory and leave the consequences of new discoveries poorly understood. I examine these issues in human genetics, where genome-wide association studies (GWAS) enable a data-driven search for the genetic roots of disease. Compared with traditional theory-based approaches, GWAS expands the genetic landscape examined, increases outcome variability with a proportionally larger increase in breakthroughs, and stimulates follow-on work aimed at clarifying causal mechanisms. An instrumental variable strategy exploiting a technology-driven decline in the cost of GWAS supports a causal interpretation of these results. Mechanism tests show that these effects arise because data-driven search surfaces more empirical anomalies: valuable discoveries that depart from theoretical expectations and redirect subsequent research toward new theorizing. Together, the results suggest that big data technologies can fuel virtuous cycles of knowledge accumulation by increasing the frequency of findings that challenge existing theories.
With Cecil-Francis Brenninkmeijer, Arul Murugan, and Abhishek Nagaraj
(2nd Revise & Resubmit)
Large Language Models (LLMs) are becoming useful tools for research, yet their potential for strategy remains underexplored. We show how LLMs can be used as synthetic subjects to study strategic interactions. We introduce a framework for designing and running simulated experiments with LLM-powered agents. We argue that this approach is useful for rapid, low-cost prototyping of human experiments and for generating novel hypotheses. We apply the framework to the exploration-exploitation dilemma and show that LLM-based experiments reproduce patterns observed among human participants. We then vary parameters and boundary conditions to illustrate how the same setup can support design iteration and generate hypotheses about when and why established results change. We conclude by discussing the promise and limitations of AI agents as “model organisms” for strategy.
With Abhishek Nagaraj
(Revise & Resubmit)
This study examines the impact of access to confidential administrative data on the rate, direction, and policy relevance of economics research. To study this question, we exploit the progressive geographic expansion of the U.S. Census Bureau's Federal Statistical Research Data Centers (FSRDCs). FSRDCs boost data diffusion, help empirical researchers publish more articles in top outlets, and increase citation-weighted publications. Besides direct data usage, spillovers to non-adopters also drive this effect. Further, citations to exposed researchers in policy documents increase significantly. Our findings underscore the importance of data access for scientific progress and evidence-based policy formulation.
[Working Paper] [NBER Working Paper][Sloan Grant][Tweetstorm summary]
With Christian Fons-Rosen and Lee Fleming
(Revise & Resubmit)
Does corporate authorship increase deceptive conduct in science? Existing research suggests that firms’ commercial stakes in the findings they publish increase the risks of misconduct. Yet when firms rely on basic science in their downstream development, misleading results can compromise costly innovation efforts, giving firms stronger incentives to get the science right. Corporate participation in basic research may therefore reduce deceptive conduct. We examine this argument in Alzheimer’s preclinical research, where inappropriate image alterations provide an objective marker of deceptive conduct. Using an AI-based detection tool validated through manual review, we scan the entire field and document a rising trend of data issues in published science. Problematic cases are nearly absent in corporate-authored papers and less frequent in academic-industry collaborations, but only when firms exercise meaningful control over the research process. The results are strongest when research is less verifiable or closer to commercialization. Taken together, our results identify a tension in the division of innovative labor, as separating research from development may weaken incentives to produce science reliable enough for downstream use.
With Johannes Hoelzemann, Gustavo Manso, and Abhishek Nagaraj
(Submitted)
We study exploration under uncertainty and show how access to data on past attempts can paradoxically hinder breakthrough discovery. We develop a model of the ``streetlight effect'' demonstrating that when data highlights attractive but ultimately suboptimal projects, it can narrow exploration and suppress innovation. In a laboratory experiment, we find that revealing the value of an enticing project lowers payoffs and reduces breakthrough discoveries. This drop stems from increased free-riding behavior, which crowds out the generation of new data. We then apply our theory in the context of scientific research into the genetic origins of human diseases, focusing on the drivers of limited exploration. To identify the causal impact of past data, we use an instrumental variable that leverages exogenous genetic overlaps between humans and laboratory mice, which reduces research costs for specific genes and leads to prioritized data collection about them. We find that diseases with early evidence of promising genetic targets are 16 percentage points less likely to yield breakthroughs than those where early efforts failed. While competition attenuates the streetlight effect, it does not eliminate it. Our paper provides the first analysis of this phenomenon, outlining the conditions under which data leads agents to look under the lamppost rather than engage in socially beneficial exploration.
With Michael Sockin and Richard Lowery
(Submitted)
We develop a model of how a principal motivates innovation when researchers generate ideas, but effort is unobservable. The principal relies on career incentives tied to peer recognition, measured through citations. Because citations rise with the number of researchers working on a topic, they create coordination incentives that can distort effort. Researchers may over-coordinate in crowded areas, producing “academic bubbles" with little prospect of advancing knowledge. We test the model in research on the genetic determinants of human disease and show that crowding inflates citation impact, with patterns suggesting that career concerns drive misallocated scientific effort in topic selection decisions.
With Dan Schliesmann
Learning from failure is central to innovation. When experimentation is costly, firms often generalize from failed projects to evaluate related but untested opportunities. We argue that this seemingly efficient strategy can systematically misdirect innovation search even when feedback from one project is informative about nearby alternatives. Because firms tend to test promising approaches, failures occur disproportionately in regions of the innovation landscape where nearby alternatives are also likely valuable. Generalizing from those failures, therefore, mostly screens out promising opportunities. We test this mechanism in pharmaceutical R\&D using data on patenting, clinical trial failures, and the biological relatedness of drug targets. Following a failure, firms reduce their investment not only in the focal target but also in related targets, especially high-potential ones. We find that the resulting increase in false negatives exceeds the decline in false positives, worsening the allocation of innovative effort. These effects are more pronounced in smooth landscapes, where local correlation makes generalization both more useful and more costly. Taken together, our findings identify a mechanism through which learning from sparse experiments can redirect search away from valuable opportunities.
With Abhishek Nagaraj
With Bikash Kumar Panda and Charlie Guthmann
With Enrico Berkes and Matthew Lee Chen
Research Policy, 2026
The study of innovation depends heavily on high-quality patent data. Yet, datasets containing complete patent documents focus only on recent decades, while historical patent datasets with broader temporal coverage typically lack detailed information. Therefore, our ability to leverage advances in textual analyses to study long-run innovation dynamics remains limited. To this end, we introduce a large-scale dataset of the universe of technical specifications of British patents granted between 1617–1899. Our data consists of the full specification texts alongside linked information about inventors, including their disambiguated names, occupations, and addresses. We use our data to document changes over time in total inventive activity, the geography of innovation, inventor occupations, and patent novelty and impact. Finally, we discuss use cases and avenues for subsequent research.
With Fernando Stipanicic and Abhishek Nagaraj
Harvard Data Science Review, 2025
Microdata from government agencies is believed to be valuable for economics research, and yet access to this data is highly restricted due to concerns about privacy and security. We provide an empirical assessment of the use and impact of restricted-access data that researchers can analyze at the U.S. Census Bureau's secure facilities. Our findings show that the use of Census Bureau's confidential data is growing and the publications employing it have a higher impact on the scientific and policy debate. However, adoption remains largely limited to established researchers from prestigious institutions. Our results and discussion inform the design of policies that balance privacy protection with accessibility to confidential microdata.
With Alessandro Nuvolari and Valentina Tartari
Explorations in Economic History, 82, 101419, 2021 [Lead article]
Winner, Bernardo Nobile Prize for the best Master's thesis using patent data
The distinction between macro- and microinventions is at the core of recent debates on the Industrial Revolution. Yet, the empirical testing of this notion has remained elusive. We address this issue by introducing a new quality indicator for all patents granted in England in the period 1700–1850. The indicator provides the opportunity for a large-scale empirical appraisal of macro- and microinventions. Our findings indicate that macroinventions did not exhibit any specific time-clustering, while microinventions were characterized by clustering behavior. In addition, we also find that macroinventions displayed a labor-saving bias and were mostly introduced by professional engineers. These results suggest that Allen’s and Mokyr’s views of macroinventions, rather than conflicting, should be regarded as complementary.
With Giovanni Dosi
in Alcorta et al. (eds), New Perspectives on Structural Change: Causes and Consequences of Structural Change in the Global Economy, 2021, Oxford: Oxford University Press
In this chapter we discuss the role of natural resources and endowment structures on structural change. Departing from theories of trade that stress specialization according to one’s comparative advantages as the key route to development, we articulate an alternative point of view on the role of technological learning and absolute advantages for structural transformation. Ricardian adjustment processes relying on endowment-based comparative advantages are often times a misleading driver of development; rather, technological competitiveness offers a better criterion to achieve sustained economic well-being. This theoretical perspective provides useful guidance to interpret the effects of globalization and the role of natural resources relative to industrial and trade policies in shaping the process of structural change and economic development.
With Valeria Cirillo, Arianna Martinelli, and Alessandro Nuvolari
Research Policy, 48(4), 905-922, 2019
One of the most significant results of the qualitative literature on national systems of innovation (NSIs) is that different systemic arrangements (i.e. configurations of actors and institutions) can deliver similar levels of innovative performance. Using factor analysis on a novel dataset of 29 quantitative indicators of innovative activities we provide an empirical characterization of the structure of European NSIs over the last ten years. Our results cast doubt on the empirical significance of the “equifinality” of heterogeneous systemic arrangements in the context of NSI. Innovation systems show inherent complexity, which leads to a high level of complementarity among their constituent components and configuration. This result implies that successful innovation policies should be systemic, leaving little flexibility in policy design and scope.
[Paper]
+1 (341) 400-3543 |
mtranc@wharton.upenn.edu
|
The Wharton School of the University of Pennsylvania