AI and Scientific Knowledge

Published at EMNLP, 2025.

Large language models (LLMs) are rapidly becoming tools for scientific information retrieval, yet their ability to recognize scientists and their contributions is far from uniform. In this work, we audit how three leading LLMs—GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—recognize over 100,000 physicists across the productivity spectrum.

Our analysis reveals systematic disparities: LLMs are significantly better at recognizing male scientists and those affiliated with Western institutions, mirroring and amplifying existing biases in scientific visibility. We trace these gaps in part to Wikipedia, which serves as a key training source and exhibits similar patterns of coverage inequality. The findings highlight that as LLMs become embedded in scientific workflows—from literature search to peer review—they risk reinforcing the unequal recognition structures already present in academia.