Natural language querying of biological databases with large language models 

Journal Article

Harnessing AI To Expedite R&D

Natural language querying of biological databases with large language models 

Meaningful querying biological databases today requires specialist knowledge of structured languages such as SQL, SPARQL, or Cypher — making data mining slow, labor-intensive, and inaccessible to many researchers. Reliable natural language querying would change that. It would also be a prerequisite for the next generation of AI co-scientist systems: tools that automate scientific hypothesis generation at scale. 

This peer-reviewed paper, produced by the Pistoia Alliance’s Large Language Models in Life Sciences project and published in Drug Discovery Today (May 2026), reports the outcomes of a systematic assessment of current practices in natural language querying with LLMs. 

Highlights 

  • Accurate natural language data mining is a requirement for AI co-scientist systems 
  • Multi-agent LLM systems combined with deterministic queries offer the best accuracy-flexibility balance 
  • LLM agents must be Findable and Reusable (FAIR) and require open API standards 
  • Shared benchmarks for natural language data mining systems are needed across the industry 

Authors 

Vladimir A. Makarov (Pistoia Alliance), Oleg Stroganov (Rancho Biosciences), Laura I. Furlong (MedBioInformatics Solutions), Brian Evarts (Crown Point Technologies), Loes van den Biggelaar (The Hyve), Alexandros Goulas (AbbVie), Etzard Stolte (Roche), Derek Marren (AstraZeneca), and Lars Greiffenberg (AbbVie). 

__________________________________ 

What did the study test? 

The authors evaluated 21 different strategies for translating natural language into structured database queries, using the Open Targets Platform as a real-world target discovery and validation use case. Five LLMs were tested — GPT-4o, Claude 3.5 Sonnet, o1, GPT-4o-mini, and open-mistral-7b — across naïve, template-based, retrieval-augmented, prompt-optimized, and multi-agent approaches. 

What works — and what doesn’t? 

Naïve prompting failed almost universally on complex scientific questions, even when the full database schema was supplied. Template-based strategies achieved 100% accuracy but are rigid: they cannot be transferred to new data sources without substantial human effort, and they do not scale. 

Multi-agent strategies — in which multiple LLM agents challenge one another’s outputs and interact with a human user — achieved 83–98% accuracy on complex questions with the best-performing model (o1).  

This combination of accuracy, flexibility, and adaptability across data sources represents the most practical path forward identified in the study. Entity recognition (for example, correctly resolving “ALS” to “Amyotrophic Lateral Sclerosis”) was the single largest remaining source of error. 

What does the industry need next? 

The paper makes three calls to action for the field: 

  • Combine multi-agent LLM systems with deterministic API calls for simple, high-confidence retrieval tasks. 
  • Develop a shared industry benchmark for natural language data mining that is resilient to LLM background-knowledge contamination — a subtle but significant source of evaluation error. 
  • Establish FAIR-aligned open standards — such as the Model Context Protocol — for the discovery of and engagement with AI agents across commercial and academic systems. 

This research was coordinated by the Pistoia Alliance Large Language Models in Life Sciences project. To learn more about the project, visit https://pistoiaalliance.org/project/benchmarks-for-natural-language-data-mining-with-llms/

Published on: April 27, 2026