Natural language querying of biological databases with large language models

Journal Article

Harnessing AI To Expedite R&D

Natural language querying of biological databases with large language models

Meaningful querying biological databases today requires specialist knowledge of structured languages such as SQL, SPARQL, or Cypher — making data mining slow, labor-intensive, and inaccessible to many researchers. Reliable natural language querying would change that. It would also be a prerequisite for the next generation of AI co-scientist systems: tools that automate scientific hypothesis generation at scale.

This peer-reviewed paper, produced by the Pistoia Alliance’s Large Language Models in Life Sciences project and published in Drug Discovery Today (May 2026), reports the outcomes of a systematic assessment of current practices in natural language querying with LLMs.

Highlights

Accurate natural language data mining is a requirement for AI co-scientist systems

Multi-agent LLM systems combined with deterministic queries offer the best accuracy-flexibility balance

LLM agents must be Findable and Reusable (FAIR) and require open API standards

Shared benchmarks for natural language data mining systems are needed across the industry

Authors

Vladimir A. Makarov (Pistoia Alliance), Oleg Stroganov (Rancho Biosciences), Laura I. Furlong (MedBioInformatics Solutions), Brian Evarts (Crown Point Technologies), Loes van den Biggelaar (The Hyve), Alexandros Goulas (AbbVie), Etzard Stolte (Roche), Derek Marren (AstraZeneca), and Lars Greiffenberg (AbbVie).

__________________________________

What did the study test?

The authors evaluated 21 different strategies for translating natural language into structured database queries, using the Open Targets Platform as a real-world target discovery and validation use case. Five LLMs were tested — GPT-4o, Claude 3.5 Sonnet, o1, GPT-4o-mini, and open-mistral-7b — across naïve, template-based, retrieval-augmented, prompt-optimized, and multi-agent approaches.

What works — and what doesn’t?

Naïve prompting failed almost universally on complex scientific questions, even when the full database schema was supplied. Template-based strategies achieved 100% accuracy but are rigid: they cannot be transferred to new data sources without substantial human effort, and they do not scale.

Multi-agent strategies — in which multiple LLM agents challenge one another’s outputs and interact with a human user — achieved 83–98% accuracy on complex questions with the best-performing model (o1).

This combination of accuracy, flexibility, and adaptability across data sources represents the most practical path forward identified in the study. Entity recognition (for example, correctly resolving “ALS” to “Amyotrophic Lateral Sclerosis”) was the single largest remaining source of error.

What does the industry need next?

The paper makes three calls to action for the field:

Combine multi-agent LLM systems with deterministic API calls for simple, high-confidence retrieval tasks.
Develop a shared industry benchmark for natural language data mining that is resilient to LLM background-knowledge contamination — a subtle but significant source of evaluation error.
Establish FAIR-aligned open standards — such as the Model Context Protocol — for the discovery of and engagement with AI agents across commercial and academic systems.

This research was coordinated by the Pistoia Alliance Large Language Models in Life Sciences project. To learn more about the project, visit https://pistoiaalliance.org/project/benchmarks-for-natural-language-data-mining-with-llms/.

Published on: April 27, 2026

Resource Link Title	Action
Temporary open-access link	View
Permanent DOI	View
Pre-print	View
Large Language Models in Life Science project webpage	View
Benchmarks for Natural Language Data Mining with LLMs project webpage	View

Projects

Communities

Natural language querying of biological databases with large language models

Harnessing AI To Expedite R&D

Natural language querying of biological databases with large language models

What did the study test?

What works — and what doesn’t?

What does the industry need next?