Benchmarks for Natural Language Data Mining with LLMs

Project Charter

Develop comprehensive benchmarks for Natural Language data mining and Scientific Chat applications to facilitate the rigorous evaluation of Large Language Model (LLM) performance across key stages of the data-to-insight pipeline in pharmaceutical R&D, ultimately supporting more effective deployment of AI in drug discovery.

The Challenge

Pharma and biotech organizations are increasingly experimenting with natural‑language‑based scientific search, including NL‑to‑Query‑Language (NL2QL) translation, “AI co‑scientist” tools, and scientific chat systems. Recent Pistoia Alliance investigations show that, although these tools are advancing rapidly, there is no universally accepted set of benchmarks to evaluate each stage of the natural‑language data mining process.

This absence of standardized test sets, evaluation metrics, and quality criteria makes it difficult to:

Assess true tool performance
Compare different systems objectively
Identify gaps and opportunities for improvement
Build confidence in AI‑enabled scientific assistants

As a result, both developers and end users lack the foundational resources needed for reliable evaluation, which slows progress and responsible adoption.

The Solution

The first phase of the Large Language Models in the Life Sciences project highlighted the importance of proper benchmarks to assess natural language data mining systems. As a second phase, this project aims to define, construct, and assess the quality of a comprehensive suite of benchmarks covering the four steps of the natural language data mining workflow:

Understanding the question
Recognizing named entities and synonyms, and disambiguating terms
Constructing accurate structured queries
Assessing the quality of final answers

To that end, the project will:

Review existing or proposed benchmarks to determine coverage and where gaps remain
Create new benchmarks where necessary, including test sets, metrics, and quality criteria
Produce at least one white paper outlining the problem space, solutions, and long‑term strategy
Provide a sustainable plan for benchmark evolution and maintenance

Ultimately, this project will establish shared, industry-level best practices for benchmarking current and future AI-based research tools and strengthen the role of the Pistoia Alliance as a trusted steward of evaluation standards for AI. Member-specific value includes objective criteria for the selection of AI-based technologies, actionable guidance for the improvement of AI products, and overall enhanced reliability of natural language data mining tools to accelerate high-quality R&D.

Targeted Outputs

Planned outputs of this project are:

A literature‑based review of existing benchmarks for all four workflow stages
Identification and cataloguing of workflow gaps
Construction of new benchmarks where gaps exist, including test sets, statistical evaluation metrics and cutoffs, and benchmark quality assessment strategies based on recognized criteria
A white paper (or academic paper) documenting findings, recommendations, and best practices
A long‑term maintenance and evolution plan for the constructed benchmarks
A final report for the Pistoia Alliance AI community
The publication of any new benchmark code and data to GitHub

Why This Is Important and Why Now

AI‑assisted scientific search is rapidly becoming central to R&D, yet reliable evaluation remains difficult without high‑quality, community‑vetted benchmarks. As organizations accelerate their adoption of LLM‑based tools, the absence of shared standards has become a critical barrier. With committed support already in place (see project supporters below) and clear evidence that benchmarking gaps hinder effective tool selection and safe deployment, now is the ideal moment to establish a common framework that will guide responsible, comparable, and future‑proof use of natural language data mining technologies across the life sciences. We welcome additional sponsors and intellectual supporters.