Benchmarks for Natural Language Data Mining with LLMs

Project Charter

Develop comprehensive benchmarks for Natural Language data mining and Scientific Chat applications to facilitate the rigorous evaluation of Large Language Model (LLM) performance across key stages of the data-to-insight pipeline in pharmaceutical R&D, ultimately supporting more effective deployment of AI in drug discovery.

The Challenge 

Pharma and biotech organizations are increasingly experimenting with natural‑language‑based scientific search, including NL‑to‑Query‑Language (NL2QL) translation, “AI co‑scientist” tools, and scientific chat systems. Recent Pistoia Alliance investigations show that, although these tools are advancing rapidly, there is no universally accepted set of benchmarks to evaluate each stage of the natural‑language data mining process. 

This absence of standardized test sets, evaluation metrics, and quality criteria makes it difficult to: 

  • Assess true tool performance 
  • Compare different systems objectively 
  • Identify gaps and opportunities for improvement 
  • Build confidence in AI‑enabled scientific assistants 

As a result, both developers and end users lack the foundational resources needed for reliable evaluation, which slows progress and responsible adoption.

 

The Solution 

The first phase of the Large Language Models in the Life Sciences project highlighted the importance of proper benchmarks to assess natural language data mining systems. As a second phase, this project aims to define, construct, and assess the quality of a comprehensive suite of benchmarks covering the four steps of the natural language data mining workflow: 

  1. Understanding the question 
  2. Recognizing named entities and synonyms, and disambiguating terms 
  3. Constructing accurate structured queries 
  4. Assessing the quality of final answers 

To that end, the project will: 

  • Review existing or proposed benchmarks to determine coverage and where gaps remain 
  • Create new benchmarks where necessary, including test sets, metrics, and quality criteria 
  • Produce at least one white paper outlining the problem space, solutions, and long‑term strategy 
  • Provide a sustainable plan for benchmark evolution and maintenance 

Ultimately, this project will establish shared, industry-level best practices for benchmarking current and future AI-based research tools and strengthen the role of the Pistoia Alliance as a trusted steward of evaluation standards for AI. Member-specific value includes objective criteria for the selection of AI-based technologies, actionable guidance for the improvement of AI products, and overall enhanced reliability of natural language data mining tools to accelerate high-quality R&D. 

 

Targeted Outputs 

Planned outputs of this project are: 

  • A literature‑based review of existing benchmarks for all four workflow stages 
  • Identification and cataloguing of workflow gaps 
  • Construction of new benchmarks where gaps exist, including test sets, statistical evaluation metrics and cutoffs, and benchmark quality assessment strategies based on recognized criteria 
  • A white paper (or academic paper) documenting findings, recommendations, and best practices 
  • A long‑term maintenance and evolution plan for the constructed benchmarks 
  • A final report for the Pistoia Alliance AI community 
  • The publication of any new benchmark code and data to GitHub 

 

Why This Is Important and Why Now 

AI‑assisted scientific search is rapidly becoming central to R&D, yet reliable evaluation remains difficult without high‑quality, community‑vetted benchmarks. As organizations accelerate their adoption of LLM‑based tools, the absence of shared standards has become a critical barrier. With committed support already in place (see project supporters below) and clear evidence that benchmarking gaps hinder effective tool selection and safe deployment, now is the ideal moment to establish a common framework that will guide responsible, comparable, and future‑proof use of natural language data mining technologies across the life sciences. We welcome additional sponsors and intellectual supporters.

Get Involved

To find out more or participate in this project, get in touch with Vladimir Makarov.

Contact Vladimir

Project Sponsors

  • Abbvie logo
Secure your place

Accelerating Innovations - London

April 13­-15, 2026

Once again we will be hosting the Annual Spring conference at the Royal Society of Medicine, London. Registration is now open.

Register here

Our Events

21 Apr 2026

Life Science Informatics Forum – Benelux Chapter #5

Book Now
29 Apr 2026

UK Life Science Informatics Forum

Book Now
19 May 2026

Clinical Trials Technology Congress

Book Now
21 May 2026

European Life Science Informatics Forum – German Chapter

Book Now
11 Jun 2026

How AI is Redefining User Experience in Life Sciences

Book Now
18 Jun 2026

Chemical Exchange Format Committee Meeting 2026

Book Now
23 Jun 2026

European Life Science Informatics Forum – Nordics Chapter #2

Book Now
16 Jul 2026

Joint Meeting of the Pistoia Alliance / EMBL-EBI / ELIXIR

Book Now
22 Sep 2026

Lab of the Future Congress

Book Now
29 Sep 2026

2026 CSCS Conference

Book Now
20 Oct 2026

2026 User Experience in Life Sciences (UXLS) Conference

Book Now