Benchmarks for Natural Language Data Mining with LLMs

Project Charter

Develop comprehensive benchmarks for Natural Language data mining and Scientific Chat applications to facilitate the rigorous evaluation of Large Language Model (LLM) performance across key stages of the data-to-insight pipeline in pharmaceutical R&D, ultimately supporting more effective deployment of AI in drug discovery.

The Challenge 

Pharma and biotech organizations are increasingly experimenting with natural‑language‑based scientific search, including NL‑to‑Query‑Language (NL2QL) translation, “AI co‑scientist” tools, and scientific chat systems. Recent Pistoia Alliance investigations show that, although these tools are advancing rapidly, there is no universally accepted set of benchmarks to evaluate each stage of the natural‑language data mining process. 

This absence of standardized test sets, evaluation metrics, and quality criteria makes it difficult to: 

  • Assess true tool performance 
  • Compare different systems objectively 
  • Identify gaps and opportunities for improvement 
  • Build confidence in AI‑enabled scientific assistants 

As a result, both developers and end users lack the foundational resources needed for reliable evaluation, which slows progress and responsible adoption.

 

The Solution 

The first phase of the Large Language Models in the Life Sciences project highlighted the importance of proper benchmarks to assess natural language data mining systems. As a second phase, this project aims to define, construct, and assess the quality of a comprehensive suite of benchmarks covering the four steps of the natural language data mining workflow: 

  1. Understanding the question 
  2. Recognizing named entities and synonyms, and disambiguating terms 
  3. Constructing accurate structured queries 
  4. Assessing the quality of final answers 

To that end, the project will: 

  • Review existing or proposed benchmarks to determine coverage and where gaps remain 
  • Create new benchmarks where necessary, including test sets, metrics, and quality criteria 
  • Produce at least one white paper outlining the problem space, solutions, and long‑term strategy 
  • Provide a sustainable plan for benchmark evolution and maintenance 

Ultimately, this project will establish shared, industry-level best practices for benchmarking current and future AI-based research tools and strengthen the role of the Pistoia Alliance as a trusted steward of evaluation standards for AI. Member-specific value includes objective criteria for the selection of AI-based technologies, actionable guidance for the improvement of AI products, and overall enhanced reliability of natural language data mining tools to accelerate high-quality R&D. 

 

Targeted Outputs 

Planned outputs of this project are: 

  • A literature‑based review of existing benchmarks for all four workflow stages 
  • Identification and cataloguing of workflow gaps 
  • Construction of new benchmarks where gaps exist, including test sets, statistical evaluation metrics and cutoffs, and benchmark quality assessment strategies based on recognized criteria 
  • A white paper (or academic paper) documenting findings, recommendations, and best practices 
  • A long‑term maintenance and evolution plan for the constructed benchmarks 
  • A final report for the Pistoia Alliance AI community 
  • The publication of any new benchmark code and data to GitHub 

 

Why This Is Important and Why Now 

AI‑assisted scientific search is rapidly becoming central to R&D, yet reliable evaluation remains difficult without high‑quality, community‑vetted benchmarks. As organizations accelerate their adoption of LLM‑based tools, the absence of shared standards has become a critical barrier. With committed support already in place (see project supporters below) and clear evidence that benchmarking gaps hinder effective tool selection and safe deployment, now is the ideal moment to establish a common framework that will guide responsible, comparable, and future‑proof use of natural language data mining technologies across the life sciences. We welcome additional sponsors and intellectual supporters.

Get Involved

To find out more or participate in this project, get in touch with Vladimir Makarov.

Contact Vladimir

Project Sponsors

  • Abbvie logo
Secure your place

Accelerating Innovations - London

April 13­-15, 2026

Once again we will be hosting the Annual Spring conference at the Royal Society of Medicine, London. Registration is now open.

Register here

Our Events

10 Mar 2026

日本会員限定コール | Japan Member Only Call

Book Now
10 Mar 2026

Automating Batch Tracking Across Supply Chain and Regulatory Using the IDMP Ontology

Book Now
11 Mar 2026

Roundtable: Open-Source Instrument Data Converter Idea

Book Now
12 Mar 2026

International Women’s Day: Data, AI, and Health Inequity in the Life Sciences

Book Now
17 Mar 2026

Pharmacovigilance Initiatives at Pistoia Alliance

Book Now
18 Mar 2026

Social Media Listening for Real-World Evidence

Book Now
24 Mar 2026

US Life Science Informatics Forum

Book Now
01 Apr 2026

What is the role of the knowledge graph in the age of AI?

Book Now
13 Apr 2026

Pistoia Alliance 2026 London Conference

Book Now
14 Apr 2026

2026 Data Science Talent Awards Final

Book Now
21 Apr 2026

Life Science Informatics Forum – Benelux Chapter #5

Book Now
29 Apr 2026

UK Life Science Informatics Forum

Book Now
19 May 2026

Clinical Trials Technology Congress

Book Now
29 Sep 2026

2026 CSCS Conference

Book Now