Project Charter
Develop comprehensive benchmarks for Natural Language data mining and Scientific Chat applications to facilitate the rigorous evaluation of Large Language Model (LLM) performance across key stages of the data-to-insight pipeline in pharmaceutical R&D, ultimately supporting more effective deployment of AI in drug discovery.
The Challenge
Pharma and biotech organizations are increasingly experimenting with natural‑language‑based scientific search, including NL‑to‑Query‑Language (NL2QL) translation, “AI co‑scientist” tools, and scientific chat systems. Recent Pistoia Alliance investigations show that, although these tools are advancing rapidly, there is no universally accepted set of benchmarks to evaluate each stage of the natural‑language data mining process.
This absence of standardized test sets, evaluation metrics, and quality criteria makes it difficult to:
- Assess true tool performance
- Compare different systems objectively
- Identify gaps and opportunities for improvement
- Build confidence in AI‑enabled scientific assistants
As a result, both developers and end users lack the foundational resources needed for reliable evaluation, which slows progress and responsible adoption.
The Solution
The first phase of the Large Language Models in the Life Sciences project highlighted the importance of proper benchmarks to assess natural language data mining systems. As a second phase, this project aims to define, construct, and assess the quality of a comprehensive suite of benchmarks covering the four steps of the natural language data mining workflow:
- Understanding the question
- Recognizing named entities and synonyms, and disambiguating terms
- Constructing accurate structured queries
- Assessing the quality of final answers
To that end, the project will:
- Review existing or proposed benchmarks to determine coverage and where gaps remain
- Create new benchmarks where necessary, including test sets, metrics, and quality criteria
- Produce at least one white paper outlining the problem space, solutions, and long‑term strategy
- Provide a sustainable plan for benchmark evolution and maintenance
Ultimately, this project will establish shared, industry-level best practices for benchmarking current and future AI-based research tools and strengthen the role of the Pistoia Alliance as a trusted steward of evaluation standards for AI. Member-specific value includes objective criteria for the selection of AI-based technologies, actionable guidance for the improvement of AI products, and overall enhanced reliability of natural language data mining tools to accelerate high-quality R&D.
Targeted Outputs
Planned outputs of this project are:
- A literature‑based review of existing benchmarks for all four workflow stages
- Identification and cataloguing of workflow gaps
- Construction of new benchmarks where gaps exist, including test sets, statistical evaluation metrics and cutoffs, and benchmark quality assessment strategies based on recognized criteria
- A white paper (or academic paper) documenting findings, recommendations, and best practices
- A long‑term maintenance and evolution plan for the constructed benchmarks
- A final report for the Pistoia Alliance AI community
- The publication of any new benchmark code and data to GitHub
Why This Is Important and Why Now
AI‑assisted scientific search is rapidly becoming central to R&D, yet reliable evaluation remains difficult without high‑quality, community‑vetted benchmarks. As organizations accelerate their adoption of LLM‑based tools, the absence of shared standards has become a critical barrier. With committed support already in place (see project supporters below) and clear evidence that benchmarking gaps hinder effective tool selection and safe deployment, now is the ideal moment to establish a common framework that will guide responsible, comparable, and future‑proof use of natural language data mining technologies across the life sciences. We welcome additional sponsors and intellectual supporters.
Get Involved
To find out more or participate in this project, get in touch with Vladimir Makarov.
Contact Vladimir