Pre-Competitive Real-World Benchmarking of AI Solutions

Date Submitted: November 23, 2024
Authors: Vladimir Makarov, Pistoia Alliance, and Alexandros Karargyris, MLCommons
Idea Originators: Abhishek Pandey, Abbvie and Micah Sheller, Intel​
Other Supporting Individuals/Companies: Bayer, Amgen, AstraZeneca, Roche, US FDA
Identified Funders: Abbvie
 

Problem Statement

Many AI tools and software solutions are used for drug discovery and manufacturing today, with evidence of growth of investments in the AI field by the biotechnology and pharmaceutical industries. However, performance of many AI solutions is unclear, increasing the risk of investment, hindering adoption, slowing progress and business impact of AI.

Verification of performance claims made by the authors of such software requires effort. In the ideal world this testing would only need to be done once, however, in the competitive environment of the pharmaceutical industry it has to be performed individually by each company, resulting in duplication of effort.

Available public leader boards address only the easiest use case of testing of publicly available models on publicly available data sets, ignoring using cases with real-world business impact.  To test proprietary models against proprietary data requires strong IP protection, confidentiality and integrity measures.

Without objective AI model performance data, the FDA and other regulatory bodies have limited ability to assess AI models in medicinal product approvals. The proposed project aims to close these gaps through a collaborative effort organized by two major non-profit organizations (Pistoia Alliance and MLCommons) to enforce neutrality while enabling commercial opportunities​.

MLCommons is a community-driven organization that develops tools for the entire AI industry through benchmarks, benchmark best practices, public datasets, and measurements for AI risk and reliability. MLCommons’ MedPerf is an open platform for benchmarking AI on real-world datasets enabling neutral benchmarking. It has been used in ​various projects [REF1, REF2, REF3, REF4]. 

Idea Proposal and Value Proposition

We propose ​the creation of a collaborative community to develop and maintain best practices related to real–world benchmarking of AI tools used for a multiplicity of use cases along the drug discovery, manufacturing, and clinical research value chain. We also propose the deployment of a system​for benchmarking ​such​​of ​AI tools on private assets (models and data)​. Th​is​​e​ proposed vendor-neutral system should include benchmark data sets, specifically designed for the named use cases, evaluation metrics, and infrastructure for secure compute and IP protection, that would allow for benchmarking of AI tools without disclosure of the model code and the test data sets.  

  • Pharmaceutical, biotechnology firms, and contract research organizations will benefit because they will be able to make AI technology investment decisions based on the objective evaluation of technologies, including the ability to assess the performance on the proprietary data sets without the need to disclose the data themselves. This de-risks AI investments and provides critical information before procurement and development decisions.  
  • Technology vendors will benefit by knowing where their offerings stand in comparison with other solutions, thus being able to plan product improvements 
  • Regulators will be able to assess the AI model quality with little duplicated effort 
  • Patient advocacy groups can provide input and direction on benchmark use cases and metrics that capture patient population interests 
  • The overall quality and speed of drug discovery R&D will be improved 

Targeted Outputs

  • List of candidate AI/ML use cases in drug discovery, manufacturing, clinical research, that would benefit the most from benchmarking 
  • List of benchmarks for these uses cases that are already in existence 
  • White paper describing the problem space and the proposed solution 
  • A community system for benchmarking, including the infrastructure for secure compute, homomorphic data transformation, or other appropriate IP protection tools; benchmark data sets for the specific use cases; and metadata management tools necessary for review of models and benchmark data sets without disclosure of any proprietary information 
  • Metrics useful for evaluation of specific use cases 

Example Use Cases

Possible Topics for Future Benchmarking of AI models 

  • ADMET prediction 
  • Compound melting point prediction 
  • Solubility prediction 
  • pH alteration prediction 
  • Phenotypic profiling 
  • High content imaging 
  • Rare event detection 

This proposal is not limited by these ideas, but includes actions to understand which use cases in the drug discovery industry would benefit the most from benchmarking of AI/ML tools. 

Critical Success Factors

  • Financial and/or intellectual participation of  
  • Multiple pharmaceutical firms 
  • AI software vendor firms 
  • Technology companies (e.g. Intel, Google, Nvidia, Microsoft) 
  • Regulatory agencies 
  • Patient advocacy groups 

Why This Is a Good Idea / Why Now

As of today, AI solutions are widely used in life sciences research. Over 75 AI-discovered molecules entered clinical trials since 2015, with CAGR of over 60%, and success rate in the Phase I trials of 80-90%, significantly higher than the historical average of 40-65%. (Reference: https://www.drugdiscoverytrends.com/six-signs-ai-driven-drug-discovery-trends-pharma-industry).

As a result of these early successes, investments in AI in biotech and pharma are increasing. 62% of respondents plan to invest in AI in the next two years, based on the Pistoia Alliance survey (200 expert opinions across Europe, the Americas, and APAC, reference: https://www.drugtargetreview.com/news/153454/the-pistoia-alliance-key-findings-on-ai/).

Despite this success, the growth of the field is limited by the lack of standardized verification techniques for AI quality. Leaderboards like Hugging Face require standardized public benchmarks, which are not readily available or even allowed by privacy considerations for pharma R&D. (See the limitations of ADMET property prediction and related use cases above). Testing of private AI/ML models or testing of models on private data require IP protection measures that are not possible on public leader boards. However, requiring each individual company to require such testing and benchmarking behind in-house firewalls results in duplication of resources and effort.