Synthetic Data in Medical Imaging

Date Submitted: 10 July 2024
 
Person Creating Document: Angeli Möller
 
Idea Originator: Sadegh Mohammadi Bayer AG
 
 

Problem Statement

Artificial Intelligence (AI) models have been shown to have impressive capabilities in the medical image analysis domain with potential to assist clinicians in their everyday tasks, from the diagnosis to disease monitorisation and surgical planning procedure, thereby enhancing patient care. While the clinical adoption of these AI-based tools has increased, the development of this technology is still limited by the need for large-scale medical image datasets, which are often not available in this domain but are a hard requirement for training performant AI systems. It is often the case that medical imaging data is limited to a few of the more common conditions, while rare disease and minority patient populations are left underrepresented, limiting the general efficacy of these tools. In addition, contrarily to other domains like natural imaging, where images can be captured with one’s phone, medical image data is usually captured by expensive medical devices and complex medical procedures. Moreover, the process of creating the associated image labels for supervised AI training must be conducted by domain experts which lead to increased acquisition costs. Finally, privacy concerns also limit the access and use of medical data for AI training, thus constraining their development.
 
Recent research suggests that synthetic data could help overcome these challenges by filling in the gaps where real data is scarce, of inconsistent quality or even difficult to acquire, while also preserving privacy [1, 2]. This is especially relevant in the case of rare diseases, where diagnosis is performed by identifying altered phenotypes through pathology or radiology imaging, but is often undetected by medical AI due to lack of relevant data for developing such models. While synthetic data augmentation is regularly applied in the natural language processing for training large language models [3], this approach has not fully been applied to image analysis, except for some vision tasks like face recognition, aerospace and maritime defence [4-6]. Thus, despite its potential, synthetic image data adoption still lags in the medical industry.
 
Many aspects contribute to this mismatch. On the one hand, the lack of data hinders development, and several initiatives are being put in place to create more centralized data repositories for generative model development [7]. On the other hand, and most importantly, there is a lack of a standardised regulatory approach for validating synthetic data and the AI models trained on it, which causes uncertainty on the development of these technologies.

 

Idea Proposal and Value Proposition

The lack of a standardised regulatory approach for synthetic data causes uncertainty on the development of these technologies. These technologies are currently being developed primarily by private corporations or non-profit organizations, given their development and deployment costs. However, there is no clear stance from regulators on the validation of the synthetic data itself, as well as any other downstream usage of the generated data. This climate of uncertainty is already slowing down AI development, and has caused companies like Meta to limit the deployment of AI tools in global regions [8].
 
In this sense, it is imperative to initiate an industry-wide conversation on the adoption of synthetic medical image data, particularly focusing on the ethical and regulatory aspects of its validation and use. Therefore, we propose to form a community of practice within the Pistoia Alliance framework to work with regulators to generate an acceptable validation pathway for algorithms trained on synthetic medical images as well as for synthetic data itself.  With all the potential that synthetic data has for enhancing medical AI tool performance, we believe this to be a crucial step that will reduce uncertainty and foster its more wide-spread adoption, consequently contributing to improved patient quality of care.
 
By fostering an industry-wide agreement on how to tackle all the regulatory concerns related to the use of medical synthetic image data, we would take an effective step towards the widespread use of this technology. Researchers and AI practitioners would benefit from synthetic data generation to fill in the gaps of their real datasets and boost the development of AI tools with good performance even for underrepresented patient populations. Beyond patients with rarer conditions, synthetic data could also help achieve the larger amount of training data that is required to train effective AI tools across all population subsets, thus improving outcomes for all patients. Acceleration of AI research in this field could also be expected as synthetic data consists of a valuable alternative to real data, especially at an early-development stage, when data acquisition can delay progress due to its scarcity, privacy constraints and associated costs. There is also potential for synthetic control groups to be generated from historical records to minimise animal sacrifice in pre-clinical studies and, potentially, the number of placebo patients in clinical studies.
 

Targeted Outputs

We identify the following outputs for this project:

  • Identification and agreement on common concerns, necessities and perceived regulatory gaps when developing medical generative AI models and using the synthesised data for downstream clinical tasks.
  • Development of industry standards for their proposal to regulatory authorities, with the objective of minimizing clinical risks and improve outcomes.
  • Foster trust and certainty required for industry adoption for the further development of synthetic data initiatives.
  • Improve knowledge transfer protocols for transparency.
  • Raise awareness within the medical community on the usefulness of synthetic data.
  • Improved clinical outcomes for patients from traditionally underrepresented populations (e.g. rare diseases and minority ethnicities)

 

Example Use Case(s)
  • Focused development of medically tailored GenAI models while adhering to objective standards, ensuring quality of the research output and minimization of clinical and development risks.
  • Regulatory coverage for decision support systems that allow to produce counterfactual reasoning on a diagnosis.

 

Critical Success Factors

It is imperative to foster a solid collaboration of various MedTech industrial partners as well as other companies and public entities active in this space. Involvement of regulatory entities in this collaboration would also be of utmost importance as the definition of the validation pathways must be centred around their concerns. The development of industry standards would bridge the gap between academic work and industrial support, while providing a solid background for regulators to address public concerns.

 

Why This Is a Good Idea / Why Now

The current regulatory landscape is ill-defined, causing uncertainty in the development of clinical AI models. Moreover, these technologies are currently being developed primarily by private corporations or non-profit organizations, given their development and deployment costs. This climate of uncertainty is already slowing down development and has caused large AI companies to limit the deployment of AI tools in some regions. In this sense, it is imperative to form a community of practice to work with regulators to generate an acceptable validation pathway for algorithms trained on synthetic medical images and synthetic data itself, to reduce uncertainty and foster research.

 

Other Relevant Information

Bayer is onboard as a sponsor, and NVIDIA is keen to support the build of the synthetic medical imaging community.

 

References:
[1] Man, K.; Chahl, J. A Review of Synthetic Image Data and Its Use in Computer Vision. J. Imaging 2022, 8, 310. https://doi.org/10.3390/jimaging8110310

[2] Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., … & Weller, A. (2022). Synthetic Data–what, why and how?. arXiv preprint arXiv:2205.03257.

[3] Puri, Raul, et al. “Training question answering models from synthetic data.” arXiv preprint arXiv:2002.09599 (2020).

[4] Improving the realism of synthetic images. Apple Machine Learning Research. (n.d.). https://machinelearning.apple.com/research/gan

[5] Rüter, J., Maienschein, T., Schirmer, S., Schopferer, S., & Torens, C. (2024). Filling the Gaps: Using Synthetic Low-Altitude Aerial Images to Increase Operational Design Domain Coverage. Sensors, 24(4), 1144.

[6] Benchmarking Success: Synthetic Data Matches and Surpasses Real Data in AI Training DMI,https://datamachineintelligence.eu/post/benchmarking-success-synthetic-data-matches-and-surpasses-real-data-in-ai-training/

[7] Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023 on harmonised rules on fair access to and use of data and amending Regulation (EU) 2017/2394 and Directive (EU) 2020/1828 (Data Act). Council of European Union, 2023. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32023R2854

[8] https://www.euronews.com/next/2024/07/18/meta-stops-eu-roll-out-of-ai-model-due-to-regulatory-concerns