This project aims to convert unstructured assay protocol descriptions into a high-quality FAIR data set, and create standards for this information.
Why is this important?
This project has the potential to:
- Revolutionize biotech R&D by standardizing research methods and improving the reproducibility of experiments.
- Increase efficiency for bench scientists: reduce assay search, planning, and set-up time, allow to skip experiments known to have failed in the public domain
- Increase efficiency for data scientists: will help with harmonizing and merging datasets, cleansing data for analytics, informatics, ML, and AI applications
- Support precompetitive collaborations, including a growing number of data science-focused initiatives that benefit from interoperable scientific data
- Decrease the costs of internal curation efforts of bioassays
- Potentially simplify the preparation of regulatory submissions
- For assay kit vendors and CROs, a way to market products by having references to them in major public databanks like ChEMBL and PubChem
- For scientific publishing organizations, potentially increase the quality of publications provided that the authors use the common assay annotation standard and deposit their methods into a public data bank prior to or simultaneously with publications. Possible reduction of workload of peer reviewers and editors
- For funders of research, increase the value of the science that was funded
What will the project achieve?
The project will enable:
- Costs to be shared for converting published (unstructured) biological assay descriptions into high-accuracy machine-readable FAIR data objects described by a community-defined data model tailored to address current and future essential business questions
- The data model to be FAIR and based on public ontologies such as the BioAssay Ontology
- The data model to be developed in a community-wide collaborative way and to eventually be promoted to the industry standard for the publication of assay metadata
- The generated FAIR data to be made available to the public after a period of exclusivity for partnering organizations
How will the project do this?
A number of common issues will be addressed by this project:
- There are currently >1.4 million biological assay protocols contained in research publications.
- The biological assay is a popular data type for post-hoc data mining. Most of these published data and metadata are not in a form suitable for automated mining. They are partially annotated in e.g., ChEMBL and PubChem, but the volume, depth, and quality of these annotations are inadequate for addressing many current and future business questions.
- Significant labor efforts (estimated between 4 and 12 weeks per assay) are spent in research organizations to select, set up, and validate biological assays. Some of these efforts fail completely and lead to waste that could have been avoided had the assay selection and set-up processes been more efficient.
- Manual curation of bioassay data and metadata is done for smaller datasets and systems. Fully automated curation via NLP and auto-classification is not accurate enough.
- About half of the organizations surveyed by the Pistoia Alliance in 2019 already engage in the conversion of unstructured assay protocols into machine-readable form. This is a high-cost process that also leads to duplication of effort across organizations.
- Every year new assay protocols are developed and published. At the same time, some assay protocols become obsolete either due to technology development or because of the organizations that create and maintain them go out of business. These factors contribute to difficulties with the interpretation and reproducibility of historically performed assays.