Full-length video recording of the webinar is available here.
We are now asking Dr. Ola Engqvist and Professor Alex Tropsha to answer these questions posted during and immediately after the webinar.
- How is stereochemistry handled in the SMILES analysis?
Like any other token, if there is stereochemistry information in the structures in the training set, molecules with stereocenters will be generated.
- How do you design correct structures? Doesn’t it create erroneous structures?
- What percentage of generated smiles would be valid?
See our analysis of GDB-13 https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0341-z There is a trade-off between generating novel diverse structures and correct structures. To obtain maximal diversity it looks like one needs to accept a small fraction of erroneous structures
- Hi all, fascinating stuff! There is a certain beauty in using SMILES linguistically, as well as making it more accessible – how sensitive is it to SMILES flavor/quality?
A: In our experience molecular generation is rather insensitive to the SMILES type used even though you see differences. A more detailed analysis is yet to be published.
- How do you assess uncertainty?
In general for QSAR modelling we rely on conformal and Venn-ABERS predictions where AstraZeneca also has published extensively
- Are SMILES stable with respect to small topology changes in the molecule? That looks like it might be important for the RNNs in the GAN?
A small change in a structure might result in a large change in the SMILES string, if this is an issue in practise still needs to be investigated. This potential issue is one reason that there has been a lot of research using graphs instead of strings for molecular generation.
- What Machine Learning tool do you use to develop reinforcement learning models?
PyTorch and scikit-learn
- How many of the molecules have moved into clinical candidate and what was the timeline?
It is still far too early to assess how much novel methods like deep learning based de novo molecular generation and synthesis prediction can impact the timelines
- If you generate 1000 molecules from a start molecule how many of them gives you 1c50 < 10um in HTS?
From 0 to 1000, depending how good your affinity prediction model is for the target
- Should we say bye bye to library enumeration?
No, but the use of library enumeration might change over time. The starting point for a library enumeration will be a molecule identified through deep learning based molecular de novo generation. Libraries are still a cost efficient way to explore the chemical space around an active molecule.