Automatic Chemical Design Using a Data-Driven Continuous Representation of Molec

109 阅读2分钟

本人已参与【新人创作礼】活动,一起开启掘金创作之路。


title: [Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules] authors: [Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik] year: 2018 url: pubs.acs.org/doi/10.1021…


论文名称:Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules 论文地址:pubs.acs.org/doi/10.1021… 论文作者:Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik

Who?

哈佛,多大,剑桥,普林斯顿的研究员们。。。 最后的作者也是SELFIES的通讯作者,可以看出这是一个有意义的研究方向

What?

最早的把离散的化学分子投射到连续空间的工作之一,还为后续任务埋下了伏笔,比如why中的加粗。

Why?

continuous representations of molecules allow:

  • generate novel chemical structures by simple op in the latent, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules
  • gradient-based optimization to guide the search for optimized functional compounds

How?

一张经典图就够了

image.png 模型结构设计

  • Autoencoder Framework
    • The autoencoder is trained to minimize reconstruction error
    • To prevent invalid string => Variational AE
      • adding noise to the encoded molecules forces the decoder to learn how to decode a wider variety of latent points and find more robust representations.
      • use RDKit to validate output molecules and discard invalid ones
    • Encoder: 1D CNN (vocab size = channel size)
      • This is explained by the presence of repetitive, translationally invariant substrings that correspond to chemical substructures, e.g., cycles and functional groups.
    • Decoder: GRU
    • Predictor: MLP

实验

把why中的insight可视化:

  • perturbing known chemical structures

image.png

  • interpolating between molecules

image.png

  • guide optimization
    • predictor - gradient ascend

这篇文章开创了后续很多工作,后面基于1D或者2D的分子表示基本上都是按照这个框架,即一个encoder,一个decoder,一个predictor的方式设计的。不论是这篇文章使用的VAE,还是后面出来的很多生成模型,flow-based,diffusion models,autoregressive models等等,都可以参考这个文章的思路进行。所以这个工作的奠基作用还是很大的,难怪会有1k7多的citations