本人已参与【新人创作礼】活动,一起开启掘金创作之路。
title: [Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules] authors: [Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik] year: 2018 url: pubs.acs.org/doi/10.1021…
论文名称:Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules 论文地址:pubs.acs.org/doi/10.1021… 论文作者:Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik
Who?
哈佛,多大,剑桥,普林斯顿的研究员们。。。 最后的作者也是SELFIES的通讯作者,可以看出这是一个有意义的研究方向
What?
最早的把离散的化学分子投射到连续空间的工作之一,还为后续任务埋下了伏笔,比如why中的加粗。
Why?
continuous representations of molecules allow:
- generate novel chemical structures by simple op in the latent, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules
- gradient-based optimization to guide the search for optimized functional compounds
How?
一张经典图就够了
模型结构设计
- Autoencoder Framework
- The autoencoder is trained to minimize reconstruction error
- To prevent invalid string => Variational AE
- adding noise to the encoded molecules forces the decoder to learn how to decode a wider variety of latent points and find more robust representations.
- use RDKit to validate output molecules and discard invalid ones
- Encoder: 1D CNN (vocab size = channel size)
- This is explained by the presence of repetitive, translationally invariant substrings that correspond to chemical substructures, e.g., cycles and functional groups.
- Decoder: GRU
- Predictor: MLP
实验
把why中的insight可视化:
- perturbing known chemical structures
- interpolating between molecules
- guide optimization
- predictor - gradient ascend
这篇文章开创了后续很多工作,后面基于1D或者2D的分子表示基本上都是按照这个框架,即一个encoder,一个decoder,一个predictor的方式设计的。不论是这篇文章使用的VAE,还是后面出来的很多生成模型,flow-based,diffusion models,autoregressive models等等,都可以参考这个文章的思路进行。所以这个工作的奠基作用还是很大的,难怪会有1k7多的citations