本人已参与【新人创作礼】活动，一起开启掘金创作之路。

title: [Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules] authors: [Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik] year: 2018 url: pubs.acs.org/doi/10.1021…

论文名称：Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules 论文地址：pubs.acs.org/doi/10.1021… 论文作者：Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik

Who?

哈佛，多大，剑桥，普林斯顿的研究员们。。。最后的作者也是SELFIES的通讯作者，可以看出这是一个有意义的研究方向

What?

最早的把离散的化学分子投射到连续空间的工作之一，还为后续任务埋下了伏笔，比如why中的加粗。

Why?

continuous representations of molecules allow:

generate novel chemical structures by simple op in the latent, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules
gradient-based optimization to guide the search for optimized functional compounds

How?

一张经典图就够了

模型结构设计

Autoencoder Framework
- The autoencoder is trained to minimize reconstruction error
- To prevent invalid string => Variational AE
  - adding noise to the encoded molecules forces the decoder to learn how to decode a wider variety of latent points and find more robust representations.
  - use RDKit to validate output molecules and discard invalid ones
- Encoder: 1D CNN (vocab size = channel size)
  - This is explained by the presence of repetitive, translationally invariant substrings that correspond to chemical substructures, e.g., cycles and functional groups.
- Decoder: GRU
- Predictor: MLP

实验

把why中的insight可视化：

perturbing known chemical structures

interpolating between molecules

guide optimization
- predictor - gradient ascend

这篇文章开创了后续很多工作，后面基于1D或者2D的分子表示基本上都是按照这个框架，即一个encoder，一个decoder，一个predictor的方式设计的。不论是这篇文章使用的VAE，还是后面出来的很多生成模型，flow-based，diffusion models，autoregressive models等等，都可以参考这个文章的思路进行。所以这个工作的奠基作用还是很大的，难怪会有1k7多的citations