LaDI-VTON | 基于Diffusion的2D虚拟试衣基于Stable Diffusion架构，为了增强文图生成模

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

paper arxiv.org/abs/2305.13…
code github.com/miccunifi/l…

Abstract

a latent diffusion model （LDMs） extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model’s characteristics 使用可学习跳层连接的自编码器模块和隐扩散模型可以提高生成过程中模特特征的保留率
a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process 文本反演组件可以将衣服的视觉特征映射到CLIP的token嵌入空间，因此生成一套伪词tocken嵌入，从而能够影响生成过程，进而保留衣服的纹理和细节

Contributions

首次使用隐扩散模型LDMs
通过具有跳层连接的自编码器保留修复区域外的细节
定义了前向纹理反演模块进一步保留生成过程中输入衣服的纹理信息
在Benchmarks（DressCode & VITON-HD）上取得SOTA

Related Work

Image-Based Virtual Try-On
- VITON coarse2fine, TPS, encoder-decoder arch
- learnable TPS, GANs, DMs
Diffusion Models
- text-to-image synthesis
- image-to-image translation
- image-editting
- inpainting
- 和虚拟试衣最相关的task是人体图像生成
Textual Inversion
- Textual inversion is a recent technique proposed to learn a pseudo word in the embedding space of the text encoder starting from visual concepts.

Methodology

基于Stable Diffusion架构，为了增强文图生成模型虚拟试衣的能力，修改了网络结构使得输入为衣服和模特的姿态，同时为了保留衣服细节，提出了前向textual inversion，最后使用masked skip connections提升SD的图像重建自编码器，从而提高了图像生成质量且更好地保留了模特图的细粒度细节。

Stable Diffusion 文图生成模型
CLIP 视觉-语言模型，将二者特征对齐到一个共享的特征空间

Overview

Textual Inversion

纹理反演：输入图像，预测CLIP token特征空间中pseudo-word
q: 文本提示，通过CLIP变换到特征空间得到true-word
VE (visual encoder): OpenCLIP ViT-H/14 model pre-trained on LAION-2B
单层ViT + 3层MLP + GELU激活 + dropout

Diffusion Virtual Try-On Model

选择SD的inpainting pipeline
输入 textual-inverted information 𝑌ˆ of the in-shop garment, the pose map 𝑃, and the garment fitted to the model body shape 𝐶𝑊 （the warped garment）

Enhanced Mask-Aware Skip Connections

EMASC
- 作用于masked image（mask from inpainting）
- 目的是学习传播编解码器对应的相关信息

CLOTHES WARPING PROCEDURE

coarse2fine

step1coarse warping
- TPS：把cloth变形到model的pose和mask shape匹配
  - Toward characteristic-preserving image-based virtual try-on network.
step2 refine warping
- U-net: 输入coarse garment, pose, model，输出target warped garment (GT)
  - U-Net: Convolutional Networks for Biomedical Image Segmentation.

Experiments

Conclusions

第一个基于LDMs的try-on

References

Stable Diffusion pipeline
- huggingface.co/stabilityai…
Textual Inversion: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
- arxiv.org/abs/2208.01…
- textual-inversion.github.io/
LDMs: High-Resolution Image Synthesis with Latent Diffusion Models
- arxiv.org/abs/2112.10…
- github.com/CompVis/lat…
CLIP: Learning Transferable Visual Models From Natural Language Supervision
- CLIP (Contrastive Language-Image Pre-Training)
- arxiv.org/abs/2103.00…
- github.com/OpenAI/CLIP