一小时内使用AI工具包在H100 GPU上微调FLUX模型本文详细介绍了如何使用AI工具包在搭载NVIDIA H100

如何在一小时内使用AI工具包和某中心H100 GPU微调FLUX模型

FLUX在过去的一个月里席卷了整个互联网，这并非没有原因。其声称优于DALLE 3、Ideogram和Stable Diffusion 3等模型的说法已被证实是有根据的。随着该模型的能力被集成到越来越多的流行图像生成工具中，其在Stable Diffusion领域的扩展只会继续下去。自模型发布以来，我们还看到了用户工作流程的一些重要进展。其中值得注意的是首批LoRA（低秩适应模型）和用于改进引导的ControlNet模型的发布。这些允许用户分别对文本引导和对象放置施加一定程度的控制。在本文中，我们将探讨一种使用AI工具包在自定义数据上训练我们自己的LoRA的首批方法。这个方法来自Jared Burkett的仓库，它为我们提供了一种快速连续微调FLUX schnell或dev模型的最佳新方式。请跟随步骤了解训练您自己的FLUX LoRA所需的所有步骤。

将此项目付诸实践 运行于某中心

设置H100环境

要开始操作，我们推荐在某中心上设置一个强大的GPU或多GPU环境。点击某中心控制台左上角的Gradient/Core按钮，切换到Core，来启动一个新的H100或多路A100/H100实例。然后，点击最右侧的“创建实例”按钮。创建新实例时，请确保选择正确的GPU和模板，即ML-In-A-Box，它预装了我们将使用的大多数软件包。我们还应该选择具有足够大存储空间（大于250GB）的实例，以避免在训练模型后遇到潜在的内存问题。完成后，启动您的实例，然后通过浏览器中的桌面流访问您的实例，或从本地机器通过SSH连接。

数据准备

现在我们已经全部设置完毕，可以开始为训练加载数据。要为训练选择数据，请选择一个在相机或图像中具有鲜明特征，并且我们容易获取的主题。这可以是一种风格，或特定类型的对象/主题/人物。

例如，我们选择对本文作者的脸进行训练。为此，我们使用高质量相机以不同角度和距离拍摄了大约30张自拍。然后将这些图像裁剪为正方形，并重命名以符合所需的命名格式。接着，我们使用Florence-2自动为每张图像生成描述，并将这些描述保存在与图像对应的各自文本文件中。

数据必须按照以下格式存储在其自己的目录中：

---- img1.png
- img1.txt
- img2.png
- img2.txt
...

图像和文本文件必须遵循相同的命名约定

为了实现这一切，我们建议调整以下代码片段以运行自动标注。在您的图像文件夹上运行以下代码片段（或GitHub仓库中的label.py）。

!pip install -U oyaml transformers einops albumentations python-dotenv

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import os

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = '某机构/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype='auto').eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

prompt = "<MORE_DETAILED_CAPTION>"

for i in os.listdir('<YOUR DIRECTORY NAME>'+'/'):
    if i.split('.')[-1]=='txt':
        continue
    image = Image.open('<YOUR DIRECTORY NAME>'+'/'+i)

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)

    generated_ids = model.generate(
      input_ids=inputs["input_ids"],
      pixel_values=inputs["pixel_values"],
      max_new_tokens=1024,
      num_beams=3,
      do_sample=False
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    parsed_answer = processor.post_process_generation(generated_text, task="<MORE_DETAILED_CAPTION>", image_size=(image.width, image.height))
    print(parsed_answer)
    with open('<YOUR DIRECTORY NAME>'+'/'+f"{i.split('.')[0]}.txt", "w") as f:
        f.write(parsed_answer["<MORE_DETAILED_CAPTION>"])
        f.close()

在您的图像文件夹上运行完毕后，标注好的文本文件将以相应的命名方式保存。至此，我们应该已经准备好开始使用AI工具包了！

设置训练循环

这项工作基于Ostris仓库的AI工具包，我们要感谢他们出色的工作。要开始使用AI工具包，首先将以下代码粘贴到终端中以设置环境：

git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
git submodule update --init --recursive
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
pip install peft
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

这需要几分钟时间。之后，我们还需要完成最后一步。通过以下终端命令登录，将一个只读令牌添加到HuggingFace缓存：

huggingface-cli login

设置完成后，我们就可以开始训练循环了。

将此项目付诸实践 运行于某中心

配置训练循环

AI工具包提供了一个处理所有FLUX.1模型训练复杂性的训练脚本run.py。可以微调schnell或dev模型，但我们推荐训练dev模型。dev在使用许可证方面有更多限制，但它在提示理解、拼写和对象构成方面也比schnell强大得多。然而，由于经过了蒸馏处理，schnell的训练速度应该快得多。

run.py接收一个yaml配置文件来处理各种训练参数。对于这个用例，我们将编辑train_lora_flux_24gb.yaml文件。以下是配置的一个示例版本：

---
job: extension
config:
  # this name will be the folder and filename name
  name: <YOUR LORA NAME>
  process:
    - type: 'sd_trainer'
      # root folder to save training sessions/samples/weights
      training_folder: "output"
      # uncomment to see performance stats in the terminal every N steps
#      performance_log_every: 1000
      device: cuda:0
      # if a trigger word is specified, it will be added to captions of training data if it does not already exist
      # alternatively, in your captions you can add [trigger] and it will be replaced with the trigger word
#      trigger_word: "p3r5on"
      network:
        type: "lora"
        linear: 16
        linear_alpha: 16
      save:
        dtype: float16 # precision to save
        save_every: 250 # save every this many steps
        max_step_saves_to_keep: 4 # how many intermittent saves to keep
      datasets:
        # datasets are a folder of images. captions need to be txt files with the same name as the image
        # for instance image2.jpg and image2.txt. Only jpg, jpeg, and png are supported currently
        # images will automatically be resized and bucketed into the resolution specified
        # on windows, escape back slashes with another backslash so
        # "C:\\path\\to\\images\\folder"
        - folder_path: <PATH TO YOUR IMAGES>
          caption_ext: "txt"
          caption_dropout_rate: 0.05  # will drop out the caption 5% of time
          shuffle_tokens: false  # shuffle caption order, split by commas
          cache_latents_to_disk: true  # leave this true unless you know what you're doing
          resolution: [1024]  # flux enjoys multiple resolutions
      train:
        batch_size: 1
        steps: 2500  # total number of steps to train 500 - 4000 is a good range
        gradient_accumulation_steps: 1
        train_unet: true
        train_text_encoder: false  # probably won't work with flux
        gradient_checkpointing: true  # need the on unless you have a ton of vram
        noise_scheduler: "flowmatch" # for training only
        optimizer: "adamw8bit"
        lr: 1e-4
        # uncomment this to skip the pre training sample
#        skip_first_sample: true
        # uncomment to completely disable sampling
#        disable_sampling: true
        # uncomment to use new vell curved weighting. Experimental but may produce better results
        linear_timesteps: true

        # ema will smooth out learning, but could slow it down. Recommended to leave on.
        ema_config:
          use_ema: true
          ema_decay: 0.99

        # will probably need this if gpu supports it for flux, other dtypes may not work correctly
        dtype: bf16
      model:
        # huggingface model name or path
        name_or_path: "black-forest-labs/FLUX.1-dev"
        is_flux: true
        quantize: true  # run 8bit mixed precision
#        low_vram: true  # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower.
      sample:
        sampler: "flowmatch" # must match train.noise_scheduler
        sample_every: 250 # sample every this many steps
        width: 1024
        height: 1024
        prompts:
          # you can add [trigger] to the prompts here and it will be replaced with the trigger word
#          - "[trigger] holding a sign that says 'I LOVE PROMPTS!'"\
          - "woman with red hair, playing chess at the park, bomb going off in the background"
          - "a woman holding a coffee cup, in a beanie, sitting at a cafe"
          - "a horse is a DJ at a night club, fish eye lens, smoke machine, lazer lights, holding a martini"
          - "a man showing off his cool new t shirt at the beach, a shark is jumping out of the water in the background"
          - "a bear building a log cabin in the snow covered mountains"
          - "woman playing the guitar, on stage, singing a song, laser lights, punk rocker"
          - "hipster man with a beard, building a chair, in a wood shop"
          - "photo of a man, white background, medium shot, modeling clothing, studio lighting, white backdrop"
          - "a man holding a sign that says, 'this is a sign'"
          - "a bulldog, in a post apocalyptic world, with a shotgun, in a leather jacket, in a desert, with a motorcycle"
        neg: ""  # not used on flux
        seed: 42
        walk_seed: true
        guidance_scale: 4
        sample_steps: 20
# you can add any additional meta info here. [name] is replaced with config name at top
meta:
  name: "[name]"
  version: '1.0'

我们将编辑的最重要的几行是：第5行 - 我们更改名称的地方，第30行 - 我们添加图像目录路径的地方，以及第69和70行 - 我们可以编辑高度和宽度以反映我们的训练图像。相应地编辑这些行，以使训练器适应在您的图像上运行。此外，我们可能还想编辑提示词。有几个提示词涉及动物或场景，因此如果我们试图捕捉特定人物，我们可能想编辑这些提示词以更好地指导模型。我们还可以使用第87-88行上的引导尺度和采样步骤值进一步控制这些生成的样本。

如果希望更快地训练FLUX.1模型，我们可以通过编辑第37行的批量大小和第39行的梯度累积步数来进一步优化模型训练。如果在多GPU或H100上训练，可以稍微提高这些值，否则建议保持不变。请注意，提高它们可能会导致内存不足错误。在第38行，我们可以更改训练步数。他们推荐在500到4000之间，因此我们选择中间的2500步。我们使用这个值获得了良好的结果。它将每250步保存一个检查点，但我们也可以根据需要更改第22行的这个值。最后，我们可以通过将第62行的HuggingFace ID替换为schnell（'black-forest-labs/FLUX.1-schnell'）来将模型从dev更改为schnell。现在一切已设置完毕，我们可以运行训练了！

运行FLUX.1训练循环

要运行训练循环，我们现在只需要使用run.py脚本。

python3 run.py config/examples/train_lora_flux_24gb.yaml

对于我们的训练循环，我们使用了60张图像，在单个H100上训练了2500步。整个过程大约需要45分钟。之后，LoRA文件及其检查点保存在Downloads/ai-toolkit/output/my_first_flux_lora_v1/中。正如我们所看到的，面部特征逐渐转变，以更紧密地匹配所需主体的特征。

在输出目录中，我们还可以找到模型使用配置中先前提到的提示词生成的样本。这些可以用来查看训练的进展情况。

使用我们新的FLUX.1 LoRA进行推理

现在模型已完成训练，我们可以使用新训练的LoRA来调整FLUX.1的输出。我们已经在Notebook中提供了一个快速的推理脚本。

import torch
from diffusers import DiffusionPipeline

model_id = 'black-forest-labs/FLUX.1-dev'
adapter_id = f'output/{lora_name}/{lora_name}.safetensors'
pipeline = DiffusionPipeline.from_pretrained(model_id)
pipeline.load_lora_weights(adapter_id)

prompt = "ethnographic photography of man at a picnic"
negative_prompt = "blurry, cropped, ugly"

pipeline.to('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
image = pipeline(
    prompt=prompt,
    num_inference_steps=50,
    generator=torch.Generator(device='cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu').manual_seed(1641421826),
    width=1152,
    height=768,
).images[0]
display(image)

仅用500步对本文作者的脸进行微调，我们就能够实现相当准确的特征再现：（此处应有示例输出图像）这个过程可以应用于任何类型的对象、主题、概念或风格的LoRA训练。我们建议尝试各种尽可能多样化的图像来捕捉主题/风格，就像Stable Diffusion一样。

结语

FLUX.1确实是向前迈出的一大步，我们个人无法停止将其用于各种艺术任务。它正在迅速取代所有其他图像生成器，这有充分的理由。本教程展示了如何使用云上的GPU为FLUX.1微调LoRA模型。读者应能理解如何使用其中展示的技术训练自定义LoRA。请随时关注，我们将在不久的将来发布更多关于FLUX.1的博客文章！