用 API 把产品白底图批量转成带音效的展示视频，记录一下用 Seedream 做场景图 + Wan 2.7 转视频，两

批量给产品图生成展示短视频，手动剪一条要 20 分钟，30 个 SKU 就是一整天。用 API 试试能不能自动化——从白底产品图到带环境音的展示视频，全程代码跑。

思路很简单，两步：先用 AI 把白底图变成有场景感的产品图，再把这张图转成带镜头运动和环境音的短视频。

两个模型怎么配合

第一步用 Seedream 5.0 Lite 做图片编辑，把白底产品图放进场景里。第二步用 Wan 2.7 I2V 把场景图转成视频，带音频输出。

Seedream 5.0 Lite 编辑的时候能保持产品主体不变，只改背景和氛围。Wan 2.7 I2V 支持 720p/1080p 输出，默认带音频同步，不需要额外配音。两个都是按次计费。

装 SDK：

pip install wavespeed

Key 在 wavespeed.ai/settings/api-keys 创建。

白底图变场景图

用 Seedream 5.0 Lite 的图片编辑接口，传一张白底产品图 + 一句场景描述，模型会把产品放进对应场景里，保持产品本身不变。拿一个咖啡杯的白底图试。

import wavespeed

product_url = wavespeed.upload("./mug_white_bg.jpg")

output = wavespeed.run(
    "bytedance/seedream-v5.0-lite/edit",
    {
        "images": [product_url],
        "prompt": "Place the coffee mug on a rustic wooden cafe table, morning sunlight streaming through a window, a croissant and newspaper slightly blurred in background, warm cozy atmosphere, the mug is the focal point, keep the mug design exactly unchanged"
    }
)

scene_url = output["outputs"][0]
print(scene_url)

端点 bytedance/seedream-v5.0-lite/edit，核心就是 images + prompt。prompt 后半段一定要写保持产品不变，不然模型可能把杯子也改了——我第一次没写约束条件，出来的杯子换了个颜色。

试了三种不同的场景 prompt，同一张白底图放进去：

scenes = [
    "Place the coffee mug on a rustic wooden cafe table, morning sunlight streaming through a window, warm cozy atmosphere, the mug is the focal point, keep the mug design exactly unchanged",
    "Place the coffee mug on a light marble surface in a minimalist studio, soft diffused window light from the left, clean neutral background, product photography style, keep the mug design exactly unchanged",
    "Place the coffee mug on a dark slate surface with dramatic rim lighting, dark moody background with warm amber tones, steam rising from the mug, high-end product advertisement style, keep the mug design exactly unchanged",
]

results = []
for scene_prompt in scenes:
    out = wavespeed.run("bytedance/seedream-v5.0-lite/edit", {
        "images": [product_url],
        "prompt": scene_prompt,
        "seed": 42
    })
    results.append(out["outputs"][0])

左边是原始白底图，后面三张分别是咖啡馆晨光、极简工作室、暗调广告风。杯子的造型和图案基本保住了，背景融合得挺自然。咖啡馆那张的光影最好看，暗调那张有热气升腾的效果，适合做高端感的展示。工作室那张最干净，适合详情页。

同一张产品图，换句 prompt 就是完全不同的调性。

场景图转视频

用 Wan 2.7 的 I2V 接口，传场景图 + 镜头运动描述，5 秒 720p 视频大概等 40 秒，默认带环境音。拿咖啡馆那张场景图丢给 Wan 2.7 I2V：

video_output = wavespeed.run(
    "alibaba/wan-2.7/image-to-video",
    {
        "image": scene_url,
        "prompt": "Slow push-in shot toward the coffee mug on the cafe table, steam gently rising, morning light shifts subtly, shallow depth of field, the mug stays centered and sharp, cozy warm atmosphere",
        "resolution": "720p",
        "duration": 5
    },
    timeout=300.0
)

video_url = video_output["outputs"][0]
print(video_url)

端点 alibaba/wan-2.7/image-to-video，参数：image（场景图 URL）、prompt（描述镜头运动）、resolution、duration。Wan 2.7 默认带音频输出，不需要额外参数。

5 秒的推进镜头，从稍远处慢慢推到杯子特写。光线有微妙的变化，背景虚化自然。音频自动生成了咖啡馆的环境音——杯碟轻碰、低语声那种，和画面挺搭的。

不过仔细看杯身的几何图案，在镜头推进过程中有轻微的变形和闪烁。5 秒短视频快速浏览没问题，逐帧看还是能发现 AI 的痕迹。

几个要注意的地方

prompt 里镜头运动描述很关键。第一次我只写了 "a mug on the table"，出来的视频几乎不动，就是一张图在那微微晃。加上 "slow push-in shot" 之后才有了真正的镜头感。

视频 prompt 别写太长。我一开始把场景描述、镜头运动、光线变化、音效全塞进去，结果模型好像只处理了前半段，后面的全忽略了。拆成核心动作 + 一个镜头运动就够了。

生成时间比图片长很多。5 秒 720p 视频等了大概 40 秒，10 秒的要一分多钟。批量跑的时候可以考虑并发，但注意 API 限流。

音频是自动生成的环境音，不是配乐。出来的声音和画面场景相关——咖啡馆场景会有杯碟声和低语，暗调场景就比较安静。如果需要配特定的背景音乐，得自己后期加。

还有一个坑：白底图如果产品边缘有锯齿或者抠图不干净，Seedream 编辑出来的场景图边缘也会有痕迹。建议用干净的产品图，PNG 带透明通道最好。

批量跑怎么写

把场景图生成和视频生成串起来，一个循环处理所有产品。30 个 SKU 串行跑大概 40 分钟，并发可以更快但要注意限流。

import wavespeed
import urllib.request
import os

products = [
    {"file": "./products/mug_01.jpg", "scene": "rustic cafe table, morning light, cozy"},
    {"file": "./products/mug_02.jpg", "scene": "minimalist studio, soft window light, clean"},
    {"file": "./products/bottle_01.jpg", "scene": "outdoor picnic blanket, golden hour, lifestyle"},
]

output_dir = "./output_videos"
os.makedirs(output_dir, exist_ok=True)

for i, item in enumerate(products):
    print(f"处理第 {i+1}/{len(products)} 个...")

    # 上传产品图
    img_url = wavespeed.upload(item["file"])

    # 第一步：生成场景图
    scene_out = wavespeed.run("bytedance/seedream-v5.0-lite/edit", {
        "images": [img_url],
        "prompt": f"Place the product in a {item['scene']}, keep the product design exactly unchanged, product is the focal point",
        "seed": 42
    })
    scene_url = scene_out["outputs"][0]

    # 第二步：场景图转视频
    video_out = wavespeed.run("alibaba/wan-2.7/image-to-video", {
        "image": scene_url,
        "prompt": "Slow push-in shot toward the product, subtle lighting shift, shallow depth of field, the product stays centered and sharp",
        "resolution": "720p",
        "duration": 5
    }, timeout=300.0)

    # 下载视频
    video_path = os.path.join(output_dir, f"product_{i+1}.mp4")
    urllib.request.urlretrieve(video_out["outputs"][0], video_path)
    print(f"  保存到 {video_path}")

print("全部完成")

30 个产品跑下来大概 40 分钟（串行）。具体费用取决于当时的定价，可以去 wavespeed.ai/pricing 查。想快可以用 asyncio 并发，但要注意限流。

实际效果能用吗

电商详情页、社交媒体短视频、产品预热视频这些场景够用了，5 秒循环播放效果不错。但精细产品细节（手表刻度、小字 logo）会有变形，不适合。

场景图这一步效果比我预期好。Seedream 5.0 Lite 对产品主体的保持做得不错，三种场景出来的杯子造型都没走样。光影融合也自然，不像是硬抠图贴上去的。

视频这一步有好有坏。镜头运动和环境音都挺像回事，快速浏览完全能用。但产品表面的细节（图案、文字、纹理）在运动过程中会有轻微变形，这是目前 I2V 模型的通病。如果产品有精细的 logo 或者小字，建议用 5 秒短视频，变形不太明显。

两个步骤的实际表现：

步骤	模型	输出	效果
白底图→场景图	Seedream 5.0 Lite Edit	JPG	产品保持好，背景融合自然
场景图→视频	Wan 2.7 I2V	720p MP4 带音频	镜头运动流畅，细节有轻微变形

不太行的场景：需要精确展示产品细节的（比如手表表盘上的刻度），AI 生成的视频在细节上还是会有变形。需要特定配乐节奏的也不行，音频是自动生成的环境音，没法控制节拍。

如果要做更长的视频或者多镜头剪辑，可以生成多条 5 秒的片段再拼接。Wan 2.7 也支持首尾帧控制（传 last_image 参数），可以让前后两段视频衔接更自然。

Seedream 5.0 Lite 文档：wavespeed.ai/models/byte…
Wan 2.7 I2V 文档：wavespeed.ai/models/alib…
SDK：github.com/WaveSpeedAI…