SAM 3多模态提示与交互式分割实战本文深入探讨了SAM 3模型的高级应用，通过详细的代码示例展示了如何使用多文本提示、

环境配置

首先需要安装必要的库。我们将安装 transformers 库以加载SAM 3模型和处理器，supervision 库用于标注、绘制和检查，以及 jupyter_bbox_widget 交互式小组件，用于在笔记本中点击图像添加点或绘制边界框。

!pip install --q git+https://github.com/huggingface/transformers supervision jupyter_bbox_widget

设置与导入

安装完成后，导入所需的库。

import io
import torch
import base64
import requests
import matplotlib
import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt

from google.colab import output
from accelerate import Accelerator
from IPython.display import display
from jupyter_bbox_widget import BBoxWidget
from PIL import Image, ImageDraw, ImageFont
from transformers import Sam3Processor, Sam3Model, Sam3TrackerProcessor, Sam3TrackerModel

加载 SAM 3 模型

检查GPU是否可用，然后加载处理器和模型。

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Sam3Processor.from_pretrained("facebook/sam3")
model = Sam3Model.from_pretrained("facebook/sam3").to(device)

下载示例图像

下载一些示例图像用于后续的分割演示。

!wget -q https://media.roboflow.com/notebooks/examples/birds.jpg
!wget -q https://media.roboflow.com/notebooks/examples/traffic_jam.jpg
!wget -q https://media.roboflow.com/notebooks/examples/basketball_game.jpg
!wget -q https://media.roboflow.com/notebooks/examples/dog-2.jpeg

单图像多文本提示

此示例对同一张图像应用两个不同的文本提示："player in white" 和 "player in blue"。遍历每个提示，合并所有检测结果并一起可视化。

prompts = ["player in white", "player in blue"]
IMAGE_PATH = "/content/basketball_game.jpg"

# 加载图像
image = Image.open(IMAGE_PATH).convert("RGB")

all_masks = []
all_boxes = []
all_scores = []

total_objects = 0

for prompt in prompts:
    inputs = processor(
        images=image,
        text=prompt,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    results = processor.post_process_instance_segmentation(
        outputs,
        threshold=0.5,
        mask_threshold=0.5,
        target_sizes=inputs["original_sizes"].tolist()
    )[0]

    num_objects = len(results["masks"])
    total_objects += num_objects

    print(f"为提示 '{prompt}' 找到 {num_objects} 个对象")

    all_masks.append(results["masks"])
    all_boxes.append(results["boxes"])
    all_scores.append(results["scores"])

results = {
    "masks": torch.cat(all_masks, dim=0),
    "boxes": torch.cat(all_boxes, dim=0),
    "scores": torch.cat(all_scores, dim=0),
}

print(f"\n所有提示共找到对象总数: {total_objects}")

输出:

为提示 'player in white' 找到 5 个对象
为提示 'player in blue' 找到 6 个对象

所有提示共找到对象总数: 11

生成标签并可视化结果。

labels = []
for prompt, scores in zip(prompts, all_scores):
    labels.extend([prompt] * len(scores))

# 假设 overlay_masks_boxes_scores 是一个已定义的可视化辅助函数
# overlay_masks_boxes_scores(
#     image=image,
#     masks=results["masks"],
#     boxes=results["boxes"],
#     scores=results["scores"],
#     labels=labels,
#     score_threshold=0.5,
#     alpha=0.45,
# )

多图像批量推理

此示例同时处理两张图像，并为每张图像提供独立的文本提示，实现并行处理。

cat_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"

images = [
    Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),
    Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
]

text_prompts = ["ear", "dial"]

inputs = processor(images=images, text=text_prompts, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

# 对两张图像的结果进行后处理
results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs.get("original_sizes").tolist()
)

print(f"图像 1: 找到 {len(results[0]['masks'])} 个对象")
print(f"图像 2: 找到 {len(results[1]['masks'])} 个对象")

输出:

图像 1: 找到 2 个对象
图像 2: 找到 7 个对象

单边界框提示

此示例使用边界框代替文本提示，为模型提供空间位置信息。

# 加载图像
image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

# 边界框采用 xyxy 格式: [x1, y1, x2, y2]
box_xyxy = [100, 150, 500, 450]

input_boxes = [[box_xyxy]]
input_boxes_labels = [[1]]  # 1 = 正例（前景）框

# 辅助函数：在图像上绘制输入框
def draw_input_box(image, box, color="red", width=3):
    img = image.copy().convert("RGB")
    draw = ImageDraw.Draw(img)
    x1, y1, x2, y2 = box
    draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)
    return img

input_box_vis = draw_input_box(image, box_xyxy)
# display(input_box_vis)

inputs = processor(
    images=image,
    input_boxes=input_boxes,
    input_boxes_labels=input_boxes_labels,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"找到 {len(results['masks'])} 个对象")

输出:

找到 1 个对象

单图像多边界框提示（双正例前景区域）

此示例使用两个正例边界框引导SAM 3分割框内所有检测到的对象。

kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
    requests.get(kitchen_url, stream=True).raw
).convert("RGB")

box1_xyxy = [59, 144, 76, 163]  # 刻度盘
box2_xyxy = [87, 148, 104, 159] # 按钮

input_boxes = [[box1_xyxy, box2_xyxy]]
input_boxes_labels = [[1, 1]]  # 1 = 正例（前景）

inputs = processor(
    images=kitchen_image,
    input_boxes=input_boxes,
    input_boxes_labels=input_boxes_labels,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"找到 {len(results['masks'])} 个对象")

输出:

找到 7 个对象

单图像多边界框提示（正例前景与负例背景控制）

此示例结合使用一个正例框和一个负例框。正例框突出想要分割的区域，负例框告知模型忽略附近区域。

kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
    requests.get(kitchen_url, stream=True).raw
).convert("RGB")

box1_xyxy = [59, 144, 76, 163]  # 刻度盘
box2_xyxy = [87, 148, 104, 159] # 按钮

input_boxes = [[box1_xyxy, box2_xyxy]]
input_boxes_labels = [[1, 0]]  # 1 = 正例， 0 = 负例

inputs = processor(
    images=kitchen_image,
    input_boxes=input_boxes,
    input_boxes_labels=input_boxes_labels,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"找到 {len(results['masks'])} 个对象")

输出:

找到 6 个对象

结合文本与视觉提示进行选择性分割

此示例同时使用两种提示类型：文本提示 "handle" 和负例边界框以排除烤箱把手区域。

kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
    requests.get(kitchen_url, stream=True).raw
).convert("RGB")

# 分割 "handle"，但使用负例框排除烤箱把手
text = "handle"
# 覆盖烤箱把手区域的负例框 (xyxy): [40, 183, 318, 204]
oven_handle_box = [40, 183, 318, 204]
input_boxes = [[oven_handle_box]]

inputs = processor(
    images=kitchen_image,
    text="handle",
    input_boxes=[[oven_handle_box]],
    input_boxes_labels=[[0]],  # 负例框
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"找到 {len(results['masks'])} 个对象")

输出:

找到 3 个对象

跨两张图像的批量混合提示分割

此示例演示了SAM 3如何在一个批次中处理多种提示类型。第一张图像接收文本提示 ("laptop")，第二张图像接收视觉提示（正例边界框）。

# 复用之前的 images 列表，包含 cat 和 kitchen 图像
# images = [cat_image, kitchen_image]
text = ["laptop", None]
box2_xyxy = [59, 144, 76, 163] # 复用刻度盘框
input_boxes = [None, [box2_xyxy]]
input_boxes_labels = [None, [1]]

inputs = processor(
    images=images,
    text=text,
    input_boxes=input_boxes,
    input_boxes_labels=input_boxes_labels,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs["original_sizes"].tolist()
)

# 结果 [0] 对应文本提示，结果 [1] 对应边界框提示

基于边界框优化的交互式分割

此示例将分割变为完全交互式的工作流程。通过UI小组件直接在图像上绘制边界框。绘制的每个框成为SAM 3的提示信号：绿色（正例）框标识想要分割的区域，红色（负例）框排除模型应忽略的区域。

output.enable_custom_widget_manager()

# 加载图像
url = "http://images.cocodataset.org/val2017/000000136466.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# 转换为base64
def pil_to_base64(img):
    buffer = io.BytesIO()
    img.save(buffer, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()

# 创建小组件
widget = BBoxWidget(
    image=pil_to_base64(image),
    classes=["positive", "negative"]
)

display(widget)

绘制框后，需要将小组件数据转换为SAM 3可用的格式。

def widget_to_sam_boxes(widget):
    boxes = []
    labels = []

    for ann in widget.bboxes:
        x = int(ann["x"])
        y = int(ann["y"])
        w = int(ann["width"])
        h = int(ann["height"])

        x1 = x
        y1 = y
        x2 = x + w
        y2 = y + h

        label = ann.get("label") or ann.get("class")

        boxes.append([x1, y1, x2, y2])
        labels.append(1 if label == "positive" else 0)

    return boxes, labels

boxes, box_labels = widget_to_sam_boxes(widget)
print("框:", boxes)
print("标签:", box_labels)

输出示例:

框: [[58, 147, 76, 165], [88, 149, 106, 157]]
标签: [1, 0]

将转换后的框输入模型进行分割。

inputs = processor(
    images=image,
    input_boxes=[boxes],              # batch size = 1
    input_boxes_labels=[box_labels],
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"找到 {len(results['masks'])} 个对象")

输出:

找到 6 个对象

基于点提示的交互式分割

此示例使用点提示而非文本或边界框。点击图像标记正例点和负例点。每个点击点的中心成为引导坐标，SAM 3使用这些坐标优化分割。

首先加载适合点提示的模型和处理器。

# 设置设备
device = Accelerator().device

# 加载模型和处理器
print("正在加载SAM 3模型...")
model = Sam3TrackerModel.from_pretrained("facebook/sam3").to(device)
processor = Sam3TrackerProcessor.from_pretrained("facebook/sam3")
print("模型加载成功!")

# 加载图像
IMAGE_PATH = "/content/dog-2.jpeg"
raw_image = Image.open(IMAGE_PATH).convert("RGB")

创建交互界面。

def get_points_from_widget(widget):
    """从小组件边界框中提取点坐标"""
    positive_points = []
    negative_points = []

    for ann in widget.bboxes:
        x = int(ann["x"])
        y = int(ann["y"])
        w = int(ann["width"])
        h = int(ann["height"])

        # 获取边界框的中心点
        center_x = x + w // 2
        center_y = y + h // 2

        label = ann.get("label") or ann.get("class")

        if label == "positive":
            positive_points.append([center_x, center_y])
        elif label == "negative":
            negative_points.append([center_x, center_y])

    return positive_points, negative_points

# 创建用于点选择的小组件
widget = BBoxWidget(
    image=pil_to_base64(raw_image),
    classes=["positive", "negative"]
)

# 创建UI按钮
segment_button = widgets.Button(
    description='🎯 分割',
    button_style='success',
    tooltip='使用标记点运行分割',
    icon='check',
    layout=widgets.Layout(width='150px', height='40px')
)

reset_button = widgets.Button(
    description='🔄 重置',
    button_style='warning',
    tooltip='清除所有点',
    icon='refresh',
    layout=widgets.Layout(width='150px', height='40px')
)

display(widgets.HBox([segment_button, reset_button]))
display(widget)

定义点击按钮后执行的分割函数。

def segment_from_widget(b=None):
    """使用小组件中的点运行分割"""
    positive_points, negative_points = get_points_from_widget(widget)

    if not positive_points and not negative_points:
        print("⚠️ 请至少添加一个点（在图像上绘制小框）！")
        return

    # 合并点和标签
    all_points = positive_points + negative_points
    all_labels = [1] * len(positive_points) + [0] * len(negative_points)

    print(f"\n🔄 正在运行分割...")
    print(f"  • {len(positive_points)} 个正点: {positive_points}")
    print(f"  • {len(negative_points)} 个负点: {negative_points}")

    # 准备输入
    input_points = [[all_points]]  # [batch, object, points, xy]
    input_labels = [[all_labels]]  # [batch, object, labels]

    inputs = processor(
        images=raw_image,
        input_points=input_points,
        input_labels=input_labels,
        return_tensors="pt"
    ).to(device)

    # 运行推理
    with torch.no_grad():
        outputs = model(**inputs)

    # 后处理掩码
    masks = processor.post_process_masks(
        outputs.pred_masks.cpu(),
        inputs["original_sizes"]
    )[0]

    print(f"✅ 生成 {masks.shape[1]} 个掩码，形状为 {masks.shape}")

    # 可视化结果（需定义 visualize_results 函数）
    # visualize_results(masks, positive_points, negative_points)

segment_button.on_click(segment_from_widget)

定义重置函数。

def reset_widget(b=None):
    """清除所有标注"""
    widget.bboxes = []
    print("🔄 重置！所有点已清除。")

reset_button.on_click(reset_widget)

总结

在本教程的第二部分，我们探索了SAM 3的高级功能，将其从一个强大的分割工具转变为一个灵活、交互式的视觉查询系统。演示了如何利用多种提示类型（文本、边界框和点），无论是单独使用还是组合使用，以实现精确、上下文感知的分割结果。

涵盖的高级工作流程包括：

在同一图像中同时分割多个概念
使用不同提示高效处理图像批次
使用正例边界框聚焦感兴趣区域
使用负例提示排除不需要的区域
结合文本和视觉提示进行选择性、细粒度的控制
构建完全交互式的分割界面，用户可通过绘制框或点击点实时查看结果

这些技术展示了SAM 3在现实世界应用中的多功能性，为大规模数据标注、智能视频编辑、增强现实体验和科学研究等任务提供了像素级精确的分割控制。FINISHED