Sagemaker vllm部署Whisper模型创建Notebook所需要的IAM Role 点击创建role Ser

创建Notebook所需要的IAM Role

点击创建role

Service选择Sagemaker，然后创建为了便于测试，在刚刚创建好的role中，添加s3和ecr admin policy(生产环境不建议使用)

创建Notebook

注意变更：Volume Size和IAM role

部署Whisper推理端点

克隆示例repo

cd Sagemaker
git clone https://github.com/cuiIRISA/blazing-whisper-vllm-accelerator.git

自定义notebook。示例给的是异步推理方式，我们需要修改为同步推理

cd blazing-whisper-vllm-accelerator/
cp SageMaker_async.ipynb SageMaker_sync.ipynb

打开SageMaker_sync.ipynb，做如下修改：

此处的image_uri，需要修改为您自己的aws id与对应的aws region

import boto3
import json
import time

# Set your ECR image URI
image_uri = "436103886277.dkr.ecr.ap-northeast-1.amazonaws.com/whisper-fastapi"

此处的instance_type，需要根据您的需求进行更改

model_name = "whisper"
instance_type = "ml.g5.2xlarge"

此处的HUGGING_FACE_HUB_TOKEN需要修改为您自己的hugging face token

# Update your container definition with environment variables
container = {
    'Image': image_uri,  # Should be full ECR URI (e.g., account.dkr.ecr.region.amazonaws.com/whisper-fastapi:latest)
    'Mode': 'SingleModel',
    'Environment': {
        'HUGGING_FACE_HUB_TOKEN': 'hf_**********BeMfryIrriA',  # Your HF token
    }
}

# Create the model with the container
model_response = sm_client.create_model(
    ModelName=model_name,
    Containers=[container],  # Use `Containers` instead of `PrimaryContainer`
    ExecutionRoleArn=role_arn
)

如下blog需修改为：

# Create endpoint config
sm_client.create_endpoint_config(
    EndpointConfigName='whisper-async-config',
    AsyncInferenceConfig={
        'OutputConfig': {
            'S3OutputPath': 's3://sagemaker-ap-northeast-1-436103886277/transcript/outputs'
        }
    },
    ProductionVariants=[
        {
            'VariantName': 'whisper-variant',
            'ModelName': model_name,
            'InstanceType': instance_type,  # GPU instance
            'InitialInstanceCount': 1
        }
    ]
)

# Create async endpoint
sm_client.create_endpoint(
    EndpointName='whisper-async-endpoint',
    EndpointConfigName='whisper-async-config'
)

该Block：

# Create endpoint config for synchronous inference
sm_client.create_endpoint_config(
    EndpointConfigName='whisper-sync-config',  # 修改配置名称
    ProductionVariants=[
        {
            'VariantName': 'whisper-variant',
            'ModelName': model_name,
            'InstanceType': instance_type,  # GPU instance
            'InitialInstanceCount': 1
        }
    ]
)


# Create async endpoint
sm_client.create_endpoint(
    EndpointName='whisper-sync-endpoint',
    EndpointConfigName='whisper-sync-config'
)

如上block后的测试代码，因为是以异步方式测试的，所以均可删除

测试Whisper推理端点

新增如下block进行测试

import json
import boto3

# 读取音频文件
with open("English_04.wav", "rb") as file:
    audio_data = file.read()

# 调用SageMaker endpoint
client = boto3.client('runtime.sagemaker')
response = client.invoke_endpoint(
    EndpointName='whisper-async-endpoint',
    ContentType="audio/wav",
    Body=audio_data
)

# 解析结果并提取文本
result = json.loads(response['Body'].read())
text = ' '.join([chunk['text'] for chunk in result['transcript']])

print(f"转录结果: {text}")

压测统计数据

测试条件：

whisper模型：whisper-large-v3
50条短句，英文单词在30-40之间。音频长度10s之内
多线程并发请求统计
每次批量请求短音频均为重新生成，以排除kv cache缓存因素
时延范围：从客户端发起推理请求，到接收到完整转录文本
sagemaker endpoint与请求客户端均在日本东京region

机型	并发	平均时延
g6e.2x	4	0.62
g6e.2x	8	1.04
g6e.2x	16	2.11
g6.2x	4	0.92
g6.2x	8	1.40
g6.2x	16	2.79
g5.2x	4	0.67
g5.2x	8	1.22
g5.2x	16	2.51

补充spark-tts压测数据：

测试条件：

50条短句，中文汉字在10个字左右
多线程并发请求统计
每次批量请求短句均为重新生成，以排除kv cache缓存因素
时延范围：从客户端发起推理请求，到接收到完整转录文本
tensorrt框架部署，部署参考：github.com/SparkAudio/…

机型	并发	平均时延
g6e.2x	4	0.877
g6e.2x	8	1.459
g6e.2x	16	2.755
g6.2x	4	1.695
g6.2x	8	2.996
g6.2x	16	5.968
g5.2x	4	1.130
g5.2x	8	2.221
g5.2x	16	4.260