Sagemaker vllm部署Whisper模型

112 阅读2分钟

创建Notebook所需要的IAM Role

点击创建role

image.png

Service选择Sagemaker,然后创建 image.png 为了便于测试,在刚刚创建好的role中,添加s3和ecr admin policy(生产环境不建议使用)

image.png

创建Notebook

image.png

注意变更:Volume Size和IAM role

image.png

部署Whisper推理端点

克隆示例repo

cd Sagemaker
git clone https://github.com/cuiIRISA/blazing-whisper-vllm-accelerator.git

自定义notebook。示例给的是异步推理方式,我们需要修改为同步推理

cd blazing-whisper-vllm-accelerator/
cp SageMaker_async.ipynb SageMaker_sync.ipynb

打开SageMaker_sync.ipynb,做如下修改:

  • 此处的image_uri,需要修改为您自己的aws id与对应的aws region
import boto3
import json
import time

# Set your ECR image URI
image_uri = "436103886277.dkr.ecr.ap-northeast-1.amazonaws.com/whisper-fastapi"
  • 此处的instance_type,需要根据您的需求进行更改
model_name = "whisper"
instance_type = "ml.g5.2xlarge"
  • 此处的HUGGING_FACE_HUB_TOKEN需要修改为您自己的hugging face token
# Update your container definition with environment variables
container = {
    'Image': image_uri,  # Should be full ECR URI (e.g., account.dkr.ecr.region.amazonaws.com/whisper-fastapi:latest)
    'Mode': 'SingleModel',
    'Environment': {
        'HUGGING_FACE_HUB_TOKEN': 'hf_**********BeMfryIrriA',  # Your HF token
    }
}

# Create the model with the container
model_response = sm_client.create_model(
    ModelName=model_name,
    Containers=[container],  # Use `Containers` instead of `PrimaryContainer`
    ExecutionRoleArn=role_arn
)
  • 如下blog需修改为:
# Create endpoint config
sm_client.create_endpoint_config(
    EndpointConfigName='whisper-async-config',
    AsyncInferenceConfig={
        'OutputConfig': {
            'S3OutputPath': 's3://sagemaker-ap-northeast-1-436103886277/transcript/outputs'
        }
    },
    ProductionVariants=[
        {
            'VariantName': 'whisper-variant',
            'ModelName': model_name,
            'InstanceType': instance_type,  # GPU instance
            'InitialInstanceCount': 1
        }
    ]
)

# Create async endpoint
sm_client.create_endpoint(
    EndpointName='whisper-async-endpoint',
    EndpointConfigName='whisper-async-config'
)

该Block:

# Create endpoint config for synchronous inference
sm_client.create_endpoint_config(
    EndpointConfigName='whisper-sync-config',  # 修改配置名称
    ProductionVariants=[
        {
            'VariantName': 'whisper-variant',
            'ModelName': model_name,
            'InstanceType': instance_type,  # GPU instance
            'InitialInstanceCount': 1
        }
    ]
)


# Create async endpoint
sm_client.create_endpoint(
    EndpointName='whisper-sync-endpoint',
    EndpointConfigName='whisper-sync-config'
)

如上block后的测试代码,因为是以异步方式测试的,所以均可删除

测试Whisper推理端点

新增如下block进行测试

import json
import boto3

# 读取音频文件
with open("English_04.wav", "rb") as file:
    audio_data = file.read()

# 调用SageMaker endpoint
client = boto3.client('runtime.sagemaker')
response = client.invoke_endpoint(
    EndpointName='whisper-async-endpoint',
    ContentType="audio/wav",
    Body=audio_data
)

# 解析结果并提取文本
result = json.loads(response['Body'].read())
text = ' '.join([chunk['text'] for chunk in result['transcript']])

print(f"转录结果: {text}")

压测统计数据

测试条件:

  • whisper模型:whisper-large-v3
  • 50条短句,英文单词在30-40之间。音频长度10s之内
  • 多线程并发请求统计
  • 每次批量请求短音频均为重新生成,以排除kv cache缓存因素
  • 时延范围:从客户端发起推理请求,到接收到完整转录文本
  • sagemaker endpoint与请求客户端均在日本东京region
机型并发平均时延
g6e.2x40.62
g6e.2x81.04
g6e.2x162.11
g6.2x40.92
g6.2x81.40
g6.2x162.79
g5.2x40.67
g5.2x81.22
g5.2x162.51

补充spark-tts压测数据:

测试条件:

  • 50条短句,中文汉字在10个字左右
  • 多线程并发请求统计
  • 每次批量请求短句均为重新生成,以排除kv cache缓存因素
  • 时延范围:从客户端发起推理请求,到接收到完整转录文本
  • tensorrt框架部署,部署参考:github.com/SparkAudio/…
机型并发平均时延
g6e.2x40.877
g6e.2x81.459
g6e.2x162.755
g6.2x41.695
g6.2x82.996
g6.2x165.968
g5.2x41.130
g5.2x82.221
g5.2x164.260