创建Notebook所需要的IAM Role
点击创建role
Service选择Sagemaker,然后创建
为了便于测试,在刚刚创建好的role中,添加s3和ecr admin policy(生产环境不建议使用)
创建Notebook
注意变更:Volume Size和IAM role
部署Whisper推理端点
克隆示例repo
cd Sagemaker
git clone https://github.com/cuiIRISA/blazing-whisper-vllm-accelerator.git
自定义notebook。示例给的是异步推理方式,我们需要修改为同步推理
cd blazing-whisper-vllm-accelerator/
cp SageMaker_async.ipynb SageMaker_sync.ipynb
打开SageMaker_sync.ipynb,做如下修改:
- 此处的image_uri,需要修改为您自己的aws id与对应的aws region
import boto3
import json
import time
# Set your ECR image URI
image_uri = "436103886277.dkr.ecr.ap-northeast-1.amazonaws.com/whisper-fastapi"
- 此处的instance_type,需要根据您的需求进行更改
model_name = "whisper"
instance_type = "ml.g5.2xlarge"
- 此处的HUGGING_FACE_HUB_TOKEN需要修改为您自己的hugging face token
# Update your container definition with environment variables
container = {
'Image': image_uri, # Should be full ECR URI (e.g., account.dkr.ecr.region.amazonaws.com/whisper-fastapi:latest)
'Mode': 'SingleModel',
'Environment': {
'HUGGING_FACE_HUB_TOKEN': 'hf_**********BeMfryIrriA', # Your HF token
}
}
# Create the model with the container
model_response = sm_client.create_model(
ModelName=model_name,
Containers=[container], # Use `Containers` instead of `PrimaryContainer`
ExecutionRoleArn=role_arn
)
- 如下blog需修改为:
# Create endpoint config
sm_client.create_endpoint_config(
EndpointConfigName='whisper-async-config',
AsyncInferenceConfig={
'OutputConfig': {
'S3OutputPath': 's3://sagemaker-ap-northeast-1-436103886277/transcript/outputs'
}
},
ProductionVariants=[
{
'VariantName': 'whisper-variant',
'ModelName': model_name,
'InstanceType': instance_type, # GPU instance
'InitialInstanceCount': 1
}
]
)
# Create async endpoint
sm_client.create_endpoint(
EndpointName='whisper-async-endpoint',
EndpointConfigName='whisper-async-config'
)
该Block:
# Create endpoint config for synchronous inference
sm_client.create_endpoint_config(
EndpointConfigName='whisper-sync-config', # 修改配置名称
ProductionVariants=[
{
'VariantName': 'whisper-variant',
'ModelName': model_name,
'InstanceType': instance_type, # GPU instance
'InitialInstanceCount': 1
}
]
)
# Create async endpoint
sm_client.create_endpoint(
EndpointName='whisper-sync-endpoint',
EndpointConfigName='whisper-sync-config'
)
如上block后的测试代码,因为是以异步方式测试的,所以均可删除
测试Whisper推理端点
新增如下block进行测试
import json
import boto3
# 读取音频文件
with open("English_04.wav", "rb") as file:
audio_data = file.read()
# 调用SageMaker endpoint
client = boto3.client('runtime.sagemaker')
response = client.invoke_endpoint(
EndpointName='whisper-async-endpoint',
ContentType="audio/wav",
Body=audio_data
)
# 解析结果并提取文本
result = json.loads(response['Body'].read())
text = ' '.join([chunk['text'] for chunk in result['transcript']])
print(f"转录结果: {text}")
压测统计数据
测试条件:
- whisper模型:whisper-large-v3
- 50条短句,英文单词在30-40之间。音频长度10s之内
- 多线程并发请求统计
- 每次批量请求短音频均为重新生成,以排除kv cache缓存因素
- 时延范围:从客户端发起推理请求,到接收到完整转录文本
- sagemaker endpoint与请求客户端均在日本东京region
| 机型 | 并发 | 平均时延 |
|---|---|---|
| g6e.2x | 4 | 0.62 |
| g6e.2x | 8 | 1.04 |
| g6e.2x | 16 | 2.11 |
| g6.2x | 4 | 0.92 |
| g6.2x | 8 | 1.40 |
| g6.2x | 16 | 2.79 |
| g5.2x | 4 | 0.67 |
| g5.2x | 8 | 1.22 |
| g5.2x | 16 | 2.51 |
补充spark-tts压测数据:
测试条件:
- 50条短句,中文汉字在10个字左右
- 多线程并发请求统计
- 每次批量请求短句均为重新生成,以排除kv cache缓存因素
- 时延范围:从客户端发起推理请求,到接收到完整转录文本
- tensorrt框架部署,部署参考:github.com/SparkAudio/…
| 机型 | 并发 | 平均时延 |
|---|---|---|
| g6e.2x | 4 | 0.877 |
| g6e.2x | 8 | 1.459 |
| g6e.2x | 16 | 2.755 |
| g6.2x | 4 | 1.695 |
| g6.2x | 8 | 2.996 |
| g6.2x | 16 | 5.968 |
| g5.2x | 4 | 1.130 |
| g5.2x | 8 | 2.221 |
| g5.2x | 16 | 4.260 |