主要对比一下Amazon SageMaker和Kubeflow使用体验的异同。
想要快速上手体验,用Studio或者Notebook就行了。
Studio对experiment、AutoML等功能做了集成,便于我们无代码操作;同时,我们也可以直接在Studio内查看Experiment的结果展示、等。
但是Studio启动较慢,并且Studio也是对这些功能SDK的进一步UI可视化封装;所以,我们可以先通过Notebook走一遍流程。
本周的主要任务,就是在Amazon SageMaker里体验一遍Kubeflow已有的功能,比较一下两者间的异同。
SageMaker pytorch Mnist
SageMaker与Kubeflow一个区别就在于:
在Kubeflow中,我们可以为管道内各组件挂载相同PV卷,使其运行在同一文件系统环境下,或者像Elyra,为整个Pipeline配置一个Minio的Bucket,作为共同的文件系统工作环境;
在SageMaker里,我们会建立一个会话,并设置默认存储桶Bucket,之后将要用于模型训练的数据集上传至该存储桶中;
下载、转化并上传数据至S3
为当前 SageMaker 会话设置默认 S3 存储桶 URI,创建一个新文件夹prefix,然后将数据集上传至该文件夹下。
import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-mnist"
role = sagemaker.get_execution_role()
### 下载数据
from torchvision.datasets import MNIST
from torchvision import transforms
MNIST.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/"]
MNIST(
"data",
download=True,
transform=transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
),
)
### 上传数据至Amazon S3
inputs = sagemaker_session.upload_data(path="data", bucket=bucket, key_prefix=prefix)
print("input spec (in this case, just an S3 path): {}".format(inputs))
u1s1,这个API确实蛮好用;我们或许也可以参考此API,基于原生MinIO Python SDK封装一个类似Cilent().upload_data(path,bucket,key_prefix)的高级API。
训练模型脚本
和Kubeflow一样,准备一份可以直接运行的训练模型脚本:
### ------------------------ mnist.py --------------------------
# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py
import argparse
import json
import logging
import os
import sys
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
from torchvision import datasets, transforms
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))
class Net(nn.Module):
def __init__(self):
...
def forward(self, x):
...
def _get_train_data_loader(batch_size, training_dir, is_distributed, **kwargs):
logger.info("Get train data loader")
dataset = datasets.MNIST(
training_dir,
train=True,
transform=transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
),
)
train_sampler = (
torch.utils.data.distributed.DistributedSampler(dataset) if is_distributed else None
)
return torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=train_sampler is None,
sampler=train_sampler,
**kwargs
)
def _get_test_data_loader(test_batch_size, training_dir, **kwargs):
logger.info("Get test data loader")
return torch.utils.data.DataLoader(
datasets.MNIST(
training_dir,
train=False,
transform=transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
),
),
batch_size=test_batch_size,
shuffle=True,
**kwargs
)
def _average_gradients(model):
# Gradient averaging.
size = float(dist.get_world_size())
for param in model.parameters():
dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
param.grad.data /= size
def train(args):
is_distributed = len(args.hosts) > 1 and args.backend is not None
logger.debug("Distributed training - {}".format(is_distributed))
use_cuda = args.num_gpus > 0
logger.debug("Number of gpus available - {}".format(args.num_gpus))
kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
device = torch.device("cuda" if use_cuda else "cpu")
if is_distributed:
# Initialize the distributed environment.
world_size = len(args.hosts)
os.environ["WORLD_SIZE"] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ["RANK"] = str(host_rank)
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
logger.info(
"Initialized the distributed environment: '{}' backend on {} nodes. ".format(
args.backend, dist.get_world_size()
)
+ "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), args.num_gpus)
)
# set the seed for generating random numbers
torch.manual_seed(args.seed)
if use_cuda:
torch.cuda.manual_seed(args.seed)
train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs)
test_loader = _get_test_data_loader(args.test_batch_size, args.data_dir, **kwargs)
logger.debug(
"Processes {}/{} ({:.0f}%) of train data".format(
len(train_loader.sampler),
len(train_loader.dataset),
100.0 * len(train_loader.sampler) / len(train_loader.dataset),
)
)
logger.debug(
"Processes {}/{} ({:.0f}%) of test data".format(
len(test_loader.sampler),
len(test_loader.dataset),
100.0 * len(test_loader.sampler) / len(test_loader.dataset),
)
)
model = Net().to(device)
if is_distributed and use_cuda:
# multi-machine multi-gpu case
model = torch.nn.parallel.DistributedDataParallel(model)
else:
# single-machine multi-gpu case or single-machine or multi-machine cpu case
model = torch.nn.DataParallel(model)
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
for epoch in range(1, args.epochs + 1):
model.train()
for batch_idx, (data, target) in enumerate(train_loader, 1):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
if is_distributed and not use_cuda:
# average gradients manually for multi-machine cpu case only
_average_gradients(model)
optimizer.step()
if batch_idx % args.log_interval == 0:
logger.info(
"Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}".format(
epoch,
batch_idx * len(data),
len(train_loader.sampler),
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)
test(model, test_loader, device)
save_model(model, args.model_dir)
def test(model, test_loader, device):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, size_average=False).item() # sum up batch loss
pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
logger.info(
"Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(
test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset)
)
)
# 当estimator.deploy时,需要显式定义出model_fn方法
def model_fn(model_dir):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.DataParallel(Net())
with open(os.path.join(model_dir, "model.pth"), "rb") as f:
model.load_state_dict(torch.load(f))
return model.to(device)
# 部署函数的参数允许我们设置将用于端点的实例的数量和类型。这些值不需要与我们训练模型时设置的值相同。我们可以在一组基于 GPU 的实例上训练模型,然后在终端上部署基于CPU的模型实例;但这需要我们确保将模型返回或另存为 CPU 模型
# 因此,建议将模型返回或另存为CPU模型
def save_model(model, model_dir):
logger.info("Saving the model.")
path = os.path.join(model_dir, "model.pth")
torch.save(model.cpu().state_dict(), path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# 模型训练参数
parser.add_argument(
"--batch-size",
type=int,
default=64,
metavar="N",
help="input batch size for training (default: 64)",
)
parser.add_argument(
"--test-batch-size",
type=int,
default=1000,
metavar="N",
help="input batch size for testing (default: 1000)",
)
parser.add_argument(
"--epochs",
type=int,
default=10,
metavar="N",
help="number of epochs to train (default: 10)",
)
parser.add_argument(
"--lr", type=float, default=0.01, metavar="LR", help="learning rate (default: 0.01)"
)
parser.add_argument(
"--momentum", type=float, default=0.5, metavar="M", help="SGD momentum (default: 0.5)"
)
parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)")
parser.add_argument(
"--log-interval",
type=int,
default=100,
metavar="N",
help="how many batches to wait before logging training status",
)
parser.add_argument(
"--backend",
type=str,
default=None,
help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
)
# 与SageMaker相关的环境参数
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
train(parser.parse_args())
但这里,我们还需要通过访问环境变量获取部分有关训练环境的属性:
- SM_HOSTS: 包含所有主机的JSON编码列表;在Pytorch中,该列表长度等于WORLD_SIZE;
- SM_CURRENT_HOST: 当前容器名称;在Pytorch中,该容器序号等于RANK;
- SM_MODEL_DIR: 模型的保存路径;该模型之后将上传至S3;
- SM_NUM_GOUS: 当前容器可用的GPU数;
注: Pytorch分布式训练时,dist.init_process_group(backend, rank, world_size),需要用到WORLD_SIZE、RANK。
若在调用PyTorch Estimator的fit()方法时,使用了名为training的输入通道,则按照以下格式设置 SM_CHANNEL_[channel_name]:
- SM_CHANNEL_TRAINING: 输入通道training中数据的存储路径;
该训练脚本从输入通道training的指定路径下加载数据,使用超参数配置训练,训练模型,并将模型保存至model_dir,以便稍后托管。超参数作为参数传递给脚本,可以使用argparse.ArgumentParser实例进行检索。
在SageMaker中训练
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point="mnist.py",
role=role,
py_version="py38",
framework_version="1.11.0",
instance_count=2,
instance_type="ml.c5.2xlarge",
hyperparameters={"epochs": 1, "backend": "gloo"},
)
estimator.fit({"training": inputs})
sagemaker.pytorch.estimator.PyTorch是sagemaker针对Pytorch开发的Estimator,包含以下主要参数:
- entry_point: 训练脚本的执行入口;
- py_version、framework_version: python及pytorch的版本;SageMaker将分配满足该版本要求的计算资源;
- instance_count、instance_type: 计算资源的数量、类型;
- hyperparameters: 训练脚本的超参数;
- image_uri: 若指定,Estimator将使用此Image作为训练和部署的运行环境,而py_version、framework_version将失效;image_uri必须是ECR url或 dockerhub image;
部署并测试该模型
mnist.py中,model_fn方法需要由我们显式定义出来;而input_fn, predict_fn, output_fn 和 transform_fm已经默认定义在sagemaker-pytorch-containers中。
### 部署该Predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
### 生成测试数据
import gzip
import numpy as np
import random
import os
data_dir = "data/MNIST/raw"
with gzip.open(os.path.join(data_dir, "t10k-images-idx3-ubyte.gz"), "rb") as f:
images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28).astype(np.float32)
mask = random.sample(range(len(images)), 16) # randomly select some of the test images
mask = np.array(mask, dtype=np.int)
data = images[mask]
### 测试该Predictor
response = predictor.predict(np.expand_dims(data, axis=1))
print("Raw prediction result:")
print(response)
print()
labeled_predictions = list(zip(range(10), response[0]))
print("Labeled predictions: ")
print(labeled_predictions)
print()
labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])
print("Most likely answer: {}".format(labeled_predictions[0]))
### 删除部署端点并释放资源
sagemaker_session.delete_endpoint(endpoint_name=predictor.endpoint_name)
SageMaker自动模型调参
SageMaker的自动调参算法:
- Grid Search
- Random Search
- Bayesian optimization
- Hyperband
- Hyperband with early stopping
对SageMaker Pytorch Mnist设置超参调优
步骤同上,在完成
- 下载、转化并上传数据至S3
- 训练模型脚本
后,我们在 SageMaker中训练 设置超参调优步骤:
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import (
IntegerParameter,
CategoricalParameter,
ContinuousParameter,
HyperparameterTuner,
)
estimator = PyTorch(
entry_point="mnist.py",
role=role,
py_version="py3",
framework_version="1.8.0",
instance_count=1,
instance_type="ml.c5.2xlarge",
hyperparameters={"epochs": 1, "backend": "gloo"},
)
# ContinuousParameter 用于连续类参数,在最小值和最大值间随机取浮点值
# IntegerParameter 与ContinuousParameter类似,随机取整型值
# CategoricalParameter 用于离散类参数,在离散集合中取一个值
hyperparameter_ranges = {
"lr": ContinuousParameter(0.001, 0.1),
"batch-size": CategoricalParameter([32, 64, 128, 256, 512]),
}
与Kubeflow的Katib相似,需要指定要优化的目标指标及其定义,其中包括从训练任务的 CloudWatch 日志中提取该指标所需的正则表达式。
我们将average test loss作为目标指标,并将objective_type设置为“最小化”,以便超参数调优在搜索最佳超参数设置时寻求缩小目标指标。
在mnist.py中,每轮test()时,将输出:logger.info("Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset));
因此,将Regex设置为:"Test set: Average loss: ([0-9\.]+)"
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\.]+)"}]
最后,将Estimator传入HyperparameterTuner中,
tuner = HyperparameterTuner(
estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
max_jobs=9, #要运行的训练作业总数
max_parallel_jobs=3, #最大可并行运行的训练作业数
objective_type=objective_type,
)
tuner.fit({"training": inputs})
然后,我们可以访问SageMaker console->Training->Hyperparameter tuning jobs中查看超参数调优作业的进度。
我们还可以将具有最佳目标指标的结果模型直接部署至端点:
# 部署该模型
predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
# 删除端点,释放资源
tuner.delete_endpoint()
SageMaker Pipelines
发现要查看Pipelines图必须登Studio,所以提前用上了Studio...
Studio布局:
可以说是对JupyterLab的再一次升级,在左侧栏集成了不少SageMaker的功能。
SageMaker Pipelines编排
编排SageMaker Pipelines时,主要要注意一件事:
就是SageMaker Pipeline的各组件也是相互独立的,所以很重要一件事就是,每个组件的输入都是从S3里拿下来的,它的输出也得放回S3去;所以一定要定义好每个组件:
- 该组件的输入的S3存放路径,以及从S3下载下来后在该组件容器内的保存路径;
- 该组件的输出在容器内的存放路径,以及把它上传至S3后在S3中的保存路径;
话不多说,走一遍Pipeline吧。
先全览一下整个Pipeline的结构:
我们将:
- 下载数据并预处理
- 将预处理后的数据用于对模型的超参调优训练
- 选取调参结果最优的模型参数,在测试集上评估该模型
- 若模型评估的结果高于我们设定的阈值,将模型注册至模型表中;
sagemaker初始化
!pip install -U sagemaker==2.72.1 --quiet
### 各项初始化
import sagemaker
import sagemaker.session
session = sagemaker.session.Session()
region = session.boto_region_name
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = "paramaterized" # Prefix to S3 artifacts
pipeline_name = "DEMO-parameterized-pipeline" # SageMaker Pipeline name
credit_model_group = "DEMO-credit-registry"
churn_model_group = "DEMO-churn-registry"
### 下载之后要用到的数据
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_statlog_german_credit_data/german_credit_data.csv credit_risk/german_credit_data.csv
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.csv customer_churn/churn-dataset.csv
管道参数的定义
这里要说明一下,其实管道参数的定义是之后我们在构建组件的时候,一边编写管道,一边发现有些参数我们并不想写死,然后回头补充的;但是这里为了文章结构清晰,我就先放在最开头了。
SageMaker的管道参数是要用专有Class包裹起来的:
from sagemaker.workflow.parameters import (
ParameterInteger,
ParameterString,
ParameterFloat,
)
# 往哪个注册表注册模型
model_registry_package = ParameterString(name="ModelGroup", default_value="default-registry")
# 输入数据的uri
input_data = ParameterString(name="InputData", default_value="s3://{}/uri/data.csv".format(bucket))
# 预处理脚本的uri
preprocess_script = ParameterString(
name="PreprocessScript", default_value="s3://{}/uri/preprocess.py".format(bucket)
)
# 模型评估脚本的uri
evaluate_script = ParameterString(
name="EvaluateScript", default_value="s3://{}/uri/evaluate.py".format(bucket)
)
# 自动调参时,要运行的作业总数
max_training_jobs = ParameterInteger(name="MaxiumTrainingJobs", default_value=1)
# 自动调参时,可并行的最大作业数
max_parallel_training_jobs = ParameterInteger(name="MaxiumParallelTrainingJobs", default_value=1)
# 决定是否向注册表注册模型的精度阈值
accuracy_condition_threshold = ParameterFloat(name="AccuracyConditionThreshold", default_value=0.7)
# 数据预处理时容器所用的计算实例类型
processing_instance_type = ParameterString(
name="ProcessingInstanceType", default_value="ml.m5.large"
)
# 训练、调参模型时容器所用的计算实例类型
training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.xlarge")
数据预处理组件的构建
第一步是对数据进行预处理,写好两份预处理代码并保存:
%%writefile customer_churn/preprocess.py
import logging
import numpy as np
import pandas as pd
import os
logger = logging.getLogger()
logger.setLevel(logging.INFO)
if __name__ == "__main__":
logger.info("Starting preprocessing.")
input_data_path = os.path.join("/opt/ml/processing/input", "churn-dataset.csv")
try:
os.makedirs("/opt/ml/processing/train")
os.makedirs("/opt/ml/processing/validation")
os.makedirs("/opt/ml/processing/test")
except:
pass
logger.info("Reading input data")
# 数据处理
df = pd.read_csv(input_data_path)
df = df.drop(["Phone"], axis=1)
df["Area Code"] = df["Area Code"].astype(object)
df = df.drop(["Day Charge", "Eve Charge", "Night Charge", "Intl Charge"], axis=1)
model_data = pd.get_dummies(df)
model_data = pd.concat(
[
model_data["Churn?_True."],
model_data.drop(["Churn?_False.", "Churn?_True."], axis=1),
],
axis=1,
)
# 划分数据集
train_data, validation_data, test_data = np.split(
model_data.sample(frac=1, random_state=1729),
[int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)
# 保存结果
train_data.to_csv("/opt/ml/processing/train/train.csv", header=False, index=False)
validation_data.to_csv(
"/opt/ml/processing/validation/validation.csv", header=False, index=False
)
test_data.to_csv("/opt/ml/processing/test/test.csv", header=False, index=False)
以上代码是要干这件事:在容器内,从/opt/ml/processing/input/churn-dataset.csv读取csv文件,数据处理后划分为训练集、验证集、测试集,并分别保存至/opt/ml/processing/train/train.csv,/opt/ml/processing/validation/validation.csv,/opt/ml/processing/test/test.csv;
%%writefile credit_risk/preprocess.py
import logging
import numpy as np
import pandas as pd
import os
logger = logging.getLogger()
logger.setLevel(logging.INFO)
if __name__ == "__main__":
logger.info("Starting preprocessing.")
input_data_path = os.path.join("/opt/ml/processing/input", "german_credit_data.csv")
try:
os.makedirs("/opt/ml/processing/train")
os.makedirs("/opt/ml/processing/validation")
os.makedirs("/opt/ml/processing/test")
except:
pass
logger.info("Reading input data")
data = pd.read_csv(input_data_path, sep=",")
model_data = pd.get_dummies(data)
# 划分数据集
train_data, validation_data, test_data = np.split(
model_data.sample(frac=1, random_state=1729),
[int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)
# 保存结果
pd.concat(
[
train_data["risk_high risk"],
train_data.drop(["risk_low risk", "risk_high risk"], axis=1),
],
axis=1,
).to_csv("/opt/ml/processing/train/train.csv", index=False, header=False)
pd.concat(
[
validation_data["risk_high risk"],
validation_data.drop(["risk_low risk", "risk_high risk"], axis=1),
],
axis=1,
).to_csv("/opt/ml/processing/validation/validation.csv", index=False, header=False)
pd.concat(
[test_data["risk_high risk"], test_data.drop(["risk_low risk", "risk_high risk"], axis=1)],
axis=1,
).to_csv("/opt/ml/processing/test/test.csv", index=False, header=False)
以上代码是要干这件事:在容器内,从/opt/ml/processing/input/german_credit_data.csv读取csv文件,数据处理后划分为训练集,验证集、测试集,并分别保存至/opt/ml/processing/train/train.csv,/opt/ml/processing/validation/validation.csv,/opt/ml/processing/test/test.csv;
然后咱们把数据集及数据处理脚本传上S3:
customer_churn_data_uri = session.upload_data(
path="customer_churn/churn-dataset.csv", key_prefix=prefix + "/data"
)
credit_data_uri = session.upload_data(
path="credit_risk/german_credit_data.csv", key_prefix=prefix + "/data"
)
churn_preprocess_uri = session.upload_data(
path="customer_churn/preprocess.py", key_prefix=prefix + "/preprocess/churn"
)
credit_preprocess_uri = session.upload_data(
path="credit_risk/preprocess.py", key_prefix=prefix + "/preprocess/credit"
)
print("Customer churn data set uploaded to ", customer_churn_data_uri)
print("Credit data set uploaded to ", credit_data_uri)
### print输出:
### Customer churn data set uploaded to s3://sagemaker-us-east-1-568816277838/paramaterized/data/churn-dataset.csv
### Credit data set uploaded to s3://sagemaker-us-east-1-568816277838/paramaterized/data/german_credit_data.csv
print("Customer churn preprocessing script uploaded to ", churn_preprocess_uri)
print("Credit preprocessing script uploaded to ", credit_preprocess_uri)
### print输出:
### Customer churn preprocessing script uploaded to s3://sagemaker-us-east-1-568816277838/paramaterized/preprocess/churn/preprocess.py
### Credit preprocessing script uploaded to s3://sagemaker-us-east-1-568816277838/paramaterized/preprocess/credit/preprocess.py
然后,我们就可以构建pipeline的第一个组件了:数据预处理组件:
SageMaker对每个组件的Input和Output都是要严格定义的:
其中,input的source是该组件的输入的S3存放路径,destination是S3下载下来后在该组件容器内的保存路径;
outputs的source是该组件的输出在容器内的存放路径,destination是把它上传至S3后在S3中的保存路径;
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.functions import Join
from sagemaker.workflow.execution_variables import ExecutionVariables
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1", role=role, instance_type=processing_instance_type, instance_count=1
)
step_preprocess_data = ProcessingStep(
name="Preprocess-Data",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(
output_name="train",
source="/opt/ml/processing/train",
destination=Join(
on="/",
values=[
"s3://{}".format(bucket),
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"train",
],
),
),
ProcessingOutput(
output_name="validation",
source="/opt/ml/processing/validation",
destination=Join(
on="/",
values=[
"s3://{}".format(bucket),
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"validation",
],
),
),
ProcessingOutput(
output_name="test",
source="/opt/ml/processing/test",
destination=Join(
on="/",
values=[
"s3://{}".format(bucket),
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"test",
],
),
),
],
code=preprocess_script,
)
模型训练、调参组件的构建
然后开始编写训练、调参组件:
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter
from sagemaker.workflow.steps import TuningStep
### 获取用于模型训练的Image
image_uri = sagemaker.image_uris.retrieve(
framework="xgboost",
region=region,
version="1.2-2",
py_version="py3",
)
### 定义训练Estimator
xgb_estimator = Estimator(
image_uri=image_uri,
instance_type=training_instance_type,
instance_count=1,
role=role,
disable_profiler=True,
)
### 封装调参类Tuner
xgb_tuner = HyperparameterTuner(
estimator=xgb_estimator,
objective_metric_name="validation:auc",
hyperparameter_ranges={
"eta": ContinuousParameter(0, 0.5),
"alpha": ContinuousParameter(0, 1000),
"min_child_weight": ContinuousParameter(1, 120),
"max_depth": IntegerParameter(1, 10),
"num_round": IntegerParameter(1, 2000),
"subsample": ContinuousParameter(0.5, 1),
},
max_jobs=max_training_jobs,
max_parallel_jobs=max_parallel_training_jobs,
)
### 构建调参组件
step_tuning = TuningStep(
name="Train-And-Tune-Model",
tuner=xgb_tuner,
inputs={
"train": TrainingInput(
s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
content_type="text/csv",
),
"validation": TrainingInput(
s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs["validation"].S3Output.S3Uri,
content_type="text/csv",
),
},
)
训练组件将分别读取"train"通道下的训练集数据的s3 uri保存路径,及"validation"通道下的验证集数据s3 uri保存路径;
模型评估组件的构建
训练模型后,通常会在将模型注册到模型注册表之前,先在测试集上评估模型。这可确保模型注册表不会充斥着性能不佳的模型版本。
所以开始编写模型评估组件,这部分也叫做ProcessingStep:
模型评估组件由以下几部分组成:
- processor: 可以看作训练组件中的Estimator,我们会选择和训练Estimator相同的image_uri,但不一定会指定相同的instance_type
- PropertyFile: 通过PropertyFile指定了processing step的输出;
- inputs: 通常需要加载测试集的s3 uri路径,并通过调用TuningStep类获取模型最优参数的s3 uri路径;
- outputs: outputs定义了输出文件的source路径(容器内路径)和destination路径(s3路径),而PropertyFile则指定了具体的输出文件;
- code: 模型评估代码;
写好一份评估代码:
%%writefile evaluate.py
import json
import logging
import pathlib
import pickle
import tarfile
import numpy as np
import pandas as pd
import xgboost
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
confusion_matrix,
roc_curve,
)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())
if __name__ == "__main__":
model_path = "/opt/ml/processing/model/model.tar.gz"
with tarfile.open(model_path) as tar:
tar.extractall(path="..")
logger.debug("Loading xgboost model.")
model = pickle.load(open("xgboost-model", "rb"))
logger.debug("Loading test input data.")
test_path = "/opt/ml/processing/test/test.csv"
df = pd.read_csv(test_path, header=None)
logger.debug("Reading test data.")
y_test = df.iloc[:, 0].to_numpy()
df.drop(df.columns[0], axis=1, inplace=True)
X_test = xgboost.DMatrix(df.values)
logger.info("Performing predictions against test data.")
prediction_probabilities = model.predict(X_test)
predictions = np.round(prediction_probabilities)
precision = precision_score(y_test, predictions, zero_division=1)
recall = recall_score(y_test, predictions)
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
fpr, tpr, _ = roc_curve(y_test, prediction_probabilities)
logger.debug("Accuracy: {}".format(accuracy))
logger.debug("Precision: {}".format(precision))
logger.debug("Recall: {}".format(recall))
logger.debug("Confusion matrix: {}".format(conf_matrix))
# Available metrics to add to model: https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html
report_dict = {
"binary_classification_metrics": {
"accuracy": {"value": accuracy, "standard_deviation": "NaN"},
"precision": {"value": precision, "standard_deviation": "NaN"},
"recall": {"value": recall, "standard_deviation": "NaN"},
"confusion_matrix": {
"0": {"0": int(conf_matrix[0][0]), "1": int(conf_matrix[0][1])},
"1": {"0": int(conf_matrix[1][0]), "1": int(conf_matrix[1][1])},
},
"receiver_operating_characteristic_curve": {
"false_positive_rates": list(fpr),
"true_positive_rates": list(tpr),
},
},
}
output_dir = "/opt/ml/processing/evaluation"
pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
evaluation_path = f"{output_dir}/evaluation.json"
with open(evaluation_path, "w") as f:
f.write(json.dumps(report_dict))
以上代码将分别加载模型及测试集,评估得到模型的各项指标,以json的形式组织,并最终保存在/opt/ml/processing/evaluation/evaluation.json中;
上传该代码:
evaluate_script_uri = session.upload_data(path="evaluate.py", key_prefix=prefix + "/evaluate")
print("Evaluation script uploaded to ", evaluate_script_uri)
### print输出:
### Evaluation script uploaded to s3://sagemaker-us-east-1-568816277838/paramaterized/evaluate/evaluate.py
构建模型评估组件:
from sagemaker.processing import ScriptProcessor
from sagemaker.workflow.properties import PropertyFile
evaluate_model_processor = ScriptProcessor(
image_uri=image_uri,
command=["python3"],
instance_type=processing_instance_type,
instance_count=1,
role=role,
)
evaluation_report = PropertyFile(
name="EvaluationReport", output_name="evaluation", path="evaluation.json"
)
step_evaluate_model = ProcessingStep(
name="Evaluate-Model",
processor=evaluate_model_processor,
inputs=[
ProcessingInput(
source=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),
destination="/opt/ml/processing/model",
),
ProcessingInput(
source=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
"test"
].S3Output.S3Uri,
destination="/opt/ml/processing/test",
),
],
outputs=[
ProcessingOutput(
output_name="evaluation",
source="/opt/ml/processing/evaluation",
destination=Join(
on="/",
values=[
"s3://{}".format(bucket),
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"evaluation-report",
],
),
),
],
code=evaluate_script,
property_files=[evaluation_report],
)
模型注册组件的构建
然后,构建模型注册组件:如果训练的模型满足模型性能要求,则会向模型注册表注册新的模型版本以进行进一步分析。
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.step_collections import RegisterModel
model_metrics = ModelMetrics(
model_statistics=MetricsSource(
s3_uri=Join(
on="/",
values=[
step_evaluate_model.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"],
"evaluation.json",
],
),
content_type="application/json",
)
)
step_register_model = RegisterModel(
name="Register-Model",
estimator=xgb_estimator,
model_data=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),
content_types=["text/csv"],
response_types=["text/csv"],
inference_instances=["ml.m4.xlarge"],
transform_instances=["ml.m4.xlarge"],
model_package_group_name=model_registry_package,
model_metrics=model_metrics,
)
RegisterModel模型注册组件有以下几个重要参数:
- model_metrics: 为了能在模型注册表里查看到模型指标,我们得使用在评估步骤中创建的评估报告创建模型指标对象。然后,创建注册模型步骤。
- model_package_group_name: 注册表名;
- estimator: 即上文用于模型训练时的Estimator;
- model_data: 调参步骤时最优参数的s3 uri保存路径;
- inference_instances: 用于实时推理的实例类型;
- transform_instances: 用于transformation任务、以及在endpoint上部署模型的实例类型;
其中,ModelMetrics将获取评估步骤中evaluation.json的s3 uri路径以创建模型指标对象。
之后,注册后的模型可以在注册表中查看到accuracy、precision、recall、confusion_matrix及roc曲线;
向管道添加条件
向管道添加条件是通过 ConditionStep 完成的;我们只想向模型注册表注册满足准确性条件的新模型版本。
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet
cond_gte = ConditionGreaterThanOrEqualTo(
left=JsonGet(
step_name=step_evaluate_model.name,
property_file=evaluation_report,
json_path="binary_classification_metrics.accuracy.value",
),
right=accuracy_condition_threshold,
)
step_cond = ConditionStep(
name="Accuracy-Condition",
conditions=[cond_gte],
if_steps=[step_register_model],
else_steps=[],
)
在ConditionStep中,我们通过判断是否满足conditions,决定是否执行if_steps或else_steps;
其中,对conditions,我们使用ConditionGreaterThanOrEqualTo类完成: 通过读取step_evaluate_model生成的property_file中的binary_classification_metrics.accuracy.value(Json字典索引,不清楚的翻一下evaluate.py中的dict部分就明白了),若该值大于我们自定义的accuracy_condition_threshold,将执行if_steps:即模型注册。
管道创建
该部分代码较清晰,不解释:
from sagemaker.workflow.pipeline import Pipeline
pipeline = Pipeline(
name=pipeline_name,
parameters=[
processing_instance_type,
training_instance_type,
input_data,
preprocess_script,
evaluate_script,
accuracy_condition_threshold,
model_registry_package,
max_parallel_training_jobs,
max_training_jobs,
],
steps=[step_preprocess_data, step_tuning,step_evaluate_model, step_cond],
)
pipeline.upsert(role_arn=role)
创建管道后,我们可以通过UI后者SDK启动管道:
SDK启动:
pipeline.start(
execution_display_name="Credit",
parameters=dict(
InputData=credit_data_uri,
PreprocessScript=credit_preprocess_uri,
EvaluateScript=evaluate_script_uri,
AccuracyConditionThreshold=0.2,
MaxiumParallelTrainingJobs=2,
MaxiumTrainingJobs=4,
ModelGroup=credit_model_group,
),
)
UI启动:
启动后,等待它的跑通:
我们可以查看某组件的运行结果:
上图即展示了调参步骤时的调参最优结果及参数;
管道最终跑通后,我们可以在模型注册表里查看我们的模型,及模型评估可视化报告:
我们还可以勾选多个模型,比较它们的结果:
SageMaker Pipeline的总体逻辑
上面部分,我写了非常长的内容记录了Pipeline的跑通过程及其代码细节,便于之后想要跑通自己的SageMaker Pipeline的人借鉴参考。
记录完细节后,我们再整体盘一下它的编排逻辑:
其实它的编排逻辑很清晰,定义每个组件时,最重要的三个部分:
- inputs;
- outputs;
- 执行部分;
其中,inputs和outputs分别要定义它的source和destination,即这个组件的输入和输出"从哪来,到哪去"的问题:它往往需要从s3上加载inputs到容器内,然后在把输出从容器内放回s3中;
(注: 相比之下,Kubeflow的Elyra就方便多了,它默认为一个pipeline的所有组件都配置了一个s3工作环境,即所有组件都工作在一个ELyra动态创建并分配的bucket内,如果你希望某个组件的输出可以被其他组件获取,你只要在组件Properity的output files写明它的容器内路径,Elyra就会将其永久保存至bucket内)
这在使组件标准化的同时,也不可避免地造成了开发时的繁琐。
执行部分主要分为两种:
- 对于数据预处理、模型评估等组件,通常做法是写好code后,将其上传至s3中,然后由组件加载;
- 对于模型训练、调参组件,则通常需要将模型训练代码包装成一个image,在Estimator中写明它的image uri,可以是Amazon的ECR或者dockerhub;
SageMaker Pipeline的开发不便之处
关于实例限制
u1s1,SageMaker在数据预处理,模型训练,模型评估时,总是要指定运行实例类型;这就造成了我们开发的不便:
第一种不便:该实例类型不可用在这件事上:
SageMaker很神奇的是,它为数据预处理、模型训练、模型评估、模型注册后所指定的inference_instances,transform_instances都做了实例限制;而且我TM全都碰到了一遍。
不过这问题还好,我只要对着列表排查下就好了;
第二种不便就更难受了:
AWS账号资源被限制了:ml.m5.12xlarge我申请了两个,但是资源限制为0。
所以你还得登号去查看你的资源限制;
但是这个资源限制也很模糊:比如Amazon的官网上是这么写的,每个初始账号可用于training Job的ml.m5.4xlage是20个:
但是,我账号这儿显示:
我的配额只有0个;但好玩的是,我明明配额只有0个,可是这里还显示,我正使用4个;
这是因为我正在并行4个调参任务:
反正这一块就做的挺迷吧。
关于模型训练Estimator
模型训练Estimator部分,要填image_uri,这点就很难受了;也就是我写个模型训练代码,还得把它封装成iamge提交;这很不方便训练代码的迭代。
当然,Amazon的Python SDK还是基于Estimator做了一定的封装,比如对Pytorch框架,它专门开发了基于Estimator的更高层Pytorch类,你写好pytorch代码后,通过entry_point指定代码文件,就行了;
但是本质逻辑还是不变的:
它会替你封装一个image:即附加了你的训练代码文件的Pytorch镜像;然后在Estimator里填上这个image的uri;但不管这么说,对于我们开发来说,也算是更方便了吧。
SageMaker的无代码开发
我这里发现SageMaker的无代码开发,跟Kubeflow的无代码开发毛病简直是一样的:就是还不如有代码。
这里截图一下它的training Job创建和Hyperparameter tuning jobs创建界面;
training Job创建界面:
Hyperparameter tuning jobs创建界面:
你可以惊喜地发现,用代码得写的内容,这儿一个不落都得写上;
而且还带来了两个麻烦点:
- 用代码你可以边开发边想想该这么写Metirc的Regex,Obejective啥的;但无代码开发你这直接对着框框填,就非常不直观。
- 无代码开发这边的训练Container,就真的只能填你的training Image 的 Amazon ECR了;我们用Python SDK,还可以利用Pytorch等高级类回避这个镜像构建上传的麻烦。
所以我觉得这个无代码开发就还不如代码开发了。