【Datawhale X 李宏毅苹果书 AI夏令营】Task3：机器学习框架机器学习框架建立函数定义损失函数优化（

机器学习框架

建立函数
定义损失函数
优化（使损失函数最小）

训练攻略~

1. 检查 training data 的 loss

如果loss很大，说明training data没有学好，原因有以下两种：

1、model bias

模型太简单了，最好的结果不一定在function里面【大海捞针但针不在海里】

解决方案：设计更加复杂的模型。

2、optimization

无法找出损失最低的那个函数【针在海里，但是捞不上来】

3、model bias VS optimization

（1）运行简单的模型

（2）比较模型和简单模型的loss，如果模型loss＞简单模型loss，说明是optimization出现了问题

参考文献：arxiv.org/pdf/1512.03… 20层的弹性小，但是能达到更低的loss；56层的弹性大，但是loss反而更高了。说明是optimization出现了问题，不能找到最优解，即不能保证后面的信息都是有用的。

2. 检查 testing data 的 loss

1、overfitting

产生原因：

解决方案：

a、增加 training data

直接增加：搜集更多data
data augmentation：根据对题目的理解创造出新data（一定要合理！！！）

例如在影像处理时，对图片进行翻转可以使data翻倍。

b、增加限制条件

减少参数、共享参数（例如：CNN）
更少特征
早停Early stopping
正则化Reguarization
丢弃法Dropout

交叉验证

2、mismatch

训练资料和测试资料的分布不一样。

实践作业 HW2

1、下载数据集并解压

（1）打开terminal

（2）下载代码：

git clone https://oauth2:3EQxRxxHC8AwoQfojKpK@www.modelscope.cn/datasets/Datawhale/HW2-DNN-libriphone.git

（3）解压代码：

cd HW2-DNN-libriphone/ ls unzip -q ml2023spring-hw2.zip

2、运行baseline

准确率

3、精读代码

datawhaler.feishu.cn/wiki/M7tqwI…

导入库；

数据准备与预处理；

定义模型；

定义损失函数和优化器等其他配置；

训练模型与评估模型；

进行预测。

01 设置随机数种子（seed），使得实验结果可以复现

import numpy as np
import torch
import random
def same_seeds(seed):
    random.seed(seed) 
    np.random.seed(seed)  
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed) 
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

02 定义了一系列用于处理音频特征数据集的功能，包括加载特征、数据偏移、特征拼接以及数据预处理

import os
import torch
from tqdm import tqdm

def load_feat(path):
    feat = torch.load(path)
    return feat

def shift(x, n):
    if n < 0:
        left = x[0].repeat(-n, 1)
        right = x[:n]
    elif n > 0:
        right = x[-1].repeat(n, 1)
        left = x[n:]
    else:
        return x

    return torch.cat((left, right), dim=0)

def concat_feat(x, concat_n):
    assert concat_n % 2 == 1 # n must be odd
    if concat_n < 2:
        return x
    seq_len, feature_dim = x.size(0), x.size(1)
    x = x.repeat(1, concat_n) 
    x = x.view(seq_len, concat_n, feature_dim).permute(1, 0, 2) # concat_n, seq_len, feature_dim
    mid = (concat_n // 2)
    for r_idx in range(1, mid+1):
        x[mid + r_idx, :] = shift(x[mid + r_idx], r_idx)
        x[mid - r_idx, :] = shift(x[mid - r_idx], -r_idx)

    return x.permute(1, 0, 2).view(seq_len, concat_n * feature_dim)

def preprocess_data(split, feat_dir, phone_path, concat_nframes, train_ratio=0.8, random_seed=1213):
    class_num = 41 # NOTE: pre-computed, should not need change

    if split == 'train' or split == 'val':
        mode = 'train'
    elif split == 'test':
        mode = 'test'
    else:
        raise ValueError('Invalid \'split\' argument for dataset: PhoneDataset!')

    label_dict = {}
    if mode == 'train':
        for line in open(os.path.join(phone_path, f'{mode}_labels.txt')).readlines():
            line = line.strip('\n').split(' ')
            label_dict[line[0]] = [int(p) for p in line[1:]]
        
        # split training and validation data
        usage_list = open(os.path.join(phone_path, 'train_split.txt')).readlines()
        random.seed(random_seed)
        random.shuffle(usage_list)
        train_len = int(len(usage_list) * train_ratio)
        usage_list = usage_list[:train_len] if split == 'train' else usage_list[train_len:]

    elif mode == 'test':
        usage_list = open(os.path.join(phone_path, 'test_split.txt')).readlines()

    usage_list = [line.strip('\n') for line in usage_list]
    print('[Dataset] - # phone classes: ' + str(class_num) + ', number of utterances for ' + split + ': ' + str(len(usage_list)))

    max_len = 3000000
    X = torch.empty(max_len, 39 * concat_nframes)
    if mode == 'train':
        y = torch.empty(max_len, dtype=torch.long)

    idx = 0
    for i, fname in tqdm(enumerate(usage_list)):
        feat = load_feat(os.path.join(feat_dir, mode, f'{fname}.pt'))
        cur_len = len(feat)
        feat = concat_feat(feat, concat_nframes)
        if mode == 'train':
            label = torch.LongTensor(label_dict[fname])

        X[idx: idx + cur_len, :] = feat
        if mode == 'train':
            y[idx: idx + cur_len] = label

        idx += cur_len

    X = X[:idx, :]
    if mode == 'train':
        y = y[:idx]

    print(f'[INFO] {split} set')
    print(X.shape)
    if mode == 'train':
        print(y.shape)
        return X, y
    else:
        return X

03 创建一个自定义的数据集，适用于PyTorch框架中的数据加载和批处理

import torch
from torch.utils.data import Dataset

class LibriDataset(Dataset):
    def __init__(self, X, y=None):
        self.data = X
        if y is not None:
            self.label = torch.LongTensor(y)
        else:
            self.label = None

    def __getitem__(self, idx):
        if self.label is not None:
            return self.data[idx], self.label[idx]
        else:
            return self.data[idx]

    def __len__(self):
        return len(self.data)

04 构建神经网络模型的基础类

import torch.nn as nn

class BasicBlock(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(BasicBlock, self).__init__()

        # TODO: apply batch normalization and dropout for strong baseline.
        # Reference: https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html (batch normalization)
        #       https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html (dropout)
        self.block = nn.Sequential(
            nn.Linear(input_dim, output_dim),
            nn.ReLU(),
        )

    def forward(self, x):
        x = self.block(x)
        return x


class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim=41, hidden_layers=1, hidden_dim=256):
        super(Classifier, self).__init__()

        self.fc = nn.Sequential(
            BasicBlock(input_dim, hidden_dim),
            *[BasicBlock(hidden_dim, hidden_dim) for _ in range(hidden_layers)],
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        x = self.fc(x)
        return x

05 定义了一些数据处理和模型训练所需的参数

# data prarameters
# TODO: change the value of "concat_nframes" for medium baseline
concat_nframes = 3   # the number of frames to concat with, n must be odd (total 2k+1 = n frames)
train_ratio = 0.75   # the ratio of data used for training, the rest will be used for validation

# training parameters
seed = 1213          # random seed
batch_size = 512        # batch size
num_epoch = 10         # the number of training epoch
learning_rate = 1e-4      # learning rate
model_path = './model.ckpt'  # the path where the checkpoint will be saved

# model parameters
# TODO: change the value of "hidden_layers" or "hidden_dim" for medium baseline
input_dim = 39 * concat_nframes  # the input dim of the model, you should not change the value
hidden_layers = 2          # the number of hidden layers
hidden_dim = 64           # the hidden dim

06 使用PyTorch框架准备数据加载器（`DataLoader`）以供模型训练和验证使用

from torch.utils.data import DataLoader
import gc

same_seeds(seed)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'DEVICE: {device}')

# preprocess data
train_X, train_y = preprocess_data(split='train', feat_dir='./libriphone/feat', phone_path='./libriphone', concat_nframes=concat_nframes, train_ratio=train_ratio, random_seed=seed)
val_X, val_y = preprocess_data(split='val', feat_dir='./libriphone/feat', phone_path='./libriphone', concat_nframes=concat_nframes, train_ratio=train_ratio, random_seed=seed)

# get dataset
train_set = LibriDataset(train_X, train_y)
val_set = LibriDataset(val_X, val_y)

# remove raw feature to save memory
del train_X, train_y, val_X, val_y
gc.collect()

# get dataloader
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)

07 实现了模型的训练和验证流程，包括创建模型、定义损失函数、设置优化器，并执行训练和验证的迭代过程

# create model, define a loss function, and optimizer
model = Classifier(input_dim=input_dim, hidden_layers=hidden_layers, hidden_dim=hidden_dim).to(device)
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

best_acc = 0.0
for epoch in range(num_epoch):
    train_acc = 0.0
    train_loss = 0.0
    val_acc = 0.0
    val_loss = 0.0
    
    # training
    model.train() # set the model to training mode
    for i, batch in enumerate(tqdm(train_loader)):
        features, labels = batch
        features = features.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad() 
        outputs = model(features) 
        
        loss = criterion(outputs, labels)
        loss.backward() 
        optimizer.step() 
        
        _, train_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
        train_acc += (train_pred.detach() == labels.detach()).sum().item()
        train_loss += loss.item()
    
    # validation
    model.eval() # set the model to evaluation mode
    with torch.no_grad():
        for i, batch in enumerate(tqdm(val_loader)):
            features, labels = batch
            features = features.to(device)
            labels = labels.to(device)
            outputs = model(features)
            
            loss = criterion(outputs, labels) 
            
            _, val_pred = torch.max(outputs, 1) 
            val_acc += (val_pred.cpu() == labels.cpu()).sum().item() # get the index of the class with the highest probability
            val_loss += loss.item()

    print(f'[{epoch+1:03d}/{num_epoch:03d}] Train Acc: {train_acc/len(train_set):3.5f} Loss: {train_loss/len(train_loader):3.5f} | Val Acc: {val_acc/len(val_set):3.5f} loss: {val_loss/len(val_loader):3.5f}')

    # if the model improves, save a checkpoint at this epoch
    if val_acc > best_acc:
        best_acc = val_acc
        torch.save(model.state_dict(), model_path)
        print(f'saving model with acc {best_acc/len(val_set):.5f}')

08 释放Python中不再需要的对象所占用的内存

del train_set, val_set
del train_loader, val_loader
gc.collect()

09 加载测试数据并创建数据加载器，以便在模型训练完成后对测试数据进行预测

# load data
test_X = preprocess_data(split='test', feat_dir='./libriphone/feat', phone_path='./libriphone', concat_nframes=concat_nframes)
test_set = LibriDataset(test_X, None)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)

10 加载之前训练好的模型，并将其准备好用于测试

# load model
model = Classifier(input_dim=input_dim, hidden_layers=hidden_layers, hidden_dim=hidden_dim).to(device)
model.load_state_dict(torch.load(model_path))

11 在测试集上进行预测，并收集所有的预测结果

pred = np.array([], dtype=np.int32)

model.eval()
with torch.no_grad():
    for i, batch in enumerate(tqdm(test_loader)):
        features = batch
        features = features.to(device)

        outputs = model(features)

        _, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
        pred = np.concatenate((pred, test_pred.cpu().numpy()), axis=0)

12 将预测结果写入CSV文件中，以便于后续分析或提交

with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(pred):
        f.write('{},{}\n'.format(i, y))

【Datawhale X 李宏毅苹果书 AI夏令营】Task3：机器学习框架

机器学习框架

训练攻略~

1. 检查 training data 的 loss

1、model bias

2、optimization

3、model bias VS optimization

2. 检查 testing data 的 loss

1、overfitting

2、mismatch

分类

1、分类与回归

2、softmax

a、为什么用softmax

b、损失函数

实践作业 HW2

1、下载数据集并解压

2、运行baseline

3、精读代码

01 设置随机数种子（seed），使得实验结果可以复现

02 定义了一系列用于处理音频特征数据集的功能，包括加载特征、数据偏移、特征拼接以及数据预处理

03 创建一个自定义的数据集，适用于PyTorch框架中的数据加载和批处理

04 构建神经网络模型的基础类

05 定义了一些数据处理和模型训练所需的参数

06 使用PyTorch框架准备数据加载器（`DataLoader`）以供模型训练和验证使用

07 实现了模型的训练和验证流程，包括创建模型、定义损失函数、设置优化器，并执行训练和验证的迭代过程

08 释放Python中不再需要的对象所占用的内存

09 加载测试数据并创建数据加载器，以便在模型训练完成后对测试数据进行预测

10 加载之前训练好的模型，并将其准备好用于测试

11 在测试集上进行预测，并收集所有的预测结果

12 将预测结果写入CSV文件中，以便于后续分析或提交

【Datawhale X 李宏毅苹果书 AI夏令营】Task3：机器学习框架

机器学习框架

训练攻略~

1. 检查 training data 的 loss

1、model bias

2、optimization

3、model bias VS optimization

2. 检查 testing data 的 loss

1、overfitting

2、mismatch

分类

1、分类与回归

2、softmax

a、为什么用softmax

b、损失函数

实践作业 HW2

1、下载数据集并解压

2、运行baseline

3、精读代码

01 设置随机数种子（seed），使得实验结果可以复现

02 定义了一系列用于处理音频特征数据集的功能，包括加载特征、数据偏移、特征拼接以及数据预处理

03 创建一个自定义的数据集，适用于PyTorch框架中的数据加载和批处理

04 构建神经网络模型的基础类

05 定义了一些数据处理和模型训练所需的参数

06 使用PyTorch框架准备数据加载器（DataLoader）以供模型训练和验证使用

07 实现了模型的训练和验证流程，包括创建模型、定义损失函数、设置优化器，并执行训练和验证的迭代过程

08 释放Python中不再需要的对象所占用的内存

09 加载测试数据并创建数据加载器，以便在模型训练完成后对测试数据进行预测

10 加载之前训练好的模型，并将其准备好用于测试

11 在测试集上进行预测，并收集所有的预测结果

12 将预测结果写入CSV文件中，以便于后续分析或提交

06 使用PyTorch框架准备数据加载器（`DataLoader`）以供模型训练和验证使用