简介

语音转换是一个人的声音转换为另一个说话人声音的技术。在这个项目中，受论文 "Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder "的启发，我们团队将专注于使用光谱转换（SC）技术。

我们的项目分为两个独立的部分。

语音处理
VAE模型的训练

目的

让用户全面了解VAE，可以不仅用于处理MNIST数据，而且还能用于处理语音文件。请参考sideComparisonVAE.ipynb，以了解我们如何处理MNIST数据和语音文件的语音内容（频谱包络）的侧面比较。

架构

以下是我们项目的架构。

📁模块
- 📑 voiceVaeModel.py
- 📑 voicetreatment.py
📑 README.md
📑 mnistVAE.ipynb
📑 sideComparisonVAE.ipynb
📑 voiceTreatment.ipynb

先决条件

numpy 👉Python的基本科学库，内置数学函数和简单的数组处理。
librosa 👉用于音乐和音频分析的软件包
pyworld 👉开源软件，用于高质量的语音分析、操作和系统化。
tensorflow 👉机器学习包，可以训练/运行深度神经网络（NN）。
matplotlib 👉用于图形绘制的软件
pandas👉操作数据帧，这是一个Python对象，在我们操作大型数据集时非常方便。

提供的数据

我们使用的是VCC2016数据集中提供的语料。这个数据集包括训练集（150个语料）、验证集（12个语料）和测试集（54个语料）。源语者包括3名女性和2名男性，而目标语者包括2名女性和3名男性。

语音处理

使用的参数

ε	FFT尺寸	频谱包络维度	特征维度	f0上限
2	1024	513	513+513+1+1+1 = 1029	500 Hz	。

选择的FFT大小（定义=1024）将决定所得到的光谱的分辨率。谱线的数量是 FFT大小的一半。因此，光谱包络有512条光谱线。每条谱线的分辨率=采样率/FFT_size=16000/1024=约15Hz。因此，更大的FFT尺寸将提供更高的分辨率，但需要更长的时间来计算。

功能

def wav2pw(x, fs=16000, fft_size=FFT_SIZE):
    _f0, timeaxis = pw.dio(x, fs, f0_ceil = f0_ceil) # _f0 = Raw pitch
    f0 = pw.stonemask(x, _f0, timeaxis, fs)  # f0 = Refined pitch
    sp = pw.cheaptrick(x, f0, timeaxis, fs, fft_size=fft_size) # sp = spectogram spectral smoothing
    ap = pw.d4c(x, f0, timeaxis, fs, fft_size=fft_size) # extract aperiodicity
    return {
      'f0': f0,
      'sp': sp,
      'ap': ap,
    }

pyworld包及其子包（dio、stonemask、cheaptrick和d4c）将返回指定的{}.wav文件的频谱包络（SP）、孔径速度（AP）、基本频率（f0）。f0的上限是这样设置的，即我们只过滤低频。会有513个SP的实例和513个AP的实例。

函数

def analysis(filename, fft_size=FFT_SIZE, dtype=np.float32):
    ''' Basic (WORLD) feature extraction ''' 
    fs = 16000
    x, _ = librosa.load(filename, sr=fs, mono=True, dtype=np.float64) #audio time series, sampling rate
    features = wav2pw(x, fs=16000, fft_size=fft_size)
    ap = features['ap']
    f0 = features['f0'].reshape([-1, 1]) #rows = unknown, columns = 1
    sp = features['sp']
    en = np.sum(sp + EPSILON, axis=1, keepdims=True) # Normalizing Factor
    sp_r = np.log10(sp / en) # Refined Spectrogram Normalization
    target = np.concatenate([sp_r, ap, f0, en], axis=1).astype(dtype)
    return target

在这个函数中，我们使用librosa包来获取音频文件的振幅和采样率。然后振幅被*wav2pw()*模块使用。我们加入一个归一化系数（SP+ε的总和），并在对数尺度上提炼出频谱包络的谱图。

每个单一的音频文件都会有一个不同长度的基频，这取决于音频的长度。比如说对于SF1/100001.wav，我们有704个f0的实例。而对于SF1/100002.wav，我们有216个f0的实例。这是因为SF1/100001.wav的音频持续时间比SF1/100001.wav长。然而，特色维度保持不变：对于每个f0的实例，我们仍然有513个SP的实例和513个AP的实例。

例如，在SF1/100002.wav中，对于f0=212.431344 Hz，频谱包络是这样的。

extract_and_save_bin_to(...)函数

由于要处理大量的数据，为了让我们的VAE模型运行得更快，我们把每个{}.wav文件的特色尺寸转换成{}.bin二进制文件。

VAE模型

参考文献中提出的方法启发了生成手写数字（MNIST数据集）的无意义的工作的参考。这项参考工作试图从庞大的手写图像中提取书写风格和数字身份，并重新合成这种图像。与此相似，2个偶然的潜在变量是身份和变异。

	变量	识别
手写数字	手写风格	名义数字
语音框架	语音内容	说话来源

编码器、解码器和VAE模型的结构可以在sideComparisonVAE.ipynb文件中找到。

训练程序

训练VAE涉及从潜在变量z的分布中取样。因此，我们引入了一个重新参数化的技巧，将随机性引入这个潜变量。
训练一个VAE是点状的。观察者和目标的光谱帧都不会被隔离为输入和输出。两者都会被看作是一个输入。
编码器的输入是光谱帧和说话人身份的合并。编码器接收来自源和目标的帧，因此它具有独立于说话者的编码能力。
VAE参数。

隐蔽层的数量	每个隐蔽层的节点数量	潜伏空间的大小	小批量的大小
2	512	128	128

推理和学习

我们的目标是进行最大似然学习，通过最大化模型中数据的对数似然。 max log ptheta (x)，其中pθ (x)被称为观察x的边际似然，而θ是模型参数。

计算这个观察x的边际似然是很困难的，因为联合似然模型是由以下内容给出的。 pθ (x,z) = pθ (x|z) p(z)，其中左边的项被称为潜在的表示。

由于直接优化对数pθ（x）是不可行的，我们将选择优化它的下限（通过把它分成一个重建损失项和一个KL发散损失项）。这个下限被称为证据下限（ELBO）。为了适应keras模型，我们不是最大化ELBO，而是最小化NELBO（Negative ELBO）。

源码

voiceTreatment.py

# -*- coding: utf-8 -*-
"""voiceTreatment

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1WSP9KjiBPkz_svVgjgc11huKtZLmHaIr

# **Dependencies**
"""
# Commented out IPython magic to ensure Python compatibility.
"""
Created on Mon Apr 9 05:30:42 2020
@author: yun bin choh
"""

import librosa
from scipy.io import wavfile
import scipy
import numpy as np
import os
from os.path import join
import numpy as np
!pip install pyworld
import pyworld as pw
import tensorflow as tf
import matplotlib.pyplot as plt

"""# **Parameters**"""

EPSILON = 1e-10
SETS = ['Training Set', 'Testing Set']  # TODO: for VCC2016 only
SPEAKERS = [s.strip() for s in tf.io.gfile.GFile('/content/drive/My Drive/speakers.tsv', 'r').readlines()]
FFT_SIZE = 1024
SP_DIM = FFT_SIZE // 2 + 1 # =513
FEAT_DIM = SP_DIM + SP_DIM + 1 + 1 + 1  # [sp, ap, f0, en, s] = 1029
RECORD_BYTES = FEAT_DIM * 4  # all features saved in `float32` = 4116
f0_ceil = 500

"""# **wav2pw() function**
The pyworld package and its subpackages (dio, stonemask, cheaptrick and d4c) would return the spectral envelope (SP), aperiodocity (AP), fundamental frequency (f0) of the specified {}.wav file. \\
The f0 ceiling is set such that we only filter the lower frequencies. There would be 513 instances of SP and 513 instances of AP.
"""

def wav2pw(x, fs=16000, fft_size=FFT_SIZE):
    _f0, timeaxis = pw.dio(x, fs, f0_ceil = f0_ceil) # _f0 = Raw pitch
    f0 = pw.stonemask(x, _f0, timeaxis, fs)  # f0 = Refined pitch
    sp = pw.cheaptrick(x, f0, timeaxis, fs, fft_size=fft_size) # sp = spectogram spectral smoothing
    ap = pw.d4c(x, f0, timeaxis, fs, fft_size=fft_size) # extract aperiodicity
    return {
      'f0': f0,
      'sp': sp,
      'ap': ap,
    }

"""# **analysis() function**

In this function, we use the librosa package to acquire the amplitude and sampling rate of the audio file. The amplitude is then being used by the wav2pw() module. We add in a normalizing factor (summation of SP + epsilon), and refined the spectrogram of the spectral envelope on a logarithm scale.

Each single audio file would have a varying length of fundamental frequency depending on the length of the audio. For example: For SF1/100001.wav, we have 704 instances of f0. While for SF1/100002.wav, we have 216 instances of f0. This is because the audio duration for SF1/100001.wav is longer than SF1/100001.wav. However, the featured dimensions remain the same: For each instance of f0, we still have 513 instances of SP and 513 instances of AP.
"""

def analysis(filename, fft_size=FFT_SIZE, dtype=np.float32):
    ''' Basic (WORLD) feature extraction ''' 
    fs = 16000
    x, _ = librosa.load(filename, sr=fs, mono=True, dtype=np.float64) #audio time series, sampling rate
    features = wav2pw(x, fs=16000, fft_size=fft_size)
    ap = features['ap']
    f0 = features['f0'].reshape([-1, 1]) #rows = unknown, columns = 1
    sp = features['sp']
    en = np.sum(sp + EPSILON, axis=1, keepdims=True) # Normalizing Factor
    sp_r = np.log10(sp / en) # Refined Spectrogram Normalization
    target = np.concatenate([sp_r, ap, f0, en], axis=1).astype(dtype)
    return target 
    # 根据位置将元素加在一起。这个文件的目标有704行和1028列

voiceVaeModel.py

# -*- coding: utf-8 -*-
"""voiceVaeModel.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1FSOeHjDL1BcYdYPbmFH7JKhyRALRYCq3
"""

# 注释了IPython魔法，以确保Python的兼容性。
"""
创建于2020年4月9日 星期一 05:30:42
@作者：Yun Bin Choh
"""

# %tensorflow_version 1.x

import keras
from keras.layers import Conv2D, Conv2DTranspose, Input, Flatten, Dense, Lambda, Reshape
from keras.layers import BatchNormalization
from keras.models import Model
from keras.datasets import mnist
from keras.losses import binary_crossentropy
from keras import backend as K
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
import seaborn as sns


import voiceTreatment

def voiceVAE(filename_source, filename_target, FFT_SIZE=1024, batch_size = 128, no_epochs = 50, validation_split = 0.2, verbosity = 1, latent_dim_voice = 128, num_channels = 1):
'''
INPUTS : 
        - filename_source: string, example: '/content/drive/My Drive/dataset/vcc2016/wav/Training Set/SF1/*.wav'
        - filename_target: string, example: '/content/drive/My Drive/dataset/vcc2016/wav/Training Set/TM1/*.wav'
        - FFT_SIZE: int, the fast fourier transform size to determine the number of spectral lines
        - batch_size: int, mini-batch size
        - no_epochs: int, number of epochs for model training
        - validation_split: float, the ratio to split the data set into training/validation sets
        - verbosity: int, 0,1 or 2. 0 = silent, 1 = progress bar, 2 = one line per epoch
        - latent_dim_voice: int, dimension of latent space
        - num_channels: int, to reshape the data into (_,_,_,_) so as to fit into keras model
OUTPUTS:
        - encoder_voice: keras.engine.training.Model
        - decoder_voice: keras.engine.training.Model
        - vae_voice: keras.engine.training.Model
        - hist.history['loss']: list, loss over no_epochs
        - hist.history['val_loss']: list, validation loss over no_epochs
'''
      features = []
      files = tf.gfile.Glob(filename_source)
      for elem in files:
        save = voiceTreatment.analysis(elem, fft_size=FFT_SIZE, dtype=np.float32)
        for i in range(len(save)//100):
          features.append([save[i][0:513], 'SF1'])

      ## Adding in additional data for TM1
      files2 = tf.gfile.Glob(filename_target)
      for elem in files2:
        save = voiceTreatment.analysis(elem, fft_size=FFT_SIZE, dtype=np.float32)
        for i in range(len(save)//100):
          features.append([save[i][0:513], 'TM1'])

      print("From 162 utterances in SF1 & TM1 respectively, we have:",len(features),"batches")
      print("In each batch, we have:",len(features[0][0]),"spectral lines")

      # 请注意，不同的扬声器有不同的批次数量
      count=0
      for i in range(len(features)):
        if features[i][1] == 'SF1':
          count += 1
      print('Note that different speakers have different number of batches (Due to varying length of sound files):')
      print("From 162 utterances in SF1, we have:",count,"batches")
      print("From 162 utterances in TM1, we have:",len(features)-count,"batches")
      
      # 数据的结构化
      featuresDF = pd.DataFrame(features, columns=['feature','class_label'])

      y = np.array(featuresDF.class_label.tolist())
      X = np.array(featuresDF.feature.tolist())
      from sklearn.model_selection import train_test_split 
      x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 127)

      x_train = x_train.reshape(len(x_train),1,513)
      x_test = x_test.reshape(len(x_test),1,513)

      label_encoder = LabelEncoder()
      y_train = label_encoder.fit_transform(y_train)
      y_test = label_encoder.fit_transform(y_test)

      img_width_voice, img_height_voice = x_train.shape[1], x_train.shape[2]

      # Reshape data
      input_train_voice = x_train.reshape(x_train.shape[0], img_height_voice, img_width_voice, num_channels)
      input_test_voice = x_test.reshape(x_test.shape[0], img_height_voice, img_width_voice, num_channels)
      input_shape_voice = (img_height_voice, img_width_voice, num_channels)

      # Parse numbers as floats
      input_train_voice = input_train_voice.astype('float32')
      input_test_voice = input_test_voice.astype('float32')

      # Normalize data
      input_train_voice = input_train_voice / 8
      input_test_voice = input_test_voice / 8

      # Encoder parameters/architecture
      i_voice       = Input(shape=input_shape_voice, name='encoder_input_voice')
      cx_voice      = Conv2D(filters=8, kernel_size=[7,1], strides=[3,1], padding='same', activation='relu')(i_voice)
      cx_voice      = BatchNormalization()(cx_voice)
      cx_voice     = Conv2D(filters=16, kernel_size=[7,1], strides=[3,1], padding='same', activation='relu')(cx_voice)
      cx_voice      = BatchNormalization()(cx_voice)
      x_voice      = Flatten()(cx_voice)
      x_voice       = Dense(512, activation='relu')(x_voice)
      x_voice       = BatchNormalization()(x_voice)
      mu_voice      = Dense(latent_dim_voice, name='latent_mu_voice')(x_voice)
      sigma_voice   = Dense(latent_dim_voice, name='latent_sigma_voice')(x_voice)
      conv_shape_voice = K.int_shape(cx_voice)

      # Reparametrization trick
      def sample_z_voice(args):
        mu, sigma = args
        batch     = K.shape(mu)[0]
        dim       = K.int_shape(mu)[1]
        eps       = K.random_normal(shape=(batch, dim))
        return mu + K.exp(sigma / 2) * eps

      z_voice = Lambda(sample_z_voice, output_shape=(latent_dim_voice, ), name='z')([mu_voice, sigma_voice])

      # Activation of encoder
      encoder_voice = Model(i_voice, [mu_voice, sigma_voice, z_voice], name='encoder_voice')
      #encoder_voice.summary()

      # Decoder parameters/architecture
      d_i_voice   = Input(shape=(latent_dim_voice, ), name='decoder_input_voice')
      x_voice     = Dense(conv_shape_voice[1] * conv_shape_voice[2] * conv_shape_voice[3], activation='relu')(d_i_voice)
      x_voice     = BatchNormalization()(x_voice)
      x_voice    = Reshape((conv_shape_voice[1], conv_shape_voice[2], conv_shape_voice[3]))(x_voice)
      cx_voice    = Conv2DTranspose(filters=16, kernel_size=[7,1], strides=[3,1], padding='same', activation='relu')(x_voice)
      cx_voice    = BatchNormalization()(cx_voice)
      cx_voice    = Conv2DTranspose(filters=8, kernel_size=[7,1], strides=[3,1], padding='same',  activation='relu')(cx_voice)
      cx_voice    = BatchNormalization()(cx_voice)
      o_voice     = Conv2DTranspose(filters=num_channels, kernel_size=[7,1], activation='sigmoid', padding='same', name='decoder_output_voice')(cx_voice)

      # 解码器的激活
      decoder_voice = Model(d_i_voice, o_voice, name='decoder_voice')
      #decoder_voice.summary()

    
      # Activation of VAE model
      vae_outputs_voice = decoder_voice(encoder_voice(i_voice)[2])
      vae_voice         = Model(i_voice, vae_outputs_voice, name='vae_voice')
      #vae_voice.summary()

      # Defining the loss function
      def kl_reconstruction_loss_voice(true, pred):
        # Reconstruction loss
        reconstruction_loss = binary_crossentropy(K.flatten(true), K.flatten(pred)) * img_width_voice * img_height_voice
        # KL divergence loss
        kl_loss = 1 + sigma_voice - K.square(mu_voice) - K.exp(sigma_voice)
        kl_loss = K.sum(kl_loss, axis=-1)
        kl_loss *= -0.5
        # Total loss = 50% rec + 50% KL divergence loss
        return K.mean(reconstruction_loss + kl_loss)

      # Compile VAE
      vae_voice.compile(optimizer='adam', loss=kl_reconstruction_loss_voice)

      # Train autoencoder
      hist = vae_voice.fit(input_train_voice, input_train_voice, epochs = no_epochs, batch_size = batch_size, validation_split = validation_split)

      golden_size = lambda width: (width, 2. * width / (1 + np.sqrt(5)))
      #NELBO
      fig, ax = plt.subplots(figsize=golden_size(6))

      hist_df = pd.DataFrame(hist.history)
      hist_df.plot(ax=ax)

      ax.set_ylabel('NELBO')
      ax.set_xlabel('# epochs')

      ax.set_ylim(.99*hist_df[1:].values.min(), 
                  1.1*hist_df[1:].values.max())


      plt.show()

      # 前两个维度的潜空间可视化
      def viz_latent_space_voice(encoder_voice, data):
        input_data, target_data = data
        mu, _, _ = encoder_voice.predict(input_data)
        plt.figure(figsize=(8, 10))
        plt.scatter(mu[:, 0], mu[:, 1], c=target_data)
        plt.xlabel('z - dim 1')
        plt.ylabel('z - dim 2')
        plt.colorbar()
        plt.show()

      data_v = (input_test_voice, y_test)
      viz_latent_space_voice(encoder_voice, data_v)

      # Latent space visualization for first n-dimensions where 2 <n <=128
      store = []
      for i in range(len(encoder_voice.predict(input_test_voice)[0])):
        store.append(encoder_voice.predict(input_test_voice)[0][i])

      storeFrame = pd.DataFrame(store)
      #storeFrame = storeFrame.transpose()

      pp = sns.pairplot(storeFrame.loc[:, 0:9], size=1.8, aspect=1.8,
                        plot_kws=dict(edgecolor="k", linewidth=0.5),
                        diag_kind="kde", diag_kws=dict(shade=True))

      fig = pp.fig 
      fig.subplots_adjust(top=0.93, wspace=0.3)
      t = fig.suptitle('Voice Latent Space Representation Pairwise Plots', fontsize=14)

      return encoder_voice, decoder_voice, vae_voice, hist.history['loss'], hist.history['val_loss']

鸣谢

Inspired by github.com/JeremyCCHsu…
Guided by blog.keras.io/building-au…

基于VAE神经网络模型来处理语音内容【频谱包络】

简介

目的

架构