机器学习
Emmmm,这学期在 Coursera 学完了 Andrew Ng 的 Machine Learning 课程。我对这个课程一向是不以为意的,却不小心报了个名,还手贱申请了个经济援助,学完就可以免费拿证书(卖几百块哒),课程期间还送正版的 Matlab Online,这一系列的偶(占)然(小)事(便)件(宜)促使我开始刷这个课了。然后就——真香~!这个课真的是很好的机器学习入门,难怪那么多人推荐。
Coursera 里课程笔记有每一章的总结,总结的非常好,推荐学完之后看一看。但我还是喜欢自己来写,所以我之前边看视频边写了几乎涵盖整个课程的笔记:
clownote.github.io/categories/…
把这些笔记精简就成了这篇文章。
这里我主要是写了各种算法的描述,还从编程作业里提取了算法大概的代码实现,方便日后快速查阅。至于课程里老师花大力气讲的关于机器学习系统的设计、优化、debug 还有各种奇技淫巧就一概不提了,这些东西还是要看老师的视频才能体会到精髓(就是不承认我懒)。下面就开始吧。
监督学习
监督学习是给x、y数据去训练的。
回归问题
做预测,值域为连续的数(例如区间)
数学模型:
- 预测函数:
-
待求参数:
-
代价函数:
👉代码实现:
function J = computeCostMulti(X, y, theta)
%COMPUTECOSTMULTI Compute cost for linear regression with multiple variables
% J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the
% parameter for linear regression to fit the data points in X and y
m = length(y); % number of training examples
J = 0;
J = 1 / (2*m) * (X*theta - y)' * (X*theta - y);
% or less vectorizedly:
% predictions = X * theta;
% sqrErrors = (predictions - y) .^ 2;
% J = 1 / (2*m) * sum(sqrErrors);
end
- 优化目标:找到一组使最小。
求解方法:
梯度下降
(这里暂且只讨论 batch gradient descent)
向量化表示:
👉代码实现:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
%GRADIENTDESCENTMULTI Performs gradient descent to learn theta
% theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by
% taking num_iters gradient steps with learning rate alpha
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
predictions = X * theta;
errors = (predictions - y);
theta = theta - alpha / m * (X' * errors);
% Save the cost J in every iteration
J_history(iter) = computeCostMulti(X, y, theta);
end
end
正规方程
👉代码实现:
function [theta] = normalEqn(X, y)
%NORMALEQN Computes the closed-form solution to linear regression
% NORMALEQN(X,y) computes the closed-form solution to linear
% regression using the normal equations.
theta = zeros(size(X, 2), 1);
theta = pinv(X' * X) * X' * y;
end
这里我们求伪逆以确保正常运行。通常造成不可逆的原因是:
- 存在可约特征,即给定的某两/多个特征线性相关,只保留一个删除其他即可解决。(e.g. there are the size of house in feet^2 and the size of house in meter^2, where we know that 1 meter = 3.28 feet)
- 给定特征过多,(). 可以删除一些不重要的特征(考虑PCA算法)
梯度下降vs正规方程
| Gradient Descent | Normal Equation | |
|---|---|---|
| 需要选择 alpha | ✅ | - |
| 第三方 | ✅ | - |
| 时间复杂度 | 求的伪逆需要 | |
| n 相当大时 | 可以工作 | 十分缓慢甚至不可计算 |
实际上,当 时,我们通常更倾向于使用梯度下降,否则正规方程一般都表现得更好。
注:特征缩放
我们可以通过使输入值大概在一定的范围内来使梯度下降运行更快,比如说,我们可以把所有值变到 的范围内,同时,我们还可以通过处理让输入值之间的差距不要太大(例如,输入值中同时有 0.000001 和 1 这样差距大的值会影响梯度下降的效率)。
在实践中,我们通常想要保证变量值在 这个范围内取值。
为达成该目标,我们做如下操作:
- Feature scaling
- Mean normalizaton
把两个操作和在一起,即:
其中, 是特征(i)的值的平均,是值的范围。
👉代码实现:
function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
mu = mean(X);
sigma = std(X);
X_norm = (X - mu) ./ sigma;
end
分类问题
做预测,值域为离散的几个特定值(例如 0 或 1;0/1/2/3)
逻辑回归
假设函数:
其中, 称为 Simoid 函数,或逻辑函数,其图像如下:
👉代码实现:
function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
% J = SIGMOID(z) computes the sigmoid of z.
g = 1.0 ./ (1.0 + exp(-z));
end
上式可化简得:
的输出是预测值为1的可能性,并有下两式成立:
决策边界:逻辑回归的决策边界就是将区域分成和 两部分的一个超平面。
决策边界由假设函数决定。这是由于要完成分类,需用的输出来决定结果是 0 还是 1。定 0.5 为分界,即:
由 Simoid 函数的性质,上式等价为:
那么对于给定的一组 ,例如,有 当且仅当 ,这时决策边界为 。
决策边界也可以是下面这种复杂的情况:
逻辑回归模型:
向量化表示:
梯度下降:
向量化表示:
👉代码实现(使用Advanced Optimization):
- 提供
function [J, grad] = costFunction(theta, X, y)
%COSTFUNCTION Compute cost and gradient for logistic regression
% J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
% parameter for logistic regression and the gradient of the cost
% w.r.t. to the parameters.
m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
h = sigmoid(X*theta);
J = 1/m * (-y'*log(h) - (1-y)'*log(1-h));
grad = 1/m * X'*(h-y);
end
- 调用 Advanced Optimization 函数解决优化问题:
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2, 1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
多元分类
我们采用一系列的单元(逻辑)分类来完成多元分类:
注:过拟合
过拟合对训练集中的数据预测的很好,但对没见过的新样本预测效果不佳。
解决过拟合的方法有:
-
减少特征数量(PCA)
-
正则化:在代价函数中加入 的权重:
注意,是我们加上的常数项,不应该被正则化。
代码实现:
function [J, grad] = lrCostFunction(theta, X, y, lambda)
%LRCOSTFUNCTION Compute cost and gradient for logistic regression with
%regularization
% J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
% Unregularized cost function & gradient for logistic regression
h = sigmoid(X * theta);
J = 1/m * (-y'*log(h) - (1-y)'*log(1-h));
grad = 1/m * X'*(h-y);
% Regularize
temp = theta;
temp(1) = 0;
J = J + lambda/(2*m) * sum(temp.^2);
grad = grad + lambda/m * temp;
grad = grad(:);
end
神经网络
第一层是数据集,称为输入层,可以看作 ;中间是数个隐藏层,最终得到的就是预测函数,这一层叫做输出层。
假设有 c 个层,则:
例如,用一层的神经网络,我们可以建立一些表达逻辑函数的神经网络:
多元分类
神经网络的拟合
| Notation | Represent |
|---|---|
| 神经网络中的总层数 | |
| 第层中的节点数(不算偏移单元) | |
| 输出节点数 |
代价函数:
向后传播算法:
注:上式中 代表 Matlab/Octave 中的 element-wise 的乘法。
向后传播的使用:
先看几个涉及到的方法:
- 参数展开:为使用优化函数,我们需要把所有的矩阵展开并拼接成一个长向量:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ];
deltaVector = [ D1(:); D2(:); D3(:) ];
在得到优化结果后返回原来的矩阵:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
- 梯度检查:利用 取一个小的邻域如 ,可以检查我们用向后传播求出的梯度是否正确(若正确,有 gradApprox ≈ deltaVector 成立)。代码实现:
epsilon = 1e-4;
for i = 1 : n
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) += epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2*epsilon);
end
- 随即初始化:在开始时,将 随机初始化,应保证随机值的取值在一个 的范围内(这个 与梯度检查中的无关)。代码实现:
# If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
将上述技巧与向后传播算法结合,我们就得到了了训练神经网络的方法:
- 随机初始化
- 向前传播得到 对任意
- 计算代价函数
- 使用向后传播计算偏导
- 利用梯度检查验证向后传播是否正确,若没问题则关闭梯度检查功能
- 使用梯度下降或优化函数得到
👉代码实现:
- 随机初始化
function W = randInitializeWeights(L_in, L_out)
%RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
%incoming connections and L_out outgoing connections
% W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights
% of a layer with L_in incoming connections and L_out outgoing
% connections.
%
% Note that W should be set to a matrix of size(L_out, 1 + L_in) as
% the first column of W handles the "bias" terms
%
W = zeros(L_out, 1 + L_in);
% epsilon_init = 0.12
epsilon_init = sqrt(6 / (L_in + L_out));
W = rand(L_out, 1 + L_in) * (2 * epsilon_init) - epsilon_init;
end
- 计算代价
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
% [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
% X, y, lambda) computes the cost and gradient of the neural network. The
% parameters for the neural network are "unrolled" into the vector
% nn_params and need to be converted back into the weight matrices.
%
% The returned parameter grad should be a "unrolled" vector of the
% partial derivatives of the neural network.
%
% Reshape nn_params back into the parameters Theta_1 and Theta_2, the weight matrices
% for our 2 layer neural network
Theta_1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta_2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
% Setup some useful variables
m = size(X, 1);
K = num_labels;
J = 0;
Theta_1_grad = zeros(size(Theta_1));
Theta_2_grad = zeros(size(Theta_2));
% y(5000x1) -> Y(5000x10)
Y = zeros(m, K);
for i = 1 : m
Y(i, y(i)) = 1;
end
% Feedforward
a_1 = X;
a_1_bias = [ones(m, 1), a_1];
z_2 = a_1_bias * Theta_1';
a_2 = sigmoid(z_2);
a_2_bias = [ones(m, 1), a_2];
z_3 = a_2_bias * Theta_2';
a_3 = sigmoid(z_3);
h = a_3;
% Cost Function
% for i = 1 : K
% yK = Y(:, i);
% hK = h(:, i);
% J += 1/m * (-yK'*log(hK) - (1-yK)'*log(1-hK));
% J can be get by element-wise compute more elegantly.
J = 1/m * sum(sum((-Y.*log(h) - (1-Y).*log(1-h))));
% Regularize
J = J + lambda/(2*m) * (sum(sum(Theta_1(:, 2:end).^2)) + sum(sum(Theta_2(:, 2:end).^2)));
% Backpropagation
delta_3 = a_3 .- Y;
delta_2 = (delta_3 * Theta_2) .* sigmoidGradient([ones(m, 1), z_2]);
% sigmoidGradient: return g = sigmoid(z) .* (1 - sigmoid(z));
delta_2 = delta_2(:, 2:end);
Delta_1 = delta_2' * a_1_bias;
Delta_2 = delta_3' * a_2_bias;
Theta_1_grad = Delta_1 ./ m + lambda/m * [zeros(size(Theta_1, 1), 1), Theta_1(:, 2:end)];
Theta_2_grad = Delta_2 ./ m + lambda/m * [zeros(size(Theta_2, 1), 1), Theta_2(:, 2:end)];
% Unroll gradients
grad = [Theta_1_grad(:) ; Theta_2_grad(:)];
end
- 预测
function p = predict(Theta1, Theta2, X)
%PREDICT Predict the label of an input given a trained neural network
% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
% trained weights of a neural network (Theta1, Theta2)
% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);
p = zeros(size(X, 1), 1);
h1 = sigmoid([ones(m, 1) X] * Theta1');
h2 = sigmoid([ones(m, 1) h1] * Theta2');
[dummy, p] = max(h2, [], 2);
end
- 驱动
input_layer_size = 400; % 20x20 Input Images of Digits
hidden_layer_size = 25; % 25 hidden units
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)
fprintf('\nInitializing Neural Network Parameters ...\n')
load('Xy.mat');
fprintf('\nTraining Neural Network... \n')
% value to see how more training helps.
options = optimset('MaxIter', 1000);
lambda = 1;
% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, X, y, lambda);
% Now, costFunction is a function that takes in only one argument (the
% neural network parameters)
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
pred = predict(Theta1, Theta2, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);
支持向量机
优化目标:
当 值比较大时,这个优化目标会选择将第一个求和项趋于零,这样优化目标就变成了:
由欧氏空间的知识:
上式可表示为:
SVM 会选择最大的间隙:
核方法
面对如下分类问题:
我们可以使用多项式来回归,例如,当 时预测 ;
这样有太多多项式比较麻烦,我们可以考虑如下方法:
这里的
我们用 替换了多项式,避免了高次项的麻烦,那么如何确定 ?大概的思想如下:
为方便描述,假设我们只有 ,并且只打算构造 。那么,不管 (偏移项),我们从 - 的图像中选择 3 个点,记为 ,称之为标记点。任给 ,我们通过计算其与各标记点的临近程度得到一组 :
这里具体的 similarity 函数称为核函数,核函数多种多样。我们这里写的是很常用的 ,称为 Gaussian Kernel,他的代码实现如下:
由这种方法,我们知道,给定 ,对每个 我们会得到一个 ,满足:
- 当 接近 时
- 当 远离 时
的选择会影响 值随 远离 而下降的速度, 越大,减小地越慢:
使用核方法,我们可以做出这种预测:
当且仅当给定临近 或 时预测 1,否则预测 0.
SVM 中使用核
-
给定训练集
-
选择标记点:
-
对于样本 ,计算核:
-
令 ,其中 。
预测:
- 给定 ,计算
- 预测 如果
训练:
实现:
我们可以利用诸如 liblinear, libsvm 之类的库来得到 SVM 的 参数 ,要使用这些库,我们一般需要做以下工作:
- 选择参数
- 选择核函数
- No kernel(即线性核,亦即做逻辑回归:Predict if ),适用于 n大 m小 的情况(避免过拟合)
- Gaussian kernel()适用于 m大 n小 的情况(可拟合更复杂的非线性边界)
- 提供核函数(Gaussian kernel 为例):
注意:使用 Gaussian Kernel 前务必做特征缩放!
👉无核函数的代码实现:
function sim = linearKernel(x1, x2)
%LINEARKERNEL returns a linear kernel between x1 and x2
% sim = linearKernel(x1, x2) returns a linear kernel between x1 and x2
% and returns the value in sim
% Ensure that x1 and x2 are column vectors
x1 = x1(:); x2 = x2(:);
% Compute the kernel
sim = x1' * x2; % dot product
end
👉高斯核的代码实现:
function sim = gaussianKernel(x1, x2, sigma)
%RBFKERNEL returns a radial basis function kernel between x1 and x2
% sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
% and returns the value in sim
% Ensure that x1 and x2 are column vectors
x1 = x1(:); x2 = x2(:);
sim = 0;
sim = exp(-sum((x1 - x2).^2) / (2 * sigma ^ 2));
end
👉SVM 示例:
% Load X, y, Xtest and ytest
load('data.mat');
% SVM Parameters
C = 1; % C = 1 ~ 100 is fine
sigma = 0.1; % sigma = 0.03 ~ 0.1 gives somewhat good boundary, less is better
% We set the tolerance and max_passes lower here so that the code will run
% faster. However, in practice, you will want to run the training to
% convergence.
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
多元分类
逻辑回归 vs 神经网络 vs SVM
= 特征数()
= 训练样本数
- n相对于m大 (e.g. )
- 逻辑回归,或 无核SVM
- n小、m适中(e.g. )
- 用 Gaussian 核 SVM
- n小、m大
- 创造/添加特征,然后用 逻辑回归或无核SVM
神经网络通常可以解决上述任何一种情况,但可能相对较慢。
无监督学习
无监督学习是只给x数据的,不给y。
K-Means 聚类
把一堆东西自动分成K堆。
输入:
- :聚类的个数
- :训练集
输出:
- 个类
K-Means算法:
代价函数:
优化目标:
得到较优解(不一定能得到最优解)的算法:
的选择:
- 更具实际问题的需求易得;
- 选择拐点:
👉代码实现
- 找最近的类中心:
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
idx = zeros(size(X,1), 1);
for i = 1 : size(X, 1)
min_j = 0;
min_l = Inf;
for j = 1 : size(centroids, 1)
l = sum((X(i, :) - centroids(j, :)) .^ 2);
if l <= min_l
min_j = j;
min_l = l;
end
end
idx(i) = min_j;
end
end
- 计算中心:
function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returns the new centroids by computing the means of the
%data points assigned to each centroid.
% centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by
% computing the means of the data points assigned to each centroid. It is
% given a dataset X where each row is a single data point, a vector
% idx of centroid assignments (i.e. each entry in range [1..K]) for each
% example, and K, the number of centroids. You should return a matrix
% centroids, where each row of centroids is the mean of the data points
% assigned to it.
%
% Useful variables
[m n] = size(X);
centroids = zeros(K, n);
for i = 1 : K
ck = find(idx == i);
centroids(i, :) = sum(X(ck,:)) / size(ck, 1);
end
end
- 运行K-Means
function [centroids, idx] = runkMeans(X, initial_centroids, ...
max_iters, plot_progress)
%RUNKMEANS runs the K-Means algorithm on data matrix X, where each row of X
%is a single example
% [centroids, idx] = RUNKMEANS(X, initial_centroids, max_iters, ...
% plot_progress) runs the K-Means algorithm on data matrix X, where each
% row of X is a single example. It uses initial_centroids used as the
% initial centroids. max_iters specifies the total number of interactions
% of K-Means to execute. plot_progress is a true/false flag that
% indicates if the function should also plot its progress as the
% learning happens. This is set to false by default. runkMeans returns
% centroids, a Kxn matrix of the computed centroids and idx, a m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
% 若使用 plot_progress 需要额外的画图函数实现,这里没有给出.
%
% Set default value for plot progress
if ~exist('plot_progress', 'var') || isempty(plot_progress)
plot_progress = false;
end
% Plot the data if we are plotting progress
if plot_progress
figure;
hold on;
end
% Initialize values
[m n] = size(X);
K = size(initial_centroids, 1);
centroids = initial_centroids;
previous_centroids = centroids;
idx = zeros(m, 1);
% Run K-Means
for i=1:max_iters
% Output progress
fprintf('K-Means iteration %d/%d...\n', i, max_iters);
if exist('OCTAVE_VERSION')
fflush(stdout);
end
% For each example in X, assign it to the closest centroid
idx = findClosestCentroids(X, centroids);
% Optionally, plot progress here
if plot_progress
plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);
previous_centroids = centroids;
fprintf('Press enter to continue.\n');
input("...");
end
% Given the memberships, compute new centroids
centroids = computeCentroids(X, idx, K);
end
% Hold off if we are plotting progress
if plot_progress
hold off;
end
end
- 驱动脚本:
% Load an example dataset
load('data.mat');
% Settings for running K-Means
K = 3;
max_iters = 10;
% For consistency, here we set centroids to specific values
% but in practice you want to generate them automatically, such as by
% settings them to be random examples (as can be seen in
% kMeansInitCentroids).
initial_centroids = [3 3; 6 2; 8 5];
% Run K-Means algorithm. The 'true' at the end tells our function to plot
% the progress of K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);
fprintf('\nK-Means Done.\n\n');
PCA 维数约减
主成分分析:把n维的数据(投影)降到k维,略去不重要的部分(k<=n)。
PCA算法:
-
数据预处理
训练集:
预处理(feature scaling & mean normalization):
-
-
Replace each with
-
2)降维
- 计算协方差矩阵(这个矩阵记做大Sigma,注意和求和号区分):
- 求的特征值(实际上是奇异值分解):
[U, S, V] = svd(Sigma); - 从上一步svd得到:
- 完成降维::
👉代码实现:
% do feature scaling & mean normalization
Sigma = 1/m * X' * X;
[U, S, V] = svd(Sigma);
Ureduce = U(:, 1:K);
Z = X * Ureduce;
数据复原:将数据还原到原来的维度():
一般情况下 ,我们只能期望 尽量接近 .
(主成分个数)的选择
一般,选择 为使得下式成立的最小值:
算法:
异常检测
从一堆数据中找出异常于其他的。
问题描述:给定数据集 ,通过训练,判断 是否异常。
要解决这个问题,我们可以对(概率)建立一个模型,选择一个临界值 ,使:
这样问题可以转化为密度值估计。我们常用高斯分布解决这个问题。
高斯分布
服从高斯分布:
则, 的概率为:
其中参数 和 由下式确定(这是在机器学习里常用的格式,不一定和数学里的一样):
👉代码实现:
function [mu sigma2] = estimateGaussian(X)
%ESTIMATEGAUSSIAN This function estimates the parameters of a
%Gaussian distribution using the data in X
% [mu sigma2] = estimateGaussian(X),
% The input X is the dataset with each n-dimensional data point in one row
% The output is an n-dimensional vector mu, the mean of the data set
% and the variances sigma^2, an n x 1 vector
%
% Useful variables
[m, n] = size(X);
mu = zeros(n, 1);
sigma2 = zeros(n, 1);
mu = mean(X);
sigma2 = var(X) * (m - 1) / m;
end
借此我们便可得到异常检查算法:
异常检查算法
- 选择认为可能表现出样本异常的数据特征
- 计算参数
- 对于新给的样本 ,计算 :
- 如果,则预测异常。
多元高斯分布
参数:
- (covariance matrix,
Sigma = 1/m * X' * X;)
参数的计算:
👉代码实现:
function p = multivariateGaussian(X, mu, Sigma2)
%MULTIVARIATEGAUSSIAN Computes the probability density function of the
%multivariate gaussian distribution.
% p = MULTIVARIATEGAUSSIAN(X, mu, Sigma2) Computes the probability
% density function of the examples X under the multivariate gaussian
% distribution with parameters mu and Sigma2. If Sigma2 is a matrix, it is
% treated as the covariance matrix. If Sigma2 is a vector, it is treated
% as the \sigma^2 values of the variances in each dimension (a diagonal
% covariance matrix)
%
k = length(mu);
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
Sigma2 = diag(Sigma2);
end
X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));
end
用多元高斯分布的异常检查
- 拟合多元高斯分布的 模型,通过参数:
- 对于新给 ,计算:
- 如果,则预测异常。
门槛选择
通过计算 值可以得到最适合的 。
值由 precision () 和 recall () 给出:
其中:
- 是 true positives:预测为正,实际也为正
- 是 false positives:预测为正,实际为负
- 是 false negatives:预测为负,实际为正
👉代码实现:
function [bestEpsilon bestF1] = selectThreshold(yval, pval)
%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting
%outliers
% [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best
% threshold to use for selecting outliers based on the results from a
% validation set (pval) and the ground truth (yval).
%
bestEpsilon = 0;
bestF1 = 0;
F1 = 0;
stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
cvPredictions = pval < epsilon;
tp = sum((cvPredictions == 1) & (yval == 1));
fp = sum((cvPredictions == 1) & (yval == 0));
fn = sum((cvPredictions == 0) & (yval == 1));
prec = tp / (tp + fp);
rec = tp / (tp + fn);
F1 = (2 * prec * rec) / (prec + rec);
if F1 > bestF1
bestF1 = F1;
bestEpsilon = epsilon;
end
end
end
推荐系统
通过评分,推荐用户新内容。
符号说明:(假设我们要推荐的东西是电影)
- = 用户数
- = 电影数
- 若用户 对电影 评过分,否则为 0
- = 用户 给电影 的评分(只有当 时才有定义)
基于内容推荐
预测模型:
- 若用户 对电影 评过分
- = 用户 给电影 的评分(如果有定义)
- 用户 的参数(向量)
- 电影 的特征(向量)
对于用户 ,电影 ,预测评分:。
优化目标:
- 优化 (对于单个用户 的参数)
- 优化 (对所有用户)
我们可以用梯度下降解决问题:
协同过滤
在基于内容推荐中我们有时会很难把握电影(我们要推荐的东西)有哪些特征(),我们想让机器学习自己找特征,这就用到协同过滤。
新加的优化目标:(之前在基于内容推荐里面的优化目标仍需考虑)
- 给定 ,学习 :
- 给定 ,学习 :
协同过滤:
现在我们的问题是即没有训练好的 ,又没有一组充分优化的 ,但学习 要先有 ,学习 要先有 。这就变成了一个类似鸡生蛋、蛋生鸡的问题。
我们可以考虑这样解决这个难题:
首先随机猜一组 ,然后用这组 就可以得到一组 ;用这组得到的 又可以优化 ,优化后的 又拿来优化 ...... 不断重复这个过程,我们可以期望得到一组 和 都充分优化的解(事实上它们最终是会收敛的)。
我们随机初始化一组参数,然后重复来回计算 和 ,最终会得到解,但这样比较麻烦,我们可以做的更高效:
同时优化 和 :
协同过滤算法:
- 将 随机初始化为一些比较小的随机值
- 优化
- 对于给定用户,该用户的参数是 ,则用训练得到的某电影的特征 ,我们可以预测该用户可能为此电影评分:。
低秩矩阵分解:
我们可以看到,我们最终的预测是这样的:
考虑到几乎不可能有用户把接近所有的电影都评分,这个预测矩阵是稀疏的,存储这个矩阵会造成大量浪费,不妨令:
则有:
我们便将它分为了两部分。用这个方法求取 和 ,获得推荐系统需要的参数,称之为低秩矩阵分解。该方法不仅能在编程时直接通过向量化的手法获得参数,还通过矩阵分解节省了内存空间。
寻找相关电影:
我们常需要推荐与电影 相关的电影 ,可以这样找到:
均值归一化处理:
再电影推荐问题中,由于评分总是1到5分(或其他范围),故不用特征缩放,但可以做 mean normalization:
对用户 , 电影 , 预测:
👉代码实现:
- 代价函数:
function [J, grad] = cofiCostFunc(params, Y, R, num_users, num_movies, ...
num_features, lambda)
%COFICOSTFUNC Collaborative filtering cost function
% [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ...
% num_features, lambda) returns the cost and gradient for the
% collaborative filtering problem.
%
% Unfold the U and W matrices from params
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
num_users, num_features);
J = 0;
X_grad = zeros(size(X));
Theta_grad = zeros(size(Theta));
h = X * Theta';
er = (h - Y) .* R;
J = 1/2 * sum(sum(er.^2));
X_grad = er * Theta;
Theta_grad = er' * X;
% Regularized
J += lambda/2 *(sum(sum(Theta.^2)) + sum(sum(X.^2)));
X_grad += lambda * X;
Theta_grad += lambda * Theta;
grad = [X_grad(:); Theta_grad(:)];
end
- 均值归一
function [Ynorm, Ymean] = normalizeRatings(Y, R)
%NORMALIZERATINGS Preprocess data by subtracting mean rating for every
%movie (every row)
% [Ynorm, Ymean] = NORMALIZERATINGS(Y, R) normalized Y so that each movie
% has a rating of 0 on average, and returns the mean rating in Ymean.
%
[m, n] = size(Y);
Ymean = zeros(m, 1);
Ynorm = zeros(size(Y));
for i = 1:m
idx = find(R(i, :) == 1);
Ymean(i) = mean(Y(i, idx));
Ynorm(i, idx) = Y(i, idx) - Ymean(i);
end
end
- 驱动脚本
% Normalize Ratings
[Ynorm, Ymean] = normalizeRatings(Y, R);
% Useful Values
num_users = size(Y, 2);
num_movies = size(Y, 1);
num_features = 10;
% Set Initial Parameters (Theta, X)
X = randn(num_movies, num_features);
Theta = randn(num_users, num_features);
initial_parameters = [X(:); Theta(:)];
% Set options for fmincg
options = optimset('GradObj', 'on', 'MaxIter', 100);
% Set Regularization
lambda = 10;
theta = fmincg (@(t)(cofiCostFunc(t, Ynorm, R, num_users, num_movies, ...
num_features, lambda)), ...
initial_parameters, options);
% Unfold the returned theta back into U and W
X = reshape(theta(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(theta(num_movies*num_features+1:end), ...
num_users, num_features);
fprintf('Recommender system learning completed.\n');
p = X * Theta';
my_predictions = p(:,1) + Ymean;
EOF