0、论文背景
本论文主要是通过改进的BPSO算法来进行特征选择,特征选择的效果通过1-NN分类算法来检验。通过比较得出提出的算法在特征选择上具有一定的优越性。
Zhang Y, Gong D, Hu Y, et al. Feature selection algorithm based on bare bones particle swarm optimization[J]. Neurocomputing, 2015, 148: 150-157.
是一篇二区的论文,个人觉得针对特征选择的实际问题而提出的改进的BPSO算法实际上创新性不是太大,但是可以让我们就相关问题有一个清晰的认识。
1、论文的核心思想
本文提出了一种利用BPSO求解最优特征子集的新方法,称为 binary BPSO。该算法(binary BPSO)设计了(a reinforced memory strategy)一种增强记忆策略来更新粒子的局部最优值,以避免粒子中突出基因的退化,并提出了一种统一的组合算法( uniform combination (UC))来平衡局部开发和全局探索。此外,利用1-NN方法作为分类器来评价粒子的分类精度。本文选取了一些国际标准数据集,对该算法进行了评价。
有关Binary PSO算法请参见:Binary PSO;有关BPSO算法参见:BPSO。
2、The Binary BPSO
The Binary BPSO的算法流程如下:
其中1.1中涉及到一个算法样本的初始化:
N代表特征(变量的个数),|S|代表初始化样本的个数,表示第i个样本中第j个特征被选择的概率。
那如何表示第i个样本中第j个特征被选中的情况呢?
zij如果等于1表示第i个样本中第j个特征被选中。否则表示未选中。
1.2中怎么计算适应度值呢?
根据样本特征是否选中,从原来数据集中选择特征选中的数据和标签,得到新的数据集来进行分类准确率的计算,采用的分类算法是1-NN,从原始数据集中选取单个数据作为测试样本,其余数据构成训练样本。计算准确率的方法是:
下面就涉及到关键的两个算法创新点:a reinforced memory strategy和uniform combination (UC) 。
2.1 a reinforced memory strategy
传统的更新Pbest的方法在之前的PSO算法中提到过,但是这种更新方式有个缺陷:当为0.1时,恰好随机生成的数小于0.1,该特征被选择;虽然看似表现不好的特征被选择,但是总体预测正确率却变高了,说明该特征的重要性不止0.1;之前的更新Pbest的只是单纯地将xi值更新到Pbest,而忽略了应该提高xij这些特征在下次被选中的概率;因为如果不提高,下次这个特征可能就选不上了。因此提出了a reinforced memory strategy:
2.2 uniform combination
由于The Binary BPSO有可能会陷入到局部最优的情况,那么如何摆脱局部最优,达到全局搜索的目的呢?这就uniform combination。 它的思路是:当Gbest不更新时候,使Pbest基因突变的概率加大。具体策略如下:
当Gbest不更新时候,num++;当Gbest更新时候,num重置为0。pc表示Pbest基因突变的概率。
Pbest基因突变的过程如下:
3、论文中的实验结果
实验条件是:初始化种群20,迭代次数如下图:
不同算法对应数据集的分类准确率:
算法迭代过程中,准确率和特征选择个数的变化:
迭代过程中特征选择的具体展现:
当不采用 uniform combination或者a reinforced memory strategy与两个策略都采用的Binary BPSO算法实际运行结果比较。
4. 个人算法的复现以及简单的实验验证
数据集采用了4个,复现了Binary PSO和Binary BPSO两个算法:
function test_y = knn(x, y, test_x, k, num)
% x:nxm,y:nx1,test_x:n0xm,test_y:n0x1
n0 = size(test_x, 1);
test_y = zeros(n0, 1);
for i = 1 : n0
ind = [];
max_label = 0;
Dist = sum((x - test_x(i, :)).^2,2);
% 从小大大排序,取前k个
[D,I] = sort(Dist);
D = D(1:k, :);
D_y = y(I(1:k), :);
for j = 1 : num
% 计算k个样本中各类样本的数量,记录下最大数量的样本的类别
label1 = sum(D_y==j);
if label1 > max_label
max_label = label1;
ind = [];
ind = [ind; j];
elseif label1 == max_label
ind = [ind; j];
end
end
num_ind = size(ind, 1);
min_dist = inf;
% 如果存在数量相等的情况,就将距离之和最小的类标签作为test_y的类别
for j1 = 1 : num_ind
dist = sum(D(D_y == ind(j1)));
if dist <= min_dist
test_y(i, :) = ind(j1);
end
end
end
end
function F = evaluate(Z, data_x, data_y, num_lable)
num_simple = size(Z, 1);
F = zeros(num_simple, 1);
% 采用1-NN
k = 1;
for i = 1 : num_simple
data_x1 = data_x(:, Z(i, :) == 1);
n = size(data_x1, 1);
test_y = zeros(n ,1);
for i1 = 1 : n
% 在样本集中选择一个样本作为测试集,其他样本作为训练集
test_x = data_x1(i1, :);
if i1 == 1
ind = i1 + 1 : n;
test_y(i1, :) = knn(data_x1(ind, :), data_y(ind, :), test_x, k, num_lable);
else
ind = [1 : i1 - 1, i1 + 1 : n];
test_y(i1, :) = knn(data_x1(ind, :), data_y(ind, :), test_x, k, num_lable);
end
end
% 正确率与正确个数
acc_num = sum(test_y == data_y);
acc_rate =1.0 * acc_num ./ n;
% fprintf("data total: %d feature select: Z(%d,:) acc_num: %d acc_rate: %.4f\n",n,i,acc_num,acc_rate);
F(i, :) = acc_rate;
end
end
function [best_Zi, best_acc] = Bi_PSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable)
% num_simple:初始化粒子群样本数量;num_iter:搜索迭代次数;num_feature:每个样本包含的特征个数
% data_x代表聚类样本数据,data_y代表聚类样本的标签;num_lable代表聚类样本的标签的种类数
% best_Zi:准确率最高的样本的特征选择情况
% 初始化粒子群样本sample_x:num_simple*num_feature
X = rand(num_simple, num_feature);
% 将X转换成Z
r1 = ones(num_simple, num_feature) * 0.5;
Z = zeros(num_simple, num_feature);
Z(r1 < X) = 1;
Z(r1 >= X) = 0;
pbestz = Z;
% 计算分类准确率F:num_simple*1
F = evaluate(Z, data_x, data_y, num_lable);
pbestf = F;
% 计算gbest,准确率最大的对应的索引
[fmin, gbest] = max(pbestf);
fprintf("iter 0 feature: pbestz(%d,:) fmin: %.4f\n",gbest, fmin);
% 当前位置信息presentx
presentx = rand([num_simple, num_feature]);
% 将presentx转换成presentz
presentz = zeros(num_simple, num_feature);
r2 = ones(num_simple, num_feature) * 0.5;
presentz(r2 < presentx) = 1;
presentz(r2 >= presentx) = 0;
vz = Z;
w = 1;
for i = 1 : num_iter
r3 = rand(num_simple, num_feature);
% pso更新下一步的位置
vz = w.*vz + 2 * r3 .* (pbestz - presentz) + 2 * r3 .* (pbestz(gbest, :) - presentz);
% vz设置边界目的是为了使s在0~1之间
vz(vz > 10) = 10;
vz(vz < -10) = -10;
s = 1 ./ (1 + exp(-vz));
r4 = rand(num_simple, num_feature);
%r4 = ones(num_simple, num_feature) * 0.5;
presentz(r4 < s) = 1;
presentz(r4 >= s) = 0;
presentf = evaluate(presentz, data_x, data_y, num_lable);
% 更新每个单独个体最佳位置
pbestz(presentf > pbestf, :) = presentz(presentf > pbestf, :);
pbestf(presentf > pbestf, :) = presentf(presentf > pbestf, :);
% 更新所有个体最佳位置
[fmin, gbest] = max(pbestf);
fprintf("iter %d feature: pbestz(%d,:) fmin: %.4f\n",i, gbest, fmin);
% best_feature = find(pbestz(gbest, :) == 1);
% disp(sum(pbestz(gbest, :)));
% disp(best_feature);
end
best_Zi = pbestz(gbest, :);
best_acc = fmin;
end
clc;clear;clearvars;
% load('lonosphere_X.mat', 'ionosphere');
% data_x = ionosphere;
% load('lonosphere_Y.mat', 'ionosphere1');
% data_y = ionosphere1;
% load('Vehicle_X.mat')
% load('Vehicle_Y.mat')
% data_x = xaa;
% data_y = xaa1;
load('sonar_X.mat')
data_x = sonar1;
load('sonar_Y.mat')
data_y = sonar1;
% load('WDBC_X.mat')
% data_x = UCIBeastcancerwdbc;
% load('WDBC_Y.mat')
% data_y = UCIBeastcancerwdbc;
num_lable = 2;
% num_lable = 4;
num_simple = 20;
num_iter = 100;
num_feature = size(data_x, 2);
best_Z1 = zeros(1, num_feature);
best_acc1 = 0;
%for i = 1 : 10
for i = 1
[best_Zi, best_acc] = Bi_PSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable);
if best_acc > best_acc1
best_acc1 = best_acc;
best_Z1 = best_Zi;
end
end
best_feature = find(best_Z1 == 1);
disp(sum(best_Z1));
disp(best_feature);
function [best_Zi, best_acc] = Bi_BPSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable)
% num_simple:初始化粒子群样本数量;num_iter:搜索迭代次数;num_feature:每个样本包含的特征个数
% data_x代表聚类样本数据,data_y代表聚类样本的标签;num_lable代表聚类样本的标签的种类数
% best_Zi:准确率最高的样本的特征选择情况
% 初始化粒子群样本sample_x:num_simple*num_feature
X = rand(num_simple, num_feature);
% 将X转换成Z
r1 = ones(num_simple, num_feature) * 0.5;
Z = zeros(num_simple, num_feature);
Z(r1 < X) = 1;
Z(r1 >= X) = 0;
pbestx = X;
% 计算分类准确率F:num_simple*1
F = evaluate(Z, data_x, data_y, num_lable);
pbestf = F;
% 计算gbest,准确率最大的对应的索引
[fmin, gbest] = max(pbestf);
fprintf("iter 0 feature: pbestz(%d,:) fmin: %.4f\n",gbest, fmin);
num1 = 0;
for i = 1 : num_iter
s = zeros(num_simple, num_feature);
% pso更新下一步的位置,这里可以设置一下超过搜索范围的就设置为边界
r = rand;
if r > 0.5
x = pbestx;
else
x = normrnd((pbestx + pbestx(gbest, :)) ./ 2,abs(pbestx - pbestx(gbest, :)));
end
r2 = rand(num_simple, num_feature);
s(r2 < x) = 1;
s(r2 >= x) = 0;
presentf = evaluate(s, data_x, data_y, num_lable);
% 更新每个单独个体最佳位置
% pbestx(presentf > pbestf, :) = x(presentf > pbestf, :);
% a reinforced memory strategy
pbestx(presentf > pbestf, :) = 0.5 * (x(presentf > pbestf, :) + s(presentf > pbestf, :));
pbestx((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)),:) = 0.5 * (x((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :) + ...
s((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :));
pbestf(presentf > pbestf, :) = presentf(presentf > pbestf, :);
pbestf((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)),:) = presentf((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :);
Z(presentf > pbestf, :) = s(presentf > pbestf, :);
Z((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)),:) = s((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :);
% uniform combination (UC)
pc = 0.2 / (1 + exp(5 - num1));
for i1 = 1 : num_simple
r3 = rand;
if r3 < pc
k = randi(num_simple, 1, 1);
u_n = ceil(pc * num_feature);
for i2 = 1 : u_n
l = randi(num_feature, 1, 1);
pbestx(i1, l) = pbestx(k, l) + rand;
end
end
end
% 更新所有个体最佳位置
[fmin1, gbest] = max(pbestf);
if fmin1 > fmin
num1 = 0;
else
num1 = num1 + 1;
end
fmin = fmin1;
fprintf("iter %d feature: pbestz(%d,:) fmin: %.4f\n",i, gbest, fmin);
end
best_Zi = Z(gbest, :);
best_acc = fmin;
end
clc;clear;clearvars;
% load('lonosphere_X.mat', 'ionosphere');
% data_x = ionosphere;
% load('lonosphere_Y.mat', 'ionosphere1');
% data_y = ionosphere1;
% load('Vehicle_X.mat')
% load('Vehicle_Y.mat')
% data_x = xaa;
% data_y = xaa1;
load('sonar_X.mat')
data_x = sonar1;
load('sonar_Y.mat')
data_y = sonar1;
% load('Beast-cancer_X.mat')
% data_x = UCIBeastcancerwdbc;
% load('Beast-cancer_Y.mat')
% data_y = UCIBeastcancerwdbc;
num_lable = 2;
num_simple = 20;
num_iter = 100;
num_feature = size(data_x, 2);
best_Z1 = zeros(1, num_feature);
best_acc1 = 0;
%for i = 1 : 10
for i = 1
[best_Zi, best_acc] = Bi_BPSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable);
if best_acc > best_acc1
best_acc1 = best_acc;
best_Z1 = best_Zi;
end
end
best_feature = find(best_Z1 == 1);
disp(sum(best_Z1));
disp(best_feature);
disp(best_acc1);
目前只展现sonar和lonosphere两个数据集的测试结果:
Binary PSO sonar数据集测试的准确率:
Binary PSO sonar数据集测试的特征数,最后定格在31个(总共60):
Binary PSO lonosphere数据集测试的准确率:
Binary PSO lonosphere数据集测试的特征数,最后定格在9个(总共34):
Binary BPSO sonar数据集测试的准确率:
Binary BPSO sonar数据集测试的特征数,最后定格在30个(总共60):
Binary BPSO lonosphere数据集测试的准确率:
Binary PSO lonosphere数据集测试的特征数,最后定格在9个(总共34):
如有错误,恳请私下交流和改正。