前言
Python是十分强大的语言,它可以在很多领域发挥作用,无论是软件开发、前端开发、后端开发、数据分析和机器学习等方向都有出色的成绩。本篇旨在不同的应用场景利用Python进行数据分析。
股票分析
- Pandas库最初是为了进行金融分析而产生的,当然也在不同的项目背景下为我们了解、观察、分析数据集提供了巨大的帮助。
一.探索性分析
首先我们导入相关模块:
## import numpy as np
## import pandas as pd
## import matplotlib.pyplot as plt
设置jupyter notekbook的显示参数(字体,行数、列数):
pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 20)
导入数据集,选择前10行进行观察:
data_all = pd.read_excel('data_all.xlsx')
data_all(10)
——————————————————————————————————————————————————
date code open high low close pre_close pct_chg vol amt
0 2018-11-01 000001.SZ 10.99 11.05 10.76 10.83 10.91 -0.7333 1542776.32 16.794434
1 2018-11-01 000002.SZ 24.90 25.27 24.28 24.42 24.23 0.7842 617847.25 15.318339
2 2018-11-01 000004.SZ 15.63 15.63 15.43 15.50 15.54 -0.2574 5597.02 0.086807
3 2018-11-01 000005.SZ 2.74 2.76 2.71 2.71 2.73 -0.7326 50199.00 0.137363
4 2018-11-01 000006.SZ 5.13 5.15 5.03 5.04 5.07 -0.5917 149151.86 0.759842
5 2018-11-01 000007.SZ 7.25 7.40 7.07 7.16 7.25 -1.2414 288091.37 2.075888
6 2018-11-01 000008.SZ 4.43 4.47 4.31 4.35 4.41 -1.3605 470410.19 2.058089
7 2018-11-01 000009.SZ 4.05 4.10 4.00 4.03 4.02 0.2488 128825.16 0.523551
8 2018-11-01 000010.SZ 4.38 4.38 4.31 4.34 4.38 -0.9132 26943.00 0.117115
9 2018-11-01 000011.SZ 8.90 9.08 8.85 8.86 8.90 -0.4494 37702.91 0.338891
其中columns分别是:
- date: 交易日期
- code: 股票代码
- open:开盘价
- high:最高价
- low:最低价
- close:收盘价
- pre_close:昨收价
- pct_chg:涨跌幅
- vol: 成交量
10.amt: 成交额
使用describe()来观察数据集的整体分布情况:
data_all.describe()
——————————————————————————————————————————————————
open high low close pre_close pct_chg vol amt
count 984476.000000 984476.000000 984476.000000 984476.000000 984476.000000 984476.000000 9.844760e+05 984476.000000
mean 13.680437 13.960330 13.432177 13.696887 13.680717 0.103236 1.367581e+05 1.345854
std 22.011995 22.416306 21.665628 22.056704 22.002960 3.057492 3.112537e+05 3.061065
min 0.130000 0.140000 0.120000 0.130000 0.130000 -27.193700 2.000000e+00 0.000004
25% 5.400000 5.500000 5.310000 5.410000 5.400000 -1.308000 2.376319e+04 0.220520
50% 8.840000 9.000000 8.690000 8.840000 8.840000 0.000000 5.593079e+04 0.515636
75% 15.600000 15.920000 15.320000 15.620000 15.600000 1.339300 1.377442e+05 1.287892
max 1231.000000 1241.610000 1228.060000 1233.750000 1233.750000 400.153100 4.034860e+07 181.957345
我们现在来看看这些股票的收盘价:
price = data_all['close']
price.describe(),price.plot()
——————————————————————————————————————————————————
(count 984476.000000
mean 13.696887
std 22.056704
min 0.130000
25% 5.410000
50% 8.840000
75% 15.620000
max 1233.750000
Name: close, dtype: float64,
<AxesSubplot:>)
再看看涨跌幅,发现最高涨跌幅达到400.1531。
data_all['pct_chg'].max(),data_all['pct_chg'].hist(bins=200)
——————————————————————————————————————————————————
(400.1531, <AxesSubplot:>)
对索引做一些改动,形成多层次的索引,进而观察在某一天内的股票及其收盘价情况:
price.index=pd.MultiIndex.from_frame(df=data_all[['date', 'code']])
price, price.index
——————————————————————————————————————————————————
(date code amt
2018-11-01 000001.SZ 16.794434 10.83
000002.SZ 15.318339 24.42
000004.SZ 0.086807 15.50
000005.SZ 0.137363 2.71
000006.SZ 0.759842 5.04
000007.SZ 2.075888 7.16
000008.SZ 2.058089 4.35
000009.SZ 0.523551 4.03
000010.SZ 0.117115 4.34
000011.SZ 0.338891 8.86
000012.SZ 0.397421 4.19
000014.SZ 0.509856 9.01
000016.SZ 0.601305 3.55
000017.SZ 0.594669 4.43
000018.SZ 0.439367 2.00
...
2019-12-12 688288.SH 0.764449 31.28
688299.SH 0.809870 16.78
688300.SH 0.609886 32.52
688310.SH 0.761119 27.59
688321.SH 1.021553 55.52
688333.SH 0.662783 52.44
688357.SH 1.546412 45.22
688358.SH 0.953628 45.40
688363.SH 1.579945 82.78
688366.SH 0.389405 87.00
688368.SH 1.080490 78.83
688369.SH 1.655059 58.48
688388.SH 1.446511 45.19
688389.SH 0.506921 15.85
688399.SH 2.451546 55.90
Name: close, Length: 984476, dtype: float64,
接着对表格进行unstack操作:
mat_close = price.unstack()
mat_close.head(10)
——————————————————————————————————————————————————
code 000001.SZ 000002.SZ 000004.SZ 000005.SZ 000006.SZ 000007.SZ 000008.SZ 000009.SZ 000010.SZ 000011.SZ ... 688333.SH 688357.SH 688358.SH 688363.SH 688366.SH 688368.SH 688369.SH 688388.SH 688389.SH 688399.SH
date
2018-11-01 10.83 24.42 15.50 2.71 5.04 7.16 4.35 4.03 4.34 8.86 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-02 11.09 24.62 15.76 2.76 5.14 7.17 4.37 4.14 4.38 9.06 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-05 10.92 24.04 16.31 2.81 5.13 7.64 4.52 4.29 4.40 9.18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-06 10.84 24.14 16.26 2.82 5.12 7.03 4.43 4.41 4.34 9.27 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-07 10.81 23.85 16.12 2.79 5.08 7.03 NaN 4.31 4.35 9.06 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-08 10.89 23.99 16.30 2.86 5.09 7.32 NaN 4.29 4.33 9.14 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-09 10.55 23.55 16.18 2.87 5.03 7.29 NaN 4.29 4.26 8.95 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-12 10.56 23.88 16.59 3.00 5.13 7.58 4.22 4.45 4.31 9.22 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-13 10.54 23.94 17.12 3.15 5.28 8.34 4.30 4.56 4.40 9.40 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-14 10.44 24.15 17.17 3.08 5.30 8.00 4.22 4.56 4.39 9.49 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
处理完成后,更加清晰看出每一只股票在一段时间内的变化情况。 以上的过程我们也可以包装成一个函数:
def variable(data, field):
data_s = data[field]
data_s.index = pd.MultiIndex.from_frame(df=data[['date', 'code']])
return data_s.unstack()
通过这个函数我们来看看涨跌幅的变化情况。
mat_pct_chg = variable_ts(data=data_all, field='pct_chg')
mat_pct_chg
——————————————————————————————————————————————————
code 000001.SZ 000002.SZ 000004.SZ 000005.SZ 000006.SZ 000007.SZ 000008.SZ 000009.SZ 000010.SZ 000011.SZ ... 688333.SH 688357.SH 688358.SH 688363.SH 688366.SH 688368.SH 688369.SH 688388.SH 688389.SH 688399.SH
date
2018-11-01 -0.7333 0.7842 -0.2574 -0.7326 -0.5917 -1.2414 -1.3605 0.2488 -0.9132 -0.4494 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-02 2.4007 0.8190 1.6774 1.8450 1.9841 0.1397 0.4598 2.7295 0.9217 2.2573 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-05 -1.5329 -2.3558 3.4898 1.8116 -0.1946 6.5551 3.4325 3.6232 0.4566 1.3245 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-06 -0.7326 0.4160 -0.3066 0.3559 -0.1949 -7.9843 -1.9912 2.7972 -1.3636 0.9804 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-07 -0.2768 -1.2013 -0.8610 -1.0638 -0.7813 0.0000 NaN -2.2676 0.2304 -2.2654 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-08 0.7401 0.5870 1.1166 2.5090 0.1969 4.1252 NaN -0.4640 -0.4598 0.8830 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-09 -3.1221 -1.8341 -0.7362 0.3497 -1.1788 -0.4098 NaN 0.0000 -1.6166 -2.0788 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-12 0.0948 1.4013 2.5340 4.5296 1.9881 3.9781 -4.7404 3.7296 1.1737 3.0168 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-13 -0.1894 0.2513 3.1947 5.0000 2.9240 10.0264 1.8957 2.4719 2.0882 1.9523 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-14 -0.9488 0.8772 0.2921 -2.2222 0.3788 -4.0767 -1.8605 0.0000 -0.2273 0.9574 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 rows × 3756 columns
设置index为date,保留column上的date。
data_index.set_index(keys='time', drop=False, inplace=True)
然后使用滑动窗口rolling来观测收盘价的平均值。
ma = data_all['close'].rolling(window=20, min_periods=20).mean()
将两个表连接在一起右边的close是滑动窗口观测到的值,小于period=20时,计数为NaN。
pd.concat([data_all['close'],ma],axis=1,sort=False)
——————————————————————————————————————————————————
close close
date
2018-11-01 10.83 NaN
2018-11-01 24.42 NaN
2018-11-01 15.50 NaN
2018-11-01 2.71 NaN
2018-11-01 5.04 NaN
2018-11-01 7.16 NaN
2018-11-01 4.35 NaN
2018-11-01 4.03 NaN
2018-11-01 4.34 NaN
2018-11-01 8.86 NaN
2018-11-01 4.19 NaN
2018-11-01 9.01 NaN
2018-11-01 3.55 NaN
2018-11-01 4.43 NaN
2018-11-01 2.00 NaN
... ... ...
2019-12-12 31.28 49.3930
2019-12-12 16.78 47.4825
2019-12-12 32.52 48.1540
2019-12-12 27.59 48.7690
2019-12-12 55.52 43.8650
2019-12-12 52.44 45.0920
2019-12-12 45.22 45.1880
2019-12-12 45.40 45.8355
2019-12-12 82.78 49.0185
2019-12-12 87.00 52.5670
2019-12-12 78.83 55.0840
2019-12-12 58.48 56.5405
2019-12-12 45.19 54.0675
2019-12-12 15.85 47.9895
2019-12-12 55.90 48.8405
984476 rows × 2 columns
pd.concat([data_all['close'],ma],axis=1,sort=False).plot()
下面计算一下数据延伸时的最大值。
exmax = data_all['close'].expanding().max()
pd.concat([data_all['close'], exmax], axis=1, sort=False).plot()
下面将要对数据集重新采样,以一周为周期单位,并计算各个指标的平均值。
mean = data_all.resample(rule='1W', on='date', closed='right', label='right').mean()
mean
——————————————————————————————————————————————————
open high low close pre_close pct_chg vol amt
date
2018-11-04 11.475561 11.776644 11.344113 11.580576 11.347201 1.686861 133421.827245 1.197930
2018-11-11 11.736163 11.949151 11.524949 11.720392 11.743626 0.035427 111001.643627 0.924826
2018-11-18 12.050949 12.414799 11.919639 12.251079 12.073681 1.607222 149621.296784 1.187044
2018-11-25 12.350931 12.580767 12.052282 12.264035 12.400272 -1.124476 129389.371799 1.063069
2018-12-02 11.858034 12.080183 11.601627 11.840531 11.845595 -0.007843 96146.719166 0.778999
2018-12-09 12.143786 12.380277 11.967465 12.178485 12.144444 0.350796 105900.684959 0.919899
2018-12-16 11.997293 12.178797 11.791668 11.960303 11.997663 -0.384111 84651.361308 0.720631
2018-12-23 11.668868 11.845912 11.462035 11.653172 11.706126 -0.376411 76366.230127 0.649877
2018-12-30 11.591742 11.788976 11.347982 11.564124 11.587275 -0.373679 80999.883521 0.690408
2019-01-06 11.332548 11.612886 11.125718 11.394576 11.368771 0.585630 89418.638862 0.750437
2019-01-13 11.710140 11.944959 11.567360 11.761189 11.697039 0.611978 113789.838744 0.919325
2019-01-20 11.829106 12.032179 11.649138 11.841103 11.832487 0.006484 106323.611524 0.858491
2019-01-27 11.915102 12.113135 11.748725 11.923547 11.919975 -0.091527 97456.824903 0.813621
2019-02-03 11.625442 11.843185 11.360569 11.571198 11.627905 -0.667317 93623.021020 0.772521
2019-02-10 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
2019-09-08 14.596266 14.935986 14.401262 14.715649 14.564102 1.048966 155212.596756 1.722521
2019-09-15 15.144680 15.418701 14.884215 15.162391 15.096582 0.599808 158320.709253 1.768950
2019-09-22 15.135537 15.397141 14.874496 15.129959 15.115724 -0.088277 131388.196198 1.469611
2019-09-29 15.054052 15.331692 14.701197 14.958344 15.065550 -0.827569 118295.782658 1.374129
2019-10-06 14.758728 14.980529 14.387937 14.547902 14.732943 -1.002513 81982.295657 0.960229
2019-10-13 14.629373 14.930197 14.398692 14.710025 14.627513 0.636353 95204.032934 1.101362
2019-10-20 14.950047 15.203062 14.689691 14.903079 14.933476 -0.300134 109903.485569 1.195154
2019-10-27 14.691936 14.932835 14.423435 14.703832 14.694104 0.155208 92698.280959 1.013124
2019-11-03 14.803516 15.089399 14.525632 14.795101 14.809838 -0.193310 118589.653663 1.310630
2019-11-10 14.926280 15.196216 14.705147 14.934956 14.899184 0.034028 106198.282544 1.219384
2019-11-17 14.758897 15.002853 14.476706 14.721258 14.777994 -0.628522 93082.760733 1.046375
2019-11-24 14.932153 15.210862 14.681571 14.939452 14.937258 0.231435 96077.021700 1.100175
2019-12-01 14.592503 14.804645 14.330456 14.553943 14.599136 -0.127033 93553.277018 0.984303
2019-12-08 14.653081 14.916211 14.485489 14.753741 14.665127 0.522050 91099.107137 1.010874
2019-12-15 15.059145 15.326443 14.854766 15.088637 15.057555 0.123028 113471.141525 1.237854
59 rows × 8 columns
发现结果有很多空值,使用向后填充的办法清除空值。
mean.fillna(method='ffill').plot()
我们还可以使用重新采样的方式获得更多需要的指标:
data_all.resample(rule='1W', on='date', closed='right', label='right').agg({'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'vol': 'sum', 'amt': 'sum'})
——————————————————————————————————————————————————
open high low close vol amt
date
2018-11-04 10.99 600.00 1.02 4.88 9.290162e+08 8341.184058
2018-11-11 10.95 593.00 1.04 4.79 1.940087e+09 16164.108137
2018-11-18 10.46 570.00 0.67 5.28 2.626901e+09 20840.936047
2018-11-25 10.57 572.00 0.40 5.20 2.278417e+09 18719.589955
2018-12-02 10.34 569.80 0.26 4.91 1.697278e+09 13751.677451
2018-12-09 10.59 616.50 0.25 5.50 1.874760e+09 16284.974932
2018-12-16 10.22 606.88 0.25 5.13 1.501292e+09 12780.393349
2018-12-23 10.16 595.97 0.22 5.14 1.354202e+09 11524.263748
2018-12-30 9.40 596.40 0.20 4.84 1.435642e+09 12236.799335
2019-01-06 9.39 612.00 1.11 4.99 9.532921e+08 8000.406237
2019-01-13 9.84 637.00 1.14 5.04 2.023411e+09 16347.443668
2019-01-20 10.22 690.20 1.13 5.18 1.891497e+09 15272.562995
2019-01-27 10.34 698.88 1.00 4.94 1.734926e+09 14484.087155
2019-02-03 11.04 699.00 0.92 4.83 1.669392e+09 13774.813951
2019-02-10 NaN NaN NaN NaN 0.000000e+00 0.000000
... ... ... ... ... ... ...
2019-09-08 14.15 1151.02 0.27 59.66 2.841943e+09 31539.368510
2019-09-15 14.98 1148.00 0.19 57.04 2.319557e+09 25916.883317
2019-09-22 14.70 1160.00 0.19 56.69 2.406769e+09 26920.325267
2019-09-29 15.34 1188.87 0.18 52.92 2.166824e+09 25169.923667
2019-10-06 15.75 1169.43 0.18 49.65 3.008750e+08 3524.038865
2019-10-13 15.60 1180.00 0.15 48.72 1.396453e+09 16154.776554
2019-10-20 16.97 1215.68 0.15 45.93 2.017608e+09 21940.642711
2019-10-27 16.43 1181.50 0.20 47.69 1.702775e+09 18610.066243
2019-11-03 16.98 1199.96 0.21 44.66 2.182524e+09 24120.830213
2019-11-10 16.98 1215.65 0.24 20.00 1.960633e+09 22512.265428
2019-11-17 16.50 1240.00 0.24 17.08 1.722496e+09 19363.177225
2019-11-24 16.35 1241.61 0.23 16.42 1.781652e+09 20401.654126
2019-12-01 15.64 1198.60 0.25 16.34 1.737191e+09 18277.521529
2019-12-08 15.35 1170.00 0.24 51.00 1.692713e+09 18783.057471
2019-12-15 15.62 1176.00 0.24 55.90 1.689018e+09 18425.459966
59 rows × 6 columns
二.问题分析
导入数据集:
data_basic = pd.read_excel('data_basic.xlsx')
data_zt = pd.read_excel('data_zt.xlsx')
data_all = pd.read_excel('data_all.xlsx')
引入多层次索引操作函数:
def variable_ts(data, field):
ser = data[field]
ser.index = pd.MultiIndex.from_frame(df=data[['date', 'code']])
return ser.unstack()
1. 找出每天最高的成交笔数(num)
a.使用groupby
result_max_num_date = data_zt.groupby('date')['num'].max()
b.使用pivot_table
result_max_num_date = data_zt.pivot_table(values='num', index='date', aggfunc='max')
c.使用multiindex
mat_num = variable_ts(data=data_zt, field='num')
result_max_num_date = mat_num.max(axis=1)
result_max_num_date
result_max_num_date.plot()
——————————————————————————————————————————————————
date
2018-01-02 1.0
2018-01-03 2.0
2018-01-04 3.0
2018-01-05 3.0
2018-01-08 3.0
...
2019-12-06 6.0
2019-12-09 4.0
2019-12-10 5.0
2019-12-11 6.0
2019-12-12 7.0
Length: 474, dtype: float64
2. 计算这些股票每天的涨停情况,以及它们的平均交易额
首先观察mat_num中的值,当值为非NaN时,表明这一天股票出现涨停,如果有连续的数字出现(1,2,3……n),表明这只股票出现了n连板。但是我们需要的是涨停的次数,例如出现000009.SZ出现1,2,这只股票在两天内一共涨停2次,为了解决这个问题,我们可以将非NaN值设置为1,求出总和就是每只股票的涨停总数。
mat_num
——————————————————————————————————————————————————
code 000004.SZ 000005.SZ 000006.SZ 000007.SZ 000008.SZ 000009.SZ 000010.SZ 000011.SZ 000012.SZ 000014.SZ ... 603987.SH 603988.SH 603989.SH 603990.SH 603992.SH 603993.SH 603996.SH 603997.SH 603998.SH 603999.SH
date
2018-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-08 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
...
2019-12-06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-09 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
474 rows × 3119 columns
定义一个0值的DataFrame,行列索引与mat_num一致,用来存储单日涨停数。
df = pd.DataFrame(0, index=mat_num.index, columns=mat_num.columns)
将出现涨停的股票对应到df中记作1。
df[mat_num > 0]=1
然后可以轻松求得每只股票的总涨停数。
zt_sums_stock = df.sum(axis=0).sort_values(ascending=False)
——————————————————————————————————————————————————
code
603032.SH 43
300598.SZ 42
600776.SH 37
300663.SZ 36
002356.SZ 34
..
600297.SH 1
002358.SZ 1
603689.SH 1
002340.SZ 1
002438.SZ 1
Length: 3119, dtype: int64
下面来看一下成交额,先填补空值,然后使用窗口函数来观测平均成交额。 看到mat_amt_mean的前四行是NaN就很好。
mat_amt = variable_ts(data=data_all, field='amt')
mat_amt = mat_amt.fillna(value=0)
mat_amt_mean = mat_amt.rolling(window=5).mean()
mat_amt_mean
——————————————————————————————————————————————————
code 000001.SZ 000002.SZ 000004.SZ 000005.SZ 000006.SZ 000007.SZ 000008.SZ 000009.SZ 000010.SZ 000011.SZ ... 688333.SH 688357.SH 688358.SH 688363.SH 688366.SH 688368.SH 688369.SH 688388.SH 688389.SH 688399.SH
date
2018-11-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-07 14.349779 13.219227 0.178639 0.169550 0.673762 2.084265 1.657001 1.001253 0.122785 0.463832 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-12-06 8.638587 14.973211 0.223669 0.151208 0.385596 0.375832 1.196141 0.444873 0.045953 0.115200 ... 0.364501 1.242398 2.479148 1.851320 0.292710 0.533009 0.481587 0.816026 0.434213 1.745468
2019-12-09 8.809291 15.861202 0.233287 0.139696 0.336337 0.548861 1.583662 0.524081 0.052064 0.114330 ... 0.440212 1.474739 2.711663 1.645825 0.313169 0.757351 0.538574 0.927789 0.480371 2.190393
2019-12-10 9.111459 16.341740 0.225406 0.144344 0.330028 0.796226 1.804214 0.751842 0.052398 0.126397 ... 0.716250 1.982968 1.926679 1.974561 0.413794 1.053451 0.740009 1.402833 0.660357 2.819268
2019-12-11 10.210811 19.963586 0.231022 0.160623 0.413147 0.872229 1.813619 0.911660 0.056627 0.189311 ... 0.826002 1.858374 1.613671 1.886199 0.459506 1.186162 0.858411 1.595369 0.730358 3.260726
2019-12-12 9.990485 21.051259 0.217922 0.150647 0.461222 0.889543 1.610768 0.955314 0.052097 0.209479 ... 0.889447 1.824044 1.462992 1.816224 0.478536 1.287344 1.103114 1.757640 0.758520 2.670553
273 rows × 3756 columns
接近胜利的一步,将每只股票的涨停数和平均交易额放在一起,并剔除空值。
result = pd.concat([zt_sums_stock, mat_amt.mean()], axis=1, sort=False)
result.columns = ['sums', 'amt']
result = result.dropna(subset=['sums'], how='all')
result
——————————————————————————————————————————————————
sums amt
603032.SH 43.0 2.580331
300598.SZ 42.0 2.507630
600776.SH 37.0 11.090669
300663.SZ 36.0 3.338191
002356.SZ 34.0 1.084322
... ... ...
600297.SH 1.0 0.884834
002358.SZ 1.0 1.730683
603689.SH 1.0 0.413853
002340.SZ 1.0 2.840811
002438.SZ 1.0 0.324684
3119 rows × 2 columns
从散点图可以看出,涨停数超过20次的股票,成交额并不算很高,而成交额大于20的股票涨停数不足十次。
plt.scatter(x=result['sums'], y=result['amt'])
3. 寻找涨停的股票及其连板数
经过解决第二个问题,我们得到了一些经验,通过构造函数创建容器来存储股票的涨停数。
def frame_like(data, value):
return pd.DataFrame(data=value, index=data.index, columns=data.columns)
mat_zgb = frame_like(mat_num, value=None)
mat_num_fill = mat_num.fillna(value=0)
现在,mat_num_fill中存储着股票涨停的数据,对于出现1,2,3这3连板的股票,我们想要提取出3作为结果。这是可以设计一个思路,涨停可能出现1次或者多次,对于1次来说,前一个数据是0,而对于多次涨停来说只要满足次数大于0就可以了。
mat_zgb[(mat_num_fill > 0) & (mat_num_fill.shift(periods=-1) == 0)] = mat_num_fill
最后再对索引进行stack操作,清楚看到股票的涨停情况。
zgb = mat_zgb.stack().reset_index(drop=False)
zgb.columns = ['date', 'code', 'num']
——————————————————————————————————————————————————
date code num
0 2018-01-02 000672.SZ 1
1 2018-01-02 000703.SZ 1
2 2018-01-02 000885.SZ 1
3 2018-01-02 002372.SZ 1
4 2018-01-02 002793.SZ 1
... ... ... ...
16350 2019-12-11 600715.SH 1
16351 2019-12-11 600812.SH 1
16352 2019-12-11 601500.SH 1
16353 2019-12-11 601999.SH 1
16354 2019-12-11 603530.SH 1
16355 rows × 3 columns
4.找出最高板为10的股票及其每次涨停的前7日的平均交易额
先对连板数进行排序。
zgb['num'].value_counts()
zgb.sort_values(by='num')
使用shift函数将整个数据集向下移动一位,再利用rolling函数动态计算平均交易额即为七天平均交易额。
mat_amt_mean = mat_amt.shift(1).rolling(window=7).mean()
找到连板数为10的七天平均交易额。
result_zgb1_amt = mat_amt_mean[mat_zgb == 10].stack()
result_zgb1_amt
——————————————————————————————————————————————————
date code
2019-01-10 601700.SH 0.072690
2019-02-25 000859.SZ 0.128327
2019-03-15 002356.SZ 0.000000
2019-04-26 300573.SZ 0.689295
dtype: float64
当然,我么那也可以进行函数的包装来获得任意连板数的任一周期的平均交易额。
def get_amt_mean(num):
mat_amt_mean = mat_amt.shift(num).rolling(window=7).mean()
result_zgb_amt = mat_amt_mean[mat_zgb == num].stack()
return result_zgb_amt
get_amt_mean(num=8)
——————————————————————————————————————————————————
date code
2019-03-11 002750.SZ 0.283191
2019-03-13 300370.SZ 1.743853
2019-03-20 600624.SH 3.506733
2019-04-01 000590.SZ 0.932301
2019-04-09 300099.SZ 1.005648
2019-04-10 300194.SZ 0.702668
dtype: float64
取得平均交易额的地址存入列表。
r = [get_amt_mean(num=int(i)) for i in zgb['num'].value_counts().index]
将k设置为索引,v设置为连板数,将连板数与平均交易额放入字典。
result_dict = {}
for k, v in enumerate(zgb['num'].value_counts().index[:-1]):
result_dict[int(v)] = r[k].mean()
print(v)
——————————————————————————————————————————————————
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
pd.Series(result_dict).plot.bar()
回归模型分析
- 线性回归、逻辑回归、岭回归、softmax回归等是常用的回归模型,能够在数据分析实践中很好的帮助我们建立模型并进行模型评估。Scikit-learn库中有丰富的模型模块,为数据分析和机器学习提供了很大的方便。本篇旨在初步理解回归数学模型的基础上进一步探索回归模型的应用场景,也会利用多种模型展示模型融合的效果。
线性回归应用分析
1. 共享单车租赁数量预测
按照分析流程依次导入相关分析库:
import pandas as pd,numpy as np
#数据预处理库:独热编码、多项式展开、标准化
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
#数据集划分
from sklearn.model_selection import train_test_split
#线性回归模型、岭回归模型
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
#模型评估:均方根误差
from sklearn.metrics import mean_squared_error
读取数据集:
path = 'datas/hour.csv '
df = pd.read_csv(path)
观察数据集,列标签为instant,dteday,casual,registered这四列的数据对于分析目标没有实际价值,因此选择删除。
df.drop(columns = ['instant','dteday','casual','registered'],inplace=True)
对于数值型数据进行独热编码,转换成由0和1组成的序列,先来查看哪些列需要进行编码。
for i in df.columns:
a=df[i]
print(a.value_counts())
下面对season、mnth、hr、weekday进行独热编码,并在df中删除。
hot = df[['season','mnth','hr','weekday']]
hotcoder = OneHotEncoder(sparse=False,handle_unknown ='ignore')
hot = pd.DataFrame(hotcoder.fit_transform(hot))
df.drop(columns =['season','mnth','hr','weekday'],inplace=True)
进行多项式扩展。
poly = df[['weathersit','temp','atemp','hum','windspeed']]
#多项式扩展参数:扩展为3阶多项式,允许存在x平方项。
polycoder=PolynomialFeatures(degree=3,interaction_only=False,include_bias=False)
#用多项式扩展器转换poly,使用get_feature_names方法获取列名。
poly = pd.DataFrame(polycoder.fit_transform(poly),
columns =polycoder.get_feature_names())
再进行标准化处理,并删除标准化的列。
ssconder = StandardScaler()
poly = pd.DataFrame(ssconder.fit_transform(poly))
df.drop(columns =['weathersit','temp','atemp','hum','windspeed'],inplace=True)
将独热编码处理、标准化处理和df合并。
df = pd.concat([hot,poly,df],axis=1)
另外,在Titanic中我们曾经使用过虚拟编码(dummies),这里也可以用来替代独热编码,下面我将包装成一个函数展示:
def Hotconder():
global df
for data in ['weekday','hr','mnth','season']:
data_dummies =pd.get_dummies( df[data],prefix =data)
df =pd.concat([data_dummies,df],axis=1)
df.drop(data,axis=1,inplace=True)
return df
划分数据集,最后一列cnt是我们需要预测的值。
x = df.iloc[:,:-1]
y = df.iloc[:,[-1]]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
建立线性回归模型及评估:
model =LinearRegression()
model.fit(x_train,y_train)
model.score(x_test,y_test) , model.score(x_train,y_train)
mean_squared_error(y_pred=model.predict(x_test),y_true=y_test)
__________________________________________________________________
(0.7039761586925211, 0.7044838271295909)
9796.229240009625
建立岭回归模型及评估:
for alpha in [0.001,0.01,0.1,1,3,4,5,6,8,10]:
print(f'alpha:{alpha}')
model = Ridge(alpha=alpha)
model.fit(x_train,y_train)
print(f'score:{model.score(x_test,y_test)}') print(mean_squared_error(y_pred=model.predict(x_test),
y_true=y_test))
————————————————————————————————————————————————————————
alpha:0.001
score:0.7068131057191035
9795.994613920964
alpha:0.01
score:0.7067652761844607
9797.592699895771
alpha:0.1
score:0.7065896548359738
9803.460580818919
alpha:1
score:0.7059984174818978
9823.215072063425
alpha:3
score:0.7053880270231085
9843.609509070426
alpha:4
score:0.7051548705857995
9851.39975907352
alpha:5
score:0.7049398070935562
9858.585485491596
alpha:6
score:0.7047357832549445
9865.402353718442
alpha:8
score:0.7043477576411166
9878.367110661091
alpha:10
score:0.7039761586925211
9890.783017954314
2. 波士顿房屋租赁价格预测
导入相关库:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model.coordinate_descent import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
防止中文乱码,拦截警告。
mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
warnings.filterwarnings(action='ignore',category=ConvergenceWarning)
warnings.filterwarnings(action='ignore',category=UserWarning)
加载数据集
data = pd.read_csv('datas/boston_housing_data.csv',sep=',')
发现数据集存在nan值,删除这些空值。
data.isnull().sum()
data.dropna(inplace=True)
接着将自变量与应变量分离。
names=[]
for i in list(data):
names.append(i)
names.remove('MEDV')
x= data[names]
y =data['MEDV'].ravel()
使用Pipeline并行调参,可是使用多个模型:
#多个模型
models =[
Pipeline([('Ss',StandardScaler()), ('Poly',PolynomialFeatures()),
('Linear',RidgeCV(alphas=np.logspace(-2,1,15)))]),
Pipeline([('Ss',StandardScaler()), ('Poly',PolynomialFeatures()),
('Linear',LassoCV(alphas=np.logspace(-2,1,15)))])
]
#参数
parameters ={
'Poly__degree' : [3,2,1],
'Poly__interaction_only':[True,False],
'Poly__include_bias' : [True,False],
'Linear__fit_intercept' : [True,False]
}
划分数据集:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
先绘制真实值的图像,由于存在不同的模型,设置titles和colors存放不同model训练后的预测图像。
titles = ['Ridge','Lasso']
colors=['r-','b-']
plt.figure(figsize=(25,10),facecolor='w')
ln_x_test = range(len(x_test))
plt.plot(ln_x_test,y_test,'g-',lw=2,label=u'真实值')
#使用网格搜索选择调优模型
for t in range(2):
model =GridSearchCV(models[t],param_grid=parameters,cv=5,n_jobs=1)
model.fit(x_train,y_train)
print(f'{titles[t]}算法的最优参数:{model.best_params_}')
print(f'{titles[t]}算法的R值:{model.best_score_}')
y_predict = model.predict(x_test)
plt.plot(ln_x_test,y_predict,colors[t],lw=t+2,alpha=0.75,
label = f'%s算法预测值,$R^2$=%.3f' % (titles[t],model.best_score_))
plt.legend(loc='upper left')
plt.grid(True)
plt.title(u'波士顿房屋价格预测')
plt.show()
————————————————————————————————————————————————————————
Ridge算法的最优参数:{'Linear__fit_intercept': True, 'Poly__degree': 2, 'Poly__include_bias': True, 'Poly__interaction_only': False}
Ridge算法的R值:0.8568618675311532
Lasso算法的最优参数:{'Linear__fit_intercept': True, 'Poly__degree': 2, 'Poly__include_bias': True, 'Poly__interaction_only': False}
Lasso算法的R值:0.8522318747421048
3. 葡萄酒质量预测
导入相关库:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import sklearn
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LassoCV,LinearRegression,RidgeCV,ElasticNetCV
设置参数防止中文乱码,并拦截警告。
mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
warnings.filterwarnings(action='ignore',category=ConvergenceWarning)
warnings.filterwarnings(action='ignore',category=UserWarning)
加载数据集,将红酒与白酒合并,用‘type’分隔开。
data_red = pd.read_csv('datas/winequality-red.csv',sep=';')
data_white = pd.read_csv('datas/winequality-white.csv',sep=';')
data_red['type'] =1
data_white['type']=2
data =pd.concat([data_red,data_white],axis=0)
处理异常值
data = data.replace('?',np.nan)
data.isnull().sum()
#datas= data.dropna(how='any')
#datas.isnull().sum()
————————————————————————————————————————————————————————fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
type 0
dtype: int64
自变量与应变量分离,此时确定了特征和目标值。
names= []
for i in list(data):
names.append(i)
names.remove('quality')
names
————————————————————————————————————————————————
['fixed acidity',
'volatile acidity',
'citric acid',
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
'alcohol',
'type']
使用Pieline创建模型列表:
models = [ Pipeline([('Poly',PolynomialFeatures()), ('Linear',LinearRegression())]),
Pipeline([('Poly',PolynomialFeatures()),
('Linear',RidgeCV(alphas=np.logspace(-4,1,20)))]),
Pipeline([('Poly',PolynomialFeatures()),
('Linear',LassoCV(alphas=np.logspace(-4,1,20)))]),
Pipeline([('Poly',PolynomialFeatures()),
('Linear',ElasticNetCV(alphas=np.logspace(-4,1,20),
l1_ratio=np.linspace(0,1,5)))])
]
绘制图像,设置尺寸,面板颜色,并设置子图图题。
plt.figure(figsize = (20,10),facecolor='w')
titles = u'线性回归预测','Ridge回归预测','Lasso回归预测','ElasticNet回归预测'
接下来划分数据集,可以看到测试集房价的变化情况。
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.01,random_state=0)
ln_x_test=range(len(x_test))
plt.plot(ln_x_test,y_test,c='r',lw=2,alpha=0.75,zorder=10,label=u'真实值')
我们想要在figure中绘制四个子图,分别是u'线性回归预测','Ridge回归预测','Lasso回归预测','ElasticNet回归预测',这对应着models中的序列为0,1,2,3,除此以外,我们还想在每一张子图中绘制不同阶预测值的图像,这对应着Poly__degree序列为1,2,3,我们还希望这三条曲线的颜色也不相同。
首先,利用numpy数组存入degree,利用linspace函数,形成不同的颜色值存入colors中,同时degree的索引用来获取colors中的颜色。
degree = np.arange(1,4,1)
l =len(degree)
colors =[]
for c in np.linspace(5570560,255,l):
colors.append('#%06x' % int(c))
当开始使用models中第一个模型时,我们也希望将这个模型的预测值曲线图放在figure的第一位。
for t in range(4):
model = models[t]
plt.subplot(2,2,t+1)
plt.plot(ln_x_test,y_test,c='g',lw=2,alpha=0.75,zorder=10,label=u'真实值')
#获取[0,1,2] 颜色 和 [1,2,3 ]阶乘
for i,d in enumerate(degree):
model.set_params(Poly__degree=d)
model.fit(x_train,y_train)
y_predict =model.predict(x_test)
R = model.score(x_train,y_train)
plt.plot(ln_x_test,y_predict,c=colors[i],lw=2,alpha=0.7,zorder=i,
label=u'%d阶预测值,R^2$=%.3f' % (d,R))
plt.legend(loc='upper left')
plt.grid(True)
plt.title(titles[t],fontsize=22)
plt.xlabel('x',fontsize=18)
plt.ylabel('y',fontsize=18)
plt.suptitle(u'葡萄酒质量检测',fontsize=28)
plt.show()
逻辑回归应用分析
1. 乳腺癌分类
导入需要的库:
import numpy as np,pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
由于打开数据及后发现只有Values,没有列标签,自行设置了列名。 id数据对于分析没有实际价值,选择删除,以降低数据量。
names = ['id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei',
'Bland Chromatin','Normal Nucleoli','Mitoses','Class']
data = pd.read_csv('../datas/breast-cancer-wisconsin.data',names=names)
data.drop('id',axis=1,inplace=True)
在数据中虽然没有空值,但是在后面进行模型训练的时候发现了“?”值,因此写在这里,进行异常值处理。同时,将类2与4转化为0,1。
data[data.values =='?']
data =data.replace('?',np.nan).dropna()
data['Class'] = data['Class'] /2-1
下面开始提取自变量x和应变量y,并查看数据集类型。
x= data.iloc[:,:-1]
y = data.iloc[:,[-1]]
type(x),type(y)
—————————————————————————————————————————————————————
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)
对自变量进行标准化处理,并划分数据集。
sscoder = StandardScaler()
x = sscoder.fit_transform(x)
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.1,random_state=0)
进行模型训练,绘制真实值与预测值图像。
model = LogisticRegression()
model.fit(x_train,y_train)
print(model.score(x_test,y_test))
y_predict=model.predict(x_test)
ln_x_test = range(len(x_test))
plt.plot(ln_x_test,y_predict,'b-',lw=2,alpha =0.75,zorder=10,label=u'预测值')
plt.plot(ln_x_test,y_test,'r-',lw=2,alpha =0.4,zorder=10,label=u'真实值')
——————————————————————————————————————————————————
0.9855072463768116
模型预测并计算auc的值,进行模型评估。
m =model.predict_proba(x_test)
print(m)
fpr,tpr,thresholds = metrics.roc_curve(y_test,y_score=[i[1] for i in m],pos_label=1)
metrics.auc(fpr,tpr)
————————————————————————————————————————————————————
0.9981096408317581
2. 信贷审批
导入相关库:
import numpy as np,pandas as pd
import matplotlib as mpl,matplotlib.pyplot as plt
import warnings
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.exceptions import ConvergenceWarning
from typing import List
设置字体并拦截警告。
mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)
加载数据集,添加数据集列标签。A16是我们需要预测的目标值。
names = ['A1','A2','A3','A4','A5','A6','A7','A8',
'A9','A10','A11','A12','A13','A14','A15','A16']
data =pd.read_csv('../datas/crx.data',names=names)
先处理数据集中的异常值“?”,然后再看卡数据的分布,来决定那些列需要进行编码。
data=data.replace("?",np.nan).dropna()
for i in list(data):
print(data[i].value_counts())
从结果中看到,将需要处理的数据分为两个类型:
虚拟编码:A4 A5 A6 A7 A13
二值编码(0 1):A1 A9 A10 A12 A16
这样我们就可以针对性的去尝试,为了保证报错后重新开始的即时性,我将数据集中需要编码的数据进行拷贝用来尝试编码。
首先是二值编码,开始我使用的一列数据进行试验,在这里我将使用的所用需要编码的列来加快分析进程。 最重要的思路是:指定一个具体数据为判断0和1的标准,然后将每一个数据进行对比,将结果转化成0和1的数据表。
data[['A11','A91','A101','A121','A161']] =data[['A1','A9','A10','A12','A16']]
for name in list(data[['A11','A91','A101','A121','A161']]):
value_new=[]
for value in data[name].values:
vn=1 if value == data[name][0] else 0
value_new.append(vn)
data[name] = value_new
当然了,最好的办法还是利用函数进行包装,这样更加规整。 在函数中,我设置了两个参数,分别是数据集data和需要二值编码的列名names,这可能和其他人的做法不同。
def two_coder(data,names)-> list:
for name in names:
value_new=[]
for value in data[name].values:
vn=1 if value == data[name][0] else 0
value_new.append(vn)
data[name] = value_new
return data
虚拟编码是利用pandas的get_dummies方法来完成的,我将它以函数的形式展现出来。
def dummies_coder(data,names)->list:
for name in names:
data_dummies = pd.get_dummies(data[name],prefix=name)
data = pd.concat([data,data_dummies],axis=1)
data.drop(name,axis=1,inplace=True)
return data
现在,我们可以轻松地完成对:A4 A5 A6 A7 A13 A1 A9 A10 A12 A16 这10列数据的编码。
two_coder_names =['A1','A9','A10','A12','A16']
dummies_coder_names =['A4','A5','A6','A7','A13']
two_coder(data,two_coder_names)
dummies_coder(data,dummies_coder_names)
不过也存在着顺序较乱的问题,不过不影响自变量和应变量分离。
当然了,我们也可以直接继续进行虚拟编码,不过目标值还要进行二值编码,将“+”,“-”转换成1和0。
total_names =['A1','A9','A10','A12','A4','A5','A6','A7','A13']
data =dummies_coder(data,total_names)
data =two_coder(data,['A16'])
下面开始划分数据集,观察一下先处理后的数据。
y =pd.DataFrame(data['A16'],columns=['A16'])
x= data.drop(['A16'],axis=1)
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.1,random_state=0)
x_train.describe().T
——————————————————————————————————————————————————————
count mean std min 25% 50% 75% max
A3 587.0 4.909319 5.073588 0.0 1.04 3.0 7.520 28.0
A8 587.0 2.221882 3.304041 0.0 0.21 1.0 2.605 28.5
A11 587.0 2.562181 5.056756 0.0 0.00 0.0 3.000 67.0
A15 587.0 943.959114 5081.188098 0.0 0.00 5.0 397.000 100000.0
A1_a 587.0 0.315162 0.464977 0.0 0.00 0.0 1.000 1.0
A1_b 587.0 0.684838 0.464977 0.0 0.00 1.0 1.000 1.0
A9_f 587.0 0.461670 0.498954 0.0 0.00 0.0 1.000 1.0
A9_t 587.0 0.538330 0.498954 0.0 0.00 1.0 1.000 1.0
A10_f 587.0 0.550256 0.497892 0.0 0.00 1.0 1.000 1.0
A10_t 587.0 0.449744 0.497892 0.0 0.00 0.0 1.000 1.0
A12_f 587.0 0.534923 0.499204 0.0 0.00 1.0 1.000 1.0
A12_t 587.0 0.465077 0.499204 0.0 0.00 0.0 1.000 1.0
A4_l 587.0 0.003407 0.058321 0.0 0.00 0.0 0.000 1.0
A4_u 587.0 0.761499 0.426530 0.0 1.00 1.0 1.000 1.0
A4_y 587.0 0.235094 0.424419 0.0 0.00 0.0 0.000 1.0
A5_g 587.0 0.761499 0.426530 0.0 1.00 1.0 1.000 1.0
A5_gg 587.0 0.003407 0.058321 0.0 0.00 0.0 0.000 1.0
A5_p 587.0 0.235094 0.424419 0.0 0.00 0.0 0.000 1.0
A6_aa 587.0 0.078365 0.268974 0.0 0.00 0.0 0.000 1.0
A6_c 587.0 0.211244 0.408539 0.0 0.00 0.0 0.000 1.0
A6_cc 587.0 0.061329 0.240137 0.0 0.00 0.0 0.000 1.0
A6_d 587.0 0.037479 0.190094 0.0 0.00 0.0 0.000 1.0
A6_e 587.0 0.035775 0.185887 0.0 0.00 0.0 0.000 1.0
A6_ff 587.0 0.069847 0.255106 0.0 0.00 0.0 0.000 1.0
A6_i 587.0 0.085179 0.279386 0.0 0.00 0.0 0.000 1.0
A6_j 587.0 0.015332 0.122975 0.0 0.00 0.0 0.000 1.0
A6_k 587.0 0.073254 0.260775 0.0 0.00 0.0 0.000 1.0
A6_m 587.0 0.059625 0.236993 0.0 0.00 0.0 0.000 1.0
A6_q 587.0 0.120954 0.326352 0.0 0.00 0.0 0.000 1.0
A6_r 587.0 0.005111 0.071367 0.0 0.00 0.0 0.000 1.0
A6_w 587.0 0.097104 0.296352 0.0 0.00 0.0 0.000 1.0
A6_x 587.0 0.049404 0.216894 0.0 0.00 0.0 0.000 1.0
A7_bb 587.0 0.081772 0.274250 0.0 0.00 0.0 0.000 1.0
A7_dd 587.0 0.010221 0.100669 0.0 0.00 0.0 0.000 1.0
A7_ff 587.0 0.076661 0.266280 0.0 0.00 0.0 0.000 1.0
A7_h 587.0 0.207836 0.406105 0.0 0.00 0.0 0.000 1.0
A7_j 587.0 0.011925 0.108641 0.0 0.00 0.0 0.000 1.0
A7_n 587.0 0.006814 0.082337 0.0 0.00 0.0 0.000 1.0
A7_o 587.0 0.003407 0.058321 0.0 0.00 0.0 0.000 1.0
A7_v 587.0 0.587734 0.492662 0.0 0.00 1.0 1.000 1.0
A7_z 587.0 0.013629 0.116042 0.0 0.00 0.0 0.000 1.0
A13_g 587.0 0.913118 0.281903 0.0 1.00 1.0 1.000 1.0
A13_p 587.0 0.003407 0.058321 0.0 0.00 0.0 0.000 1.0
A13_s 587.0 0.083475 0.276835 0.0 0.00 0.0 0.000 1.0
标准化处理自变量x值。
ss_coder = StandardScaler()
x_train =ss_coder.fit_transform(x_train)
x_test =ss_coder.fit_transform(x_test)
建立Logistic回归模型训练数据。
lgr = LogisticRegressionCV(Cs=np.logspace(-4,1,50),fit_intercept=True,penalty='l2',
solver ='lbfgs',tol=0.01,multi_class='ovr')
lgr.fit(x_train,y_train)
———————————————————————————————————————————————————————
LogisticRegressionCV(Cs=array([1.00000000e-04, 1.26485522e-04, 1.59985872e-04, 2.02358965e-04,
2.55954792e-04, 3.23745754e-04, 4.09491506e-04, 5.17947468e-04,
6.55128557e-04, 8.28642773e-04, 1.04811313e-03, 1.32571137e-03,
1.67683294e-03, 2.12095089e-03, 2.68269580e-03, 3.39322177e-03,
4.29193426e-03, 5.42867544e-03, 6.86648845e-03, 8.68511374e-03,
1.09854114e-02, 1.38...
7.19685673e-02, 9.10298178e-02, 1.15139540e-01, 1.45634848e-01,
1.84206997e-01, 2.32995181e-01, 2.94705170e-01, 3.72759372e-01,
4.71486636e-01, 5.96362332e-01, 7.54312006e-01, 9.54095476e-01,
1.20679264e+00, 1.52641797e+00, 1.93069773e+00, 2.44205309e+00,
3.08884360e+00, 3.90693994e+00, 4.94171336e+00, 6.25055193e+00,
7.90604321e+00, 1.00000000e+01]),
multi_class='ovr', tol=0.01)
对模型进行评估:
lgr_r = lgr.score(x_train,y_train)
print(f'Logistic算法的R值:{lgr_r}')
print(f'Logistic算法的稀疏化特征比例:{lgr.coef_}')
print(f'Logistic算法的参数:{lgr.coef_}')
print(f'Logistic算法的截距:{lgr.intercept_}')
————————————————————————————————————————————————————
Logistic算法的R值:0.889267461669506
Logistic算法的稀疏化特征比例:[[ 0.06010294 0.06371679 0.14746233 0.17539052 -0.0760682 0.11441961
-0.00360566 0.00360566 -0.42879631 0.42879631 -0.15905789 0.15905789
0.00924079 -0.00924079 0.05970023 0.038181 -0.04657456 0.038181
0.05970023 -0.04657456 -0.02160594 0.00424491 0.09527565 -0.02703857
0.03342162 -0.10700193 -0.09250279 -0.01900214 -0.05327403 -0.01117224
0.04677697 0.0120337 0.03154771 0.12295968 -0.02175103 -0.00740624
-0.09594935 0.07653215 0.0248417 0.02830086 -0.00219562 -0.00572481
-0.00776041 0.01346844 -0.00498746 -0.01266428]]
Logistic算法的参数:[[ 0.06010294 0.06371679 0.14746233 0.17539052 -0.0760682 0.11441961
-0.00360566 0.00360566 -0.42879631 0.42879631 -0.15905789 0.15905789
0.00924079 -0.00924079 0.05970023 0.038181 -0.04657456 0.038181
0.05970023 -0.04657456 -0.02160594 0.00424491 0.09527565 -0.02703857
0.03342162 -0.10700193 -0.09250279 -0.01900214 -0.05327403 -0.01117224
0.04677697 0.0120337 0.03154771 0.12295968 -0.02175103 -0.00740624
-0.09594935 0.07653215 0.0248417 0.02830086 -0.00219562 -0.00572481
-0.00776041 0.01346844 -0.00498746 -0.01266428]]
Logistic算法的截距:[-0.24652859]
利用模型预测y值。
y_predict=lgr.predict(x_test)
y_proba = lgr.predict_proba(x_train)
y_predict,y_proba
——————————————————————————————————————————————————————
(array([1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1],
dtype=int64),
array([[0.88120287, 0.11879713],
[0.51051602, 0.48948398],
[0.51993802, 0.48006198],
...,
[0.08128366, 0.91871634],
[0.87979668, 0.12020332],
[0.34058366, 0.65941634]]))
绘制信贷审批结果的真实值与预测值的图像。
#样本长度编号
ln_x_test =range(len(x_test))
#设置图尺寸与面板颜色
plt.figure(figsize=(20,8),facecolor='w')
plt.ylim(0,1,1,1)
plt.plot(ln_x_test,y_test,'ro',markersize=15,alpha=0.75,zorder=10,label=u'真实值')
plt.plot(ln_x_test,y_predict,'bo',markersize=17,alpha=0.6,zorder=10,
label =f'logis算法的预测值,$R^2$={lgr.score(x_test,y_test)}')
plt.legend(loc='center',fontsize=20)
plt.xlabel(u'数据编号',fontsize=20)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.ylabel(u'是否审批(0:不通过,1:通过)',fontsize=20)
plt.title(f'logistic回归算法',fontsize=24)
plt.show()
3. 鸢尾花数据分类
导入相关库:
import numpy as np,pandas as pd,matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import sklearn
from sklearn.preprocessing import StandardScaler,label_binarize
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.exceptions import ConvergenceWarning
from sklearn import metrics
from typing import List
防止中文乱码,拦截警告。
## 设置字符集,防止中文乱码
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
## 拦截异常
warnings.filterwarnings(action = 'ignore', category=ConvergenceWarning)
加载数据集,
names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'cla']
data = pd.read_csv('../datas/iris.data',names=names)
data
——————————————————————————————————————————————————————
sepal length sepal width petal length petal width cla
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
150 rows × 5 columns
观察数据集中的异常值,发现没有空值或者问号。
data.isnull().sum()
data[data.values=="?"]
————————————————————
sepal length 0
sepal width 0
petal length 0
petal width 0
cla 0
dtype: int64
sepal length sepal width petal length petal width cla
数据预处理主要是对cla进行编码,发现有三个不同的value。
data['cla'].value_counts()
————————————————————————————————————————————————————————
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: cla, dtype: int64
可以想到的办法是,利用集合的去重作用和元组的索引返回可以得到我们想要的数字。
tuple_claa =tuple(set(data1['cla']))
tuple_claa[0],tuple_claa[1],tuple_claa[2],tuple_claa.index('Iris-virginica')
————————————————————————————————————————————————
('Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 1)
根据这个思路,定义一个编码函数,直接在原始数据及data上编码,还需要设置一个列名作为参数。
def get_vn_coder(data,name) :
new_value =[]
tuple_name = tuple(set(data[name]))
for value in data[name]:
vn=tuple_name.index(value)+1
new_value.append(vn)
data[name] = new_value
return data
get_vn_coder(data,'cla')
————————————————————————————————————————————————————
sepal length sepal width petal length petal width cla
0 5.1 3.5 1.4 0.2 3
1 4.9 3.0 1.4 0.2 3
2 4.7 3.2 1.3 0.2 3
3 4.6 3.1 1.5 0.2 3
4 5.0 3.6 1.4 0.2 3
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
150 rows × 5 columns
划分数据集并进行标准化训练。
x= data.iloc[:,:-1]
y =pd.DataFrame(data.iloc[:,-1])
type(x),type(y)
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.2,random_state=0)
ss_coder = StandardScaler()
x_train =ss_coder.fit_transform(x_train)
x_test=ss_coder.transform(x_test)
建立logistic回归模型并进行训练。
lgr = LogisticRegressionCV(Cs =np.logspace(-4,1,50),cv=3,fit_intercept=True,
penalty='l2',solver='lbfgs',tol=0.01,multi_class='multinomial')
lgr.fit(x_train,y_train)
logistic回归模型输出结果:
#将预测结果转换成矩阵形式
y_test_h = label_binarize(y_test,classes=(1,2,3))
#计算预测的损失值
lgr_y_score =lgr.decision_function(x_test)
#计算roc值,thresholds 阈值
lgr_fpr,lgr_tpr,lgr_thresholds =metrics.roc_curve(y_test_h.ravel(),
lgr_y_score.ravel())
lgr_auc = metrics.auc(lgr_fpr,lgr_tpr)
print(f'lgoistic算法的R值:{lgr.score(x_train,y_train)}')
print(f'lgoistic算法的AUC值:{lgr_auc}')
#模型预测
y_pred =lgr.predict(x_test)
————————————————————————————————————————————————————
lgoistic算法的R值:0.975
lgoistic算法的AUC值:0.9011111111111111
建立KNN算法模型
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train,y_train)
KNN算法模型输出结果
#将预测结果转换成矩阵形式
y_test_h = label_binarize(y_test,classes=(1,2,3))
#计算预测的损失值
knn_y_score =knn.predict_proba(x_test)
#计算roc值,thresholds 阈值
knn_fpr,knn_tpr,knn_thresholds =metrics.roc_curve(y_test_h.ravel(),
knn_y_score.ravel())
knn_auc = metrics.auc(knn_fpr,knn_tpr)
print(f'knn算法的R值:{knn.score(x_train,y_train)}')
print(f'knn算法的AUC值:{knn_auc}')
knn_y_pred =knn.predict(x_test)
——————————————————————————————————————————————————
knn算法的R值:0.9666666666666667
knn算法的AUC值:0.9972222222222222
绘制logistic回归算法与KNN算法ROC曲线
plt.figure(figsize=(20,8),facecolor='w')
plt.plot(lgr_fpr,lgr_tpr,c='b',lw=2,label=u'Logistic算法:AUC=%.3f' % lgr_auc)
plt.plot(knn_fpr,knn_tpr,c='r',lw=2,label=u'KNN算法:AUC=%.3f' % knn_auc)
plt.plot((0,1),(0,1),c='#a0a0a0',lw=2,ls='--')
#调节坐标轴范围
plt.xlim(-0.01,1.02)
plt.ylim(-0.01,1.02)
#调节坐标轴刻度
plt.xticks(np.arange(0,1,0.1))
plt.yticks(np.arange(0,1,0.1))
#坐标轴名称
plt.xlabel('FPR' ,fontsize=20)
plt.ylabel('TPR' ,fontsize=20)
#显示网格
plt.grid(b=True,ls=':')
#调节图例
plt.legend(loc='lower right',fancybox=True,framealpha=0.7,fontsize=18)
plt.title(f'鸢尾花数据Logistic算法和KNN算法的ROC/AUC',fontsize=25)
plt.show()
绘制logistic回归模型和KNN模型预测结果图像。
#设置样本编号
ln_x_test =range(len(x_test))
#调节图像面板尺寸与颜色
plt.figure(figsize =(20,10),facecolor='w')
#调节y轴坐标范围
plt.ylim(0.5,3.5)
plt.plot(ln_x_test,y_test,'ro',alpha=0.8,markersize=18,zorder=10,label=u'真实值')
plt.plot(ln_x_test,y_pred,'bo',alpha=0.75,markersize=13,zorder=10,
label=u'Logistic预测值,$R^2$=%.3f' % lgr.score(x_test,y_test))
plt.plot(ln_x_test,knn_y_pred,'go',alpha=0.9,markersize=8,zorder=10,
label=u'KNN预测值,$R^2$=%.3f' % knn.score(x_test,y_test))
#调节图例
plt.legend(loc='lower right',fontsize=12)
plt.xlabel(u'数据编号',fontsize=20)
plt.ylabel(u'种类',fontsize=20)
plt.title(u'鸢尾花分类',fontsize=24)
plt.show()
4. 葡萄酒质量预测(Softmax)
导入相关库并进行设置。
import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
import warnings
from sklearn.preprocessing import StandardScaler,MinMaxScaler,LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.exceptions import ConvergenceWarning
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler,Normalizer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
## 设置字符集,防止中文乱码
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
## 拦截异常
warnings.filterwarnings(action = 'ignore', category=ConvergenceWarning)
加载数据集,添加type列,合并data_red和data_white。
data_red = pd.read_csv('../datas/winequality-red.csv',sep=';')
data_white = pd.read_csv('../datas/winequality-white.csv',sep=';')
data_red['type']=1
data_white['type']=2
data_all=pd.concat([data_red,data_white],axis=0)
data_all
————————————————————————————————————————————————————
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality type
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5 1
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5 1
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5 1
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6 1
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039 24.0 92.0 0.99114 3.27 0.50 11.2 6 2
4894 6.6 0.32 0.36 8.0 0.047 57.0 168.0 0.99490 3.15 0.46 9.6 5 2
4895 6.5 0.24 0.19 1.2 0.041 30.0 111.0 0.99254 2.99 0.46 9.4 6 2
4896 5.5 0.29 0.30 1.1 0.022 20.0 110.0 0.98869 3.34 0.38 12.8 7 2
4897 6.0 0.21 0.38 0.8 0.020 22.0 98.0 0.98941 3.26 0.32 11.8 6 2
6497 rows × 13 columns
查看数据集的信息。
data_all.info()
————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6497 entries, 0 to 4897
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 6497 non-null float64
1 volatile acidity 6497 non-null float64
2 citric acid 6497 non-null float64
3 residual sugar 6497 non-null float64
4 chlorides 6497 non-null float64
5 free sulfur dioxide 6497 non-null float64
6 total sulfur dioxide 6497 non-null float64
7 density 6497 non-null float64
8 pH 6497 non-null float64
9 sulphates 6497 non-null float64
10 alcohol 6497 non-null float64
11 quality 6497 non-null int64
12 type 6497 non-null int64
dtypes: float64(11), int64(2)
memory usage: 710.6 KB
处理异常值。
data =data_all.replace("?",np.nan).dropna(how='any')
提取数据集中的x,y,并进行数据集划分。
y =pd.DataFrame(data['quality'])
x=data.drop('quality',axis=1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.1,random_state=0)
对x_train,x_test进行标准化训练。
ss_coder =StandardScaler()
x_train=ss_coder.fit_transform(x_train)
x_test =ss_coder.transform(x_test)
建立logistic回归模型训练数据。
lgr = LogisticRegressionCV(fit_intercept=True,Cs =np.logspace(-3,1,50),
multi_class='multinomial',penalty='l2',solver='lbfgs')
lgr.fit(x_train,y_train)
logistic回归模型的输出结果。
lgr_R = lgr.score(x_train,y_train)
print('R值是:' ,lgr_R)
print('特征稀疏化比例是:%.2f%%' % (np.mean(lgr.coef_.ravel()==0)*100))
print('参数是:', lgr.coef_)
print('截距是:' ,lgr.intercept_)
y_pred =lgr.predict(x_test)
————————————————————————————————————————————————————
R值是: 0.5496835984265436
特征稀疏化比例是:0.00%
参数是: [[ 0.67752559 0.987899 -0.32027472 0.00677359 0.94350148 0.39520366 0.12637852 -0.14786997 0.16976999 -0.45294428 -0.52182398 0.61274299]
[-0.54979328 0.84826118 -0.01591337 -1.09324428 0.54110719 -0.88734675 0.1013851 0.99221528 -0.39935197 -0.07018644 -0.58198203 1.07255851]
[-0.68865143 0.30156317 0.08701101 -0.69110331 0.5099835 -0.25769075 0.43825846 0.67516745 -0.55340259 -0.16496295 -0.89681563 -0.4581928 ]
[-0.62705797 -0.35617218 0.00359714 -0.31382828 0.4693695 -0.04570205 0.06847681 0.51632857 -0.4794139 0.08879676 0.00749543 -0.52250221]
[-0.01449942 -0.7438577 -0.0296082 0.59747534 0.26343921 0.03448366 -0.01468908 -0.73431732 -0.08090158 0.39547905 0.27094513 -0.85751529]
[-0.11072351 -0.58705212 0.06808383 0.85254437 0.37615573 0.26429456 -0.09980634 -0.82237239 -0.06509039 0.35993973 0.56090155 -0.64243831]
[ 1.31320002 -0.45064136 0.20710431 0.64138256 -3.10355662 0.49675766 -0.62000346 -0.47915162 1.40839043 -0.15612186 1.16127953 0.79534712]]
截距是: [-1.88365356 0.34148367 2.98466092 3.5517204 2.06959079 0.0302318 -7.09403403]
绘制logistic回归模型真实值与预测值图像。
ln_x_test =range(len(x_test))
plt.figure(figsize=(20,10),facecolor='w')
plt.ylim(-1,11)
plt.plot(ln_x_test,y_test,'ro',markersize=10,alpha=0.7,zorder=10,label=u'真实值')
plt.plot(ln_x_test,y_pred,'bo',markersize=15,alpha=0.7,zorder=10,
label=u'预测值,$R^2$=%.3f' % lgr_R)
plt.legend(loc='upper left',fontsize=18)
plt.xlabel(u'数据编号',fontsize=20)
plt.ylabel(u'葡萄酒质量',fontsize=20)
plt.title(u'葡萄酒质量预测统计',fontsize=24)
plt.show()
PCA降维处理
划分数据集。
x1_train,x1_test,y1_train,y1_test=train_test_split(x,y,test_size=0.01,random_state=0)
对x1_test,x_train归一化处理。
nor =Normalizer()
x1_test =nor.transform(x1_test)
x1_train=nor.fit_transform(x1_train)
进行降维处理,但效果并不明显。
# 将样本数据维度降低成为2个维度
pca = PCA(n_components=2)
x1_train = pca.fit_transform(x1_train)
print ("贡献率:", pca.explained_variance_)
# 测试数据降维
x1_test = pca.fit_transform(x1_test)
——————————————————————————————————————————————————————
贡献率: [0.80467114 0.12287721]
模型训练。
lgr2 = LogisticRegressionCV(fit_intercept=True,Cs=np.logspace(-1,3,50),
multi_class='multinomial',penalty='l2',solver='lbfgs')
lgr2.fit(x1_train,y1_train)
模型训练结果输出。
lgr2_R = lgr2.score(x1_train,y1_train)
print('R值是:' ,lgr2_R)
print('特征稀疏化比例是:%.2f%%' % (np.mean(lgr2.coef_.ravel()==0)*100))
print('参数是:', lgr2.coef_)
print('截距是:' ,lgr2.intercept_)
y1_pred =lgr2.predict(x1_test)
——————————————————————————————————————————————————————
R值是: 0.45988805970149255
特征稀疏化比例是:0.00%
参数是: [[ 0.41173597 1.59780389]
[ 0.59498972 0.99400199]
[ 0.04557274 1.36230804]
[-0.06424483 -0.11878825]
[-0.10983418 -0.93218132]
[-0.37749056 -1.34843384]
[-0.50072887 -1.55471051]]
截距是: [-2.1018385 -0.15045045 2.26818011 2.61282555 1.57707652 -0.24589361 -3.95989963]
绘制真实值与预测值图像,降维与归一化处理后,R值降低。
ln_x1_test=range(len(x1_test))
plt.figure(figsize=(20,10),facecolor='w')
plt.plot(ln_x1_test,y1_test,'go',markersize=15,zorder=10,alpha=0.75,label=u'真实值')
plt.plot(ln_x1_test,y1_pred,'bo',markersize=10,zorder=10,alpha=0.8,
label=u'预测值,R$^2$=%.3f' % lgr2_R)
plt.legend(loc='upper left',fontsize=20)
plt.xlabel(u'数据编号',fontsize=20)
plt.ylabel(u'葡萄酒质量',fontsize=20)
plt.title(u'葡萄酒质量预测(PCA将降维处理)',fontsize=24)
plt.show()