Python应用分析

501 阅读18分钟

前言

Python是十分强大的语言,它可以在很多领域发挥作用,无论是软件开发、前端开发、后端开发、数据分析和机器学习等方向都有出色的成绩。本篇旨在不同的应用场景利用Python进行数据分析。

股票分析

  • Pandas库最初是为了进行金融分析而产生的,当然也在不同的项目背景下为我们了解、观察、分析数据集提供了巨大的帮助。 image.png

一.探索性分析

首先我们导入相关模块:

## import numpy as np
## import pandas as pd
## import matplotlib.pyplot as plt

设置jupyter notekbook的显示参数(字体,行数、列数):

pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 20)

导入数据集,选择前10行进行观察:

data_all = pd.read_excel('data_all.xlsx')
data_all(10)
——————————————————————————————————————————————————
	date	code	open	high	low	close	pre_close	pct_chg	vol	amt
0	2018-11-01	000001.SZ	10.99	11.05	10.76	10.83	10.91	-0.7333	1542776.32	16.794434
1	2018-11-01	000002.SZ	24.90	25.27	24.28	24.42	24.23	0.7842	617847.25	15.318339
2	2018-11-01	000004.SZ	15.63	15.63	15.43	15.50	15.54	-0.2574	5597.02	0.086807
3	2018-11-01	000005.SZ	2.74	2.76	2.71	2.71	2.73	-0.7326	50199.00	0.137363
4	2018-11-01	000006.SZ	5.13	5.15	5.03	5.04	5.07	-0.5917	149151.86	0.759842
5	2018-11-01	000007.SZ	7.25	7.40	7.07	7.16	7.25	-1.2414	288091.37	2.075888
6	2018-11-01	000008.SZ	4.43	4.47	4.31	4.35	4.41	-1.3605	470410.19	2.058089
7	2018-11-01	000009.SZ	4.05	4.10	4.00	4.03	4.02	0.2488	128825.16	0.523551
8	2018-11-01	000010.SZ	4.38	4.38	4.31	4.34	4.38	-0.9132	26943.00	0.117115
9	2018-11-01	000011.SZ	8.90	9.08	8.85	8.86	8.90	-0.4494	37702.91	0.338891

其中columns分别是:

  1. date: 交易日期
  2. code: 股票代码
  3. open:开盘价
  4. high:最高价
  5. low:最低价
  6. close:收盘价
  7. pre_close:昨收价
  8. pct_chg:涨跌幅
  9. vol: 成交量
    10.amt: 成交额

使用describe()来观察数据集的整体分布情况:

data_all.describe()
——————————————————————————————————————————————————
open	high	low	close	pre_close	pct_chg	vol	amt
count	984476.000000	984476.000000	984476.000000	984476.000000	984476.000000	984476.000000	9.844760e+05	984476.000000
mean	13.680437	13.960330	13.432177	13.696887	13.680717	0.103236	1.367581e+05	1.345854
std	22.011995	22.416306	21.665628	22.056704	22.002960	3.057492	3.112537e+05	3.061065
min	0.130000	0.140000	0.120000	0.130000	0.130000	-27.193700	2.000000e+00	0.000004
25%	5.400000	5.500000	5.310000	5.410000	5.400000	-1.308000	2.376319e+04	0.220520
50%	8.840000	9.000000	8.690000	8.840000	8.840000	0.000000	5.593079e+04	0.515636
75%	15.600000	15.920000	15.320000	15.620000	15.600000	1.339300	1.377442e+05	1.287892
max	1231.000000	1241.610000	1228.060000	1233.750000	1233.750000	400.153100	4.034860e+07	181.957345

我们现在来看看这些股票的收盘价:

price = data_all['close'] 
price.describe(),price.plot() 
——————————————————————————————————————————————————
(count    984476.000000
 mean         13.696887
 std          22.056704
 min           0.130000
 25%           5.410000
 50%           8.840000
 75%          15.620000
 max        1233.750000
 Name: close, dtype: float64,
 <AxesSubplot:>)

image.png

再看看涨跌幅,发现最高涨跌幅达到400.1531。

data_all['pct_chg'].max(),data_all['pct_chg'].hist(bins=200)
——————————————————————————————————————————————————
(400.1531, <AxesSubplot:>)

image.png

对索引做一些改动,形成多层次的索引,进而观察在某一天内的股票及其收盘价情况:

price.index=pd.MultiIndex.from_frame(df=data_all[['date', 'code']])
price, price.index
——————————————————————————————————————————————————
(date        code       amt      
 2018-11-01  000001.SZ  16.794434    10.83
             000002.SZ  15.318339    24.42
             000004.SZ  0.086807     15.50
             000005.SZ  0.137363      2.71
             000006.SZ  0.759842      5.04
             000007.SZ  2.075888      7.16
             000008.SZ  2.058089      4.35
             000009.SZ  0.523551      4.03
             000010.SZ  0.117115      4.34
             000011.SZ  0.338891      8.86
             000012.SZ  0.397421      4.19
             000014.SZ  0.509856      9.01
             000016.SZ  0.601305      3.55
             000017.SZ  0.594669      4.43
             000018.SZ  0.439367      2.00
                                     ...  
 2019-12-12  688288.SH  0.764449     31.28
             688299.SH  0.809870     16.78
             688300.SH  0.609886     32.52
             688310.SH  0.761119     27.59
             688321.SH  1.021553     55.52
             688333.SH  0.662783     52.44
             688357.SH  1.546412     45.22
             688358.SH  0.953628     45.40
             688363.SH  1.579945     82.78
             688366.SH  0.389405     87.00
             688368.SH  1.080490     78.83
             688369.SH  1.655059     58.48
             688388.SH  1.446511     45.19
             688389.SH  0.506921     15.85
             688399.SH  2.451546     55.90
 Name: close, Length: 984476, dtype: float64,

接着对表格进行unstack操作:

mat_close = price.unstack()
mat_close.head(10)
——————————————————————————————————————————————————
code	000001.SZ	000002.SZ	000004.SZ	000005.SZ	000006.SZ	000007.SZ	000008.SZ	000009.SZ	000010.SZ	000011.SZ	...	688333.SH	688357.SH	688358.SH	688363.SH	688366.SH	688368.SH	688369.SH	688388.SH	688389.SH	688399.SH
date																					
2018-11-01	10.83	24.42	15.50	2.71	5.04	7.16	4.35	4.03	4.34	8.86	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-02	11.09	24.62	15.76	2.76	5.14	7.17	4.37	4.14	4.38	9.06	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-05	10.92	24.04	16.31	2.81	5.13	7.64	4.52	4.29	4.40	9.18	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-06	10.84	24.14	16.26	2.82	5.12	7.03	4.43	4.41	4.34	9.27	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-07	10.81	23.85	16.12	2.79	5.08	7.03	NaN	4.31	4.35	9.06	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-08	10.89	23.99	16.30	2.86	5.09	7.32	NaN	4.29	4.33	9.14	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-09	10.55	23.55	16.18	2.87	5.03	7.29	NaN	4.29	4.26	8.95	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-12	10.56	23.88	16.59	3.00	5.13	7.58	4.22	4.45	4.31	9.22	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-13	10.54	23.94	17.12	3.15	5.28	8.34	4.30	4.56	4.40	9.40	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-14	10.44	24.15	17.17	3.08	5.30	8.00	4.22	4.56	4.39	9.49	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

处理完成后,更加清晰看出每一只股票在一段时间内的变化情况。 以上的过程我们也可以包装成一个函数:

def variable(data, field):
    data_s = data[field]
    data_s.index = pd.MultiIndex.from_frame(df=data[['date', 'code']])
    return data_s.unstack()

通过这个函数我们来看看涨跌幅的变化情况。

mat_pct_chg = variable_ts(data=data_all, field='pct_chg')
mat_pct_chg 
——————————————————————————————————————————————————
code	000001.SZ	000002.SZ	000004.SZ	000005.SZ	000006.SZ	000007.SZ	000008.SZ	000009.SZ	000010.SZ	000011.SZ	...	688333.SH	688357.SH	688358.SH	688363.SH	688366.SH	688368.SH	688369.SH	688388.SH	688389.SH	688399.SH
date						
2018-11-01	-0.7333	0.7842	-0.2574	-0.7326	-0.5917	-1.2414	-1.3605	0.2488	-0.9132	-0.4494	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-02	2.4007	0.8190	1.6774	1.8450	1.9841	0.1397	0.4598	2.7295	0.9217	2.2573	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-05	-1.5329	-2.3558	3.4898	1.8116	-0.1946	6.5551	3.4325	3.6232	0.4566	1.3245	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-06	-0.7326	0.4160	-0.3066	0.3559	-0.1949	-7.9843	-1.9912	2.7972	-1.3636	0.9804	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-07	-0.2768	-1.2013	-0.8610	-1.0638	-0.7813	0.0000	NaN	-2.2676	0.2304	-2.2654	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-08	0.7401	0.5870	1.1166	2.5090	0.1969	4.1252	NaN	-0.4640	-0.4598	0.8830	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-09	-3.1221	-1.8341	-0.7362	0.3497	-1.1788	-0.4098	NaN	0.0000	-1.6166	-2.0788	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-12	0.0948	1.4013	2.5340	4.5296	1.9881	3.9781	-4.7404	3.7296	1.1737	3.0168	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-13	-0.1894	0.2513	3.1947	5.0000	2.9240	10.0264	1.8957	2.4719	2.0882	1.9523	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-14	-0.9488	0.8772	0.2921	-2.2222	0.3788	-4.0767	-1.8605	0.0000	-0.2273	0.9574	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10 rows × 3756 columns

设置index为date,保留column上的date。

data_index.set_index(keys='time', drop=False, inplace=True)

然后使用滑动窗口rolling来观测收盘价的平均值。

ma = data_all['close'].rolling(window=20, min_periods=20).mean()

将两个表连接在一起右边的close是滑动窗口观测到的值,小于period=20时,计数为NaN。

pd.concat([data_all['close'],ma],axis=1,sort=False)
——————————————————————————————————————————————————
	close	close
date		
2018-11-01	10.83	NaN
2018-11-01	24.42	NaN
2018-11-01	15.50	NaN
2018-11-01	2.71	NaN
2018-11-01	5.04	NaN
2018-11-01	7.16	NaN
2018-11-01	4.35	NaN
2018-11-01	4.03	NaN
2018-11-01	4.34	NaN
2018-11-01	8.86	NaN
2018-11-01	4.19	NaN
2018-11-01	9.01	NaN
2018-11-01	3.55	NaN
2018-11-01	4.43	NaN
2018-11-01	2.00	NaN
...	...	...
2019-12-12	31.28	49.3930
2019-12-12	16.78	47.4825
2019-12-12	32.52	48.1540
2019-12-12	27.59	48.7690
2019-12-12	55.52	43.8650
2019-12-12	52.44	45.0920
2019-12-12	45.22	45.1880
2019-12-12	45.40	45.8355
2019-12-12	82.78	49.0185
2019-12-12	87.00	52.5670
2019-12-12	78.83	55.0840
2019-12-12	58.48	56.5405
2019-12-12	45.19	54.0675
2019-12-12	15.85	47.9895
2019-12-12	55.90	48.8405
984476 rows × 2 columns

image.png

pd.concat([data_all['close'],ma],axis=1,sort=False).plot()

下面计算一下数据延伸时的最大值。

exmax = data_all['close'].expanding().max()
pd.concat([data_all['close'], exmax], axis=1, sort=False).plot()

image.png

下面将要对数据集重新采样,以一周为周期单位,并计算各个指标的平均值。

mean = data_all.resample(rule='1W', on='date', closed='right', label='right').mean()
mean
——————————————————————————————————————————————————
	open	high	low	close	pre_close	pct_chg	vol	amt
date								
2018-11-04	11.475561	11.776644	11.344113	11.580576	11.347201	1.686861	133421.827245	1.197930
2018-11-11	11.736163	11.949151	11.524949	11.720392	11.743626	0.035427	111001.643627	0.924826
2018-11-18	12.050949	12.414799	11.919639	12.251079	12.073681	1.607222	149621.296784	1.187044
2018-11-25	12.350931	12.580767	12.052282	12.264035	12.400272	-1.124476	129389.371799	1.063069
2018-12-02	11.858034	12.080183	11.601627	11.840531	11.845595	-0.007843	96146.719166	0.778999
2018-12-09	12.143786	12.380277	11.967465	12.178485	12.144444	0.350796	105900.684959	0.919899
2018-12-16	11.997293	12.178797	11.791668	11.960303	11.997663	-0.384111	84651.361308	0.720631
2018-12-23	11.668868	11.845912	11.462035	11.653172	11.706126	-0.376411	76366.230127	0.649877
2018-12-30	11.591742	11.788976	11.347982	11.564124	11.587275	-0.373679	80999.883521	0.690408
2019-01-06	11.332548	11.612886	11.125718	11.394576	11.368771	0.585630	89418.638862	0.750437
2019-01-13	11.710140	11.944959	11.567360	11.761189	11.697039	0.611978	113789.838744	0.919325
2019-01-20	11.829106	12.032179	11.649138	11.841103	11.832487	0.006484	106323.611524	0.858491
2019-01-27	11.915102	12.113135	11.748725	11.923547	11.919975	-0.091527	97456.824903	0.813621
2019-02-03	11.625442	11.843185	11.360569	11.571198	11.627905	-0.667317	93623.021020	0.772521
2019-02-10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
2019-09-08	14.596266	14.935986	14.401262	14.715649	14.564102	1.048966	155212.596756	1.722521
2019-09-15	15.144680	15.418701	14.884215	15.162391	15.096582	0.599808	158320.709253	1.768950
2019-09-22	15.135537	15.397141	14.874496	15.129959	15.115724	-0.088277	131388.196198	1.469611
2019-09-29	15.054052	15.331692	14.701197	14.958344	15.065550	-0.827569	118295.782658	1.374129
2019-10-06	14.758728	14.980529	14.387937	14.547902	14.732943	-1.002513	81982.295657	0.960229
2019-10-13	14.629373	14.930197	14.398692	14.710025	14.627513	0.636353	95204.032934	1.101362
2019-10-20	14.950047	15.203062	14.689691	14.903079	14.933476	-0.300134	109903.485569	1.195154
2019-10-27	14.691936	14.932835	14.423435	14.703832	14.694104	0.155208	92698.280959	1.013124
2019-11-03	14.803516	15.089399	14.525632	14.795101	14.809838	-0.193310	118589.653663	1.310630
2019-11-10	14.926280	15.196216	14.705147	14.934956	14.899184	0.034028	106198.282544	1.219384
2019-11-17	14.758897	15.002853	14.476706	14.721258	14.777994	-0.628522	93082.760733	1.046375
2019-11-24	14.932153	15.210862	14.681571	14.939452	14.937258	0.231435	96077.021700	1.100175
2019-12-01	14.592503	14.804645	14.330456	14.553943	14.599136	-0.127033	93553.277018	0.984303
2019-12-08	14.653081	14.916211	14.485489	14.753741	14.665127	0.522050	91099.107137	1.010874
2019-12-15	15.059145	15.326443	14.854766	15.088637	15.057555	0.123028	113471.141525	1.237854
59 rows × 8 columns

发现结果有很多空值,使用向后填充的办法清除空值。

mean.fillna(method='ffill').plot()

image.png

我们还可以使用重新采样的方式获得更多需要的指标:

data_all.resample(rule='1W', on='date', closed='right', label='right').agg({'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'vol': 'sum', 'amt': 'sum'})
——————————————————————————————————————————————————
	open	high	low	close	vol	amt
date						
2018-11-04	10.99	600.00	1.02	4.88	9.290162e+08	8341.184058
2018-11-11	10.95	593.00	1.04	4.79	1.940087e+09	16164.108137
2018-11-18	10.46	570.00	0.67	5.28	2.626901e+09	20840.936047
2018-11-25	10.57	572.00	0.40	5.20	2.278417e+09	18719.589955
2018-12-02	10.34	569.80	0.26	4.91	1.697278e+09	13751.677451
2018-12-09	10.59	616.50	0.25	5.50	1.874760e+09	16284.974932
2018-12-16	10.22	606.88	0.25	5.13	1.501292e+09	12780.393349
2018-12-23	10.16	595.97	0.22	5.14	1.354202e+09	11524.263748
2018-12-30	9.40	596.40	0.20	4.84	1.435642e+09	12236.799335
2019-01-06	9.39	612.00	1.11	4.99	9.532921e+08	8000.406237
2019-01-13	9.84	637.00	1.14	5.04	2.023411e+09	16347.443668
2019-01-20	10.22	690.20	1.13	5.18	1.891497e+09	15272.562995
2019-01-27	10.34	698.88	1.00	4.94	1.734926e+09	14484.087155
2019-02-03	11.04	699.00	0.92	4.83	1.669392e+09	13774.813951
2019-02-10	NaN	NaN	NaN	NaN	0.000000e+00	0.000000
...	...	...	...	...	...	...
2019-09-08	14.15	1151.02	0.27	59.66	2.841943e+09	31539.368510
2019-09-15	14.98	1148.00	0.19	57.04	2.319557e+09	25916.883317
2019-09-22	14.70	1160.00	0.19	56.69	2.406769e+09	26920.325267
2019-09-29	15.34	1188.87	0.18	52.92	2.166824e+09	25169.923667
2019-10-06	15.75	1169.43	0.18	49.65	3.008750e+08	3524.038865
2019-10-13	15.60	1180.00	0.15	48.72	1.396453e+09	16154.776554
2019-10-20	16.97	1215.68	0.15	45.93	2.017608e+09	21940.642711
2019-10-27	16.43	1181.50	0.20	47.69	1.702775e+09	18610.066243
2019-11-03	16.98	1199.96	0.21	44.66	2.182524e+09	24120.830213
2019-11-10	16.98	1215.65	0.24	20.00	1.960633e+09	22512.265428
2019-11-17	16.50	1240.00	0.24	17.08	1.722496e+09	19363.177225
2019-11-24	16.35	1241.61	0.23	16.42	1.781652e+09	20401.654126
2019-12-01	15.64	1198.60	0.25	16.34	1.737191e+09	18277.521529
2019-12-08	15.35	1170.00	0.24	51.00	1.692713e+09	18783.057471
2019-12-15	15.62	1176.00	0.24	55.90	1.689018e+09	18425.459966
59 rows × 6 columns

二.问题分析

导入数据集:

data_basic = pd.read_excel('data_basic.xlsx')
data_zt = pd.read_excel('data_zt.xlsx')
data_all = pd.read_excel('data_all.xlsx')

引入多层次索引操作函数:

def variable_ts(data, field):
    ser = data[field]
    ser.index = pd.MultiIndex.from_frame(df=data[['date', 'code']])
    return ser.unstack()

1. 找出每天最高的成交笔数(num)

image.png

a.使用groupby

result_max_num_date = data_zt.groupby('date')['num'].max()   

b.使用pivot_table

result_max_num_date = data_zt.pivot_table(values='num', index='date', aggfunc='max')

c.使用multiindex

mat_num = variable_ts(data=data_zt, field='num')
result_max_num_date = mat_num.max(axis=1)
result_max_num_date 
result_max_num_date.plot()
——————————————————————————————————————————————————
date
2018-01-02    1.0
2018-01-03    2.0
2018-01-04    3.0
2018-01-05    3.0
2018-01-08    3.0
             ... 
2019-12-06    6.0
2019-12-09    4.0
2019-12-10    5.0
2019-12-11    6.0
2019-12-12    7.0
Length: 474, dtype: float64

image.png

2. 计算这些股票每天的涨停情况,以及它们的平均交易额

首先观察mat_num中的值,当值为非NaN时,表明这一天股票出现涨停,如果有连续的数字出现(1,2,3……n),表明这只股票出现了n连板。但是我们需要的是涨停的次数,例如出现000009.SZ出现1,2,这只股票在两天内一共涨停2次,为了解决这个问题,我们可以将非NaN值设置为1,求出总和就是每只股票的涨停总数。

mat_num 
——————————————————————————————————————————————————
code	000004.SZ	000005.SZ	000006.SZ	000007.SZ	000008.SZ	000009.SZ	000010.SZ	000011.SZ	000012.SZ	000014.SZ	...	603987.SH	603988.SH	603989.SH	603990.SH	603992.SH	603993.SH	603996.SH	603997.SH	603998.SH	603999.SH
date							
2018-01-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-08	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	...
2019-12-06	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2019-12-09	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2019-12-10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2019-12-11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2019-12-12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
474 rows × 3119 columns

定义一个0值的DataFrame,行列索引与mat_num一致,用来存储单日涨停数。

df = pd.DataFrame(0, index=mat_num.index, columns=mat_num.columns)

将出现涨停的股票对应到df中记作1。

df[mat_num > 0]=1

然后可以轻松求得每只股票的总涨停数。

zt_sums_stock = df.sum(axis=0).sort_values(ascending=False)
——————————————————————————————————————————————————
code
603032.SH    43
300598.SZ    42
600776.SH    37
300663.SZ    36
002356.SZ    34
             ..
600297.SH     1
002358.SZ     1
603689.SH     1
002340.SZ     1
002438.SZ     1
Length: 3119, dtype: int64

下面来看一下成交额,先填补空值,然后使用窗口函数来观测平均成交额。 看到mat_amt_mean的前四行是NaN就很好。

mat_amt = variable_ts(data=data_all, field='amt')
mat_amt = mat_amt.fillna(value=0)
mat_amt_mean = mat_amt.rolling(window=5).mean()
mat_amt_mean
——————————————————————————————————————————————————
code	     000001.SZ	000002.SZ	000004.SZ	000005.SZ	000006.SZ	000007.SZ	000008.SZ	000009.SZ	000010.SZ	000011.SZ	...	688333.SH	688357.SH	688358.SH	688363.SH	688366.SH	688368.SH	688369.SH	688388.SH	688389.SH	688399.SH
date								
2018-11-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-06	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-07	14.349779	13.219227	0.178639	0.169550	0.673762	2.084265	1.657001	1.001253	0.122785	0.463832	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2019-12-06	8.638587	14.973211	0.223669	0.151208	0.385596	0.375832	1.196141	0.444873	0.045953	0.115200	...	0.364501	1.242398	2.479148	1.851320	0.292710	0.533009	0.481587	0.816026	0.434213	1.745468
2019-12-09	8.809291	15.861202	0.233287	0.139696	0.336337	0.548861	1.583662	0.524081	0.052064	0.114330	...	0.440212	1.474739	2.711663	1.645825	0.313169	0.757351	0.538574	0.927789	0.480371	2.190393
2019-12-10	9.111459	16.341740	0.225406	0.144344	0.330028	0.796226	1.804214	0.751842	0.052398	0.126397	...	0.716250	1.982968	1.926679	1.974561	0.413794	1.053451	0.740009	1.402833	0.660357	2.819268
2019-12-11	10.210811	19.963586	0.231022	0.160623	0.413147	0.872229	1.813619	0.911660	0.056627	0.189311	...	0.826002	1.858374	1.613671	1.886199	0.459506	1.186162	0.858411	1.595369	0.730358	3.260726
2019-12-12	9.990485	21.051259	0.217922	0.150647	0.461222	0.889543	1.610768	0.955314	0.052097	0.209479	...	0.889447	1.824044	1.462992	1.816224	0.478536	1.287344	1.103114	1.757640	0.758520	2.670553
273 rows × 3756 columns

接近胜利的一步,将每只股票的涨停数和平均交易额放在一起,并剔除空值。

result = pd.concat([zt_sums_stock, mat_amt.mean()], axis=1, sort=False)
result.columns = ['sums', 'amt']
result = result.dropna(subset=['sums'], how='all')
result
——————————————————————————————————————————————————
                sums	   amt
603032.SH	43.0	2.580331
300598.SZ	42.0	2.507630
600776.SH	37.0	11.090669
300663.SZ	36.0	3.338191
002356.SZ	34.0	1.084322
...	...	...
600297.SH	1.0	0.884834
002358.SZ	1.0	1.730683
603689.SH	1.0	0.413853
002340.SZ	1.0	2.840811
002438.SZ	1.0	0.324684
3119 rows × 2 columns

从散点图可以看出,涨停数超过20次的股票,成交额并不算很高,而成交额大于20的股票涨停数不足十次。

plt.scatter(x=result['sums'], y=result['amt'])

image.png

3. 寻找涨停的股票及其连板数

经过解决第二个问题,我们得到了一些经验,通过构造函数创建容器来存储股票的涨停数。

def frame_like(data, value):
    return pd.DataFrame(data=value, index=data.index, columns=data.columns)
mat_zgb = frame_like(mat_num, value=None)
mat_num_fill = mat_num.fillna(value=0)

现在,mat_num_fill中存储着股票涨停的数据,对于出现1,2,3这3连板的股票,我们想要提取出3作为结果。这是可以设计一个思路,涨停可能出现1次或者多次,对于1次来说,前一个数据是0,而对于多次涨停来说只要满足次数大于0就可以了。

mat_zgb[(mat_num_fill > 0) & (mat_num_fill.shift(periods=-1) == 0)] = mat_num_fill

最后再对索引进行stack操作,清楚看到股票的涨停情况。

zgb = mat_zgb.stack().reset_index(drop=False)
zgb.columns = ['date', 'code', 'num']
——————————————————————————————————————————————————
            date	code	       num
0	2018-01-02	000672.SZ	1
1	2018-01-02	000703.SZ	1
2	2018-01-02	000885.SZ	1
3	2018-01-02	002372.SZ	1
4	2018-01-02	002793.SZ	1
...	...	...	...
16350	2019-12-11	600715.SH	1
16351	2019-12-11	600812.SH	1
16352	2019-12-11	601500.SH	1
16353	2019-12-11	601999.SH	1
16354	2019-12-11	603530.SH	1
16355 rows × 3 columns

4.找出最高板为10的股票及其每次涨停的前7日的平均交易额

先对连板数进行排序。

zgb['num'].value_counts()
zgb.sort_values(by='num')

使用shift函数将整个数据集向下移动一位,再利用rolling函数动态计算平均交易额即为七天平均交易额。

mat_amt_mean = mat_amt.shift(1).rolling(window=7).mean()

找到连板数为10的七天平均交易额。

result_zgb1_amt = mat_amt_mean[mat_zgb == 10].stack()
result_zgb1_amt 
——————————————————————————————————————————————————	
date        code     
2019-01-10  601700.SH    0.072690
2019-02-25  000859.SZ    0.128327
2019-03-15  002356.SZ    0.000000
2019-04-26  300573.SZ    0.689295
dtype: float64

当然,我么那也可以进行函数的包装来获得任意连板数的任一周期的平均交易额。

def get_amt_mean(num):
    mat_amt_mean = mat_amt.shift(num).rolling(window=7).mean()
    result_zgb_amt = mat_amt_mean[mat_zgb == num].stack()
    return result_zgb_amt
get_amt_mean(num=8)
——————————————————————————————————————————————————	
date        code     
2019-03-11  002750.SZ    0.283191
2019-03-13  300370.SZ    1.743853
2019-03-20  600624.SH    3.506733
2019-04-01  000590.SZ    0.932301
2019-04-09  300099.SZ    1.005648
2019-04-10  300194.SZ    0.702668
dtype: float64

取得平均交易额的地址存入列表。

r = [get_amt_mean(num=int(i)) for i in zgb['num'].value_counts().index]

将k设置为索引,v设置为连板数,将连板数与平均交易额放入字典。

result_dict = {}
for k, v in enumerate(zgb['num'].value_counts().index[:-1]):
    result_dict[int(v)] = r[k].mean()
    print(v)
——————————————————————————————————————————————————    
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
pd.Series(result_dict).plot.bar()

image.png

回归模型分析

  • 线性回归、逻辑回归、岭回归、softmax回归等是常用的回归模型,能够在数据分析实践中很好的帮助我们建立模型并进行模型评估。Scikit-learn库中有丰富的模型模块,为数据分析和机器学习提供了很大的方便。本篇旨在初步理解回归数学模型的基础上进一步探索回归模型的应用场景,也会利用多种模型展示模型融合的效果。

线性回归应用分析

1. 共享单车租赁数量预测

按照分析流程依次导入相关分析库:

import pandas as pd,numpy as np
#数据预处理库:独热编码、多项式展开、标准化
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
#数据集划分
from sklearn.model_selection import train_test_split
#线性回归模型、岭回归模型
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
#模型评估:均方根误差
from sklearn.metrics import mean_squared_error

读取数据集:

path = 'datas/hour.csv '
df = pd.read_csv(path)

观察数据集,列标签为instant,dteday,casual,registered这四列的数据对于分析目标没有实际价值,因此选择删除。

df.drop(columns = ['instant','dteday','casual','registered'],inplace=True)

对于数值型数据进行独热编码,转换成由0和1组成的序列,先来查看哪些列需要进行编码。

for i in df.columns:
    a=df[i]
    print(a.value_counts()) 

下面对season、mnth、hr、weekday进行独热编码,并在df中删除。

hot = df[['season','mnth','hr','weekday']]
hotcoder = OneHotEncoder(sparse=False,handle_unknown ='ignore')
hot = pd.DataFrame(hotcoder.fit_transform(hot))
df.drop(columns =['season','mnth','hr','weekday'],inplace=True)

进行多项式扩展。

poly = df[['weathersit','temp','atemp','hum','windspeed']]
#多项式扩展参数:扩展为3阶多项式,允许存在x平方项。
polycoder=PolynomialFeatures(degree=3,interaction_only=False,include_bias=False)
#用多项式扩展器转换poly,使用get_feature_names方法获取列名。
poly  = pd.DataFrame(polycoder.fit_transform(poly),
                     columns =polycoder.get_feature_names())

再进行标准化处理,并删除标准化的列。

ssconder = StandardScaler()
poly = pd.DataFrame(ssconder.fit_transform(poly)) 
df.drop(columns =['weathersit','temp','atemp','hum','windspeed'],inplace=True)

将独热编码处理、标准化处理和df合并。

df = pd.concat([hot,poly,df],axis=1)

另外,在Titanic中我们曾经使用过虚拟编码(dummies),这里也可以用来替代独热编码,下面我将包装成一个函数展示:

def Hotconder():
    global df 
    for data in ['weekday','hr','mnth','season']:
        data_dummies =pd.get_dummies( df[data],prefix =data)
        df  =pd.concat([data_dummies,df],axis=1)
        df.drop(data,axis=1,inplace=True)
    return df

划分数据集,最后一列cnt是我们需要预测的值。

x = df.iloc[:,:-1]
y = df.iloc[:,[-1]]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

建立线性回归模型及评估:

model  =LinearRegression()
model.fit(x_train,y_train)
model.score(x_test,y_test) , model.score(x_train,y_train)
mean_squared_error(y_pred=model.predict(x_test),y_true=y_test)
__________________________________________________________________
(0.7039761586925211, 0.7044838271295909) 
9796.229240009625

建立岭回归模型及评估:

for alpha in [0.001,0.01,0.1,1,3,4,5,6,8,10]:
    print(f'alpha:{alpha}')
    model = Ridge(alpha=alpha)
    model.fit(x_train,y_train)
    print(f'score:{model.score(x_test,y_test)}')                             print(mean_squared_error(y_pred=model.predict(x_test),
    y_true=y_test)) 
———————————————————————————————————————————————————————— 
alpha:0.001
score:0.7068131057191035
9795.994613920964
alpha:0.01
score:0.7067652761844607
9797.592699895771
alpha:0.1
score:0.7065896548359738
9803.460580818919
alpha:1
score:0.7059984174818978
9823.215072063425
alpha:3
score:0.7053880270231085
9843.609509070426
alpha:4
score:0.7051548705857995
9851.39975907352
alpha:5
score:0.7049398070935562
9858.585485491596
alpha:6
score:0.7047357832549445
9865.402353718442
alpha:8
score:0.7043477576411166
9878.367110661091
alpha:10
score:0.7039761586925211
9890.783017954314

2. 波士顿房屋租赁价格预测

导入相关库:

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model.coordinate_descent import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

防止中文乱码,拦截警告。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
warnings.filterwarnings(action='ignore',category=ConvergenceWarning)
warnings.filterwarnings(action='ignore',category=UserWarning)

加载数据集

data = pd.read_csv('datas/boston_housing_data.csv',sep=',')

发现数据集存在nan值,删除这些空值。

data.isnull().sum()
data.dropna(inplace=True)

接着将自变量与应变量分离。

names=[]
for i in list(data):
    names.append(i)
names.remove('MEDV')
x= data[names]
y =data['MEDV'].ravel()

使用Pipeline并行调参,可是使用多个模型:

#多个模型
models =[
    Pipeline([('Ss',StandardScaler()), ('Poly',PolynomialFeatures()),
              ('Linear',RidgeCV(alphas=np.logspace(-2,1,15)))]),
    
    Pipeline([('Ss',StandardScaler()), ('Poly',PolynomialFeatures()),
              ('Linear',LassoCV(alphas=np.logspace(-2,1,15)))])  
]
#参数
parameters ={ 
    'Poly__degree' : [3,2,1],
    'Poly__interaction_only':[True,False],
   'Poly__include_bias' : [True,False],
    'Linear__fit_intercept' : [True,False]  
}

划分数据集:

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)

先绘制真实值的图像,由于存在不同的模型,设置titles和colors存放不同model训练后的预测图像。

titles = ['Ridge','Lasso']
colors=['r-','b-']
plt.figure(figsize=(25,10),facecolor='w')
ln_x_test = range(len(x_test))
plt.plot(ln_x_test,y_test,'g-',lw=2,label=u'真实值')
#使用网格搜索选择调优模型
for t in range(2):
    model =GridSearchCV(models[t],param_grid=parameters,cv=5,n_jobs=1)
    model.fit(x_train,y_train)
    print(f'{titles[t]}算法的最优参数:{model.best_params_}')
    print(f'{titles[t]}算法的R值:{model.best_score_}')
    y_predict = model.predict(x_test)
    plt.plot(ln_x_test,y_predict,colors[t],lw=t+2,alpha=0.75,
             label = f'%s算法预测值,$R^2$=%.3f' % (titles[t],model.best_score_))
plt.legend(loc='upper left')
plt.grid(True)
plt.title(u'波士顿房屋价格预测')
plt.show()
————————————————————————————————————————————————————————
Ridge算法的最优参数:{'Linear__fit_intercept': True, 'Poly__degree': 2, 'Poly__include_bias': True, 'Poly__interaction_only': False}
Ridge算法的R值:0.8568618675311532
Lasso算法的最优参数:{'Linear__fit_intercept': True, 'Poly__degree': 2, 'Poly__include_bias': True, 'Poly__interaction_only': False}
Lasso算法的R值:0.8522318747421048

image.png

3. 葡萄酒质量预测

导入相关库:

import numpy as np
import pandas as pd 
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import sklearn
from sklearn.preprocessing import  PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LassoCV,LinearRegression,RidgeCV,ElasticNetCV

设置参数防止中文乱码,并拦截警告。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
warnings.filterwarnings(action='ignore',category=ConvergenceWarning)
warnings.filterwarnings(action='ignore',category=UserWarning)

加载数据集,将红酒与白酒合并,用‘type’分隔开。

data_red = pd.read_csv('datas/winequality-red.csv',sep=';')
data_white = pd.read_csv('datas/winequality-white.csv',sep=';')
data_red['type'] =1
data_white['type']=2
data =pd.concat([data_red,data_white],axis=0)

处理异常值

data = data.replace('?',np.nan)
data.isnull().sum()
#datas= data.dropna(how='any')
#datas.isnull().sum()
————————————————————————————————————————————————————————fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
type                    0
dtype: int64

自变量与应变量分离,此时确定了特征和目标值。

names= []
for i in list(data):
    names.append(i)
names.remove('quality')
names
————————————————————————————————————————————————
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'type']

使用Pieline创建模型列表:

models = [ Pipeline([('Poly',PolynomialFeatures()),          ('Linear',LinearRegression())]),
    
Pipeline([('Poly',PolynomialFeatures()),
          ('Linear',RidgeCV(alphas=np.logspace(-4,1,20)))]),
    
Pipeline([('Poly',PolynomialFeatures()),
          ('Linear',LassoCV(alphas=np.logspace(-4,1,20)))]),
    
Pipeline([('Poly',PolynomialFeatures()),
          ('Linear',ElasticNetCV(alphas=np.logspace(-4,1,20),
           l1_ratio=np.linspace(0,1,5)))])
]

绘制图像,设置尺寸,面板颜色,并设置子图图题。

plt.figure(figsize = (20,10),facecolor='w')
titles = u'线性回归预测','Ridge回归预测','Lasso回归预测','ElasticNet回归预测'

接下来划分数据集,可以看到测试集房价的变化情况。

x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.01,random_state=0)
ln_x_test=range(len(x_test))
plt.plot(ln_x_test,y_test,c='r',lw=2,alpha=0.75,zorder=10,label=u'真实值')

image.png

我们想要在figure中绘制四个子图,分别是u'线性回归预测','Ridge回归预测','Lasso回归预测','ElasticNet回归预测',这对应着models中的序列为0,1,2,3,除此以外,我们还想在每一张子图中绘制不同阶预测值的图像,这对应着Poly__degree序列为1,2,3,我们还希望这三条曲线的颜色也不相同。

首先,利用numpy数组存入degree,利用linspace函数,形成不同的颜色值存入colors中,同时degree的索引用来获取colors中的颜色。

degree = np.arange(1,4,1)
l =len(degree)
colors =[]
for c in np.linspace(5570560,255,l):
    colors.append('#%06x' % int(c))

当开始使用models中第一个模型时,我们也希望将这个模型的预测值曲线图放在figure的第一位。

for t in range(4):
    model = models[t]
    plt.subplot(2,2,t+1)
    plt.plot(ln_x_test,y_test,c='g',lw=2,alpha=0.75,zorder=10,label=u'真实值')
    #获取[0,1,2] 颜色 和 [1,2,3 ]阶乘
    for i,d in enumerate(degree):
        model.set_params(Poly__degree=d)
        model.fit(x_train,y_train)
        y_predict =model.predict(x_test)
        R  = model.score(x_train,y_train)
        plt.plot(ln_x_test,y_predict,c=colors[i],lw=2,alpha=0.7,zorder=i,
                 label=u'%d阶预测值,R^2$=%.3f' % (d,R))
    plt.legend(loc='upper left')
    plt.grid(True)
    plt.title(titles[t],fontsize=22)
    plt.xlabel('x',fontsize=18)
    plt.ylabel('y',fontsize=18)
plt.suptitle(u'葡萄酒质量检测',fontsize=28)
plt.show()    

image.png

逻辑回归应用分析

1. 乳腺癌分类

导入需要的库:

import numpy as np,pandas  as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

由于打开数据及后发现只有Values,没有列标签,自行设置了列名。 id数据对于分析没有实际价值,选择删除,以降低数据量。

names = ['id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
         'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei',
        'Bland Chromatin','Normal Nucleoli','Mitoses','Class']
data = pd.read_csv('../datas/breast-cancer-wisconsin.data',names=names)
data.drop('id',axis=1,inplace=True)

在数据中虽然没有空值,但是在后面进行模型训练的时候发现了“?”值,因此写在这里,进行异常值处理。同时,将类2与4转化为0,1。

data[data.values =='?']
data =data.replace('?',np.nan).dropna()
data['Class'] = data['Class'] /2-1 

下面开始提取自变量x和应变量y,并查看数据集类型。

x= data.iloc[:,:-1]
y = data.iloc[:,[-1]]
type(x),type(y)
—————————————————————————————————————————————————————
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)

对自变量进行标准化处理,并划分数据集。

sscoder = StandardScaler()
x = sscoder.fit_transform(x)
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.1,random_state=0)

进行模型训练,绘制真实值与预测值图像。

model = LogisticRegression()
model.fit(x_train,y_train)
print(model.score(x_test,y_test))
y_predict=model.predict(x_test)
ln_x_test = range(len(x_test))
plt.plot(ln_x_test,y_predict,'b-',lw=2,alpha =0.75,zorder=10,label=u'预测值')
plt.plot(ln_x_test,y_test,'r-',lw=2,alpha =0.4,zorder=10,label=u'真实值')
——————————————————————————————————————————————————
0.9855072463768116

image.png

模型预测并计算auc的值,进行模型评估。

m  =model.predict_proba(x_test)
print(m)
fpr,tpr,thresholds = metrics.roc_curve(y_test,y_score=[i[1] for i in m],pos_label=1)
metrics.auc(fpr,tpr)
————————————————————————————————————————————————————
0.9981096408317581

2. 信贷审批

导入相关库:

import numpy as np,pandas as pd
import matplotlib as mpl,matplotlib.pyplot as plt
import warnings 
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.exceptions import ConvergenceWarning
from typing import List

设置字体并拦截警告。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   

加载数据集,添加数据集列标签。A16是我们需要预测的目标值。

names = ['A1','A2','A3','A4','A5','A6','A7','A8',
         'A9','A10','A11','A12','A13','A14','A15','A16']
data =pd.read_csv('../datas/crx.data',names=names)

先处理数据集中的异常值“?”,然后再看卡数据的分布,来决定那些列需要进行编码。

data=data.replace("?",np.nan).dropna()
for i in list(data):                
    print(data[i].value_counts())   

从结果中看到,将需要处理的数据分为两个类型:

虚拟编码:A4 A5 A6 A7 A13

二值编码(0 1):A1 A9 A10 A12 A16

这样我们就可以针对性的去尝试,为了保证报错后重新开始的即时性,我将数据集中需要编码的数据进行拷贝用来尝试编码。

首先是二值编码,开始我使用的一列数据进行试验,在这里我将使用的所用需要编码的列来加快分析进程。 最重要的思路是:指定一个具体数据为判断0和1的标准,然后将每一个数据进行对比,将结果转化成0和1的数据表。

data[['A11','A91','A101','A121','A161']] =data[['A1','A9','A10','A12','A16']]
for  name in list(data[['A11','A91','A101','A121','A161']]):
    value_new=[]
    for value in data[name].values:
        vn=1 if value == data[name][0] else 0
        value_new.append(vn)
    data[name] = value_new

当然了,最好的办法还是利用函数进行包装,这样更加规整。 在函数中,我设置了两个参数,分别是数据集data和需要二值编码的列名names,这可能和其他人的做法不同。

def two_coder(data,names)-> list: 
    for  name in names: 
        value_new=[]
        for value in data[name].values:
            vn=1 if value == data[name][0] else 0
            value_new.append(vn)
        data[name] = value_new
    return data

虚拟编码是利用pandas的get_dummies方法来完成的,我将它以函数的形式展现出来。

def dummies_coder(data,names)->list:
    for name in names:
        data_dummies = pd.get_dummies(data[name],prefix=name)
        data = pd.concat([data,data_dummies],axis=1)
        data.drop(name,axis=1,inplace=True)
    return data

现在,我们可以轻松地完成对:A4 A5 A6 A7 A13 A1 A9 A10 A12 A16 这10列数据的编码。

two_coder_names =['A1','A9','A10','A12','A16']
dummies_coder_names =['A4','A5','A6','A7','A13']
two_coder(data,two_coder_names)
dummies_coder(data,dummies_coder_names)

不过也存在着顺序较乱的问题,不过不影响自变量和应变量分离。

当然了,我们也可以直接继续进行虚拟编码,不过目标值还要进行二值编码,将“+”,“-”转换成1和0。

total_names =['A1','A9','A10','A12','A4','A5','A6','A7','A13']
data =dummies_coder(data,total_names)
data =two_coder(data,['A16'])

下面开始划分数据集,观察一下先处理后的数据。

y =pd.DataFrame(data['A16'],columns=['A16'])
x= data.drop(['A16'],axis=1)
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.1,random_state=0)
x_train.describe().T
——————————————————————————————————————————————————————
	count	mean	std	min	25%	50%	75%	max
A3	587.0	4.909319	5.073588	0.0	1.04	3.0	7.520	28.0
A8	587.0	2.221882	3.304041	0.0	0.21	1.0	2.605	28.5
A11	587.0	2.562181	5.056756	0.0	0.00	0.0	3.000	67.0
A15	587.0	943.959114	5081.188098	0.0	0.00	5.0	397.000	100000.0
A1_a	587.0	0.315162	0.464977	0.0	0.00	0.0	1.000	1.0
A1_b	587.0	0.684838	0.464977	0.0	0.00	1.0	1.000	1.0
A9_f	587.0	0.461670	0.498954	0.0	0.00	0.0	1.000	1.0
A9_t	587.0	0.538330	0.498954	0.0	0.00	1.0	1.000	1.0
A10_f	587.0	0.550256	0.497892	0.0	0.00	1.0	1.000	1.0
A10_t	587.0	0.449744	0.497892	0.0	0.00	0.0	1.000	1.0
A12_f	587.0	0.534923	0.499204	0.0	0.00	1.0	1.000	1.0
A12_t	587.0	0.465077	0.499204	0.0	0.00	0.0	1.000	1.0
A4_l	587.0	0.003407	0.058321	0.0	0.00	0.0	0.000	1.0
A4_u	587.0	0.761499	0.426530	0.0	1.00	1.0	1.000	1.0
A4_y	587.0	0.235094	0.424419	0.0	0.00	0.0	0.000	1.0
A5_g	587.0	0.761499	0.426530	0.0	1.00	1.0	1.000	1.0
A5_gg	587.0	0.003407	0.058321	0.0	0.00	0.0	0.000	1.0
A5_p	587.0	0.235094	0.424419	0.0	0.00	0.0	0.000	1.0
A6_aa	587.0	0.078365	0.268974	0.0	0.00	0.0	0.000	1.0
A6_c	587.0	0.211244	0.408539	0.0	0.00	0.0	0.000	1.0
A6_cc	587.0	0.061329	0.240137	0.0	0.00	0.0	0.000	1.0
A6_d	587.0	0.037479	0.190094	0.0	0.00	0.0	0.000	1.0
A6_e	587.0	0.035775	0.185887	0.0	0.00	0.0	0.000	1.0
A6_ff	587.0	0.069847	0.255106	0.0	0.00	0.0	0.000	1.0
A6_i	587.0	0.085179	0.279386	0.0	0.00	0.0	0.000	1.0
A6_j	587.0	0.015332	0.122975	0.0	0.00	0.0	0.000	1.0
A6_k	587.0	0.073254	0.260775	0.0	0.00	0.0	0.000	1.0
A6_m	587.0	0.059625	0.236993	0.0	0.00	0.0	0.000	1.0
A6_q	587.0	0.120954	0.326352	0.0	0.00	0.0	0.000	1.0
A6_r	587.0	0.005111	0.071367	0.0	0.00	0.0	0.000	1.0
A6_w	587.0	0.097104	0.296352	0.0	0.00	0.0	0.000	1.0
A6_x	587.0	0.049404	0.216894	0.0	0.00	0.0	0.000	1.0
A7_bb	587.0	0.081772	0.274250	0.0	0.00	0.0	0.000	1.0
A7_dd	587.0	0.010221	0.100669	0.0	0.00	0.0	0.000	1.0
A7_ff	587.0	0.076661	0.266280	0.0	0.00	0.0	0.000	1.0
A7_h	587.0	0.207836	0.406105	0.0	0.00	0.0	0.000	1.0
A7_j	587.0	0.011925	0.108641	0.0	0.00	0.0	0.000	1.0
A7_n	587.0	0.006814	0.082337	0.0	0.00	0.0	0.000	1.0
A7_o	587.0	0.003407	0.058321	0.0	0.00	0.0	0.000	1.0
A7_v	587.0	0.587734	0.492662	0.0	0.00	1.0	1.000	1.0
A7_z	587.0	0.013629	0.116042	0.0	0.00	0.0	0.000	1.0
A13_g	587.0	0.913118	0.281903	0.0	1.00	1.0	1.000	1.0
A13_p	587.0	0.003407	0.058321	0.0	0.00	0.0	0.000	1.0
A13_s	587.0	0.083475	0.276835	0.0	0.00	0.0	0.000	1.0

标准化处理自变量x值。

ss_coder = StandardScaler()
x_train =ss_coder.fit_transform(x_train)
x_test =ss_coder.fit_transform(x_test)

建立Logistic回归模型训练数据。

lgr = LogisticRegressionCV(Cs=np.logspace(-4,1,50),fit_intercept=True,penalty='l2',
                          solver ='lbfgs',tol=0.01,multi_class='ovr')
lgr.fit(x_train,y_train)
———————————————————————————————————————————————————————
LogisticRegressionCV(Cs=array([1.00000000e-04, 1.26485522e-04, 1.59985872e-04, 2.02358965e-04,
       2.55954792e-04, 3.23745754e-04, 4.09491506e-04, 5.17947468e-04,
       6.55128557e-04, 8.28642773e-04, 1.04811313e-03, 1.32571137e-03,
       1.67683294e-03, 2.12095089e-03, 2.68269580e-03, 3.39322177e-03,
       4.29193426e-03, 5.42867544e-03, 6.86648845e-03, 8.68511374e-03,
       1.09854114e-02, 1.38...
       7.19685673e-02, 9.10298178e-02, 1.15139540e-01, 1.45634848e-01,
       1.84206997e-01, 2.32995181e-01, 2.94705170e-01, 3.72759372e-01,
       4.71486636e-01, 5.96362332e-01, 7.54312006e-01, 9.54095476e-01,
       1.20679264e+00, 1.52641797e+00, 1.93069773e+00, 2.44205309e+00,
       3.08884360e+00, 3.90693994e+00, 4.94171336e+00, 6.25055193e+00,
       7.90604321e+00, 1.00000000e+01]),
                     multi_class='ovr', tol=0.01)

对模型进行评估:

lgr_r = lgr.score(x_train,y_train)
print(f'Logistic算法的R值:{lgr_r}')
print(f'Logistic算法的稀疏化特征比例:{lgr.coef_}')
print(f'Logistic算法的参数:{lgr.coef_}')
print(f'Logistic算法的截距:{lgr.intercept_}')
————————————————————————————————————————————————————
Logistic算法的R值:0.889267461669506
Logistic算法的稀疏化特征比例:[[ 0.06010294  0.06371679  0.14746233  0.17539052 -0.0760682   0.11441961
  -0.00360566  0.00360566 -0.42879631  0.42879631 -0.15905789  0.15905789
   0.00924079 -0.00924079  0.05970023  0.038181   -0.04657456  0.038181
   0.05970023 -0.04657456 -0.02160594  0.00424491  0.09527565 -0.02703857
   0.03342162 -0.10700193 -0.09250279 -0.01900214 -0.05327403 -0.01117224
   0.04677697  0.0120337   0.03154771  0.12295968 -0.02175103 -0.00740624
  -0.09594935  0.07653215  0.0248417   0.02830086 -0.00219562 -0.00572481
  -0.00776041  0.01346844 -0.00498746 -0.01266428]]
Logistic算法的参数:[[ 0.06010294  0.06371679  0.14746233  0.17539052 -0.0760682   0.11441961
  -0.00360566  0.00360566 -0.42879631  0.42879631 -0.15905789  0.15905789
   0.00924079 -0.00924079  0.05970023  0.038181   -0.04657456  0.038181
   0.05970023 -0.04657456 -0.02160594  0.00424491  0.09527565 -0.02703857
   0.03342162 -0.10700193 -0.09250279 -0.01900214 -0.05327403 -0.01117224
   0.04677697  0.0120337   0.03154771  0.12295968 -0.02175103 -0.00740624
  -0.09594935  0.07653215  0.0248417   0.02830086 -0.00219562 -0.00572481
  -0.00776041  0.01346844 -0.00498746 -0.01266428]]
Logistic算法的截距:[-0.24652859]

利用模型预测y值。

y_predict=lgr.predict(x_test)
y_proba = lgr.predict_proba(x_train)
y_predict,y_proba
——————————————————————————————————————————————————————
(array([1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
        0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1],
       dtype=int64),
 array([[0.88120287, 0.11879713],
        [0.51051602, 0.48948398],
        [0.51993802, 0.48006198],
        ...,
        [0.08128366, 0.91871634],
        [0.87979668, 0.12020332],
        [0.34058366, 0.65941634]]))

绘制信贷审批结果的真实值与预测值的图像。

#样本长度编号
ln_x_test =range(len(x_test))
#设置图尺寸与面板颜色
plt.figure(figsize=(20,8),facecolor='w')
plt.ylim(0,1,1,1)
plt.plot(ln_x_test,y_test,'ro',markersize=15,alpha=0.75,zorder=10,label=u'真实值')
plt.plot(ln_x_test,y_predict,'bo',markersize=17,alpha=0.6,zorder=10,
        label =f'logis算法的预测值,$R^2$={lgr.score(x_test,y_test)}')
plt.legend(loc='center',fontsize=20)
plt.xlabel(u'数据编号',fontsize=20)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.ylabel(u'是否审批(0:不通过,1:通过)',fontsize=20)
plt.title(f'logistic回归算法',fontsize=24)
plt.show()

image.png

3. 鸢尾花数据分类

导入相关库:

import numpy as np,pandas as pd,matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import sklearn
from sklearn.preprocessing import StandardScaler,label_binarize
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.exceptions import ConvergenceWarning
from sklearn import metrics
from typing import List

防止中文乱码,拦截警告。

## 设置字符集,防止中文乱码
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
## 拦截异常
warnings.filterwarnings(action = 'ignore', category=ConvergenceWarning)

加载数据集,

names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'cla']
data = pd.read_csv('../datas/iris.data',names=names)
data
——————————————————————————————————————————————————————

sepal length	sepal width	petal length	petal width	cla
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica
150 rows × 5 columns

观察数据集中的异常值,发现没有空值或者问号。

data.isnull().sum()
data[data.values=="?"]
————————————————————
sepal length    0
sepal width     0
petal length    0
petal width     0
cla             0
dtype: int64

sepal length	sepal width	petal length	petal width	cla

数据预处理主要是对cla进行编码,发现有三个不同的value。

data['cla'].value_counts()
————————————————————————————————————————————————————————
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: cla, dtype: int64

可以想到的办法是,利用集合的去重作用和元组的索引返回可以得到我们想要的数字。

tuple_claa =tuple(set(data1['cla']))
tuple_claa[0],tuple_claa[1],tuple_claa[2],tuple_claa.index('Iris-virginica')
————————————————————————————————————————————————
('Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 1)

根据这个思路,定义一个编码函数,直接在原始数据及data上编码,还需要设置一个列名作为参数。

def get_vn_coder(data,name) : 
    new_value =[]
    tuple_name = tuple(set(data[name]))
    for  value in data[name]:
        vn=tuple_name.index(value)+1
        new_value.append(vn)
    data[name] = new_value
    return data
get_vn_coder(data,'cla')
————————————————————————————————————————————————————
	sepal length	sepal width	petal length	petal width	cla
0	5.1	3.5	1.4	0.2	3
1	4.9	3.0	1.4	0.2	3
2	4.7	3.2	1.3	0.2	3
3	4.6	3.1	1.5	0.2	3
4	5.0	3.6	1.4	0.2	3
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2
150 rows × 5 columns

划分数据集并进行标准化训练。

x= data.iloc[:,:-1]
y =pd.DataFrame(data.iloc[:,-1])
type(x),type(y)
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.2,random_state=0)

ss_coder = StandardScaler()
x_train =ss_coder.fit_transform(x_train)
x_test=ss_coder.transform(x_test)

建立logistic回归模型并进行训练。

lgr = LogisticRegressionCV(Cs =np.logspace(-4,1,50),cv=3,fit_intercept=True,
            penalty='l2',solver='lbfgs',tol=0.01,multi_class='multinomial')
lgr.fit(x_train,y_train)

logistic回归模型输出结果:

#将预测结果转换成矩阵形式
y_test_h = label_binarize(y_test,classes=(1,2,3))
#计算预测的损失值
lgr_y_score =lgr.decision_function(x_test)
#计算roc值,thresholds 阈值
lgr_fpr,lgr_tpr,lgr_thresholds =metrics.roc_curve(y_test_h.ravel(),
                                                  lgr_y_score.ravel())
lgr_auc = metrics.auc(lgr_fpr,lgr_tpr)
print(f'lgoistic算法的R值:{lgr.score(x_train,y_train)}')
print(f'lgoistic算法的AUC值:{lgr_auc}')
#模型预测
y_pred =lgr.predict(x_test)
————————————————————————————————————————————————————
lgoistic算法的R值:0.975
lgoistic算法的AUC值:0.9011111111111111

建立KNN算法模型

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train,y_train)

KNN算法模型输出结果

#将预测结果转换成矩阵形式
y_test_h = label_binarize(y_test,classes=(1,2,3))
#计算预测的损失值
knn_y_score =knn.predict_proba(x_test)
#计算roc值,thresholds 阈值
knn_fpr,knn_tpr,knn_thresholds =metrics.roc_curve(y_test_h.ravel(),
                                                  knn_y_score.ravel())
knn_auc = metrics.auc(knn_fpr,knn_tpr)
print(f'knn算法的R值:{knn.score(x_train,y_train)}')
print(f'knn算法的AUC值:{knn_auc}')
knn_y_pred =knn.predict(x_test)
——————————————————————————————————————————————————
knn算法的R值:0.9666666666666667
knn算法的AUC值:0.9972222222222222

绘制logistic回归算法与KNN算法ROC曲线

plt.figure(figsize=(20,8),facecolor='w')
plt.plot(lgr_fpr,lgr_tpr,c='b',lw=2,label=u'Logistic算法:AUC=%.3f' % lgr_auc)
plt.plot(knn_fpr,knn_tpr,c='r',lw=2,label=u'KNN算法:AUC=%.3f' % knn_auc)
plt.plot((0,1),(0,1),c='#a0a0a0',lw=2,ls='--')
#调节坐标轴范围
plt.xlim(-0.01,1.02)
plt.ylim(-0.01,1.02)
#调节坐标轴刻度
plt.xticks(np.arange(0,1,0.1))
plt.yticks(np.arange(0,1,0.1))
#坐标轴名称
plt.xlabel('FPR' ,fontsize=20)
plt.ylabel('TPR' ,fontsize=20)
#显示网格
plt.grid(b=True,ls=':')
#调节图例
plt.legend(loc='lower right',fancybox=True,framealpha=0.7,fontsize=18)
plt.title(f'鸢尾花数据Logistic算法和KNN算法的ROC/AUC',fontsize=25)
plt.show()

image.png

绘制logistic回归模型和KNN模型预测结果图像。

#设置样本编号
ln_x_test =range(len(x_test))
#调节图像面板尺寸与颜色
plt.figure(figsize =(20,10),facecolor='w')
#调节y轴坐标范围
plt.ylim(0.5,3.5)
plt.plot(ln_x_test,y_test,'ro',alpha=0.8,markersize=18,zorder=10,label=u'真实值')
plt.plot(ln_x_test,y_pred,'bo',alpha=0.75,markersize=13,zorder=10,
         label=u'Logistic预测值,$R^2$=%.3f' % lgr.score(x_test,y_test))
plt.plot(ln_x_test,knn_y_pred,'go',alpha=0.9,markersize=8,zorder=10,
         label=u'KNN预测值,$R^2$=%.3f' % knn.score(x_test,y_test))
#调节图例
plt.legend(loc='lower right',fontsize=12)
plt.xlabel(u'数据编号',fontsize=20)
plt.ylabel(u'种类',fontsize=20)
plt.title(u'鸢尾花分类',fontsize=24)
plt.show()

image.png

4. 葡萄酒质量预测(Softmax)

导入相关库并进行设置。

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
import warnings
from sklearn.preprocessing import StandardScaler,MinMaxScaler,LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.exceptions import ConvergenceWarning
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler,Normalizer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
## 设置字符集,防止中文乱码
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
## 拦截异常
warnings.filterwarnings(action = 'ignore', category=ConvergenceWarning)

加载数据集,添加type列,合并data_red和data_white。

data_red = pd.read_csv('../datas/winequality-red.csv',sep=';')
data_white = pd.read_csv('../datas/winequality-white.csv',sep=';')
data_red['type']=1
data_white['type']=2
data_all=pd.concat([data_red,data_white],axis=0)
data_all
————————————————————————————————————————————————————
	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.99780	3.51	0.56	9.4	5	1
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.99680	3.20	0.68	9.8	5	1
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.99700	3.26	0.65	9.8	5	1
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.99800	3.16	0.58	9.8	6	1
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.99780	3.51	0.56	9.4	5	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
4893	6.2	0.21	0.29	1.6	0.039	24.0	92.0	0.99114	3.27	0.50	11.2	6	2
4894	6.6	0.32	0.36	8.0	0.047	57.0	168.0	0.99490	3.15	0.46	9.6	5	2
4895	6.5	0.24	0.19	1.2	0.041	30.0	111.0	0.99254	2.99	0.46	9.4	6	2
4896	5.5	0.29	0.30	1.1	0.022	20.0	110.0	0.98869	3.34	0.38	12.8	7	2
4897	6.0	0.21	0.38	0.8	0.020	22.0	98.0	0.98941	3.26	0.32	11.8	6	2
6497 rows × 13 columns

查看数据集的信息。

data_all.info()
————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6497 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  type                  6497 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 710.6 KB

处理异常值。

data =data_all.replace("?",np.nan).dropna(how='any')

提取数据集中的x,y,并进行数据集划分。

y =pd.DataFrame(data['quality'])
x=data.drop('quality',axis=1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.1,random_state=0)

对x_train,x_test进行标准化训练。

ss_coder =StandardScaler()
x_train=ss_coder.fit_transform(x_train)
x_test =ss_coder.transform(x_test)

建立logistic回归模型训练数据。

lgr  = LogisticRegressionCV(fit_intercept=True,Cs =np.logspace(-3,1,50),
                           multi_class='multinomial',penalty='l2',solver='lbfgs')
lgr.fit(x_train,y_train)

logistic回归模型的输出结果。

lgr_R = lgr.score(x_train,y_train)
print('R值是:' ,lgr_R)
print('特征稀疏化比例是:%.2f%%'  %  (np.mean(lgr.coef_.ravel()==0)*100))
print('参数是:', lgr.coef_)
print('截距是:' ,lgr.intercept_)
y_pred =lgr.predict(x_test)
————————————————————————————————————————————————————
R值是: 0.5496835984265436
特征稀疏化比例是:0.00%
参数是: [[ 0.67752559  0.987899   -0.32027472  0.00677359  0.94350148  0.39520366   0.12637852 -0.14786997  0.16976999 -0.45294428 -0.52182398  0.61274299]
 [-0.54979328  0.84826118 -0.01591337 -1.09324428  0.54110719 -0.88734675   0.1013851   0.99221528 -0.39935197 -0.07018644 -0.58198203  1.07255851]
 [-0.68865143  0.30156317  0.08701101 -0.69110331  0.5099835  -0.25769075   0.43825846  0.67516745 -0.55340259 -0.16496295 -0.89681563 -0.4581928 ]
 [-0.62705797 -0.35617218  0.00359714 -0.31382828  0.4693695  -0.04570205   0.06847681  0.51632857 -0.4794139   0.08879676  0.00749543 -0.52250221]
 [-0.01449942 -0.7438577  -0.0296082   0.59747534  0.26343921  0.03448366  -0.01468908 -0.73431732 -0.08090158  0.39547905  0.27094513 -0.85751529]
 [-0.11072351 -0.58705212  0.06808383  0.85254437  0.37615573  0.26429456  -0.09980634 -0.82237239 -0.06509039  0.35993973  0.56090155 -0.64243831]
 [ 1.31320002 -0.45064136  0.20710431  0.64138256 -3.10355662  0.49675766  -0.62000346 -0.47915162  1.40839043 -0.15612186  1.16127953  0.79534712]]
截距是: [-1.88365356  0.34148367  2.98466092  3.5517204   2.06959079  0.0302318 -7.09403403]

绘制logistic回归模型真实值与预测值图像。

ln_x_test =range(len(x_test))
plt.figure(figsize=(20,10),facecolor='w')
plt.ylim(-1,11)
plt.plot(ln_x_test,y_test,'ro',markersize=10,alpha=0.7,zorder=10,label=u'真实值')
plt.plot(ln_x_test,y_pred,'bo',markersize=15,alpha=0.7,zorder=10,
         label=u'预测值,$R^2$=%.3f' % lgr_R)
plt.legend(loc='upper left',fontsize=18)
plt.xlabel(u'数据编号',fontsize=20)
plt.ylabel(u'葡萄酒质量',fontsize=20)
plt.title(u'葡萄酒质量预测统计',fontsize=24)
plt.show()

image.png

PCA降维处理

划分数据集。

x1_train,x1_test,y1_train,y1_test=train_test_split(x,y,test_size=0.01,random_state=0)

对x1_test,x_train归一化处理。

nor =Normalizer()
x1_test =nor.transform(x1_test)
x1_train=nor.fit_transform(x1_train)

进行降维处理,但效果并不明显。

# 将样本数据维度降低成为2个维度
pca = PCA(n_components=2)
x1_train = pca.fit_transform(x1_train)
print ("贡献率:", pca.explained_variance_)
 # 测试数据降维
x1_test = pca.fit_transform(x1_test)
——————————————————————————————————————————————————————
贡献率: [0.80467114 0.12287721]

模型训练。

lgr2 = LogisticRegressionCV(fit_intercept=True,Cs=np.logspace(-1,3,50),
                            multi_class='multinomial',penalty='l2',solver='lbfgs')
lgr2.fit(x1_train,y1_train)

模型训练结果输出。

lgr2_R = lgr2.score(x1_train,y1_train)
print('R值是:' ,lgr2_R)
print('特征稀疏化比例是:%.2f%%'  %  (np.mean(lgr2.coef_.ravel()==0)*100))
print('参数是:', lgr2.coef_)
print('截距是:' ,lgr2.intercept_)
y1_pred =lgr2.predict(x1_test)
——————————————————————————————————————————————————————
R值是: 0.45988805970149255
特征稀疏化比例是:0.00%
参数是: [[ 0.41173597  1.59780389]
 [ 0.59498972  0.99400199]
 [ 0.04557274  1.36230804]
 [-0.06424483 -0.11878825]
 [-0.10983418 -0.93218132]
 [-0.37749056 -1.34843384]
 [-0.50072887 -1.55471051]]
截距是: [-2.1018385  -0.15045045  2.26818011  2.61282555  1.57707652 -0.24589361 -3.95989963]

绘制真实值与预测值图像,降维与归一化处理后,R值降低。

ln_x1_test=range(len(x1_test))
plt.figure(figsize=(20,10),facecolor='w')
plt.plot(ln_x1_test,y1_test,'go',markersize=15,zorder=10,alpha=0.75,label=u'真实值')
plt.plot(ln_x1_test,y1_pred,'bo',markersize=10,zorder=10,alpha=0.8,
         label=u'预测值,R$^2$=%.3f' % lgr2_R)
plt.legend(loc='upper left',fontsize=20)
plt.xlabel(u'数据编号',fontsize=20)
plt.ylabel(u'葡萄酒质量',fontsize=20)
plt.title(u'葡萄酒质量预测(PCA将降维处理)',fontsize=24)
plt.show()

image.png